# It's not the reward, it's the environment.
RL from scratch looked good, but the heyday is over.
Learning from scratch is hard to generalize.
It's inefficient learning & exploring. Requires a lot of data.
(It's been eclipsed by LLMs and RLHF)
[[Reward is Enough, Silver 2021]]
But RL from scratch has agentic capacity.
- long horizon
- implicit understanding of uncertainty
- dynamic environments
**One path forward for RL is real-world, multi-agent learning.**
- multiple robots play soccer based on vision
- from scratch, sim -> real
- Distributional MPO
- skills + self-play
- complex and annoying, but shows it's possible
- believes the complexity of the real world is fundamentally needed
"we've been too limited in the environments that we've chosen"
The important thing isn't the world it's the environment
# The monolith is cracking
Should AI models should be monoliths or multitudes?
Distributed low-communication optimizer (DiLoCo)
Training is chunks and periodically sharing weight updates.