# It's not the reward, it's the environment. RL from scratch looked good, but the heyday is over. Learning from scratch is hard to generalize. It's inefficient learning & exploring. Requires a lot of data. (It's been eclipsed by LLMs and RLHF) [[Reward is Enough, Silver 2021]] But RL from scratch has agentic capacity. - long horizon - implicit understanding of uncertainty - dynamic environments **One path forward for RL is real-world, multi-agent learning.** - multiple robots play soccer based on vision - from scratch, sim -> real - Distributional MPO - skills + self-play - complex and annoying, but shows it's possible - believes the complexity of the real world is fundamentally needed "we've been too limited in the environments that we've chosen" The important thing isn't the world it's the environment # The monolith is cracking Should AI models should be monoliths or multitudes? Distributed low-communication optimizer (DiLoCo) Training is chunks and periodically sharing weight updates.