Imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word file. Even the task count difference understates the gap - the models have to grind their far more numerous tasks each far harder. Whereas a human student might practice a textbook problem once or twice, GRPO has the model generate hundreds to thousands of rollouts per task.
naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo's MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem.
The reason they believe robots haven't generalized like LLMs isn't that the models aren't smart enough, but that the data has been a fraction of a percent of what humans naturally generate every day, captured through interfaces that distort the very behavior they're trying to record.
The implant works by recording electrical signals from the brain's motor cortex, then using machine learning to decode the user's intended movements and send signals to stimulate the muscles, essentially creating a new neural pathway that bypasses the damaged tissue.
AI agents need world models that allow them to predict the consequences of their actions before they take them. This is key to enabling agents that can plan, remember, and reason about complex observations.
3mo ago
Underscored — save the words that stop you in your tracks.