Imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word file. Even the task count difference understates the gap - the models have to grind their far more numerous tasks each far harder. Whereas a human student might practice a textbook problem once or twice, GRPO has the model generate hundreds to thousands of rollouts per task.
It is increasingly hard to quickly demonstrate each generational change as AI has gotten better, since a lot of the old things AI was bad at, like math or counting letters in words, are now trivial for AI to do.
1mo ago
Underscored — save the words that stop you in your tracks.