underscored — Underscored

@underscored

3 clips · 1 follower

Tag:reinforcement-learningClear

dwarkesh.com

The next big breakthrough will be AIs learning on the job

But one reason that I think it quite underrated, and also which reveals the canyon walls against which the river of AI progress will only slowly chip away at, is that it is not enough for a domain to be verifiable. It also has to be very grindable—in the sense that you can run lots of parallel rollouts against a deterministic and replayable simulator.
— Dwarkesh Patel

2w ago

dwarkesh.com

The sample efficiency black hole

Imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word file. Even the task count difference understates the gap - the models have to grind their far more numerous tasks each far harder. Whereas a human student might practice a textbook problem once or twice, GRPO has the model generate hundreds to thousands of rollouts per task.

1mo ago

dwarkesh.com

Eric Jang – Building AlphaGo from scratch

naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo's MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem.

2mo ago

Underscored — save the words that stop you in your tracks.

Start saving quotes →