2025-07-16 Jason Wei.Life lessons from reinforcement learning

2025-07-16 Jason Wei.Life lessons from reinforcement learning


Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life.
过去一年里,我沉迷于强化学习(RL),几乎每天醒着的时候都在思考RL,而这一过程无意间教会了我一个关于如何过自己生活的重要道理。

One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s successful trajectories, you should take your own actions and learn from the reward given by the environment. Obviously imitation learning is useful to bootstrap to nonzero pass rate initially, but once you can take reasonable trajectories, we generally avoid imitation learning because the best way to leverage the model’s own strengths (which are different from humans) is to only learn from its own trajectories. A well-accepted instantiation of this is that RL is a better way to train language models to solve math word problems compared to simple supervised finetuning on human-written chains of thought.
强化学习中的一个核心理念是“on-policy”(基于自身策略):不是模仿别人的成功路径,而是采取自己的行动,从环境中获取反馈并学习。显然,在一开始,为了把成功率从零提升到一个起点,模仿学习是非常有用的。但一旦你能够执行合理的轨迹,我们通常就会避免继续使用模仿学习,因为最能发挥模型自身(不同于人类)优势的方法,是只从它自己的轨迹中学习。一个广泛接受的例子是:相比于单纯用人类写的思路链做监督微调,用RL来训练语言模型解决数学应用题效果更好。

Similarly in life, we first bootstrap ourselves via imitation learning (school), which is very reasonable. But even after I graduated school, I had a habit of studying how other people found success and trying to imitate them. Sometimes it worked, but eventually I realized that I would never surpass the full ability of someone else because they were playing to their strengths which I didn’t have. It could be anything from a researcher doing yolo runs more successfully than me because they built the codebase themselves and I didn’t, or a non-AI example would be a soccer player keeping ball possession by leveraging strength that I didn’t have.
生活中也是如此:我们一开始通过模仿学习(比如上学)来建立基本能力,这很合理。但即使在我毕业之后,我仍习惯于研究他人如何取得成功,并试图模仿他们。有时这确实有效,但最终我意识到,我永远无法完全超越别人,因为他们在发挥自己特有的优势,而我并不具备这些优势。比如某位研究者能比我更成功地跑YOLO实验,是因为他们亲自搭建了代码库,而我没有;又比如一个足球运动员能成功控球,是因为他拥有我所没有的身体力量。

The lesson of doing RL on policy is that beating the teacher requires walking your own path and taking risks and rewards from the environment. For example, two things I enjoy more than the average researcher are (1) reading a lot of data, and (2) doing ablations to understand the effect of individual components in a system. Once when collecting a dataset, I spent a few days reading data and giving each human annotator personalized feedback, and after that the data turned out great and I gained valuable insight into the task I was trying to solve. Earlier this year I spent a month going back and ablating each of the decisions that I previously yolo’ed while working on deep research. It was a sizable amount of time spent, but through those experiments I learned unique lessons about what type of RL works well. Not only was leaning into my own passions more fulfilling, but I now feel like I’m on a path to carving a stronger niche for myself and my research.
基于自身策略地做RL告诉我们的道理是:要想超越老师,你必须走自己的路,从环境中亲自承担风险与收获。例如,相较于一般研究人员,我更热衷于两件事:(1)阅读大量数据;(2)通过消融实验理解系统中各组件的作用。有一次在收集数据集时,我花了几天时间阅读数据,并给每位标注人员提供个性化反馈,结果数据质量非常好,我对所研究任务也获得了宝贵的理解。今年早些时候,我花了一个月时间,重新审视并消融了自己在以往深入研究中“乱试”的每一个决策。虽然这投入了不少时间,但通过这些实验,我学到了关于RL有效性的独特经验。不仅我从中获得了更大的满足感,而且我现在感觉自己正在走上一条能为自己和研究开辟更强专属定位的道路。

In short, imitation is good and you have to do it initially. But once you’re bootstrapped enough, if you want to beat the teacher you must do on-policy RL and play to your own strengths and weaknesses :)
简而言之,模仿是好的,最开始你必须这么做。但一旦你有了足够的基础,如果你想要超越老师,就必须进行基于自身策略的强化学习,并发挥自己的独特优势与接受自身的局限 :)

    热门主题

      • Recent Articles

      • 2025-03-09 Alperen Keles.Verifiability is the Limit

        Refer To:《Verifiability is the Limit》。 LLMs have created an enormous turmoil within the software engineering community within the past 5 years, much of it revolving around one central question, what is the future of our profession? ...
      • 2013-04-12 Jeff Bezos’s Letters to Amazon Shareholders

        Refer To:《2013-04-12 Jeff Bezos’s Letters to Amazon Shareholders》。 To our shareowners: As regular readers of this letter will know, our energy at Amazon comes from the desire to impress customers rather than the zeal to best competitors. We don’t ...
      • 2021-12-02 Thomas Peterffy.Nasdaq 45th Investor Conference

        Karen Snow Nasdaq, Inc. Good morning and good afternoon, everyone. I'm Karen Snow, Head of East Coast Listings and Capital Services here at NASDAQ, and I'm pleased to be here today with Thomas Peterffy, who's the Founder and Chairman of Interactive ...
      • 2021-12-07 Thomas Peterffy.Goldman Sachs 2021 US Financial Services Conference

        William Nance Goldman Sachs Group, Inc. All right. So joining us today, we've got Thomas Peterffy, Chairman of Interactive Brokers. Thomas founded Interactive Brokers in 1977. It was one of the pioneers of the financial services industry and ...
      • 2025-07-15 Jason Wei.Asymmetry of verification and verifier’s law

        Refer To:《Asymmetry of verification and verifier’s law》。 Asymmetry of verification is the idea that some tasks are much easier to verify than to solve. With reinforcement learning (RL) that finally works in a general sense, asymmetry of verification ...