2025-07-16 Jason Wei.Life lessons from reinforcement learning

2025-07-16 Jason Wei.Life lessons from reinforcement learning


Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life.
过去一年里,我沉迷于强化学习(RL),几乎每天醒着的时候都在思考RL,而这一过程无意间教会了我一个关于如何过自己生活的重要道理。

One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s successful trajectories, you should take your own actions and learn from the reward given by the environment. Obviously imitation learning is useful to bootstrap to nonzero pass rate initially, but once you can take reasonable trajectories, we generally avoid imitation learning because the best way to leverage the model’s own strengths (which are different from humans) is to only learn from its own trajectories. A well-accepted instantiation of this is that RL is a better way to train language models to solve math word problems compared to simple supervised finetuning on human-written chains of thought.
强化学习中的一个核心理念是“on-policy”(基于自身策略):不是模仿别人的成功路径,而是采取自己的行动,从环境中获取反馈并学习。显然,在一开始,为了把成功率从零提升到一个起点,模仿学习是非常有用的。但一旦你能够执行合理的轨迹,我们通常就会避免继续使用模仿学习,因为最能发挥模型自身(不同于人类)优势的方法,是只从它自己的轨迹中学习。一个广泛接受的例子是:相比于单纯用人类写的思路链做监督微调,用RL来训练语言模型解决数学应用题效果更好。

Similarly in life, we first bootstrap ourselves via imitation learning (school), which is very reasonable. But even after I graduated school, I had a habit of studying how other people found success and trying to imitate them. Sometimes it worked, but eventually I realized that I would never surpass the full ability of someone else because they were playing to their strengths which I didn’t have. It could be anything from a researcher doing yolo runs more successfully than me because they built the codebase themselves and I didn’t, or a non-AI example would be a soccer player keeping ball possession by leveraging strength that I didn’t have.
生活中也是如此:我们一开始通过模仿学习(比如上学)来建立基本能力,这很合理。但即使在我毕业之后,我仍习惯于研究他人如何取得成功,并试图模仿他们。有时这确实有效,但最终我意识到,我永远无法完全超越别人,因为他们在发挥自己特有的优势,而我并不具备这些优势。比如某位研究者能比我更成功地跑YOLO实验,是因为他们亲自搭建了代码库,而我没有;又比如一个足球运动员能成功控球,是因为他拥有我所没有的身体力量。

The lesson of doing RL on policy is that beating the teacher requires walking your own path and taking risks and rewards from the environment. For example, two things I enjoy more than the average researcher are (1) reading a lot of data, and (2) doing ablations to understand the effect of individual components in a system. Once when collecting a dataset, I spent a few days reading data and giving each human annotator personalized feedback, and after that the data turned out great and I gained valuable insight into the task I was trying to solve. Earlier this year I spent a month going back and ablating each of the decisions that I previously yolo’ed while working on deep research. It was a sizable amount of time spent, but through those experiments I learned unique lessons about what type of RL works well. Not only was leaning into my own passions more fulfilling, but I now feel like I’m on a path to carving a stronger niche for myself and my research.
基于自身策略地做RL告诉我们的道理是:要想超越老师,你必须走自己的路,从环境中亲自承担风险与收获。例如,相较于一般研究人员,我更热衷于两件事:(1)阅读大量数据;(2)通过消融实验理解系统中各组件的作用。有一次在收集数据集时,我花了几天时间阅读数据,并给每位标注人员提供个性化反馈,结果数据质量非常好,我对所研究任务也获得了宝贵的理解。今年早些时候,我花了一个月时间,重新审视并消融了自己在以往深入研究中“乱试”的每一个决策。虽然这投入了不少时间,但通过这些实验,我学到了关于RL有效性的独特经验。不仅我从中获得了更大的满足感,而且我现在感觉自己正在走上一条能为自己和研究开辟更强专属定位的道路。

In short, imitation is good and you have to do it initially. But once you’re bootstrapped enough, if you want to beat the teacher you must do on-policy RL and play to your own strengths and weaknesses :)
简而言之,模仿是好的,最开始你必须这么做。但一旦你有了足够的基础,如果你想要超越老师,就必须进行基于自身策略的强化学习,并发挥自己的独特优势与接受自身的局限 :)

    热门主题

      • Recent Articles

      • 2026-04-28 潘乱.从红果到AI短剧:谁在革谁的命?

        Refer To:《从红果到AI短剧:谁在革谁的命?》。 红果短剧的快速崛起与用户增长逻辑 红果短剧在三年内实现日活过亿的爆发式增长,主要得益于其免费模式和对非长视频用户的有效触达。与优爱腾等长视频平台偏向正剧的定位不同,短剧更接近于电影的消费体验,但通过广告变现降低了消费门槛。AI 漫剧作为新兴品类,在去年下半年开始崭露头角,虽然与传统大制作动漫路径不同,但其生产效率和题材丰富度正在迅速提升,成为行业新的增长点。 王小书: (00:04) Hmm. 潘乱: (00:04) ...
      • 2020-12-10 王宁.潮流玩具风靡背后的心理学

        Refer To:《泡泡玛特王宁:潮流玩具风靡背后的心理学》。 于近年来以Molly、Pucky、Dimoo等各类IP受到Z世代消费者欢迎的泡泡玛特,其实已经有十年历史。 “我从自己刷墙,开第一家实体店,做零售业,是在2008年5月13号,到这周末就是整整11年了。我们是创业老兵了,单泡泡玛特这个品牌就有9年。” ...
      • 2022-01-08 王宁.不做「你死我活」的生意

        Refer To:《泡泡玛特王宁:不做「你死我活」的生意》。 今年全球最火的玩具,非Labubu莫属。 6月11日,一只稀有款薄荷色Labubu以人民币108万元成交价在二级市场拍出。就是下面这只—— 图片 6月14日,因为韩国地区线下销售太火爆,恐引发安全问题,泡泡玛特发公告暂停Labubu全系列销售。 Labubu全球爆火直接拉动泡泡玛特股价飙涨,今年以来,其股价涨幅超过200%,市值超过3500亿元,创始人王宁也因此取代牧原股份秦英林,成为新晋河南首富。 ...
      • 2026-05-13 Alex Wang.Meta's AI Chief On AI Beef, New Models And Life With Zuck

        Refer To:《Meta's AI Chief On AI Beef, New Models And Life With Zuck》。 Meta Superintelligence Labs Structure and Strategic Compute Advantage Meta Superintelligence Labs 的组织结构与战略算力优势 Meta Superintelligence Labs (MSL) operates through a specialized ...
      • 2026-05-13 泡泡玛特.2026年股东大会问答记录

        Refer To:《Popmart股东大会万字实录:王宁回应一切》、《泡泡玛特 2026 年股东大会问答记录》。 美股财报相关的材料,比如,股东大会、季度会议的材料都非常完整,A股、港股在这方面的完善程度还远不如美股,泡泡玛特的这个股东大会的材料找了几个版本,还都停留在网友自己整理的材料。 问答 01:关于冰箱和小家电探索 股东提问: 公司如何看待推出冰箱等小家电产品? 王宁回答: ...