Dwarkesh Patel 00:00:00
Today I have the honor of chatting with Jeff Dean and Noam Shazeer. Jeff is Google's Chief Scientist, and through his 25 years at the company, he has worked on basically the most transformative systems in modern computing: from MapReduce, BigTable, Tensorflow, AlphaChip – genuinely, the list doesn't end – Gemini now.
今天我很荣幸能与 Jeff Dean 和 Noam Shazeer 进行对话。Jeff 是谷歌的首席科学家,在公司工作了 25 年间,他几乎参与了现代计算领域最具变革性的系统,从 MapReduce、BigTable、TensorFlow、AlphaChip,一直到如今的 Gemini,列举起来几乎没有尽头。
And Noam is the single person most responsible for the current AI revolution. He has been the inventor or co-inventor of all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh Tensorflow, to many other things. And they are two of the three co-leads of Gemini at Google DeepMind. Awesome. Thanks so much for coming on.
而 Noam 则是当今 AI 革命中最具影响力的一位关键人物。他是现代大型语言模型所用主要架构和技术——从 Transformer 到 Mixture of Experts、从 Mesh TensorFlow 到许多其他方法的发明者或共同发明者。而且他和 Jeff 同为 Google DeepMind 的 Gemini 项目三位联合负责人中的两位。太好了,非常感谢你们能来。
Jeff Dean 00:00:50
Thank you. Super excited to be here.
谢谢你。非常高兴来到这里。
Dwarkesh Patel 00:00:52
Okay, first question. Both of you have been at Google for 25, or close to 25, years. At some point early on in the company, you probably understood how everything worked. When did that stop being the case? Do you feel like there was a clear moment that happened?
好的,第一个问题:你们两位都在谷歌工作了 25 年,或者接近 25 年。公司早期某个阶段,你们也许对谷歌的一切运作都非常熟悉。你们是在什么时候开始感觉情况不再是这样了?你们觉得当时有出现一个特别明显的时间点吗?
Noam Shazeer 00:01:08
I joined, this was like, end of 2000, and they had this thing: everybody gets a mentor. I knew nothing. I would just ask my mentor everything, and my mentor knew everything. It turned out my mentor was Jeff.
我是在 2000 年末加入公司的,当时有一个规定:每个人都会分配一位导师。我那时候什么都不懂,就去问我的导师所有问题,而我的导师知道一切。结果我的导师就是 Jeff。
It was not the case that everyone at Google knew everything. It was just the case that Jeff knew everything because he had basically written everything.
并不是说谷歌所有人都知道所有事情,而是因为 Jeff 基本上写过所有的系统,所以他确实什么都懂。
Jeff Dean 00:01:33
You're very kind. I think as companies grow, you kind of go through these phases. When I joined, we were 25 people, 26 people, something like that. So you eventually you learned everyone's name, and even though we were growing, you kept track of all the people who were joining.
你太客气了。我觉得随着公司发展壮大,它会经历不同阶段。我加入时,团队大概只有 25、26 个人左右,所以你会自然而然地记住每个人的名字,尽管公司在不断发展,但你仍能掌握所有新进人员的情况。
At some point, you lose track of everyone's name in the company, but you still know everyone working on software engineering things. Then you lose track of all the names of people in the software engineering group, but you at least know all the different projects that everyone's working on. Then at some point, the company gets big enough that you get an email that Project Platypus is launching on Friday, and you're like, "What the heck is Project Platypus?"
到了某个时间点,你就记不清公司里所有人的名字了,但至少你还知道在做软件工程的人都有谁。再往后,你甚至可能连软件工程团队里所有人的名字都对不上号,不过你至少了解他们分别在做哪些项目。而当公司再大一些,你突然会收到一封邮件,说“Project Platypus”要在周五上线了,你就会想:“这到底是个什么项目?”
Noam Shazeer 00:02:15
Usually it's a very good surprise. You're like, "Wow, Project Platypus!" I had no idea we were doing that.
通常来说,这会是个不错的惊喜。你会想:“哇,Project Platypus?居然还有这个东西!”
Jeff Dean 00:02:23
But I think it is good to keep track of what's going on in the company, even at a very high level, even if you don't know every last detail. And it's good to know lots of people throughout the company so that you can go ask someone for more details or figure out who to talk to. With one level of indirection, you can usually find the right person in the company if you have a good network of people that you've built up over time.
不过我觉得,保持对公司动态的了解还是很有意义的,即使只停留在比较高层的层面,而且就算你无法掌握所有细节也没关系。在公司里认识很多人同样重要,这样你就能去找人了解更多细节,或者知道应该和谁沟通。只要有人牵线,你一般都能找到公司里最合适的人,而这是靠你平时积累的人际网络实现的。
Dwarkesh Patel 00:02:44
How did Google recruit you, by the way?
顺便问一下,谷歌是怎么招到你的呢?
Jeff Dean 00:02:46
I kind of reached out to them, actually.
实际上是我主动联系他们的。
Dwarkesh Patel 00:02:50
And Noam, how did you get recruited?
那 Noam 呢?你是怎么被招来的?
Noam Shazeer 00:02:53
I actually saw Google at a job fair in 1999, and I assumed that it was already this huge company, that there was no point in joining, because everyone I knew used Google. I guess that was because I was a grad student at Berkeley at the time. I guess I've dropped out of grad programs a few times.
我是在 1999 年的一场招聘会上看到谷歌,当时我以为它已经是一家非常庞大的公司了,加入它没多大意义,因为我认识的人都用谷歌。那时我还在伯克利念研究生,大概已经退过几次学了。
It turns out that actually it wasn't really that large. It turns out that I did not apply in 1999, but just kind of sent them a resume on a whim in 2000, because I figured it was my favorite search engine, and figured I should apply to multiple places for a job. But then it turned out to be really fun, it looked like a bunch of smart people doing good stuff. They had this really nice crayon chart on the wall of the daily number of search queries that somebody had just been maintaining. It looked very exponential. I thought, "These guys are going to be very successful, and it looks like they have a lot of good problems to work on." So I was like, "Okay, maybe I'll go work there for a little while and then have enough money to just go work on AI for as long as I want after that."
结果发现谷歌当时并没有那么大。我在 1999 年并没有投它的简历,但在 2000 年的时候,一时兴起就给他们递了一份简历,因为它是我最喜欢的搜索引擎,我想找工作时多投几家试试看。结果去面试发现这里确实很有趣,里面都是一些聪明人,做的东西也很不错。公司墙上挂着一张用蜡笔画的图表,展示的是每天的搜索请求数量,有人一直在维护,看起来就是指数增长。我当时就想:“这家公司前景肯定好,而且看起来有很多有意思的问题可以做。”于是我就想:“好吧,我先在这儿干一阵子,等赚够钱了,再想办法长期做 AI 研究。”
Dwarkesh Patel 00:04:08
Yeah, yeah. In a way you did that, right?
是啊,从某种程度上说,你确实做到了,对吗?
Noam Shazeer 00:04:10
Yeah, it totally worked out exactly according to plan.
是的,一切完全按计划顺利实现了。
Dwarkesh Patel 00:04:15
You were thinking about AI in 1999?
你在 1999 年就开始考虑 AI 了吗?
Noam Shazeer 00:04:17
Yeah, this was like 2000. Yeah, I remember in grad school, a friend of mine at the time had told me that his New Year's resolution for 2000 was to live to see the year 3000, and that he was going to achieve this by inventing AI. I was like, "Oh, that sounds like a good idea."
对,差不多是 2000 年。我记得在读研究生时,当时有个朋友告诉我,他在 2000 年的新年愿望是活到 3000 年,他要通过发明 AI 来实现这个目标。我当时想:“哦,听起来是个不错的想法。”
I didn't get the idea at the time that you could go do it at a big company. But I figured, "Hey, a bunch of people seem to be making a ton of money at startups. Maybe I'll just make some money, and then I'll have enough to live on and just work on AI research for a long time." But yeah, it actually turned out that Google was a terrific place to work on AI.
那时候我还没想到可以在大公司里做这件事。我当时想,“嘿,不少人在创业公司都赚了大钱。也许我也能先赚点钱,然后就有了足够的资金维持生活,能长期做 AI 研究。”但结果发现,谷歌其实是个做 AI 的绝佳去处。

宽松自由的工作环境。
Jeff Dean 00:05:07
One of the things I like about Google is our ambition has always been sort of something that would require pretty advanced AI. Because I think organizing the world's information and making it universally accessible and useful, actually there is a really broad mandate in there. It's not like the company was going to do this one little thing and stay doing that. And also you could see that what we were doing initially was in that direction, but you could do so much more in that direction.
我喜欢谷歌的一个原因是,我们的目标一直都需要相当先进的 AI。因为我认为,“整合全球信息,并让人人都能访问并从中受益”这个愿景本身就非常宏大。它不是说公司只打算做一件很小的事情然后一直停留在那里。而且你能看到,我们最初所做的工作就是朝那个方向发展,但在那个方向上我们还能做得更多。
Dwarkesh Patel 00:05:36
How has Moore's Law over the last two or three decades changed the kinds of considerations you have to take on board when you design new systems, when you figure out what projects are feasible? What are still the limitations? What are things you can now do that you obviously couldn't do before?
在过去二三十年里,摩尔定律对你们在设计新系统、评估哪些项目可行时,需要考虑的因素产生了怎样的影响?目前仍存在哪些限制?有哪些事情是你们现在可以做到、但以前显然做不到的?
Jeff Dean 00:05:51
I think of it as actually changing quite a bit in the last couple of decades. Two decades ago to one decade ago, it was awesome because you just wait, and like 18 months later, you get much faster hardware, and you don't have to do anything. And then more recently, I feel like the general-purpose CPU-based machine scaling has not been as good, like the fabrication process improvements are now taking three years instead of every two years. The architectural improvements in multi-core processors and so on are not giving you the same boost that we were getting 20 to 10 years ago. But I think at the same time, we're seeing much more specialized computational devices, like machine learning accelerators, TPUs, and very ML-focused GPUs, more recently, are making it so that we can actually get really high performance and good efficiency out of the more modern kinds of computations we want to run that are different than a twisty pile of C++ code trying to run Microsoft Office or something.
我觉得在过去二十年里,这方面确实发生了很大变化。从二十年前到十年前那段时间真的很棒,你只要等上一段时间,大约 18 个月,硬件性能就会大幅提升,你基本不需要做什么。可到了最近几年,我觉得基于通用 CPU 的机器在扩展性方面就没那么好了,比如芯片制程工艺的改进从两年变成了三年,多核处理器等架构层面的改进也不再像 20 年到 10 年前那样带来同样程度的性能提升。不过与此同时,我们看到了更多专用型计算设备的出现,比如机器学习加速器、TPU,以及最近非常专注于机器学习的 GPU,让我们得以在更现代的计算模式下实现非常高的性能和效率,而这些模式与过去那些复杂 C++ 代码运行 Office 之类的情况很不一样。
Noam Shazeer 00:07:02
It feels like the algorithms are following the hardware. Basically, what's happened is that at this point, arithmetic is very, very cheap, and moving data around is comparatively much more expensive. So pretty much all of deep learning has taken off roughly because of that. You can build it out of matrix multiplications that are N cubed operations and N squared bytes of data communication basically.
感觉现在算法的发展是在跟着硬件走。基本上就是,目前算术计算很便宜,而在不同地方之间搬运数据相对更加昂贵。差不多整个深度学习的起飞都与此有关。你可以把它构建成基于矩阵乘法的形式,矩阵乘法是 N^3 次运算,却只需 N^2 字节的数据通信。
Jeff Dean 00:07:39
Well, I would say that the pivot to hardware oriented around that was an important transition, because before that, we had CPUs and GPUs that were not especially well-suited for deep learning. And then we started to build TPUs at Google that were really just reduced-precision linear algebra machines, and then once you have that then you want to exploit it.
我想说,硬件设计转向深度学习导向是个非常重要的转变。因为在那之前,我们使用的 CPU 和 GPU 并不太适合深度学习。后来我们在谷歌开始做 TPU,它其实就是降精度的线性代数专用机。一旦你有了这种硬件,就会想方设法充分利用它。
Noam Shazeer 00:08:02
It seems like it's all about identifying opportunity costs. Like, okay, this is something like Larry Page, I think, used to always say: "Our second biggest cost is taxes, and our biggest cost is opportunity costs." If he didn't say that, then I've been misquoting him for years.
大体来说,这都在于识别机会成本。我记得 Larry Page 好像常说:“我们第二大的成本是税,最大的成本是机会成本。”如果他没这么说过,那我可就把他误引了好多年。
But basically it’s like, what is the opportunity that you have that you're missing out on? In this case, I guess it was that you've got all of this chip area, and you're putting a very small number of arithmetic units on it. Fill the thing up with arithmetic units! You could have orders of magnitude more arithmetic getting done. Now, what else has to change? Okay, the algorithms and the data flow and everything else.
但核心在于:你现在有什么机会,却没有把握住?在这个例子里,也许就是你有一大片芯片面积,却只放了很少的运算单元。把这些位置都填满运算单元,你就能多出好几个数量级的算力。那么接下来还要改什么?对,算法和数据流以及其他所有方面都得跟着变。
Dwarkesh Patel 00:08:51
And, oh, by the way, the arithmetic can be really low precision, so then you can squeeze even more multiplier units in.
顺便说一句,算术运算可以采用非常低的精度,这样就能再塞进更多的乘法单元。
Dwarkesh Patel 00:08:58
Noam, I want to follow up on what you said, that the algorithms have been following the hardware. If you imagine a counterfactual world where, suppose that the cost of memory had declined more than arithmetic, or just invert the dynamic you saw.
Noam,我想接着你刚才的话题。你提到算法一直在跟随硬件。如果我们假设一个反事实的情况,比如存储的成本比算术下降更多,或者你所观察到的动态发生了反转,会怎么样呢?
Noam Shazeer 00:09:12
Okay, data flow is extremely cheap, and arithmetic is not.
好的,那么数据流动极其廉价,而算术运算并不廉价。
Dwarkesh Patel 00:09:18
What would AI look like today?
那当今的 AI 看起来会是什么样子?
Jeff Dean 00:09:20
You'd have a lot more lookups into very large memories.
可能会对超大内存进行更多的查找操作。
Noam Shazeer 00:09:25
Yeah, it might look more like AI looked like 20 years ago but in the opposite direction. I'm not sure. I guess I joined Google Brain in 2012. I left Google for a few years, happened to go back for lunch to visit my wife, and we happened to sit down next to Jeff and the early Google Brain team. I thought, "Wow, that's a smart group of people."
是啊,也许会更像 20 年前的 AI,只是方向相反吧。我也不太确定。我是 2012 年加入 Google Brain 的。当时我离开谷歌几年后,回去找我太太吃午饭,碰巧和 Jeff 以及早期的 Google Brain 团队坐在一起。我当时想:“哇,这可真是一群聪明人。”
Jeff Dean 00:09:55
I think I said, "You should think about deep neural nets. We're making some pretty good progress there."
我好像对你说过:“你应该考虑一下深度神经网络。我们在这方面正取得相当不错的进展。”
Noam Shazeer 00:09:59
"That sounds fun." Okay, so I jumped back in…
“听起来挺有趣的。”于是我又跳回来……
Jeff Dean 00:10:02
I wooed him back, it was great.
我把他劝回来了,效果很好。
Noam Shazeer 00:10:05
..to join Jeff, that was like 2012. I seem to join Google every 12 years: I rejoined Google in 2000, 2012, and 2024.
……然后就和 Jeff 一起干了,那应该是 2012 年。我发现我好像每 12 年加入一次谷歌:2000 年、2012 年、还有 2024 年。
Dwarkesh Patel 00:10:14
What's going to happen in 2036?
那 2036 年会发生什么呢?
Noam Shazeer 00:10:16
I don't know. I guess we shall see.
我不知道,我们走着瞧吧。
Dwarkesh Patel 00:10:21
What are the trade-offs that you're considering changing for future versions of TPU to integrate how you're thinking about algorithms?
关于未来版本的 TPU,你们在考虑做出哪些权衡和改变来整合你们对算法的思考?
Jeff Dean 00:10:29
I think one general trend is we're getting better at quantizing or having much more reduced precision models. We started with TPUv1, and we weren't even quite sure we could quantize and model for serving with eight-bit integers. But we sort of had some early evidence that seemed like it might be possible. So we're like, "Great, let's build the whole chip around that."
我觉得有一个总体趋势是,我们在量化或者大幅降低模型精度方面越来越擅长。我们从 TPUv1 开始,当时甚至都不确定能不能把模型量化成 8 位整数来进行推理。但我们早期有一些迹象表明这可能可行,于是我们就想:“好吧,那我们就围绕这个思路来构建整块芯片。”
And then over time, I think you've seen people able to use much lower precision for training as well. But also the inference precision has gone. People are now using INT4 or FP4, which sounded like, if you said to someone like we're going to use FP4, like a supercomputing floating point person 20 years ago, they'd be like, "What? That's crazy. We like 64 bits in our floats."Or even below that, some people are quantizing models to two bits or one bit, and I think that's a trend that definitely –
然后随着时间推移,你也能看到人们在训练时也能用到更低的精度,而且推理的精度也不断下降。现在有人用 INT4 或者 FP4。如果你在 20 年前跟那些做超级计算浮点数的人说“我们要用 FP4”,他们肯定会觉得:“什么?太疯狂了吧?我们可喜欢用 64 位浮点数。”甚至还有人在把模型量化到 2 位或 1 位,我觉得这是一个很值得关注的趋势——
Dwarkesh Patel 00:11:25
One bit? Just like a zero-or-one?
一位?就 0 和 1?
Jeff Dean 00:11:27
Yeah, just a 0-1. And then you have a sign bit for a group of bits or something.
对,就是 0 或 1。然后你可以对一组位加上一个符号位之类的。
Noam Shazeer 00:11:33
It really has to be a co-design thing because, if the algorithm designer doesn't realize that you can get greatly improved performance, throughput, with the lower precision, of course, the algorithm designer is going to say, "Of course, I don't want low precision. That introduces risk." And then it adds irritation.
这真的需要软硬件协同设计才行。因为如果算法设计者不知道用更低的精度可以极大提升性能和吞吐量,那么算法设计者当然会说:“我才不想要低精度,那会带来风险。”然后就会感到很烦。
Then if you ask the chip designer, "Okay, what do you want to build?" And then they'll ask the person who's writing the algorithms today, who's going to say, "No, I don't like quantization. It's irritating." So you actually need to basically see the whole picture and figure out, "Oh, wait a minute, we can increase our throughput-to-cost ratio by a lot by quantizing."
接下来你去问芯片设计者:“好,你想做什么?”然后他又去问当前在写算法的人,对方就会说:“不,我不喜欢量化,这太让人烦了。”所以你其实需要站在全局去思考:“哎等等,我们如果做量化,就可以把吞吐量和成本比大幅提高啊。”
Jeff Dean 00:12:27
Then you're like, yes, quantization is irritating, but your model is going to be three times faster, so you're going to have to deal.
然后你会说,是的,量化令人烦躁,但你的模型会快上三倍,所以你只能接受。
Dwarkesh Patel 00:12:33
Through your careers, at various times, you’ve worked on things that have an uncanny resemblance to what we're actually using now for generative AI. In 1990, Jeff, your senior thesis was about backpropagation. And in 2007- this is the thing that I didn’t realise until I was prepping for this episode – in 2007 you guys trained a two trillion token N-gram model for language modeling.
在你们的职业生涯中,你们在不同的阶段都从事过一些和现在我们实际使用的生成式人工智能有着惊人相似之处的工作。杰夫,你在1990年的本科毕业论文是关于反向传播的。而在2007年——这是我在为本期节目做准备时才注意到的——你们在2007年训练了一个包含两万亿词元的N-gram语言模型。
Just walk me through when you were developing that model. Was this kind of thing in your head? What did you think you guys were doing at the time?
给我讲讲你们当时开发那个模型的过程。你们当时脑子里有类似现在这样的想法吗?你们当时认为自己在做什么?
Jeff Dean 00:13:13
Let me start with the undergrad thesis. I got introduced to neural nets in one section of one class on parallel computing that I was taking in my senior year. I needed to do a thesis to graduate, an honors thesis. So I approached the professor and I said, "Oh, it'd be really fun to do something around neural nets."
让我先从本科毕业论文说起。我在大四修的一门并行计算课程的一个章节里接触到了神经网络。我需要做一个毕业论文,也就是荣誉论文。于是我去找了教授,说:“哦,做点与神经网络相关的东西会很有趣。”
So, he and I decided I would implement a couple of different ways of parallelizing backpropagation training for neural nets in 1990. I called them something funny in my thesis, like "pattern partitioning" or something. But really, I implemented a model parallelism and data parallelism on a 32-processor Hypercube machine.
于是,我们决定让我在1990年实现几种不同方式的神经网络反向传播训练并行化。在我的论文里我给它们起了些有趣的名字,比如“模式分块”之类的。但实际上,我是在一台32处理器的Hypercube机器上实现了模型并行和数据并行。
In one, you split all the examples into different batches, and every CPU has a copy of the model. In the other one, you pipeline a bunch of examples along to processors that have different parts of the model. I compared and contrasted them, and it was interesting.
在其中一种方法里,你把所有样本分成不同的批次,每个CPU都有一份模型的副本。另一种方法中,你把一批样本按流水线方式传给各个处理器,而这些处理器分别拥有模型的不同部分。我对它们进行了比较和对照,感觉很有意思。
I was really excited about the abstraction because it felt like neural nets were the right abstraction. They could solve tiny toy problems that no other approach could solve at the time. I thought, naive me, that 32 processors would be able to train really awesome neural nets.
我对这种抽象非常兴奋,因为我觉得神经网络是一种正确的抽象。它们能解决当时其他方法无法解决的一些小玩具问题。我当时天真地以为,32个处理器就可以训练出非常出色的神经网络。
But it turned out we needed about a million times more compute before they really started to work for real problems, but then starting in the late 2008, 2009, 2010 timeframe, we started to have enough compute, thanks to Moore's law, to actually make neural nets work for real things. That was kind of when I re-entered, looking at neural nets.
但事实证明,我们需要大约多上百万倍的计算量,这些神经网络才真的能用于解决实际问题。然后从2008年、2009年、2010年左右开始,多亏了摩尔定律,我们终于有了足够的计算能力来让神经网络真正发挥作用。也正是在那时我重新开始研究神经网络。

事实上真正相信的只有OpenAI的团队。
But prior to that, in 2007...
不过在那之前,也就是2007年……
Dwarkesh Patel 00:14:55
Sorry, actually could I ask about this?
不好意思,我能先问一下这个问题吗?
Jeff Dean 00:14:57
Oh yeah, sure.
哦,当然可以。
Dwarkesh Patel 00:14:58
First of all, unlike other artifacts of academia, it's actually like four pages, and you can just read it.
首先,与学术界的其他研究成果不同,你这篇论文实际上就四页,你一口气就能读完。
Jeff Dean 00:15:07
It was four pages and then 30 pages of C code.
正文是四页,然后再加上30页的C代码。
Dwarkesh Patel 00:15:10
But it's just a well-produced artifact. Tell me about how the 2007 paper came together.
但它制作得非常好。跟我说说2007年那篇论文是怎么形成的吧。
Jeff Dean 00:15:15
Oh yeah, so that, we had a machine translation research team at Google led by Franz Och, who had joined Google maybe a year before, and a bunch of other people. Every year they competed in a DARPA contest on translating a couple of different languages to English, I think, Chinese to English and Arabic to English.
哦,好吧。我们当时在谷歌有一个机器翻译研究团队,由弗朗茨·奥赫领导,他可能在那之前一年左右加入谷歌,还有一大群其他人。他们每年都会参加DARPA的一个比赛,主要是把几种不同的语言翻译成英文,比如中文到英文、阿拉伯语到英文之类的。
The Google team had submitted an entry, and the way this works is you get 500 sentences on Monday, and you have to submit the answer on Friday. I saw the results of this, and we'd won the contest by a pretty substantial margin measured in Bleu score, which is a measure of translation quality.
谷歌团队也提交了参赛作品。比赛的流程是:周一给你500个句子,然后你必须在周五提交答案。我看到了比赛的结果,我们在Bleu得分(这是一种衡量翻译质量的方法)上领先了相当大的幅度,成功赢得了比赛。
So I reached out to Franz, the head of this winning team. I'm like, "This is great, when are we going to launch it?" And he's like, "Oh, well, we can't launch this. It's not really very practical because it takes 12 hours to translate a sentence." I'm like, "Well, that seems like a long time. How could we fix that?"
于是我去找了弗朗茨,也就是这个获胜团队的负责人。我说:“太好了,我们什么时候能上线?”他却说:“哦,我们没法上线。它其实并不实用,因为翻译一个句子需要12个小时。”我当时就想:“那也太久了吧。我们该怎么解决呢?”
It turned out they'd not really designed it for high throughput, obviously. It was doing 100,000 disk seeks in a large language model that they sort of computed statistics over – I wouldn't say "trained" really – for each word that it wanted to translate.
结果显而易见,他们并没有为高吞吐量做设计。这个系统在一个大型语言模型里为它想翻译的每个单词都进行10万个磁盘查找——其实我也不能完全说它是“训练”出来的,只能说它算了一些统计数据。
Obviously, doing 100,000 disk seeks is not super speedy. But I said, "Okay, well, let's dive into this." So I spent about two or three months with them, designing an in-memory compressed representation of N-gram data.
显然,做10万次磁盘查找并不怎么快。于是我说:“好吧,那我们来深入研究一下这个问题。”然后我花了大概两三个月的时间和他们一起设计了一种内存中的N-gram数据压缩表示。
We were using- an N-gram is basically statistics for how often every N-word sequence occurs in a large corpus, so you basically have, in this case, we had 2 trillion words. Most N-gram models of the day were using two-grams or maybe three-grams, but we decided we would use five-grams.
我们使用的N-gram基本上就是在大规模语料库里统计每个N词序列出现的频率。在当时的情况下,我们拥有2万亿词。有很多当时的N-gram模型用的是二元或三元序列,但我们决定使用五元序列。
So, how often every five-word sequence occurs in basically as much of the web as we could process in that day. Then you have a data structure that says, "Okay, 'I really like this restaurant' occurs 17 times in the web, or something."
也就是说,你统计每个五词序列在我们当时所能处理的尽可能多的网络文本中出现的频次。然后你构建一个数据结构,用来记录“好比,‘I really like this restaurant’在网络上出现了17次”这样的信息。
And so I built a data structure that would let you store all those in memory on 200 machines and then have sort of a batched API where you could say, "Here are the 100,000 things I need to look up in this round for this word," and we'd give you them all back in parallel. That enabled us to go from taking a night to translate a sentence to basically doing something in 100 milliseconds or something.
于是我构建了一个数据结构,可以在200台机器的内存里存储所有这些信息,然后有一个批量API让你一次性提交“这是我在这个轮次为了翻译这个词需要查询的10万个数据”的请求,我们就能并行返回所有结果。这样一来,我们就从一晚上才能翻译出一个句子,提升到了只需大约100毫秒左右就能完成。
Dwarkesh Patel 00:18:03
There's this list of Jeff Dean facts, like Chuck Norris facts. For example, that “for Jeff Dean, NP equals "no problemo."” One of them, it's funny because now that I hear you say it, actually, it's kind of true. One of them is, "The speed of light was 35 miles an hour until Jeff Dean decided to optimize it over a weekend." Just going from 12 hours to 100 milliseconds, I got to do the orders of magnitude there.
有一个关于杰夫·迪恩的“趣味事实”列表,就像对查克·诺里斯的调侃一样。比如说,“对于杰夫·迪恩来说,NP等于‘没问题’。”其中有一个很好笑,现在听你说起来,它其实还挺真实的。那条写道,“光速原本是每小时35英里,直到杰夫·迪恩决定在一个周末对其进行优化。”从12小时到100毫秒的差距,我不禁想起了这种数量级的变化。
Jeff Dean 00:18:36
All of these are very flattering. They're pretty funny. They're like an April Fool's joke gone awry by my colleagues.
这些都太夸张了。它们很有趣,就像是我同事们开的一个愚人节玩笑,结果玩过火了。
Dwarkesh Patel 00:18:45
Obviously, in retrospect, this idea that you can develop a latent representation of the entire internet through just considering the relationships between words is like: yeah, this is large language models. This is Gemini. At the time, was it just a translation idea, or did you see that as being the beginning of a different kind of paradigm?
显然,事后来看,通过仅仅考虑词之间的关系就可以为整个互联网构建一个潜在表示,这就和现在的大型语言模型是一样的。这就是Gemini。在当时,这只是一种翻译思路,还是你们觉得这标志着某种全新的范式开始?
Jeff Dean 00:19:11
I think once we built that for translation, the serving of large language models started to be used for other things, like completion... you start to type, and it suggests what completions make sense.
我觉得我们一旦为翻译搭建了这个系统,大型语言模型的服务就开始被用于其他事情,比如自动补全……你开始输入,它就会给出可能合适的补全。
So it was definitely the start of a lot of uses of language models in Google. And Noam has worked on a number of other things at Google, like spelling correction systems that use language models.
所以,这绝对算是谷歌对语言模型广泛应用的一个开端。而诺姆在谷歌也做过很多其他相关的事情,比如使用语言模型的拼写纠正系统。
Noam Shazeer 00:19:36
That was like 2000, 2001, and I think it was all in-memory on one machine.
那大概是在2000年、2001年左右,我想当时所有东西都跑在一台机器的内存里。
Jeff Dean 00:19:44
Yeah, I think it was one machine. His spelling correction system he built in 2001 was amazing. He sent out this demo link to the whole company.
I just tried every butchered spelling of every few-word query I could get, like “scrumbled uggs Bundict"—
是的,我记得就是一台机器。他在2001年构建的拼写纠正系统真的很厉害。他给整个公司发了一个演示链接。
我当时就试着把每个几词短语都拼得乱七八糟,比如“scrumbled uggs Bundict”——
Noam Shazeer 00:19:59
I remember that one, yeah yeah.
我记得那个,对对。
Jeff Dean 00:20:00
—instead of “scrambled eggs benedict”, and it just nailed it every time.
——用来代替“scrambled eggs benedict(炒蛋班尼迪克)”,而系统每次都能正确地纠正。
Noam Shazeer 00:20:04
Yeah, I guess that was language modeling.
是啊,我想那就是语言模型的作用。
Dwarkesh Patel 00:20:07
But at the time, when you were developing these systems, did you have this sense of, “look, you make these things more and more sophisticated, don't consider five words, consider 100 words, 1,000 words, then the latent representation is intelligence”. Basically when did that insight hit?
但是在你们开发这些系统时,你们有没有这样的感觉:“你看,如果让这些系统变得越来越复杂,不再只考虑五个词,而是考虑一百个、一千个词,那么这个潜在表示就会变成智能。”大概什么时候开始有这种想法的?
Noam Shazeer 00:20:24
Not really. I don't think I ever felt like, okay, N-gram models are going to–
并没有。我不认为我曾经觉得N-gram模型会——
Jeff Dean 00:20:32
–sweep the world–
——席卷世界——
Noam Shazeer 00:20:33
–yeah: “be” artificial intelligence. I think at the time, a lot of people were excited about Bayesian networks. That seemed exciting.
Definitely seeing those early neural language models, both the magic in that, “okay, this is doing something extremely cool” and also, it just struck me as the best problem in the world in that for one, it is very, very simple to state: give me a probability distribution over the next word. Also, there's roughly infinite training data out there. There's the text of the web; you have trillions of training examples of unsupervised data.
——是的,“成为”人工智能。当时,很多人对贝叶斯网络感到兴奋,看起来很酷。
当然,看到那些早期的神经语言模型,我既感到其中的魔力,“哇,这做的事情非常惊艳”,同时也觉得这是世界上最好的问题之一,因为它很简单:给出对下一个词的概率分布就行。而且这个世界上有几乎无限多的训练数据可用,网上的文本就在那里,你有数万亿条无监督数据的训练例子可以使用。
Jeff Dean 00:21:20
Yeah, or self-supervised.
是的,或者自监督。
Noam Shazeer 00:21:22
Self-supervised, yeah.
自监督,是的。
Jeff Dean 00:21:23
It's nice because you then have the right answer, and then you can train on all but the current word and try to predict the current word. It's this amazing ability to just learn from observations of the world.
这很棒,因为你可以先知道正确答案,然后用除了当前词之外的所有词进行训练,尝试去预测当前词。这是一种能够从对世界的观察中进行学习的神奇能力。
Noam Shazeer 00:21:36
And then it's AI complete. If you can do a great job of that, then you can pretty much do anything.
然后它就等于是AI完整了。如果你能把这个做好,你几乎可以做任何事情。
Dwarkesh Patel 00:23:00
There's this interesting discussion in the history of science about whether ideas are just in the air and there's a sort of inevitability to big ideas, or whether they're sort of plucked out of some tangential direction. In this case, this way in which we're laying it out very logically, does that imply basically, how inevitable does this...
在科学史中,有一个非常有趣的讨论:一些想法是不是已经“飘在空中”,从而不可避免地会被提出,还是它们是从某种边缘方向被“拔”出来的?就我们现在所说的这条逻辑脉络而言,它是否意味着这件事情有多不可避免……
Noam Shazeer 00:22:05
It does feel like it's in the air. There were definitely some, there was like the neural Turing machine, a bunch of ideas around attention, like having these key-value stores that could be useful in neural networks to focus on things. I think in some sense, it was in the air, and in some sense, you need some group to go do it.
确实感觉就像这些想法都“在空中”。确实有一些,像神经图灵机,还有一堆围绕注意力机制的想法,比如在神经网络里使用可以让网络专注的键-值存储。我觉得从某种意义上来说,这些想法的确都“在空中”,但同时你也需要某个团队真正去把它做出来。
Jeff Dean 00:22:36
I like to think of a lot of ideas as being partially in the air, where there are a few different, maybe separate research ideas that one is squinting at when you’re trying to solve a new problem. You draw on those for some inspiration, and then there's some aspect that is not solved, and you need to figure out how to solve that. The combination of some morphing of the things that already exist and some new things lead to some new breakthrough or new research result that didn't exist before.
我觉得很多想法确实部分都“飘在空中”。当你试图解决一个新问题时,会有一些也许不相干的研究思路,你会去审视它们,从中获得灵感。但总会有某个问题还没被解决,你需要想出办法解决它。把已经存在的东西稍作改变,再加上一些新东西,就能组合出某种以前没有的新突破或新研究成果。
Dwarkesh Patel 00:22:57
Are there key moments that stand out to you where you're looking at a research area, you come up with this idea, and you have this feeling of, "Holy shit, I can't believe that worked?"
在你们看来,是否有那些关键时刻,当你们在某个研究领域里突然想到一个点子,然后心想,“我的天,我真不敢相信它居然奏效了”?
Jeff Dean 00:23:06
One thing I remember was in the early days of the Brain team. We were focused on “let’s see if we could build some infrastructure that lets us train really, really big neural nets”. At that time, we didn't have GPUs in our data centers; we just had CPUs. But we know how to make lots of CPUs work together.
我记得有一件事发生在Brain团队早期。当时我们关注的是“让我们试试看能不能构建一种基础设施,让我们能训练非常非常大的神经网络”。那时我们的数据中心里没有GPU,只有CPU。不过我们知道如何让大批CPU协同工作。
So we built a system that enabled us to train pretty large neural nets through both model and data parallelism. We had a system for unsupervised learning on 10 million randomly selected YouTube frames. It was a spatially local representation, so it would build up unsupervised representations based on trying to reconstruct the thing from the high-level representations.
于是,我们构建了一个系统,能够同时利用模型并行和数据并行来训练相当大的神经网络。我们有一个用在一千万个随机选取的YouTube帧上的无监督学习系统。它是一种空间局部表示,因此会基于从高级表示进行重构的过程,构建无监督表示。
We got that working and training on 2,000 computers using 16,000 cores. After a little while, that model was actually able to build a representation at the highest level where one neuron would get excited by images of cats. It had never been told what a cat was, but it had seen enough examples of them in the training data of head-on facial views of cats that that neuron would turn on for that and not for much else.
我们让这个系统在2,000台机器、16,000个核心上进行训练。一段时间之后,这个模型实际上在最高层学到了这样一种表示:有一个神经元会被猫的图像激活。它从来没被告诉什么是猫,但它在训练数据中看到了足够多正面猫脸视角的例子,所以那个神经元只会对这种猫脸激活,而对其他几乎所有东西都不会。
Similarly, you'd have other ones for human faces and backs of pedestrians, and this kind of thing. That was kind of cool because it's from unsupervised learning principles, building up these really high-level representations. Then we were able to get very good results on the supervised ImageNet 20,000 category challenge that advanced the state of the art by 60% relative improvement, which was quite good at the time.
类似地,也有神经元对人脸感兴趣,还有对行人背影之类的东西感兴趣。这就很酷,因为这是通过无监督学习的原理来构建非常高级的表示。然后我们在有监督的ImageNet 2万个类别的挑战中获得了非常好的成绩,相对提升达到了60%,在当时算相当不错了。
That neural net was probably 50x bigger than one that had been trained previously, and it got good results. So that sort of said to me, "Hey, actually scaling up neural nets seems like, I thought it would be a good idea and it seems to be, so we should keep pushing on that."
那个神经网络可能比之前训练过的网络大了50倍,但却得到了很好的结果。这也让我觉得,“嘿,事实上让神经网络规模更大似乎真的有用,我之前猜这可能是个好主意,现在看来确实如此,所以我们应该继续做下去。”
Dwarkesh Patel 00:25:14
These examples illustrate how these AI systems fit into what you were just mentioning: that Google is fundamentally a company that organizes information. AI, in this context, is finding relationships between information, between concepts, to help get ideas to you faster, information you want to you faster.
这些例子说明了这些AI系统是如何融入到你们刚才提到的那一点:谷歌从根本上讲是一家组织信息的公司。在这个语境下,AI的作用就是寻找信息与概念之间的联系,让想法更快速地抵达你,让你更快得到所需信息。
Now we're moving with current AI models. Obviously, you can use BERT in Google Search and you can ask these questions. They are still good at information retrieval, but more fundamentally, they can write your entire code base for you and do actual work, which goes beyond just information retrieval.
而在现在的AI模型下,情况又在发展。显然,你可以在谷歌搜索里用BERT来提问。它们依旧擅长信息检索,但更根本的是,它们可以帮你写整个代码库,完成真正的工作,而不仅仅是做信息检索。
So how are you thinking about that? Is Google still an information retrieval company if you're building an AGI? An AGI can do information retrieval, but it can do many other things as well.
那你们怎么考虑这个问题呢?如果你们在构建通用人工智能,谷歌还算是一家信息检索公司吗?毕竟AGI不仅能做信息检索,还能做很多其他事情。
Jeff Dean 00:26:14
I think we're an "organize the world's information" company, and that's broader than information retrieval. Maybe: “organizing and creating new information from some guidance you give it”.
我认为我们是一家“组织世界信息”的公司,这一概念远不止于信息检索。也许可以说是:“根据你提供的指导来组织和创造新的信息”。
"Can you help me write a letter to my veterinarian about my dog? It's got these symptoms," and it'll draft that. Or, "Can you feed in this video, and can you produce a summary of what's happening in the video every few minutes?"
“你能帮我写一封信给我的兽医,讲讲我狗狗的情况吗?它有这些症状,”它会帮你起草这封信。或者,“你能处理这个视频,并每隔几分钟生成一份视频内容摘要吗?”
I think our multimodal capabilities are showing that it's more than just text. It's about understanding the world in all the different modalities that information exists in, both human ones but also non-human-oriented ones, like weird lidar sensors on autonomous vehicles, or genomic information, or health information.
我认为我们的多模态能力表明,这不仅仅是文本。这关乎于理解世界中信息存在的各种形式,不仅包括人类的方式,还包括非人类的方式,比如自动驾驶车辆上奇特的激光雷达传感器、基因组信息或健康信息。
And then, how do you extract and transform that into useful insights for people and make use of that in helping them do all kinds of things they want to do? Sometimes it's, "I want to be entertained by chatting with a chatbot." Sometimes it's, "I want answers to this really complicated question, there is no single source to retrieve from." You need to pull information from 100 web pages, figure out what's going on, and make an organized, synthesized version of that data.
那么,如何将这些信息提取并转化为对人们有用的洞察,并帮助他们完成各种任务呢?有时是“我想通过与聊天机器人对话获得娱乐”,有时则是“我想要解决这个极其复杂的问题,但没有单一的信息来源可供查询”。你需要从100个网页中提取信息,弄清楚其中的内容,然后整理出一个有条理、综合的信息版本。
Then dealing with multimodal things or coding-related problems. I think it's super exciting what these models are capable of, and they're improving fast, so I'm excited to see where we go.
然后是处理多模态内容或编程相关的问题。我认为这些模型所具备的能力令人无比激动,而且它们进步神速,所以我非常期待未来的发展。
Noam Shazeer 00:28:42
I am also excited to see where we go. I think definitely organizing information is clearly a trillion-dollar opportunity, but a trillion dollars is not cool anymore. What's cool is a quadrillion dollars.
我也非常期待未来的发展。我认为,组织信息无疑是一个万亿美元的机遇,但如今万亿美元已不再显得酷炫。真正酷的是千兆美元的机遇。
Obviously the idea is not to just pile up some giant pile of money, but it's to create value in the world, and so much more value can be created when these systems can actually go and do something for you, write your code, or figure out problems that you wouldn't have been able to figure out yourself.
显然,这个理念并不是单纯地积累一大笔钱,而是要在世界上创造价值;当这些系统能够真正为你做点什么、帮你写代码或解决你自己无法解决的问题时,就能创造出更多的价值。
To do that at scale, we're going to have to be very, very flexible and dynamic as we improve the capabilities of these models.
要在大规模上实现这一目标,我们必须在提升这些模型能力的过程中变得极为灵活和动态。
Jeff Dean 00:29:22
Yeah, I'm pretty excited about a lot of fundamental research questions that come about because you see something could be substantially improved if we tried this approach or things in this rough direction. Maybe that'll work, maybe it won't.
是的,我对许多基本的研究问题感到非常兴奋,因为你会发现,如果我们尝试这种方法或朝这个大致方向努力,很多方面都可能获得显著改进。可能会成功,也可能不会。
But I also think there's value in seeing what we could achieve for end-users and then how can we work backwards from that to actually build systems that are able to do that. As one example: organizing information, that should mean any information in the world should be usable by anyone, regardless of what language they speak.
但我也认为,看看我们能为终端用户实现什么,然后再反向构建出能够实现这些目标的系统,这本身就具有重要意义。举个例子:组织信息,这意味着世界上任何信息都应该对所有人可用,无论他们使用哪种语言。
And that I think we've done some amount of, but it's not nearly the full vision of, "No matter what language you speak, out of thousands of languages, we can make any piece of content available to you and make it usable by you. Any video could be watched in any language." I think that would be pretty awesome. We're not quite there yet, but that's definitely things I see on the horizon that should be possible.
我认为我们已经在这方面取得了一些成果,但距离这样的愿景——“无论你说什么语言,在数千种语言中,我们都能让任何内容对你可用,并让你能够使用它,任何视频都能以任何语言观看”——还差得远。不过,我坚信这绝对是未来可能实现的目标,会非常令人赞叹。
Dwarkesh Patel 00:30:26
Speaking of different architectures you might try, I know one thing you're working on right now is longer context. If you think of Google Search, it's got the entire index of the internet in its context, but it's a very shallow search. And then obviously language models have limited context right now, but they can really think. It's like dark magic, in-context learning. It can really think about what it’s seeing.
说到你可能尝试的不同架构,我知道你们现在正在研究的一个方向是更长的上下文。如果你想想谷歌搜索,它的上下文中包含了整个互联网的索引,但那只是非常浅层的搜索。而且目前语言模型的上下文有限,但它们真的能“思考”。这就像黑魔法般的上下文学习,能够真正理解它所见到的内容。
How do you think about what it would be like to merge something like Google Search and something like in-context learning?
你认为将谷歌搜索和上下文学习这样的技术融合会是什么样子?
Jeff Dean 00:30:51
Yeah, I'll take a first stab at it because – I've thought about this for a bit. One of the things you see with these models is they're quite good, but they do hallucinate and have factuality issues sometimes. Part of that is you've trained on, say, tens of trillions of tokens, and you've stirred all that together in your tens or hundreds of billions of parameters.
是的,我来试着回答一下,因为——我思考过这个问题。这些模型虽然表现相当不错,但有时会产生幻觉和事实错误。其中部分原因在于,你们训练时用了大约数十万亿个词元,并将这些数据混合进了数百亿的参数中。
But it's all a bit squishy because you've churned all these tokens together. The model has a reasonably clear view of that data, but it sometimes gets confused and will give the wrong date for something.
但由于把所有词元混合在一起,结果显得有些模糊。模型对这些数据的理解虽然还算清晰,但有时会混淆,甚至给出错误的日期。
Whereas information in the context window, in the input of the model, is really sharp and clear because we have this really nice attention mechanism in transformers. The model can pay attention to things, and it knows the exact text or the exact frames of the video or audio or whatever that it's processing.
而模型输入中的上下文窗口里的信息则非常清晰,因为Transformer中那出色的注意力机制使其能够聚焦于细节,准确掌握所处理文本、视频帧、音频或其他内容的具体细节。
Right now, we have models that can deal with millions of tokens of context, which is quite a lot. It's hundreds of pages of PDF, or 50 research papers, or hours of video, or tens of hours of audio, or some combination of those things, which is pretty cool. But it would be really nice if the model could attend to trillions of tokens.
目前,我们的模型可以处理数百万个词元的上下文,这已经相当多了——相当于数百页的PDF、50篇研究论文、数小时的视频、或几十小时的音频,甚至这些内容的组合,实在是太酷了。但如果模型能够关注到数万亿个词元,那就更好了。
Could it attend to the entire internet and find the right stuff for you? Could it attend to all your personal information for you? I would love a model that has access to all my emails, all my documents, and all my photos.
它能否关注整个互联网,为你找到正确的信息?又或者,它能否关注你所有的个人信息?我真希望有这样一个模型,它能访问我所有的电子邮件、所有文档以及所有照片。
When I ask it to do something, it can sort of make use of that, with my permission, to help solve what it is I'm wanting it to do. But that's going to be a big computational challenge because the naive attention algorithm is quadratic. You can barely make it work on a fair bit of hardware for millions of tokens, but there's no hope of making that just naively go to trillions of tokens.
当我让它执行某项任务时,它可以在得到我的许可后利用这些数据来帮助实现我的需求。但这将面临巨大的计算挑战,因为最基本的注意力算法是二次复杂度。在现有硬件上处理数百万个词元都已经相当吃力,更别说直接扩展到数万亿个词元了。
So, we need a whole bunch of interesting algorithmic approximations to what you would really want: a way for the model to attend conceptually to lots and lots more tokens, trillions of tokens. Maybe we can put all of the Google code base in context for every Google developer, all the world's source code in context for any open-source developer. That would be amazing.
因此,我们需要大量有趣的算法近似,来实现真正想要的效果:让模型在概念上能够关注更多、更多的词元,甚至达到数万亿个。也许我们可以为每个谷歌开发者提供整个谷歌代码库的上下文,为任何开源开发者提供全球所有源代码的上下文。这将会非常惊人。
Noam Shazeer 00:33:20
It would be incredible. The beautiful thing about model parameters is they are quite memory-efficient at memorizing facts. You can probably memorize on the order of one fact or something per model parameter.
这将是难以置信的。模型参数的奇妙之处在于它们在记忆事实时非常节省内存。你大概可以做到每个模型参数记住一条事实左右。
Whereas if you have some token in context, there are lots of keys and values at every layer. It could be a kilobyte, a megabyte of memory per token.
而如果你在上下文中处理某个词元,每一层都有大量的键和值,每个词元可能占用一千字节甚至一兆字节的内存。
Jeff Dean 00:33:56
You take a word and you blow it up to 10 kilobytes or something.
你把一个词扩展到大约10千字节左右。
Noam Shazeer 00:33:59
Yes. There's actually a lot of innovation going on around, okay, A, how do you minimize that? And B, what words do you need to have there? Are there better ways of accessing bits of that information?
是的。实际上围绕这个问题有很多创新:首先,如何把它最小化?其次,你需要包含哪些词汇?有没有更好的方法来获取这些信息的各个部分?
Jeff seems like the right person to figure this out. Okay, what does our memory hierarchy look like from the SRAM all the way up to data center worldwide level?
杰夫似乎是解决这一问题的合适人选。那么,从SRAM一直到全球数据中心,我们的内存层级究竟是什么样的?
Dwarkesh Patel 00:34:32
I want to talk more about the thing you mentioned: look, Google is a company with lots of code and lots of examples. If you just think about that one use case and what that implies, so you've got the Google monorepo. Maybe you figure out the long context thing, you can put the whole thing in context, or you fine-tune on it. Why hasn't this been already done?
我想更深入讨论你提到的那个问题:看看,谷歌是一家拥有大量代码和众多示例的公司。仅仅考虑这一用例及其暗示的意义——你拥有谷歌的单一代码库,也许你可以解决长上下文的问题,把整个代码库纳入上下文,或者对其进行微调。为什么这件事还没有实现呢?
You can imagine the amount of code that Google has proprietary access to, even if you're just using it internally to make your developers more efficient and productive.
你可以想象,谷歌拥有的专有代码量是多么巨大,即使仅仅在内部使用,也能大大提高开发者的效率和生产力。
Jeff Dean 00:35:09
To be clear, we have actually already done further training on a Gemini model on our internal code base for our internal developers. But that's different than attending to all of it because it sort of stirs together the code base into a bunch of parameters. Having it in context makes things clearer.
需要说明的是,我们实际上已经在内部代码库上对Gemini模型进行了进一步训练,供内部开发者使用。但这与将所有代码直接纳入上下文不同,因为后者会将代码库混合成一堆参数,而将其放入上下文则可以让内容更清晰。
Even the further trained model internally is incredibly useful. Sundar, I think, has said that 25% of the characters that we're checking into our code base these days are generated by our AI-based coding models with kind of human oversight.
即便是经过进一步训练的内部模型也极其有用。我记得桑达尔曾表示,如今我们提交到代码库中的字符中,有25%是由我们的基于AI的编码模型在一定的人为监督下生成的。
Dwarkesh Patel 00:35:49
How do you imagine, in the next year or two, based on the capabilities you see around the horizon, your own personal work? What will it be like to be a researcher at Google? You have a new idea or something. With the way in which you're interacting with these models in a year, what does that look like?
你如何设想,在未来一两年内,基于你所见的前沿能力,你个人的工作会是什么样子?在谷歌做研究员会是什么感觉?假如你有一个新点子,一年后你与这些模型的交互会是什么样的?
Noam Shazeer 00:36:04
Well, I assume we will have these models a lot better and hopefully be able to be much, much more productive.
嗯,我想那时这些模型会更强大,希望我们能变得更高效得多。
Jeff Dean 00:36:15
Yeah, in addition to kind of research-y context, anytime you're seeing these models used, I think they're able to make software developers more productive because they can kind of take a high-level spec or sentence description of what you want done and give a pretty reasonable first cut at that. From a research perspective, maybe you can say, "I'd really like you to explore this kind of idea similar to the one in this paper, but maybe let's try making it convolutional or something."
是的,除了研究领域的背景外,每当你看到这些模型被应用时,我认为它们能让软件开发者变得更高效,因为它们能根据你想要实现的高层描述或一句话说明,提供一个相当合理的初步实现。从研究角度看,你也许可以说,“我真希望你能探索一种类似于这篇论文中的想法,但我们可以尝试做成卷积型的或者其它什么形式。”
If you could do that and have the system automatically generate a bunch of experimental code, and maybe you look at it and you're like, "Yeah, that looks good, run that." That seems like a nice dream direction to go in.
如果你能做到这一点,并让系统自动生成一堆实验性代码,然后你审视后觉得,“是的,看起来不错,就执行它吧。” 这似乎是一个非常理想的发展方向。
It seems plausible in the next year or two years that you might make a lot of progress on that.
在接下来的一两年内,你在这方面取得巨大进展似乎是完全有可能的。
Dwarkesh Patel 00:38:08
It seems under-hyped because you could have literally millions of extra employees, and you can immediately check their output, the employees can check each other's output, hey immediately stream tokens.
这似乎被低估了,因为你实际上可以拥有数以百万计的额外“员工”,你可以即时检查他们的输出,员工们也能互相审核彼此的产出,甚至可以即时流式传输词元。
Jeff Dean 00:38:21
Sorry, I didn't mean to underhype it. I think it's super exciting. I just don't like to hype things that aren't done yet.
抱歉,我并不是故意要低估它。我认为这超级令人兴奋,只是不喜欢对尚未完成的事物大肆宣传。
Dwarkesh Patel 00:38:34
I do want to play with this idea more because it seems like a big deal if you have something kind of like an autonomous software engineer, especially from the perspective of a researcher who's like, "I want to build the system." Okay, so let's just play with this idea. As somebody who has worked on developing transformative systems through your careers, the idea that instead of having to code something like whatever today's equivalent of MapReduce is or Tensorflow is, just like, "Here's how I want a distributed AI library to look. Write it up for me."
我确实想进一步探讨这个想法,因为如果你拥有类似自主软件工程师的系统,那将是个大事件,尤其是从一个研究者的角度来看,会想,“我想构建这个系统。” 好的,那我们就来探讨这个想法。作为一个在职业生涯中开发过变革性系统的人,你是否设想过,与其手动编写类似于当今MapReduce或Tensorflow那样的系统,不如直接说,“这就是我希望分布式AI库的样子,帮我写出来。”
Do you imagine you could be 10x more productive? 100x more productive?
你认为这样能让你的生产力提高10倍,甚至100倍吗?
Jeff Dean 00:39:13
I was pretty impressed. I think it was on Reddit that I saw we have a new experimental coding model that's much better at coding and math and so on. Someone external tried it, and they basically prompted it and said, "I'd like you to implement a SQL processing database system with no external dependencies, and please do that in C."
我印象相当深刻。我记得在Reddit上看到,我们有一个新的实验性编码模型,在编码、数学等方面表现得更好。有人在外部试用了它,他们基本上给了它一个提示,说,“我希望你能用C语言实现一个不依赖任何外部库的SQL处理数据库系统。”
From what the person said, it actually did a quite good job. It generated a SQL parser and a tokenizer and a query planning system and some storage format for the data on disk and actually was able to handle simple queries. From that prompt, which is like a paragraph of text or something, to get even an initial cut at that seems like a big boost in productivity for software developers.
根据那个人的描述,它实际上表现得相当不错。它生成了一个SQL解析器、一个分词器、一个查询规划系统以及一种用于磁盘数据的存储格式,并且能够处理简单的查询。仅凭那段类似于一段文字的提示就能得到初步实现,这对软件开发者来说似乎大大提升了生产力.
I think you might end up with other kinds of systems that maybe don't try to do that in a single semi-interactive, "respond in 40 seconds" kind of thing but might go off for 10 minutes and might interrupt you after five minutes saying, "I've done a lot of this, but now I need to get some input. Do you care about handling video or just images or something?" That seems like you'll need ways of managing the workflow if you have a lot of these background activities happening.
我认为你最终可能会遇到其他类型的系统,它们可能不会在一次半交互式的“40秒内响应”的情况下完成任务,而是可能运行10分钟,在5分钟后中断你并提示,“我已经完成了大部分工作,但现在需要一些输入。你是希望处理视频还是仅仅处理图像?” 如果有大量这种后台活动发生,显然你需要一些方法来管理整个工作流程。
Dwarkesh Patel 00:40:44
Can you talk more about that? What interface do you imagine we might need if you could literally have millions of employees you could spin up, hundreds of thousands of employees you could spin up on command, who are able to type incredibly fast, and who- It's almost like you go from 1930s trading of tickets or something to now modern Jane Street or something. You need some interface to keep track of all this that's going on, for the AIs to integrate into this big monorepo and leverage their own strengths, for humans to keep track of what's happening. Basically what is it like to be Jeff or Noam in three years working day-to-day?
你能多谈谈这个吗?如果你可以在需要时调动数以百万计、甚至数十万计能打字极快的“员工”,你觉得我们需要什么样的接口?这几乎就像从1930年代那种票据交易方式过渡到如今现代的Jane Street。你需要一种接口来跟踪这一切,让人工智能整合到这个庞大的单一代码库中,发挥它们各自的优势,同时让人类跟踪发生的一切。基本上,三年后作为杰夫或诺姆每天的工作会是什么样子?
Noam Shazeer 00:42:26
It might be kind of similar to what we have now because we already have sort of parallelization as a major issue. We have lots and lots of really, really brilliant machine learning researchers, and we want them to all work together and build AI.
这可能与我们现在的情况有点类似,因为我们已经把并行化视为一个主要问题。我们拥有大量非常非常出色的机器学习研究者,并希望他们能够共同合作,构建人工智能。
So actually, the parallelization among people might be similar to parallelization among machines. I think definitely it should be good for things that require a lot of exploration, like, "Come up with the next breakthrough."
实际上,人与人之间的并行化可能类似于机器之间的并行化。我认为这对于需要大量探索的领域是有利的,比如,“想出下一个突破性的进展。”
If you have a brilliant idea that is just certain to work in the ML domain, then it has a 2% chance of working if you're brilliant. Mostly these things fail, but if you try 100 things or 1,000 things or a million things, then you might hit on something amazing. We have plenty of compute. Like modern top labs these days have probably a million times as much compute as it took to train Transformer.
如果你有一个在机器学习领域必定有效的绝妙点子,那么即使你很聪明,它成功的概率也只有2%。大多数情况下这些尝试都会失败,但如果你尝试100个、1,000个甚至一百万个想法,你或许就能碰到惊人的成果。我们的计算资源非常充足,就像如今顶尖的实验室拥有的计算能力可能是训练Transformer所需计算量的一百万倍。
Dwarkesh Patel 00:43:41
Yeah, actually, so that's a really interesting idea. Suppose in the world today there are on the order of 10,000 AI researchers in this community coming up with a breakthrough-
是的,实际上,这是一个非常有趣的想法。假设如今这个社区中大约有1万名AI研究者正在取得突破——
Jeff Dean 00:43:52
Probably more than that. There were 15,000 at NeurIPS last week.
可能还不止这些。上周在NeurIPS上就有15,000人。
Noam Shazeer 00:43:55
Wow.
哇。
Dwarkesh Patel 00:43:57
100,000, I don't know.
10万,我不知道。
Jeff Dean 00:43:58
Yeah, maybe. Sorry.
是的,也许。抱歉。
Dwarkesh Patel 00:44:00
No, no, it's good to have the correct order of magnitude. The odds that this community every year comes up with a breakthrough on the scale of a Transformer is, let's say, 10%. Now suppose this community is a thousand times bigger, and it is, in some sense, like this sort of parallel search of better architectures, better techniques.
不,不,保持正确的数量级是很重要的。假设这个社区每年能够在Transformer规模上取得突破的概率大约是10%。现在,假设这个社区扩大一千倍,从某种意义上说,就像是对更优秀的架构、更好的技术进行并行搜索。
Do we just get like-
那我们是不是就能得到——
Jeff Dean 00:44:22
A breakthrough a day?
每天都有一个突破吗?
Dwarkesh Patel 00:44:23
-breakthroughs every year or every day?
——是每年还是每天都有突破?
Noam Shazeer 00:44:25
Maybe. Sounds potentially good.
也许。听起来可能不错。
Dwarkesh Patel 00:44:30
But does that feel like what ML research is like? If you are able to try all these experiments…
但那看起来像是机器学习研究的实际情况吗?如果你能够尝试所有这些实验……
Noam Shazeer 00:44:37
It's a good question, because I don't know that folks haven't been doing that as much. We definitely have lots of great ideas coming along. Everyone seems to want to run their experiment at maximum scale, but I think that's a human problem.
这是个好问题,因为我不知道大家是不是没有这么做。我们确实有许多伟大的点子涌现出来。每个人似乎都想在最大规模上运行他们的实验,但我认为那是一个人性的问题。
Jeff Dean 00:44:55
It's very helpful to have a 1/1000th scale problem and then vet 100,000 ideas on that, and then scale up the ones that seem promising.
将问题缩小到1/1000的规模,然后审核100,000个点子,再将那些看起来有希望的点子放大,这非常有帮助。

很多问题都可以采用这个模式。
Dwarkesh Patel 00:44:06
So, one thing the world might not be taking seriously: people are aware that it's exponentially harder to make a model that's 100x bigger. It's 100x more compute, right? So people are worried that it's an exponentially harder problem to go from Gemini 2 to 3, or so forth.
所以,有一件事世界可能并未认真对待:人们知道制造一个大100倍的模型是指数级困难的。它需要的计算量也会增加100倍,对吧?因此,人们担心从Gemini 2到Gemini 3的过渡(等等)是一个指数级更难解决的问题。
But maybe people aren't aware of this other trend where Gemini 3 is coming up with all these different architectural ideas, trying them out, and you see what works, and you're constantly coming up with algorithmic progress that makes training the next one easier and easier. How far could you take that feedback loop?
但也许人们并不了解另一个趋势,即Gemini 3不断提出各种不同的架构点子,试验它们,看看什么有效,同时你不断取得算法进步,使得训练下一个模型越来越容易。这个反馈循环能走多远呢?
Jeff Dean 00:45:43
I think one thing people should be aware of is that the improvements from generation to generation of these models often are partially driven by hardware and larger scale, but equally and perhaps even more so driven by major algorithmic improvements and major changes in the model architecture, the training data mix, and so on, that really makes the model better per flop that is applied to the model, so I think that's a good realization. Then I think if we have automated exploration of ideas, we'll be able to vet a lot more ideas and bring them into the actual production training for next generations of these models.
我认为,人们应该意识到,这些模型每一代的改进往往部分归因于硬件和更大规模的计算,但同样甚至更多地依赖于重大的算法改进、模型架构的重大变化、训练数据组合的调整等等,这使得每一次运算所能带来的改进更显著,这是一个很好的认识。接着,我认为如果我们能够实现自动化的创意探索,就能审核更多的点子,并将它们引入下一代模型的实际生产训练中。
That's going to be really helpful because that's sort of what we're currently doing with a lot of brilliant machine learning researchers: looking at lots of ideas, winnowing ones that seem to work well at small scale, seeing if they work well at medium scale, bringing them into larger scale experiments, and then settling on adding a whole bunch of new and interesting things to the final model recipe. If we can do that 100 times faster through those machine learning researchers just gently steering a more automated search process, rather than hand-babysitting lots of experiments themselves, that's going to be really, really good.
这将非常有帮助,因为这正是我们目前和许多杰出的机器学习研究者所做的工作:审视大量点子,淘汰那些在小规模实验中看起来有效的点子,观察它们在中等规模下是否依然有效,将它们引入更大规模的实验,然后最终在模型的配方中加入一大堆新颖且有趣的元素。如果我们能借助这些研究者,仅通过轻微引导一个更自动化的搜索过程,而不是亲自照看大量实验,就能以100倍的速度完成,这将会非常非常好。
Noam Shazeer 00:47:03
The one thing that doesn't speed up is experiments at the largest scale. You still end up doing these N = 1 experiments. Really, you just try to put a bunch of brilliant people in the room, have them stare at the thing, and figure out why this is working, why this is not working.
唯一没有加速的是大规模实验。你最终仍然需要做那些 N = 1 的实验。实际上,你只是试图把一群才华横溢的人聚集在一起,让他们盯着那个东西,弄清楚它为什么奏效,为什么不奏效。
Jeff Dean 00:47:21
For that, more hardware is a good solution. And better hardware.
为此,增加硬件是个不错的解决方案。而且需要更好的硬件。
Noam Shazeer 00:47:25
Yes, we're counting on you.
是的,我们指望你了。
Dwarkesh Patel 00:47:28
So, naively, there's this software, there's this algorithmic side improvement that future AI can make. There's also the stuff you're working on. I'll let you describe it.
所以,简单来说,未来的人工智能可以通过软件和算法方面的改进来实现提升,同时也包括你们正在研发的那些东西。接下来由你来描述吧。
But if you get into a situation where just from a software level, you can be making better and better chips in a matter of weeks and months, and better AIs can presumably do that better, how does this feedback loop not just end up in, Gemini 3 taking two years, then Gemini 4 is- or the equivalent level jump is now six months, then level five is three months, then one month? You get to superhuman intelligence much more rapidly than you might naively think, because of this software, both on the hardware side and from the algorithmic side improvements.
但如果出现这样一种情况,仅从软件层面上,你就能在几周或几个月内制造出越来越好的芯片,而更优秀的 AI 也可能做得更好,这个反馈循环怎么会不导致 Gemin 3 花两年时间,而 Gemin 4——或者说同等级别的跃升——现在只需六个月,接着第五级只需三个月,再下一个仅需一个月?由于硬件和算法两方面的改进,这个软件反馈循环会让你比直觉上想象的更快达到超人智能。
Jeff Dean 00:48:26
I've been pretty excited lately about how we could dramatically speed up the chip design process. As we were talking earlier, the current way in which you design a chip takes you roughly 18 months to go from "we should build a chip" to something that you then hand over to TSMC and then TSMC takes four months to fab it, and then you get it back and you put it in your data centers.
最近我对我们如何大幅加快芯片设计流程感到非常兴奋。正如我们之前讨论的,目前设计一个芯片大约需要18个月,从“我们应该制造一块芯片”到将其交给台积电,再由台积电用四个月制造,最后你拿回芯片并将其部署在数据中心。
So that's a pretty lengthy cycle, and the fab time in there is a pretty small portion of it today. But if you could make that the dominant portion, so that instead of taking 12 to 18 months to design the chip with 150 people, you could shrink that to a few people with a much more automated search process, exploring the whole design space of chips and getting feedback from all aspects of the chip design process for the kind of choices that the system is trying to explore at the high level, then I think you could get perhaps much more exploration and more rapid design of something that you actually want to give to a fab.
这就是一个相当漫长的周期,而且制造时间在其中只占很小的一部分。但如果你能使这部分成为主导,也就是说,不是需要150人花12到18个月来设计芯片,而是通过一个更加自动化的搜索过程,将整个芯片设计空间探索出来,并从芯片设计过程各个方面获得反馈,从而让少数人就能完成设计,那么我认为你可以获得更多的探索机会,并更快地设计出你真正想交给制造厂的芯片。
That would be great because you can shrink fab time, you can shrink the deployment time by designing the hardware in the right way, so that you just get the chips back and you just plug them into some system. And that will then enable a lot more specialization, it will enable a shorter timeframe for the hardware design so that you don't have to look out quite as far into what kind of ML algorithms would be interesting. Instead, it's like you're looking at six to nine months from now, what should it be? Rather than two, two and a half years.
那将非常棒,因为你可以通过以正确的方式设计硬件来缩短制造时间和部署时间,这样你拿到芯片后只需直接将其插入某个系统。这样一来,可以实现更多的专业化,并缩短硬件设计的时间周期,从而你无需过于远瞻哪些机器学习算法会变得有趣,而只需考虑未来六到九个月内应是什么样,而不是两到两年半之后。
That would be pretty cool. I do think that fabrication time, if that's in your inner loop of improvement, you're going to like...
那将会非常酷。我确实认为,如果制造时间成为你改进内循环的一部分,你会喜欢的...
Dwarkesh Patel 00:50:19
How long is it?
那需要多长时间?
Jeff Dean 00:50:20
The leading edge nodes, unfortunately, are taking longer and longer because they have more metal layers than previous, older nodes. So that tends to make it take anywhere from three to five months.
不幸的是,最先进的节点所需时间越来越长,因为它们比以前的老节点拥有更多的金属层。这通常使得制造时间需要三到五个月。
Dwarkesh Patel 00:50:32
Okay, but that's how long training runs take anyways, right? So you could potentially do both at the same time.
好的,但训练运行本来就需要这么长时间,对吧?所以你可能可以同时进行这两件事。
Jeff Dean 00:50:38
Potentially.
有可能。
Dwarkesh Patel 00:50:39
Okay, so I guess you can't get sooner than three to five months. But the idea that you could get- but also, yeah, you're rapidly developing new algorithmic ideas.
好的,所以我猜最快也只能三到五个月。不过,同时,你们也在迅速发展新的算法点子。
Noam Shazeer 00:50:47
That can move fast.
那可以移动得很快。
Jeff Dean 00:50:48
That can move fast, that can run on existing chips and explore lots of cool ideas.
那可以移动得很快,可以在现有芯片上运行,并探索许多酷炫的创意。
Dwarkesh Patel 00:50:54
So, isn't that a situation in which you're... I think people sort of expect like, ah, there's going to be a sigmoid. Again, this is not a sure thing. But just like, is this a possibility? The idea that you have sort of an explosion of capabilities very rapidly towards the tail end of human intelligence that gets smarter and smarter at a more and more rapid rate?
那么,这是不是一种情况……我觉得人们似乎期待着,比如会出现一种S型增长。再次说明,这并不是确定无疑的。但就是说,这是否有可能?也就是人类智能在尾端会迅速爆发出越来越多的能力,并以越来越快的速度变得更聪明?
Noam Shazeer 00:51:17
Quite possibly.
很有可能。
Jeff Dean 00:51:19
Yeah. I like to think of it like this. Right now, we have models that can take a pretty complicated problem and can break it down internally in the model into a bunch of steps, can sort of puzzle together the solutions for those steps, and can often give you a solution to the entire problem that you're asking.
是的。我喜欢这样看待问题。现在,我们有些模型能够处理相当复杂的问题,并在内部将问题分解为一系列步骤,拼凑出各个步骤的解决方案,通常能给出你所提问题的整体解答。
But it isn't super reliable, and it's good at breaking things down into five to ten steps, not 100 to 1,000 steps. So if you could go from, yeah, 80% of the time it can give you a perfect answer to something that's ten steps long to something that 90% of the time can give you a perfect answer to something that's 100 to 1,000 steps of sub-problem long, that would be an amazing improvement in the capability of these models. We're not there yet, but I think that's what we're aspirationally trying to get to.
不过,这些模型并不十分可靠,它们擅长将问题分解为五到十个步骤,而不是100到1,000个步骤。所以,如果我们能把情况从——80%的时间里它能对十步长的问题给出完美答案——提升到90%的时间里对100到1,000步子问题给出完美答案,那将极大提升这些模型的能力。我们还没达到那个水平,但我认为那是我们的理想目标。
Noam Shazeer 00:52:14
We don't need new hardware for that, but we'll take it.
为此我们不需要新硬件,但我们会接受这一点。
Jeff Dean 00:52:20
Never look new hardware in the mouth.
千万别把新硬件看得太重。
Noam Shazeer 00:52:23
One of the big areas of improvement in the near future is inference time compute, applying more compute at inference time. I guess the way I like to describe it is that even a giant language model, even if you’re doing a trillion operations per token, which is more than most people are doing these days, operations cost something like 10 to the negative $18. And so you're getting a million tokens to the dollar.
近期改进的一个重要领域是推理时计算,也就是在推理阶段投入更多计算。我喜欢这样描述:即便是一个巨大的语言模型,即使每个词元执行一万亿次操作——这比如今大多数人做的还多——每次操作的成本大约是10的负18次方美元。这样你就能用一美元处理一百万个词元。
I mean compare that to a relatively cheap pastime: you go out and buy a paper book and read it, you're paying 10,000 tokens to the dollar. Talking to a language model is like 100 times cheaper than reading a paperback.
再拿一个相对廉价的爱好做对比:你出去买一本纸质书阅读,每美元大约能买到价值10,000个词元的内容。而与语言模型对话的成本却比阅读平装书便宜100倍。
So there is a huge amount of headroom there to say, okay, if we can make this thing more expensive but smarter, because we're 100x cheaper than reading a paperback, we're 10,000 times cheaper than talking to a customer support agent, or a million times or more cheaper than hiring a software engineer or talking to your doctor or lawyer. Can we add computation and make it smarter?
因此,这里有巨大的提升空间:假如我们能让这个系统更昂贵但更智能,因为我们的成本比阅读平装书便宜100倍,比与客服对话便宜10,000倍,甚至比雇佣软件工程师或咨询医生、律师便宜一百万倍或更多。那么,我们能否增加计算量使其变得更智能?
I think a lot of the takeoff that we're going to see in the very near future is of this form. We've been exploiting and improving pre-training a lot in the past, and post-training, and those things will continue to improve. But taking advantage of "think harder" at inference time is just going to be an explosion.
我认为在不久的将来,我们将看到的许多突破正以这种形式出现。我们过去一直在不断利用和改进预训练和后训练,这些技术也会持续进步。但利用推理时“更用力思考”的策略,将带来一场爆炸式的提升。
Jeff Dean 00:54:21
Yeah, and an aspect of inference time is I think you want the system to be actively exploring a bunch of different potential solutions. Maybe it does some searches on its own, gets some information back, consumes that information, and figures out, oh, now I would really like to know more about this thing. So now it iteratively explores how to best solve the high-level problem you pose to this system.
是的,推理时的一个方面是,我认为你希望系统主动探索多种潜在解决方案。也许它会自主搜索,获取信息,再消化这些信息,然后发现,“哦,现在我真的想了解更多关于这件事的内容。”于是,它就迭代地探索如何最好地解决你提出的高层问题。
And I think having a dial where you can make the model give you better answers with more inference time compute seems like we have a bunch of techniques now that can kind of do that. The more you crank up the dial, the more it costs you in terms of compute, but the better the answers get.
而且,我觉得如果有一个调节器,让你通过增加推理时的计算来使模型给出更好的答案,这样的技术我们现在已经有不少了。你把调节器调得越高,计算成本就越高,但答案也会更好。
That seems like a nice trade-off to have, because sometimes you want to think really hard because it's a super important problem. Sometimes you probably don't want to spend enormous amounts of compute to compute “what's the answer to one plus one”. Maybe the system –
这似乎是一个不错的权衡,因为有时你确实需要深入思考,因为那是一个超级重要的问题;而有时你又可能不想花费大量计算资源去算“1加1等于几”。也许系统 ——
Dwarkesh Patel 00:55:22
Shouldn’t decide to come up with new axioms of set theory or whatever!
不应该因此去提出新的集合论公理之类的东西!
Jeff Dean 00:55:25
– should decide to use a calculator tool instead of a very large language model.
——而是应该选择使用计算器工具,而非一个体量庞大的语言模型。
Dwarkesh Patel 00:55:31
Interesting. So are there any impediments to taking inference time, like having some way in which you can just linearly scale up inference time compute? Or is this basically a problem that's sort of solved, and we know how to throw 100x compute, 1000x compute, and get correspondingly better results?
有趣。那么在推理时间上是否存在任何障碍,比如有没有办法使推理计算可以线性扩展?还是说这基本上是个已经“解决”的问题,我们知道如何投入100倍、1000倍的计算,从而获得相应更好的结果?
Noam Shazeer 00:55:50
We're working out the algorithms as we speak. So I believe we'll see better and better solutions to this as these many more than 10,000 researchers are hacking at it, many of them at Google.
我们正在讨论这些算法。所以我相信,随着超过1万名研究人员(其中很多在谷歌)在努力攻克这一问题,我们将会看到越来越好的解决方案。
Jeff Dean 00:56:06
I think we do see some examples in our own experimental work of things where if you apply more inference time compute, the answers are better than if you just apply 10x, you can get better answers than x amount of computed inference time. And that seems useful and important.
我认为在我们自己的实验工作中确实有这样的例子:如果你在推理时投入更多计算,答案会比仅仅投入10倍计算时更好,也就是说,用更多的推理计算能够获得更好的答案。这看起来既有用又重要。
But I think what we would like is when you apply 10x to get even a bigger improvement in the quality of the answers than we're getting today. And so that's about designing new algorithms, trying new approaches, figuring out how best to spend that 10x instead of x to improve things.
但我认为我们希望的是,通过投入10倍的计算,能获得比今天更大幅度的答案质量提升。所以,这关乎于设计新算法、尝试新方法,找出如何用那10倍而不是x倍的计算来获得改进。
Dwarkesh Patel 00:56:44
Does it look more like search, or does it look more like just keeping going in the linear direction for a longer time?
这看起来更像是搜索,还是仅仅在更长时间内沿着线性方向持续前进?
Jeff Dean 00:56:49
I really like Rich Sutton's paper that he wrote about the Bitter Lesson and the Bitter Lesson effectively is this nice one-page paper but the essence of it is you can try lots of approaches, but the two techniques that are incredibly effective are learning and search.
我非常喜欢Rich Sutton关于“苦涩教训”写的那篇论文,这篇论文实际上只有一页,但其精髓在于你可以尝试很多方法,而其中两种极为有效的技术是学习和搜索。
You can apply and scale those algorithmically or computationally, and you often will then get better results than any other kind of approach you can apply it to a pretty broad variety of problems.
你可以在算法上或计算上应用并扩展这些技术,而且你通常会比其他任何方法在相当广泛的问题上获得更好的结果。
Search has got to be part of the solution to spending more inference time. Maybe you explore a few different ways of solving this problem, and that one didn't work, but this one worked better. I'm going to explore that a bit more.
搜索必须成为增加推理时间利用率解决方案的一部分。也许你会探索几种不同的方法来解决这个问题,有的可能行不通,而有的效果更好。我打算进一步探究这一点。
Dwarkesh Patel 00:57:36
How does this change your plans for future data center planning and so forth? Where can this kind of search be done asynchronously? Does it have to be online, offline? How does that change how big of a campus you need and those kinds of considerations?
这如何改变你们对未来数据中心规划等方面的计划?这种搜索可以异步完成吗?它必须在线进行,还是可以离线进行?这又如何改变你所需的数据中心规模以及相关考虑?
Jeff Dean 00:57:55
One general trend is it's clear that inference time compute, you have a model that's pretty much already trained and you want to do inference on, it is going to be a growing and important class of computation. Maybe you want to specialize hardware more around that.
一个普遍趋势是,显然推理时间的计算——你拥有一个几乎已经训练好的模型,并且你想对它进行推理——将成为一种日益重要的计算类别。也许你希望在这方面更专门化硬件。
Actually, the first TPU was specialized for inference and wasn't really designed for training. Then subsequent TPUs were really designed more around training and also for inference.
实际上,第一代TPU是专为推理设计的,并非真正为训练而设计。随后推出的TPU则更多地是围绕训练设计,同时也兼顾推理。
But it may be that when you have something where you really want to crank up the amount of compute you use at inference time, that even more specialized solutions will make a lot of sense.
但可能是这样,当你确实希望在推理时大幅增加计算量时,更专门化的解决方案将非常有意义。
Dwarkesh Patel 00:58:38
Does that mean you can accommodate more asynchronous training?
这是否意味着你可以容纳更多的异步训练?
Jeff Dean 00:58:41
Training? Or inference?
训练?还是推理?
Dwarkesh Patel 00:58:42
Or just you can have the different data centers don't need to talk to each other, you can just have them do a bunch of...
或者说,你可以让不同的数据中心不必相互通信,只需让它们各自处理一堆任务……
Jeff Dean 00:58:52
I like to think of it as, is the inference that you're trying to do latency-sensitive? Like a user is actively waiting for it, or is it a background thing? Maybe I have some inference tasks that I'm trying to run over a whole batch of data, but it's not for a particular user. It's just I want to run inference on it and extract some information.
我倾向于这样考虑:你所要做的推理是否对延迟敏感?比如用户是否在主动等待,还是只是后台任务?也许我有一些推理任务需要在一大批数据上运行,但这并非针对某个特定用户,而只是我想运行推理并提取一些信息。
There's probably a bunch of things that we don't really have very much of right now, but you're seeing inklings of it in our deep research tool that we just released, like a week ago. You can give it a pretty complicated, high-level task like, "Hey, can you go off and research the history of renewable energy and all the trends in costs for wind and solar and other kinds of techniques, and put it in a table and give me a full eight-page report?" And it will come back with an eight-page report with like 50 entries in the bibliography.
目前可能还有很多我们尚未充分实现的东西,但你已经在我们刚刚发布的深度研究工具中看到了一些端倪,就像一周前那样。你可以给它一个相当复杂的高层次任务,比如,“嘿,你能去研究一下可再生能源的历史以及风能、太阳能和其他技术成本趋势,并整理成表格,给我一份完整的八页报告吗?”它会返回一份包含大约50个参考文献条目的八页报告。
It's pretty remarkable. But you're not actively waiting for that for one second. It takes like a minute or two to go do that.
这真是相当了不起。但你不会为此积极等待哪怕一秒钟。它大概需要一两分钟来完成。
And I think there's going to be a fair bit of that kind of compute, and that's the kind of thing where you have some UI questions around. Okay, if you're going to have a user with 20 of these kind of asynchronous tasks in the background happening, and maybe each one of them needs to get more information from the user, like, "I found your flights to Berlin, but there's no non-stop ones. Are you okay with a non-stop one?" How does that flow work when you kind of need a bit more information, and then you want to put it back in the background for it to continue doing, you know, finding the hotels in Berlin or whatever? I think it's going to be pretty interesting, and inference will be useful.
我认为这种计算将会相当普遍,这类任务还涉及一些用户界面的问题。好吧,如果你有一个用户在后台运行20个这样的异步任务,也许每个任务都需要从用户那里获取更多信息,比如,“我找到了飞往柏林的航班,但没有直飞的。你可以接受没有直飞的吗?”当你需要获取更多信息,然后又希望将其放回后台继续处理,比如继续查找柏林的酒店等,这个流程该如何运作?我认为这会非常有趣,而推理也将大有用处。
Noam Shazeer 01:00:33
Inference will be useful. There's also a compute efficiency in inference that you don't have in training. In general, transformers can use the sequence length as a batch during training, but they can't really in inference, because when you're generating one token at a time, so there may be different hardware and inference algorithms that we design for the purposes of being efficient at inference.
推理将会很有用。推理中还有一种计算效率,这是训练中所没有的。一般来说,变换器在训练时可以将序列长度作为一个批次使用,但在推理时却不能,因为你是在一次生成一个标记,因此我们可能会设计不同的硬件和推理算法,以实现高效的推理。
Jeff Dean 01:01:02
Yeah, as a good example of an algorithmic improvement is the use of drafter models. So you have a really small language model that you do one token at a time when you're decoding, and it predicts four tokens. Then you give that to the big model and you say, "Okay, here are the four tokens the little model came up with. Check which ones you agree with."
是的,一个很好的算法改进例子是使用草拟模型。你有一个非常小的语言模型,在解码时一次生成一个标记,并且它预测出四个标记。然后你将这些标记交给大模型,并说:“好,这里是小模型生成的四个标记。检查你同意哪些。”
If you agree with the first three, then you just advance. Then you've basically been able to do a four-token width parallel computation instead of a one-token width computation in the big model. Those are the kinds of things that people are looking at to improve inference efficiency, so you don't have this single-token decode bottleneck.
如果你同意前三个,那么你就直接推进。这样你基本上就实现了大模型中四个标记宽度的并行计算,而不是一次一个标记的计算。这正是人们为提高推理效率所关注的事,因此不会存在单标记解码的瓶颈。
Noam Shazeer 01:01:46
Right, basically the big model's being used as a verifier.
没错,基本上大模型被用作验证者。
Jeff Dean 01:01:48
Right, “can you verify”, yeah.
对,“你能验证吗”,是的。
Noam Shazeer 01:01:50
[inaudible] generator and verification you can do.
[听不清] 你可以进行生成和验证。
Jeff Dean 01:01:52
Right. "Hello, how are you?" That sounds great to me. I'm going to advance past that.
对。“你好,你怎么样?”这听起来对我来说很棒。我将会跳过这一部分。
Dwarkesh Patel 01:01:56
So, a big discussion has been about how we're already tapping out nuclear power plants in terms of delivering power into one single campus. Do we have to have just two gigawatts in one place, five gigawatts in one place, or can it be more distributed and still be able to train a model? Does this new regime of inference scaling make different considerations there plausible? How are you thinking about multi-data center training now?
因此,大家大谈特谈的是,我们在向单一园区供电方面已经接近核电站的极限。我们是否必须在一个地方只有两吉瓦、五吉瓦,还是可以更加分布式,同时还能训练模型?这种新的推理扩展模式是否使得不同的考虑成为可能?你现在如何看待多数据中心的训练?
Jeff Dean 01:02:31
We're already doing it. We're pro multi-data center training. I think in the Gemini 1.5 tech report, we said we used multiple metro areas and trained with some of the compute in each place. And then a pretty long latency but high bandwidth connection between those data centers, and that works fine.
我们已经在这么做了。我们支持多数据中心训练。我记得在Gemini 1.5技术报告中,我们提到使用了多个都市区,并在每个地方使用部分计算资源进行训练。然后在这些数据中心之间使用了延迟较长但带宽很高的连接,这样运作良好。
Training is kind of interesting because each step in a training process is usually, for a large model, is usually a few seconds or something, at least. So, the latency of it being 50 milliseconds away doesn't matter that much.
训练过程相当有趣,因为对于大型模型来说,每一步通常至少需要几秒钟。因此,50毫秒的延迟并不那么重要。
Noam Shazeer 01:03:06
Just the bandwidth.
只是带宽问题。
Jeff Dean 01:03:08
Yeah, just bandwidth.
是的,只是带宽。
Noam Shazeer 01:03:10
As long as you can sync all of the parameters of the model across the different data centers and then accumulate all the gradients, in the time it takes to do one step, you're pretty good.
只要你能在不同数据中心之间同步模型的所有参数,并在执行一步的时间内累积所有梯度,就没问题了。
Jeff Dean 01:03:21
And then we have a bunch of work, even from early Brain days, when we were using CPU machines and they were really slow. We needed to do asynchronous training to help scale, where each copy of the model would do some local computation, send gradient updates to a centralized system, and then apply them asynchronously. Another copy of the model would be doing the same thing.
而且我们有很多工作,即便是从早期Brain时代,当时我们使用CPU机器且速度非常慢。我们需要进行异步训练以帮助扩展,每个模型副本都会进行一些局部计算,将梯度更新发送到集中系统,然后异步应用它们。另一个模型副本也会做同样的事情.
It makes your model parameters wiggle around a bit, and it makes people uncomfortable with the theoretical guarantees, but it actually seems to work in practice.
这会让你的模型参数有些波动,也会让人对理论保证感到不安,但实际上在实践中似乎效果不错。
Noam Shazeer 01:03:56
It was so pleasant to go from asynchronous to synchronous because your experiments are now replicable, rather than your results depend on whether there was a web crawler running on the same machine. So, I am so much happier running on TPU pods.
从异步转为同步真是令人愉快,因为这样你的实验现在是可复现的,而不是结果依赖于是否有网络爬虫在同一台机器上运行。因此,我在TPU集群上运行时要开心得多。
Jeff Dean 01:04:20
I love asynchrony. It just lets you scale so much more.
我喜欢异步。它让你能大大扩展规模。
Noam Shazeer 01:04:22
With these two iPhones and an Xbox or whatever.
用这两部iPhone和一台Xbox之类的设备。
Jeff Dean 01:04:25
Yeah, what if we could give you asynchronous but replicable results?
是啊,如果我们能给你异步但可复现的结果呢?
Noam Shazeer 01:04:29
Ooh.
哦。
Jeff Dean 01:04:31
So, one way to do that is you effectively record the sequence of operations, like which gradient update happened and when and on which batch of data. You don't necessarily record the actual gradient update in a log or something, but you could replay that log of operations so that you get repeatability. Then I think you'd be happy.
因此,实现这一点的一种方法是,你有效地记录操作序列,比如哪个梯度更新何时在哪一批数据上发生。你不必记录实际的梯度更新到日志中,但你可以重放该操作日志以获得可重复性。这样我认为你就会满意了。
Noam Shazeer 01:04:53
Possibly. At least you could debug what happened, but you wouldn't be able to necessarily compare two training runs. Because, okay, I made one change in the hyperparameter, but also I had a-
可能。至少你可以调试发生了什么,但你不一定能比较两个训练过程。因为,好吧,我在超参数上做了一个改变,但同时我还遇到了—
Jeff Dean 01:05:08
Web crawler.
网络爬虫。
Noam Shazeer 01:05:09
-web crawler messing up, and there were a lot of people streaming the Super Bowl at the same time.
——网络爬虫出问题,而且同时有很多人在直播超级碗。
Jeff Dean 01:05:19
The thing that led us to go from asynchronous training on CPUs to fully synchronous training is the fact that we have these super fast TPU hardware chips and pods, which have incredible amounts of bandwidth between the chips in a pod. Then, scaling beyond that, we have really good data center networks and even cross-metro area networks that enable us to scale to many, many pods in multiple metro areas for our largest training runs. We can do that fully synchronously.
促使我们从在CPU上进行异步训练转为完全同步训练的原因在于,我们拥有这些超高速的TPU硬件芯片和集群,它们在同一集群内的芯片间拥有惊人的带宽。进一步扩展,我们拥有非常优秀的数据中心网络,甚至跨都市区网络,使我们能够在多个都市区扩展到众多集群进行最大规模的训练。我们可以完全同步地实现这一点.
As Noam said, as long as the gradient accumulation and communication of the parameters across metro areas happens fast enough relative to the step time, you're golden. You don't really care. But I think as you scale up, there may be a push to have a bit more asynchrony in our systems than we have now because we can make it work, our ML researchers have been really happy how far we've been able to push synchronous training because it is an easier mental model to understand. You just have your algorithm sort of fighting you, rather than the asynchrony and the algorithm kind of battling you.
正如Noam所说,只要跨都市区的梯度累积和参数通信相对于步长时间足够快,你就万事大吉了。你其实并不在意。但我认为,随着规模的扩大,我们的系统可能需要引入比现在更多的异步性,因为我们可以让它工作。我们的机器学习研究人员非常高兴我们能够推动同步训练的极限,因为这是一种更容易理解的思维模型。你只是让算法在与你对抗,而不是异步性与算法在相互博弈。
Noam Shazeer 01:06:28
As you scale up, there are more things fighting you. That's the problem with scaling, that you don't always know what it is that's fighting you. Is it the fact that you've pushed quantization a little too far in some place or another? Or is it your data?
随着规模扩大,会有更多因素与你对抗。这就是扩展的问题,你并不总知道到底是什么在与你对抗。是因为你在某处过度推进了量化?还是因为你的数据问题?
Jeff Dean 01:06:46
Maybe it's your adversarial machine MUQQ17 that is setting the seventh bit of your exponent and all your gradients or something.
也许是你的对抗机器MUQQ17在设置你指数的第七位以及所有梯度之类的东西。
Noam Shazeer 01:06:56
Right. And all of these things just make the model slightly worse, so you don't even know that the thing is going on.
没错。而所有这些因素只会让模型稍微变差,因此你甚至不会察觉到这些问题的存在.
Jeff Dean 01:07:04
That's actually a bit of a problem with neural nets, is they're so tolerant of noise. You can have things set up kind of wrong in a lot of ways, and they just figure out ways to work around that or learn.
这实际上是神经网络的一个问题,它们对噪音过于容忍。你可以以各种方式设置得有些错误,而它们总能找到方法去绕过问题或者自行学习.
Noam Shazeer 01:07:15
You could have bugs in your code. Most of the time that does nothing. Some of the time it makes your model worse. Some of the time it makes your model better. Then you discover something new because you never tried this bug at scale before because you didn't have the budget for it.
你的代码中可能会有 bug。大多数时候它们不会产生任何影响,有时会让模型变差,有时反而会让模型变好。然后你会发现一些新现象,因为你之前没有在大规模上尝试过这个 bug,毕竟你当时没有足够的预算.
Dwarkesh Patel 01:07:33
What practically does it look like to debug or decode? You've got these things, some of which are making the model better, some of which are making it worse. When you go into work tomorrow, how do you figure out what the most salient inputs are?
在实际操作中,调试或解码到底是什么样子?有些因素会让模型变好,有些则会让模型变差。当你明天上班时,你如何找出最关键的输入呢?
Noam Shazeer 01:07:50
At small scale, you do lots of experiments. There's one part of the research that involves, okay, I want to invent these improvements or breakthroughs in isolation. In which case you want a nice simple code base that you can fork and hack, and some baselines.
在小规模实验中,你会进行大量试验。研究的一个方面是:好吧,我想单独发明这些改进或突破。在这种情况下,你需要一个简单的代码库,方便你分叉、修改,并且有一些基准供参考.
My dream is I wake up in the morning, come up with an idea, hack it up in a day, run some experiments, get some initial results in a day. Like okay this looks promising, these things worked, and these things didn't work.
我的梦想是:我早晨醒来,灵光一现,花一天时间改造代码,跑几个实验,一天内获得初步结果。比如说,“这看起来很有前景,这些方法奏效了,那些方法却没用.”
I think that is very achievable because-
我认为这是完全可行的,因为——
Jeff Dean 01:08:34
At small scale.
在小规模下是可行的.
Noam Shazeer 01:08:35
At small scale, as long as you keep a nice experimental code base.
只要在小规模下保持一个良好的实验代码库,就能做到.
Jeff Dean 01:08:41
Maybe an experiment takes an hour to run or two hours, not two weeks.
也许一个实验运行一两个小时,而不是两周.
Noam Shazeer 01:08:45
It’s great. So there's that part of the research, and then there's some amount of scaling up. Then you have the part which is integrating, where you want to stack all the improvements on top of each other and see if they work at large scale, and see if they work all in conjunction.
这非常棒。所以研究分为两部分,一部分是小规模的试验,然后是向大规模扩展。接着还有整合部分,即你要把所有改进叠加起来,看看它们在大规模下是否能协同工作.
Jeff Dean 01:09:02
Right, how do they interact? Right, you think maybe they're independent, but actually maybe there's some funny interaction between improving the way in which we handle video data input and the way in which we update the model parameters. Maybe that interacts more for video data than some other thing.
没错,它们之间如何相互作用呢?你可能认为它们是独立的,但实际上改善视频数据输入方式和更新模型参数的方式之间可能会产生一些奇妙的相互作用。也许这种相互作用在处理视频数据时更为显著.
There are all kinds of interactions that can happen that you maybe don't anticipate. So you want to run these experiments where you're then putting a bunch of things together and then periodically making sure that all the things you think are good are good together. If not, understanding why they're not playing nicely.
各种你可能没预料到的相互作用都可能发生。所以你需要做这些实验,把一堆因素组合在一起,然后定期确认你认为有效的所有方法是否能协同运作。如果不行,就得弄清楚为什么它们不能和谐共处.
Dwarkesh Patel 01:09:41
Two questions. One, how often does it end up being the case that things don't stack up well together? Is it like a rare thing or does it happen all the time?
有两个问题。首先,事情最终无法很好地叠加在一起的情况有多频繁?这是偶尔发生,还是经常发生?
Noam Shazeer 01:09:52
It happens 50% of the time.
大约发生 50% 的情况.
Jeff Dean 01:09:55
Yeah, I mean, I think most things you don't even try to stack because the initial experiment didn't work that well, or it showed results that aren't that promising relative to the baseline. Then you sort of take those things and you try to scale them up individually.
是的,我的意思是,大多数情况下你甚至不会尝试叠加那些初步实验效果不佳或者相较于基线结果不够理想的方法。然后你会分别尝试将这些方法单独放大.
Then you're like, "Oh yeah, these ones seem really promising." So I'm going to now include them in something that I'm going to now bundle together and try to advance and combine with other things that seem promising. Then you run the experiments and then you're like, "Oh, well, they didn't really work that well. Let's try to debug why."
接着你会想,“哦,这些看起来真的很有前景。”于是我会把它们整合到一起,与其他看起来有潜力的方法捆绑、推进、组合在一起。然后你运行实验,最后你会说,“哦,效果并不理想,让我们试着调试找出原因.”
Noam Shazeer 01:10:28
And then there are trade offs, because you want to keep your integrated system as clean as you can, because complexity –
然后存在权衡,因为你希望让你的集成系统尽可能保持整洁,因为复杂性 –
Jeff Dean 01:10:38
Codebase-wise.
就代码库而言.
Noam Shazeer 01:10:39
– yeah codebase and algorithmically. Complexity hurts, complexity makes things slower, introduces more risk.
– 是的,既指代码库也指算法上的问题。复杂性有害,会让事物运行变慢,并增加风险.
And then at the same time you want it to be as good as possible. And of course, every individual researcher wants his inventions to go into it. So there are definitely challenges there, but we've been working together quite well.
同时你又希望它尽可能地优秀。当然,每位研究者都希望自己的发明能被采纳。所以这里肯定存在挑战,但我们一直合作得非常顺利.
Dwarkesh Patel 01:11:05
Okay, so then going back to the whole dynamic “you find better and better algorithmic improvements and the models get better and better over time”, even if you take the hardware part out of it. Should the world be thinking more about, and should you guys be thinking more about this?
好,那回到那个整体动态上,“你不断发现更好的算法改进,模型也随之不断提升”,即使把硬件部分排除在外。全世界应该更多地思考这一点,你们又是否也应如此?
There's one world where AI is a thing that takes two decades to slowly get better over time and you can sort of refine things over. If you've kind of messed something up, you fix it, and it's not that big a deal, right? It's like not that much better than the previous version you released.
有一种情况是,人工智能需要二十年才能慢慢变得更好,你可以逐步完善它。如果你搞砸了什么,再修正也没什么大不了的,对吧?它可能比你之前发布的版本只好那么一点点。
There's another world where you have this big feedback loop, which means that the two years between Gemini 4 and Gemini 5 are the most important years in human history. Because you go from a pretty good ML researcher to superhuman intelligence because of this feedback loop. To the extent that you think that the second world is plausible, how does that change how you sort of approach these greater and greater levels of intelligence?
而另一种情况则是存在一个巨大的反馈循环,这意味着Gemini 4和Gemini 5之间的两年将成为人类历史上最关键的年份。因为正是这个反馈循环,让你从一个相当优秀的机器学习研究者跃升为超越人类智慧的存在。如果你认为第二种情况是可能的,那这会如何改变你对更高智能层次的应对策略?
Noam Shazeer 01:12:14
I've stopped cleaning my garage because I'm waiting for the robots. So probably I'm more in the second camp of what we're going to see, a lot of acceleration.
我已经不再打扫车库了,因为我在等待机器人的到来。所以,我大概更倾向于第二种情况——即我们将看到大幅度的加速。
Jeff Dean 01:12:24
Yeah, I mean, I think it's super important to understand what's going on and what the trends are. And I think right now the trends are the models are getting substantially better generation over generation. I don't see that slowing down in the next few generations probably.
是啊,我的意思是,我认为理解当前发生了什么以及趋势何在非常重要。而目前的趋势是,模型一代比一代显著进步。我不认为在未来几代中这一趋势会减缓。
So that means the models say two to three generations from now are going to be capable of... Let's go back to the example of breaking down a simple task into 10 sub pieces and doing it 80% of the time, to something that can break down a task, a very high level task, into 100 or 1,000 pieces and get that right 90% of the time. That's a major, major step up in what the models are capable of.
这意味着,从现在起两到三代之后,模型将能做到……举个例子:从把一个简单任务分解成10个子部分,并80%的时间内正确完成,到将一个非常高级的任务分解成100或1000个部分,并有90%的正确率。这是模型能力上的一次巨大飞跃。
So I think it's important for people to understand what is happening in the progress in the field. And then those models are going to be applied in a bunch of different domains. I think it's really good to make sure that we, as a society, get the maximal benefits from what these models can do to improve things. I'm super excited about areas like education and healthcare, making information accessible to all people.
因此,我认为让大家了解该领域的进展十分重要。之后,这些模型将会应用于众多不同领域。我认为确保我们社会能够从这些模型所能改善的一切中获得最大利益非常好。我对诸如教育和医疗等领域感到非常兴奋,这将使所有人都能获取信息。
But we also realize that they could be used for misinformation, they could be used for automated hacking of computer systems, and we want to put as many safeguards and mitigations and understand the capabilities of the models in place as we can. I think Google as a whole has a really good view to how we should approach this. Our Responsible AI principles actually are a pretty nice framework for how to think about trade offs of making better and better AI systems available in different contexts and settings, while also sort of making sure that we're doing the right thing in terms of making sure they're safe and not saying toxic things and things like that.
但我们也意识到,它们可能被用于散布错误信息,可能被用于自动化黑客攻击计算机系统,因此我们希望尽可能多地部署保护措施和缓解策略,并充分了解模型的能力。我认为整个Google对于我们应如何应对这一点有着非常好的看法。我们的负责任人工智能原则实际上为如何在不同环境和场景下权衡提供更优AI系统的优势与风险提供了相当不错的框架,同时确保它们是安全的,不会说出有害或毒性的内容。
Dwarkesh Patel 01:14:21
I guess the thing that stands out to me, if you were zooming out and looking at this period of human history, if we're in the world where, look, if you do post-training on Gemini 3 badly, it can do some misinformation – but then you fix the post training. It's a bad mistake, but it's a fixable mistake, right?
我想令我印象深刻的是,如果你把视野放宽,放眼这段人类历史,如果我们处在这样一个世界中:比如说,如果你对Gemini 3进行后训练时出现问题,它可能会产生一些错误信息——但之后你修正了后训练。这是一个糟糕的错误,但却是一个可以修正的错误,对吧?
Noam Shazeer 01:14:40
Right.
对.
Dwarkesh Patel 01:14:40
Whereas if you have this feedback loop dynamic, which is a possibility, then the mistake of the thing that catapults this intelligence explosion is misaligned, is not trying to write the code you think it's trying to write, and [instead] optimizing for some other objective.
而如果你有这种反馈循环动态,这是一种可能性,那么导致智能爆炸的错误便是目标不对齐,不是在编写你认为它应该编写的代码,而是在为其他目标进行优化。
And on the other end of this very rapid process that lasts a couple of years, maybe less, you have things that are approaching Jeff Dean or beyond level, or Noam Shazeer or beyond level. And then you have millions of copies of Jeff Dean level programmers, and- anyways, that seems like a harder to recover mistake.
而在这个可能只持续几年甚至更短时间的快速过程的另一端,你将拥有接近甚至超越Jeff Dean或Noam Shazeer水平的存在。接着你会有数以百万计处于Jeff Dean级别的程序员——总之,这似乎是一个更难以挽回的错误.
Noam Shazeer 01:15:29
As these systems do get more powerful, you have to be more and more careful.
随着这些系统变得越来越强大,你必须变得愈发谨慎。
Jeff Dean 01:15:37
One thing I would say is, there are extreme views on either end. There's, "Oh my goodness, these systems are going to be so much better than humans at all things, and we're going to be kind of overwhelmed." And then there's, "These systems are going to be amazing, and we don't have to worry about them at all."
我想说的一点是,两极的观点都非常极端。一种观点认为,“哦天哪,这些系统在所有方面都会远超人类,我们将不堪重负。”而另一种观点则认为,“这些系统将会令人惊叹,我们根本不必担心它们。”
I think I'm somewhere in the middle. I've been a co-author on a paper called "Shaping AI," which is, you know, those two extreme views often kind of view our role as kind of laissez-faire, like we're just going to have the AI develop in the path that it takes.
我认为我处在中间立场。我曾是《塑造 AI》这篇论文的共同作者,该论文中那两种极端观点常常把我们的角色看作是放任不管,好像我们只需让 AI 自行发展。
And I think there's actually a really good argument to be made that what we're going to do is try to shape and steer the way in which AI is deployed in the world so that it is, you know, maximally beneficial in the areas that we want to capture and benefit from, in education, some of the areas I mentioned, healthcare.
我认为实际上有一个很好的论据,即我们将努力塑造和引导 AI 在全球的部署方式,使其在我们希望收获并受益的领域中(比如教育、医疗等我提到的领域)发挥最大效益。
And steer it as much as we can away—maybe with policy-related things, maybe with technical measures and safeguards—away from, you know, the computer will take over and have unlimited control of what it can do. So I think that's an engineering problem: how do you engineer safe systems?
并且尽可能将其引导远离那种“计算机接管、一切随它控制”的局面——或许通过政策措施,或通过技术手段与安全保障。所以我认为这是一个工程问题:如何设计安全的系统?
I think it's kind of the modern equivalent of what we've done in older-style software development. Like if you look at, you know, airplane software development, that has a pretty good record of how do you rigorously develop safe and secure systems for doing a pretty risky task?
我觉得这有点类似于我们在传统软件开发中所做的工作。比如说,航空软件开发在如何严谨地开发出用于完成高风险任务的安全可靠系统方面,有着相当不错的记录。
Dwarkesh Patel 01:17:18
The difficulty there is that there's not some feedback loop where the 737, you put it in a box with a bunch of compute for a couple of years, and it comes out with the version 1000.
问题在于,并不存在那种反馈循环:你把一架 737 放进装有大量计算资源的“盒子”中,经过几年后它就能升级到 1000 版本。
Noam Shazeer 01:17:27
I think the good news is that analyzing text seems to be easier than generating text. So I believe that the ability of language models to actually analyze language model output and figure out what is problematic or dangerous will actually be the solution to a lot of these control issues.
我认为好消息是,分析文本似乎比生成文本要容易。所以我相信,语言模型分析自身输出、识别出问题或危险之处的能力,实际上将成为解决许多控制问题的关键。
We are definitely working on this stuff. We've got a bunch of brilliant folks at Google working on this now. And I think it's just going to be more and more important, both from a “do something good for people” standpoint, but also from a business standpoint, that you are, a lot of the time, limited in what you can deploy based on keeping things safe.
我们肯定在致力于这些工作。目前有很多杰出的人才在 Google 正在从事这方面的研究。我认为,无论是从“为人们做好事”的角度,还是从商业角度来看,基于安全考虑而对你能部署的内容进行限制,这将变得越来越重要。
And so it becomes very, very important to be really, really good at that.
因此,在这方面做得非常出色就变得极为重要。
Dwarkesh Patel 01:18:48
Yeah, obviously, I know you guys take the potential benefits and costs here seriously, and it's truly remarkable. I know you guys get credit for it, but not enough. I think there's just, there are so many different applications that you have put out for using these models to make the different areas you talked about better.
是的,很明显,我知道你们非常认真地对待这里的潜在利益与成本,这实在令人赞叹。我知道你们因此获得了一定的认可,但远远不够。我认为你们已经推出了如此多的应用,利用这些模型来改善你们所谈论的各个领域。
Um, but I do think that… again, if you have a situation where plausibly there's some feedback loop process, on the other end, you have a model that is as good as Noam Shazeer, as good as Jeff Dean.
嗯,但我确实认为……再说一次,如果出现一种情况,可能存在某种反馈循环过程,在另一端,你将拥有一个和 Noam Shazeer、和 Jeff Dean 同样优秀的模型。
If there's an evil version of you running around, and suppose there's a million of them, I think that's really, really bad. That could be much, much worse than any other risk, maybe short of nuclear war or something. Just think about it, like a million evil Jeff Deans or something.
如果出现一个邪恶版本的你在四处活动,并且假设有一百万个这样的存在,我认为那将是极其糟糕的。这可能比任何其他风险都要严重得多,或许仅次于核战争。试想一下,比如一百万个邪恶的 Jeff Dean。
Jeff Dean 01:19:47
Where do we get the training data?
我们从哪里获取训练数据?
Dwarkesh Patel 01:20:20
But, to the extent that you think that's a plausible output of some quick feedback loop process, what is your plan of okay, we've got Gemini 3 or Gemini 4, and we think it's helping us do a better job of training future versions, it's writing a bunch of the training code for us. From this point forward, we just kind of look over it, verify it.
但是,如果你认为这是某个快速反馈循环过程可能产生的合理结果,那么你的计划是什么?比如,我们有了 Gemini 3 或 Gemini 4,并认为它正在帮助我们更好地训练未来的版本,为我们编写了大量训练代码。从此以后,我们只需审查并验证它。
Even the verifiers you talked about of looking at the output of these models will eventually be trained by, or a lot of the code will be written by the AIs you make. What do you want to know for sure before we have the Gemini 4 help us with the AI research? We really want to make sure, we want to run this test on it before we let it write our AI code for us.
甚至你提到的那些检查这些模型输出的验证者,最终也将由你们打造的 AI 来训练,或者说很多代码将由它们来编写。在让 Gemini 4 协助我们的 AI 研究之前,你绝对希望知道些什么?我们真的希望在让它为我们编写 AI 代码之前,先对它进行测试。
Jeff Dean 01:21:34
I mean, I think having the system explore algorithmic research ideas seems like something where there's still a human in charge of that. Like, it's exploring the space, and then it's going to, like, get a bunch of results, and we're going to make a decision, like, are we going to incorporate this particular, you know, learning algorithm or change to the system into kind of the core code base?
我的意思是,我认为让系统探索算法研究思路看起来仍然是需要人类主导的事情。系统在探索这一领域,然后会获得一系列结果,我们会做出决定:比如,我们是否要将这种特定的学习算法或系统变更纳入核心代码库?
And so I think you can put in safeguards like that that enable us to get the benefits of the system that can sort of improve or kind of self-improve with human oversight, uh, without necessarily letting the system go full-on self-improving without any any notion of a person looking at what it's doing, right? That's the kind of engineering safeguards I'm talking about, where you want to be kind of looking at the characteristics of the systems you're deploying, not deploy ones that are harmful by some measures and some ways, and you have an understanding of what its capabilities are and what it's likely to do in certain scenarios. So, you know, I think it's not an easy problem by any means, but I do think it is possible to make these these systems safe.
因此,我认为你可以设置这样的安全保障措施,使我们能够在人工监督下获得系统自我改进的好处,而不必让系统完全自我进化到无人监管的地步,对吧?这正是我所说的工程安全保障措施,你需要关注你所部署系统的特性,而不是部署那些在某些方面可能有害的系统,并且要了解它们的能力以及在特定场景下可能会做出什么样的反应。所以,我认为这绝不是个简单的问题,但我确实认为有可能使这些系统变得安全。

承认有风险,承认自己也不知道更好。
Noam Shazeer 01:36:56
Yeah. I mean, I think we are also going to use these systems a lot to check themselves, check other systems. Even as a human, it is easier to recognize something than to generate it.
是的,我的意思是,我认为我们也会大量使用这些系统来检查它们自己、检查其他系统。即使作为人类,识别某样东西也比生成它要容易。
Jeff Dean 01:37:14
One thing I would say is if you expose the model's capabilities through an API or through a user interface that people interact with, I think then you have a level of control to understand how is it being used and put some boundaries on what it can do. And that I think is one of the tools in the arsenal of how do you make sure that what it's going to do is sort of acceptable by some set of standards you've set out in your mind?
我想说的一点是,如果你通过 API 或用户界面展示模型的能力,让人们与之交互,那么你就能在某种程度上掌控它的使用方式,并对其能做的事情设定一些界限。我认为这正是确保它的行为符合你心中设定的一系列标准的工具之一。
Noam Shazeer 01:37:44
Yeah. I mean, I think the goal is to to empower people, but for the most part we should be mostly letting people do things with these systems that make sense and closing off as few parts of the space as we can. But yeah, if you let somebody take your thing and create a million evil software engineers, then that doesn't empower people because they're going to hurt others with a million evil software engineers. So I'm against that.
是的,我的意思是,我认为目标是赋能于人,但大多数情况下,我们应该让人们用这些系统做合理的事情,并尽可能不封闭任何可能的领域。但是,如果你允许某人拿走你的东西并造出一百万个邪恶的软件工程师,那就无法真正赋能于人,因为这些邪恶的软件工程师会伤害他人。所以我反对这种情况。
Jeff Dean 01:38:14
Me too. I'll go on.
我也是。接下来我继续说。
Dwarkesh Patel 01:38:16
All right, let's talk about a few more fun topics. Make it a little lighter. Over the last 25 years, what was the most fun time? What period of time do you have the most nostalgia over?
好了,让我们谈谈几个更有趣的话题,轻松一点。在过去25年中,哪段时间最有趣?你对哪段时间最怀念?
Jeff Dean 01:38:30
At work, you mean?
你的意思是在工作时?
Noam Shazeer 01:38:31
Yeah. At work. Yeah.
是的,在工作时。是的。
Jeff Dean 01:38:32
I think the early sort of four or five years at Google when I was one of a handful of people working on search and crawling and indexing systems, our traffic was growing tremendously fast. We were trying to expand our index size and make it so we updated it every minute instead of every month, or two months if something went wrong.
我想说的是,早期在 Google 的四五年时间里,当时我是少数几个从事搜索、爬虫和索引系统工作的人之一,我们的流量增长非常迅速。我们努力扩充索引规模,力图将更新频率从每月一次(或者如果出现问题每两个月一次)提升到每分钟一次。
Seeing the growth in usage of our systems was really just personally satisfying. Building something that is used by two billion people a day is pretty incredible.
看到我们系统的使用量不断增长,令我感到非常满足。构建出每天被二十亿人使用的东西,真是不可思议。
But I would also say equally exciting is working with people on the Gemini team today. I think the progress we've been making in what these models can do over the last year and a half is really fun. People are really dedicated, really excited about what we're doing.
但我也会说,与今天 Gemini 团队的人一起工作同样令人兴奋。我认为在过去一年半中,我们在这些模型所能实现的功能上取得的进展非常有趣。大家都非常敬业,对我们的工作充满热情。
I think the models are getting better and better at pretty complex tasks. Like if you showed someone using a computer 20 years ago what these models are capable of, they wouldn't believe it. And even five years ago, they might not believe it. And that's pretty satisfying.
我认为这些模型在处理相当复杂任务方面会越来越好。如果你向20年前使用电脑的人展示这些模型的能力,他们会难以置信;即使是五年前的人也可能不敢相信,而这正令人感到满足。
I think we'll see a similar growth in usage of these models and impact in the world.
我相信我们将看到这些模型在使用量和对世界的影响上实现类似的增长。
Noam Shazeer 01:39:48
Yeah, I'm with you. Early days were super fun. Part of that is just knowing everybody and the social aspect, and the fact that you're just building something that millions and millions of people are using.
是的,我赞同。早期的日子超级有趣,部分原因是认识每个人以及那种社交氛围,再加上你在构建一个每天被数百万人使用的东西。
Same thing today. We got that whole nice micro kitchen area where you get lots of people hanging out. I love being in person, working with a bunch of great people, and building something that's helping millions to billions of people. What could be better?
今天也是如此。我们有那个很棒的小厨房区域,很多人聚在一起。我喜欢面对面工作,与一群优秀的人共事,打造出帮助数百万甚至数十亿人的产品。还有什么比这更美好的呢?
Dwarkesh Patel 01:40:21
What was this micro kitchen?
那个小厨房到底是什么?
Jeff Dean 01:40:23
Oh, we have a micro kitchen area in the building we both sit in. It's the new, so-named Gradient Canopy. It used to be named Charleston East, and we decided we needed a more exciting name because it's a lot of machine learning researchers and AI research happening in there.
哦,我们在同一栋楼里有个小厨房区域,叫做 Gradient Canopy(新名称)。它以前叫 Charleston East,但我们觉得需要一个更激动人心的名字,因为那里聚集了大量的机器学习和人工智能研究人员。
There's a micro kitchen area that we've set up with, normally it's just like an espresso machine and a bunch of snacks, but this particular one has a bunch of space in it. So we've set up maybe 50 desks in there, and so people are just hanging out in there. It's a little noisy because people are always grinding beans and brewing espresso, but you also get a lot of face-to-face ideas of connections, like, "Oh, I've tried that. Did you think about trying this in your idea?" Or, "Oh, we're going to launch this thing next week. How's the load test looking?" There's just lots of feedback that happens.
我们设立了一个小厨房区域,通常那里的设施只是一个浓缩咖啡机和一些零食,但这个区域特别宽敞。所以我们可能在那里设置了大约50个桌子,人们就可以在那里聚集。虽然那里有点吵,因为大家总是在研磨咖啡豆和煮浓缩咖啡,但你也能获得许多面对面的交流,比如,“哦,我试过那个。你有没有想过在你的想法中尝试这个?”或者,“哦,我们下周要发布这个东西,负载测试进行得怎么样?”各种反馈层出不穷。
And then we have our Gemini chat room for people who are not in that micro kitchen. We have a team all over the world, and there's probably 120 chat rooms I'm in related to Gemini things. In this particular very focused topic, we have seven people working on this, and there are exciting results being shared by the London colleagues.
另外,对于不在小厨房的人,我们还有专门的 Gemini 聊天室。我们的团队遍布全球,我大概参与了大约120个与 Gemini 相关的聊天室。在这个非常专注的主题中,有七个人在参与,并且伦敦的同事们分享了令人兴奋的成果。
When you wake up, you see what's happening in there, or it's a big group of people focused on data, and there are all kinds of issues happening in there. It's just fun.
当你早晨醒来,你就能看到那里正在发生的事情,或者看到一大群人专注于数据,各种问题纷至沓来。真是太有趣了。
Dwarkesh Patel 01:41:50
What I find remarkable about some of the calls you guys have made is you're anticipating a level of demand for compute, which at the time wasn't obvious or evident. TPUs being a famous example of this, or the first TPU being an example of this.
我觉得你们所做的一些决策中最令人惊讶的是,你们预见到了计算需求的水平,而当时这并不明显或显而易见。TPU就是一个著名的例子,或者说第一个TPU就是这样的例子.
That thinking you had in, I guess, 2013 or earlier, if you think about it that way today and you do an estimate of, look, we're going to have these models that are going to be a backbone of our services, and we're going to be doing constant inference for them. We're going to be training future versions. And you think about the amount of compute we'll need by 2030 to accommodate all these use cases, where does the Fermi estimate get you?
这种思维,我猜是在2013年或更早的时候就有的,如果你今天这样考虑,并做出估算,比如说,我们将会拥有这些作为我们服务支柱的模型,并且我们将不断为它们进行推理,同时训练未来的版本。考虑到2030年为了适应所有这些用例我们所需要的计算量,费米估计会给你什么结果呢?
Jeff Dean 01:42:30
Yeah, I mean, I think you're going to want a lot of inference. Compute is the rough, highest-level view of these capable models because if one of the techniques for improving their quality is scaling up the amount of inference compute you use, then all of a sudden what's currently like one request to generate some tokens now becomes 50 or 100 or 1000 times as computationally intensive, even though it's producing the same amount of output.
是的,我的意思是,我认为你会需要大量的推理计算。计算能力是衡量这些高性能模型的一个大致、最高层次的视角,因为如果提高模型质量的一个方法是扩大所使用的推理计算量,那么突然之间,现在生成一些标记的一个请求就会变成50、100甚至1000倍的计算密集型任务,尽管输出量相同.
And you're also going to then see tremendous scaling up of the uses of these services as not everyone in the world has discovered these chat-based conversational interfaces where you can get them to do all kinds of amazing things. Probably 10% of the computer users in the world have discovered that today, or 20%. As that pushes towards 100% and people make heavier use of it, that's going to be another order of magnitude or two of scaling.
而且,你还会看到这些服务的使用量大幅增长,因为世界上并非所有人都已经发现这些基于对话界面的聊天系统,它们可以让你完成各种令人惊叹的事情。可能今天世界上只有10%或20%的电脑用户已经发现了这一点。随着这一比例向100%推进,人们的使用会更加频繁,这将带来额外一到两个数量级的扩展.
And so you're now going to have two orders of magnitude from that, two orders of magnitude from that. The models are probably going to be bigger, you'll get another order of magnitude or two from that. And there's a lot of inference compute you want. So you want extremely efficient hardware for inference for models you care about.
因此,你现在将会有两个数量级的增长,再加上模型可能会更大,这又会带来一到两个数量级的提升。而且你需要大量的推理计算。所以你需要针对你关注的模型采用极其高效的推理硬件.
Dwarkesh Patel 01:43:52
In flops, global total global inference in 2030?
以浮点运算次数计,2030年全球总推理计算量是多少?
Noam Shazeer 01:43:58
I think just more is always going to be better. If you just kind of think about, okay, what fraction of world GDP will people decide to spend on AI at that point? And then, like, okay, what do the AI systems look like?
我认为更多总是更好。如果你考虑一下,到那时人们会决定将世界GDP的多少比例花在人工智能上?然后,再想象一下,AI系统会是什么样子?
Well, maybe it's some sort of personal assistant-like thing that is in your glasses and can see everything around you and has access to all your digital information and the world's digital information. And maybe it's like you're Joe Biden, and you have the earpiece in the cabinet that can advise you about anything in real-time and solve problems for you and give you helpful pointers. Or you could talk to it, and it wants to analyze anything that it sees around you for any potential useful impact that it has on you.
或许它会是某种类似个人助理的东西,戴在眼镜上,能够看到你周围的一切,并且可以访问你所有的数字信息以及世界的数字信息。也许就像你是乔·拜登那样,你在内阁中有一个耳机,可以实时为你提供建议、解决问题并给出有用的提示。或者你可以与它交谈,它会分析你周围所见的一切,寻找任何可能对你有益的影响.
So I mean, I can imagine, okay, and then say it's like your personal assistant or your personal cabinet or something, and that every time you spend 2x as much money on compute, the thing gets like 5, 10 IQ points smarter or something like that. And, okay, would you rather spend $10 a day and have an assistant or $20 a day and have a smarter assistant? And not only is it an assistant in life but an assistant in getting your job done better because now it makes you from a 10x engineer to a 100x or 10 millionx engineer?
所以我的意思是,我可以想象,然后说这就像是你的私人助理或你的私人内阁,每当你在计算上花费双倍资金时,这个东西就会变得更聪明,大约增加5、10个智商点之类的。那么,你是愿意每天花10美元拥有一个助理,还是花20美元拥有一个更聪明的助理?而且,它不仅在生活中是一个助理,还能帮助你更好地完成工作,因为它能让你从一个10倍工程师变成一个100倍甚至一千万倍的工程师.
Okay, so let's see: from first principles, right? So people are going to want to spend some fraction of world GDP on this thing. The world GDP is almost certainly going to go way, way up, two orders of magnitude higher than it is today, due to the fact that we have all of these artificial engineers working on improving things.
好,来看看:从基本原理出发,人们肯定会愿意在这上面花费世界GDP的一部分。由于我们有了所有这些人工工程师在不断改进东西,世界GDP几乎肯定会大幅上升,比现在高出两个数量级.
Probably we'll have solved unlimited energy and carbon issues by that point. So we should be able to have lots of energy. We should be able to have millions to billions of robots building us data centers. Let's see, the sun is what, 10 to the 26 watts or something like that?
到那时,我们可能已经解决了无限能源和碳排放问题。因此,我们应该能够拥有大量能源,也能有数百万到数十亿个机器人为我们建造数据中心。再看看,太阳大概输出10的26次方瓦特左右?
I'm guessing that the amount of compute being used for AI to help each person will be astronomical.
我猜,为了帮助每个人所使用的AI计算量将会是惊人的.
Jeff Dean 01:46:48
I would add on to that. I'm not sure I agree completely, but it's a pretty interesting thought experiment to go in that direction. And even if you get partway there, it's definitely going to be a lot of compute.
我想补充一点。我不确定我完全同意,但朝这个方向进行的思考实验确实非常有趣。即使你仅仅走到一半,也肯定需要大量的计算.
And this is why it's super important to have as cheap a hardware platform for using these models and applying them to problems that Noam described, so that you can then make it accessible to everyone in some form and have as low a cost for access to these capabilities as you possibly can.
这就是为什么拥有一个尽可能廉价的硬件平台来使用这些模型并将它们应用到 Noam 所描述的问题上如此重要,这样你才能以某种形式使其对所有人都可用,并尽可能降低获取这些能力的成本.
And I think that's achievable by focusing on hardware and model co-design kinds of things, we should be able to make these things much, much more efficient than they are today.
我认为通过专注于硬件与模型协同设计,我们应该能够让这些东西比今天更加高效得多.
Dwarkesh Patel 01:47:36
Is Google's data center build-out plan over the next few years aggressive enough given this increase in demand you're expecting?
鉴于你预期的这种需求增长,谷歌未来几年扩建数据中心的计划是否足够激进?
Jeff Dean 01:47:46
I'm not going to comment on our future capital spending because our CEO and CFO would prefer I probably not. But I will say, you can look at our past capital expenditures over the last few years and see that we're definitely investing in this area because we think it's important.
我不打算对我们未来的资本支出发表评论,因为我们的CEO和CFO可能不希望我这么做。但我可以说,你可以看看我们过去几年的资本支出,你会发现我们确实在这个领域进行了大量投资,因为我们认为这很重要.
We are continuing to build new and interesting, innovative hardware that we think really helps us have an edge in deploying these systems to more and more people, both training them and also, how do we make them usable by people for inference?
我们正在持续构建新的、有趣的、创新的硬件,我们认为这将真正帮助我们在向越来越多的人部署这些系统时占据优势,无论是在训练还是在让人们用于推理方面.
Dwarkesh Patel 01:48:21
One thing I've heard you talk a lot about is continual learning, the idea that you could just have a model which improves over time rather than having to start from scratch. Is there any fundamental impediment to that? Because theoretically, you should just be able to keep fine-tuning a model. What does that future look like to you?
我听你多次谈到的一件事是持续学习,即你可以拥有一个随着时间不断改进的模型,而不必从零开始。对此有没有什么根本性的障碍?因为理论上,你应该只需不断微调一个模型。你眼中的那样的未来是什么样子的?
Jeff Dean 01:48:40
Yeah, I've been thinking about this more and more. I've been a big fan of models that are sparse because I think you want different parts of the model to be good at different things. We have our Gemini 1.5 Pro model, and other models are mixture-of-experts style models where you now have parts of the model that are activated for some token and parts that are not activated at all because you've decided this is a math-oriented thing, and this part's good at math, and this part's good at understanding cat images. So, that gives you this ability to have a much more capable model that's still quite efficient at inference time because it has very large capacity, but you activate a small part of it.
是的,我越来越多地在思考这个问题。我一直很喜欢稀疏模型,因为我认为你希望模型的不同部分擅长不同的事情。我们有 Gemini 1.5 Pro 模型,其他模型则采用专家混合的风格,这样你就会有一些部分在处理某个标记时被激活,而其他部分则完全不被激活,因为你已经决定这是一个面向数学的任务,这部分擅长数学,而那部分擅长理解猫的图像。这样就赋予了你构建一个在推理时非常高效的、具备巨大容量但只激活其中一小部分的更强大模型的能力.
But I think the current problem, well, one limitation of what we're doing today is it's still a very regular structure where each of the experts is the same size. The paths merge back together very fast. They don't go off and have lots of different branches for mathy things that don't merge back together with the kind of cat-image thing.
但我认为当前的问题是,我们今天所做的仍然是一个非常规则的结构,每个专家的大小都是相同的。各个路径很快就合并在一起,它们不会分开发展出很多用于数学计算而与处理猫图像那类任务不合并的分支.
I think we should probably have a more organic structure in these things. I also would like it if the pieces of those model of the model could be developed a little bit independently. Like right now, I think we have this issue where we're going to train a model. So, we do a bunch of preparation work on deciding the most awesome algorithms we can come up with and the most awesome data mix we can come up with.
我认为我们可能需要一个更有机的结构。我也希望这些模型的各个部分能够稍微独立地发展。就像现在,我认为我们存在这样一个问题:我们要训练一个模型,所以我们会做大量准备工作,决定我们能提出的最出色的算法和最棒的数据组合.
But there's always trade-offs there, like we'd love to include more multilingual data, but that might come at the expense of including less coding data, and so, the model's less good at coding but better at multilingual, or vice versa. I think it would be really great if we could have a small set of people who care about a particular subset of languages go off and create really good training data, train a modular piece of a model that we can then hook up to a larger model that improves its capability in, say, Southeast Asian languages or in reasoning about Haskell code or something.
但总会有权衡,比如我们希望包含更多的多语种数据,但这可能以牺牲部分编码数据为代价,导致模型在编码上表现较差但在多语种上表现更好,反之亦然。我认为如果我们能有一小部分人专注于某个特定语种的改进,去创建非常好的训练数据,训练出一个模块化的模型部分,然后再将其连接到一个更大的模型上,以提升它在例如东南亚语言或推理Haskell代码方面的能力,那将会非常好.
Then, you also have a nice software engineering benefit where you've decomposed the problem a bit compared to what we do today, which is we have this kind of a whole bunch of people working. But then, we have this kind of monolithic process of starting to do pre-training on this model.
此外,这还具有很好的软件工程优势,因为相比于我们今天那种让一大群人一起工作的单一整体的预训练过程,这种方式能够将问题分解开来.
If we could do that, you could have 100 teams around Google. You could have people all around the world working to improve languages they care about or particular problems they care about and all collectively work on improving the model. And that's kind of a form of continual learning.
如果我们能做到这一点,谷歌周围就可以有100个团队,你可以让全球各地的人专注于改进他们关心的语种或特定问题,共同努力提升模型。这就算是一种持续学习的形式.
Noam Shazeer 01:51:27
That would be so nice. You could just glue models together or rip out pieces of models and shove them into other...
那就太好了。你可以把模型粘合在一起,或者拆下模型的部分,再塞进其他模型里……
Jeff Dean 01:51:33
Upgrade this piece without throwing out the thing...
升级这一部分而不把整个东西废弃掉……
Noam Shazeer 01:51:36
...or you just attach a fire hose, and you suck all the information out of this model, shove it into another model. There is, I mean, the countervailing interest there is sort of science, in terms of, okay, we're still in the period of rapid progress, so, if you want to do sort of controlled experiments, and okay, I want to compare this thing to that thing because that then is helping us figure out what to build. In that interest, it's often best to just start from scratch so you can compare one complete training run to another complete training run at the practical level because it helps us figure out what to build in the future. It's less exciting but does lead to rapid progress.
……或者你直接接上消防水带,把这个模型的所有信息都吸出来,塞进另一个模型里。我的意思是,从科学角度来说,有一个相反的兴趣所在:好吧,我们还处在快速进步的阶段,所以如果你想做一些受控实验,比如,我想比较这个和那个,因为这有助于我们确定未来该构建什么。从这个角度看,通常最好从零开始,这样你可以在实践层面对比一次完整的训练过程与另一次完整的训练过程,因为这有助于我们决定未来该构建什么。虽然这种方法不那么刺激,但确实能带来快速进步。
Jeff Dean 01:52:32
Yeah, I think there may be ways to get a lot of the benefits of that with a version system of modularity. I have a frozen version of my model, and then I include a different variant of some particular module, and I want to compare its performance or train it a bit more. Then, I compare it to the baseline of this thing with now version N prime of this particular module that does Haskell interpretation.
是的,我认为通过模块化的版本系统可能有办法获得许多好处。我有一个冻结版的模型,然后我引入某个特定模块的不同变体,想比较它的性能或进一步训练它。接着,我将其与基线进行比较,这个基线是当前版本中执行 Haskell 解释的特定模块的 N' 版本。
Noam Shazeer 01:52:58
Actually, that could lead to faster research progress, right? You've got some system, and you do something to improve it. And if that thing you're doing to improve it is relatively cheap compared to training the system from scratch, then it could actually make research much, much cheaper and faster.
实际上,这可能会带来更快的研究进展,对吧?你有一个系统,然后对它做些改进。如果这种改进相对于从零开始训练系统来说成本较低,那么它实际上可以让研究变得便宜得多、速度更快。
Jeff Dean 01:53:16
Yeah, and also more parallelizable, I think, across people.
是的,我认为这也能让工作在人员之间更易并行化。
Noam Shazeer 01:53:24
Okay, let's figure it out and do that next.
好吧,那我们找出办法,接下来就这么做。
Dwarkesh Patel 01:53:29
So, this idea that is sort of casually laid out there would actually be a big regime shift compared to how things are done today. If you think the way things are headed, this is a sort of very interesting prediction about... You just have this blob where things are getting pipelined back and forth – and if you want to make something better, you can do like a sort of surgical incision almost.
所以,这个看似随意提出的想法,实际上与今天的做法相比会带来巨大的制度变革。如果你考虑未来的发展方向,这算是一种非常有趣的预测……你就像有一个不断来回流水线作业的整体——如果你想让某个东西变得更好,你几乎可以做出一种类似外科手术般的切口。
Jeff Dean 01:53:55
Right, or grow the model, add another little bit of it here. Yeah, I've been sort of sketching out this vision for a while in Pathways...
对,或者扩展模型,在这里增加一点东西。是的,我在 Pathways 中已经构想这个愿景有一段时间了……
Noam Shazeer 01:54:04
Yeah, you've been building the...
是的,你一直在构建……
Jeff Dean 01:54:05
...and we've been building the infrastructure for it. So, a lot of what Pathways, the system, can support is this kind of twisty, weird model with asynchronous updates to different pieces. And we're using Pathways to train our Gemini models, but we're not making use of some of its capabilities yet. But maybe we should.
……而我们一直在为此构建基础设施。所以,Pathways 这个系统能够支持的很多东西,就是这种带有各部分异步更新的弯曲、怪异的模型。我们正利用 Pathways 来训练我们的 Gemini 模型,但还没有充分利用它的一些能力。也许我们应该这样做。
Noam Shazeer 01:54:24
Ooh maybe.
哦,也许吧。
Dwarkesh Patel 01:54:27
This is so interesting, and I don't want to lose this thread, but give me one moment.
这太有趣了,我不想丢失这个话题,但请稍等我一下。
Noam Shazeer 01:54:33
There have been times, like the way the TPU pods were set up. I don't know who did that, but they did a pretty brilliant job. The low-level software stack and the hardware stack, okay, you've got your nice regular high-performance hardware, you've got these great torus-shaped interconnects, and then you've got the right low-level collectives, the all-reduces, et cetera, which I guess came from supercomputing, but it turned out to be kind of just the right thing to build distributed deep learning on top of.
曾经有一段时间,比如 TPU 集群的设置方式。我不知道是谁干的,但他们做得非常出色。低层的软件堆栈和硬件堆栈——你拥有那种漂亮、常规的高性能硬件,拥有这些极好的环形互连,然后还有合适的低级集合操作、全归约等等,我猜这些来自超级计算,但结果证明,这正是构建分布式深度学习系统所需的正确方案。
Dwarkesh Patel 01:39:15
Okay, so a couple of questions. One, suppose Noam makes another breakthrough, and now we've got a better architecture. Would you just take each compartment and distill it into this better architecture? And that's how it keeps improving over time?
好,先问几个问题。首先,假设 Noam 再次取得突破,现在我们拥有了更好的架构。你是否会把每个部分提取出来,并将其精炼成这种更优的架构?这就是它不断改进的方式吗?
Jeff Dean 01:39:33
I do think distillation is a really useful tool because it enables you to transform a model in its current model architecture form into a different form. Often, you use it to take a really capable but large and unwieldy model and distill it into a smaller one that maybe you want to serve with really good, fast latency inference characteristics.
我确实认为蒸馏是一种非常有用的工具,因为它使你能够将当前模型架构形式转变为另一种形式。通常,你用它将一个非常强大但庞大笨重的模型精炼成一个更小的模型,这个小模型可能具有非常好的、低延迟的推理特性。
But I think you can also view this as something that's happening at the module level. Maybe there'd be a continual process where you have each module, and it has a few different representations of itself. It has a really big one. It's got a much smaller one that is continually distilling into the small version.
但我认为你也可以把这看作是在模块层面上发生的事情。也许会有一个持续的过程,每个模块都有几种不同的表现形式:一个很大的版本,以及一个不断被精炼成小版本的较小版本。
And then the small version, once that's finished, you sort of delete the big one and you add a bunch more parameter capacity. Now, start to learn all the things that the distilled small one doesn't know by training it on more data, and then you kind of repeat that process. If you have that kind of running a thousand different places in your modular model in the background, that seems like it would work reasonably well.
然后,一旦小版本完成,你就可以删除大版本,并增加大量参数容量。接着,通过用更多数据训练,让它学习所有被精炼的小版本不知道的东西,然后你重复这一过程。如果你的模块化模型在后台有成千上万处这样运行,似乎效果会相当不错。
Dwarkesh Patel 01:40:42
This could be a way of doing inference scaling, like the router decides how much do you want the big one.
这可能是一种推理扩展的方法,比如路由器决定你需要多少大模型的能力。
Jeff Dean 01:40:47
Yeah, you can have multiple versions. Oh, this is an easy math problem, so I'm going to route it to the really tiny math distilled thing. Oh, this one's really hard, so...
是的,你可以有多个版本。比如,“哦,这是个简单的数学问题,所以我把它路由到那个超小的数学精炼模型;哦,这个问题非常难,所以……”
Dwarkesh Patel 01:40:56
One, at least from public research, it seems like it's often hard to decode what each expert is doing in mixture of expert type models. If you have something like this, how would you enforce the kind of modularity that would be visible and understandable to us?
首先,根据公开研究来看,往往很难解读混合专家模型中每个专家在做什么。如果有类似的情况,你会如何强制实现那种对我们来说可见且易于理解的模块化?
Noam Shazeer 01:41:13
Actually, in the past, I found experts to be relatively easy to understand. I mean, the first Mixture of Experts paper, you could just look at the experts.
实际上,以前我发现专家模型相对容易理解。我的意思是,在第一篇混合专家论文中,你只需观察各个专家即可。
Dwarkesh Patel 01:41:24
“I don’t know, I'm only the inventor of Mixture of Experts.”
“我不知道,我只是混合专家模型的发明者。”
Noam Shazeer 01:57:25
Yeah – oh, the what?
是的——哦,什么?
Jeff Dean 01:41:28
Yeah, yeah.
是啊,是啊。
Noam Shazeer 01:41:30
Yeah, you could just see, okay, this expert, like we did, you know, a thousand, two thousand experts. Okay, and this expert, was getting words referring to cylindrical objects.
是的,你可以清楚地看到,比如这个专家,我们用了大约一千、两千个专家,而这个专家专门负责处理与圆柱形物体相关的词汇。
Jeff Dean 01:41:42
This one's super good at dates.
这个专家在处理日期方面非常出色。
Noam Shazeer 01:41:44
Yeah.
是的。
Jeff Dean 01:41:45
Talking about times.
谈论时间相关的话题。
Noam Shazeer 01:41:46
Yeah, pretty easy to do.
是的,非常容易做到。
Noam Shazeer 01:41:46
Not that you would need that human understanding to figure out how to work the thing at runtime because you just have some sort of learned router that's looking at the example.
并不是说你必须具备那种人类的理解来弄清楚如何在运行时操作这个东西,因为你只需依靠某种经过训练的路由器来判断示例即可。
Jeff Dean 01:42:04
One thing I would say is there is a bunch of work on interpretability of models and what are they doing inside. Sort of expert-level interpretability is a sub-problem of that broader area. I really like some of the work that my former intern, Chris Olah, and others did at Anthropic, where they trained a very sparse autoencoder and were able to deduce what characteristics some particular neuron in a large language model has, so they found a Golden Gate Bridge neuron that's activated when you're talking about the Golden Gate Bridge. And I think you could do that at the expert level, you could do that at a variety of different levels and get pretty interpretable results, and it's a little unclear if you necessarily need that. If the model is just really good at stuff, we don't necessarily care what every neuron in the Gemini model is doing, as long as the collective output and characteristics of the overall system are good. That's one of the beauties of deep learning, is you don't need to understand or hand-engineer every last feature.
我想说的是,有很多关于模型内部工作及其解释性的研究。专家级的解释性只是这一更广泛领域的一个子问题。我非常喜欢我以前的实习生 Chris Olah 等人在 Anthropic 所做的一些工作,他们训练了一个非常稀疏的自编码器,并能够推断出大型语言模型中某个特定神经元的特性,比如他们发现了一个“金门大桥神经元”,当你谈论金门大桥时这个神经元就会被激活。我认为你可以在专家层面上做类似的工作,在不同层次上获得相当可解释的结果,而且是否绝对需要这种解释性还不是很明确。如果模型在各方面表现得非常出色,我们并不一定关心 Gemini 模型中的每个神经元在做什么,只要整个系统的集体输出和特性是好的。这正是深度学习的魅力之一——你不必理解或手工设计每一个特征。
Dwarkesh Patel 01:43:13
Man, there are so many interesting implications of this that I could just keep asking you about this- I would regret not asking you more about this, so I'll keep going. One implication is, currently, if you have a model that has some tens or hundreds of billions of parameters, you can serve it on a handful of GPUs.
老兄,这其中有太多有趣的含义,我可以不断问你这方面的问题——要是不多问点我会后悔,所以我继续问。一个含义是,目前如果你有一个拥有几十亿或上百亿参数的模型,你可以在少数几个 GPU 上部署它。
In this system, where any one query might only make its way through a small fraction of the total parameters, but you need the whole thing loaded into memory, the specific kind of infrastructure that Google has invested in with these TPUs that exist in pods of hundreds or thousands would be immensely valuable, right?
而在这种系统中,虽然每个查询可能只经过总参数的一小部分,但你需要将整个模型加载到内存中。谷歌在那些成百上千个TPU集群上投资的特定基础设施将极为宝贵,对吧?
Noam Shazeer 01:44:02
For any sort of even existing mixtures of experts, you want the whole thing in memory. I guess there's kind of this misconception running around with Mixture of Experts that, okay, the benefit is that you don't even have to go through those weights in the model.
对于任何一种现存的混合专家模型,你都希望整个模型在内存中。我猜有一种误解认为,混合专家模型的好处在于你甚至不必处理模型中的那些权重。
If some expert is unused, it doesn't mean that you don't have to retrieve that memory because, really, in order to be efficient, you're serving at very large batch sizes.
如果某个专家没有被使用,并不意味着你可以不调用那部分内存,因为实际上,为了效率,你需要在非常大的批量下进行服务。
Jeff Dean 01:44:36
Of independent requests.
独立请求的情况。
Noam Shazeer 01:44:38
Right, of independent requests. So it's not really the case that, okay, at this step, you're either looking at this expert or you're not looking at this expert.
对,独立请求。所以实际上并不是说,在这一步,你要么使用这个专家,要么不使用它。
Because if that were the case, then when you did look at the expert, you would be running it at batch size one, which is massively inefficient. Like you've got modern hardware, the operational intensities are whatever, hundreds. So that's not what's happening. It's that you are looking at all the experts, but you only have to send a small fraction of the batch through each one.
因为如果真是那样的话,当你调用某个专家时,你将以批量大小为1来运行,这在效率上是极其低下的。现代硬件的操作强度通常是数百倍的,所以实际情况不是这样。实际上,你会调用所有专家,但只需将整个批次中的一小部分送入每个专家。
Jeff Dean 01:45:17
Right, but you still have a smaller batch at each expert that then goes through. And in order to get kind of reasonable balance, one of the things that the current models typically do is they have all the experts be roughly the same compute cost, and then you run roughly the same size batches through them in order to propagate the very large batch you're doing at inference time and have good efficiency.
对,但每个专家处理的批次依然较小。为了实现合理的平衡,目前的模型通常会使所有专家的计算成本大致相同,然后以大致相同的批次大小送入它们,以便在推理时传递非常大的批次并保持良好的效率。
But I think you often in the future might want experts that vary in computational cost by factors of 100 or 1000. Or maybe paths that go for many layers on one case, and a single layer or even a skip connection in the other case. And there, I think you're going to want very large batches still, but you're going to want to push things through the model a little bit asynchronously at inference time, which is a little easier than training time.
但我认为在未来,你可能会希望专家的计算成本存在100倍或1000倍的差异。或者,在某些情况下路径经过多层,而在其他情况下仅经过单层甚至跳跃连接。在这种情况下,我认为你仍然需要非常大的批次,但希望在推理时以一种比训练时稍微容易一些的方式异步处理。
That's part of one of the things that Pathways was designed to support. You have these components, and the components can be variable cost and you kind of can say, for this particular example, I want to go through this subset of the model, and for this example, I want to go through this subset of the model and have the system kind of orchestrate that.
这正是Pathways设计要支持的功能之一。你拥有这些组件,这些组件可以有不同的成本,你可以说,对于这个具体示例,我希望走这个模型子集,对于另一个示例,我希望走那个模型子集,并让系统来协调这一切。
Dwarkesh Patel 01:46:39
It also would mean that it would take companies of a certain size and sophistication to be able to... Right now, anybody can train a sufficiently small enough model. But if it ends up being the case that this is the best way to train future models, then you would need a company that can basically have a data center serving a single quote, unquote “blob” or model. So it would be an interesting change in paradigms in that way as well.
这也意味着,只有达到一定规模和复杂度的公司才能做到这一点……目前,任何人都可以训练一个足够小的模型。但如果最终证明这才是训练未来模型的最佳方式,那么你将需要一家能够拥有一个数据中心来服务于单个“整体”或模型的公司。这在范式上也会带来有趣的变化。
Noam Shazeer 01:47:10
You definitely want to have at least enough HBM to put your whole model. So depending on the size of your model, most likely that's how much HBM you'd want to have at a minimum.
你肯定希望至少拥有足够的HBM(高带宽内存)来容纳整个模型。因此,根据模型的大小,这通常就是你至少需要的HBM容量。
Jeff Dean 01:47:28
It also means you don't necessarily need to grow your entire model footprint to be the size of a data center. You might want it to be a bit below that.
这也意味着你不必将整个模型的规模扩展到数据中心的级别。你可能希望它稍微低于那个规模。
And then have potentially many replicated copies of one particular expert that is being used a lot, so that you get better load balancing. This one's being used a lot because we get a lot of math questions, and this one is an expert on Tahitian dance, and it is called on really rarely.
然后,你可以对某个经常使用的特定专家进行大量复制,以实现更好的负载均衡。比如这个专家因为我们收到很多数学问题而频繁调用,而另一个专长于塔希提舞蹈的专家则很少被调用。
That one, maybe you even page out to DRAM rather than putting it in HBM. But you want the system to figure all this stuff out based on load characteristics.
那个专家也许甚至可以从DRAM中调取,而不必放在HBM中。但你希望系统能根据负载特性来处理所有这些问题。
Dwarkesh Patel 01:48:09
Right. Now, language models, obviously, you put in language, you get language out. Obviously, it's multimodal.
对。现在,语言模型显然是输入语言,输出语言。当然,它是多模态的。
But the Pathways blog post talks about so many different use cases that are not obviously of this kind of auto-regressive nature going through the same model. Could you imagine, basically, Google as a company, the product is like Google Search goes through this, Google Images goes through this, Gmail goes through it?
但是,Pathways博客文章讨论了许多不同的用例,这些用例并非明显属于同一种自回归性质经过同一模型的情形。你能想象吗,基本上谷歌作为一家公司,其产品如Google搜索、Google图片、Gmail等都是通过这个模型运行的?
Jeff Dean 01:48:45
You're starting to see some of this by having a lot of uses of Gemini models across Google that are not necessarily fine-tuned. They're just given instructions for this particular use case in this feature in this product setting.
你已经开始看到这种情况,谷歌在很多场景下使用Gemini模型,并不一定经过微调,而只是针对特定用例,在某个产品功能中给予指令。
So, I definitely see a lot more sharing of what the underlying models are capable of across more and more services. I do think that's a pretty interesting direction to go, for sure.
所以,我绝对看到越来越多的服务在共享底层模型的能力。我确实认为这绝对是一个非常有趣的发展方向。

字节也是同样的思路。
Dwarkesh Patel 01:49:14
Yeah, I feel like people listening might not register how interesting a prediction this is about where AI is going. It's like sort of getting Noam on a podcast in 2018 and being like, "Yeah, so I think language models will be a thing."
是的,我觉得听众可能无法体会到这一预测对于AI未来发展的有趣程度。就像在2018年请Noam上播客时他说:“是的,我认为语言模型将会成为现实。”
It's like, if this is where things go, this is actually incredibly interesting.
如果事情的发展正朝这个方向走,那真是令人难以置信地有趣。
Jeff Dean 01:49:36
Yeah, and I think you might see that might be a big base model. And then you might want customized versions of that model with different modules that are added onto it for different settings that maybe have access restrictions.
是的,我认为你会看到这可能成为一个大型基础模型,然后你可能需要针对不同设置(可能带有访问限制)的情况,为该模型定制附加不同模块的版本。
Maybe we have an internal one for Google use, for Google employees, that we've trained some modules on internal data, and we don't allow anyone else to use those modules, but we can make use of it. Maybe other companies, you add on other modules that are useful for that company setting and serve it in our cloud APIs.
也许我们会有一个专供谷歌使用、针对谷歌员工的内部版本,我们用内部数据训练了一些模块,并不允许其他人使用这些模块,但我们自己可以利用它。也许其他公司则会添加对其业务环境有用的模块,并通过我们的云 API 提供服务。
Dwarkesh Patel 01:50:09
What is the bottleneck to making this sort of system viable? Is it systems engineering? Is it ML?
使这种系统可行的瓶颈是什么?是系统工程的问题吗?还是机器学习的问题?
Jeff Dean 01:50:17
It's a pretty different way of operating than our current Gemini development. So, I think we will explore these kinds of areas and make some progress on them.
这与我们当前的 Gemini 开发方式截然不同。所以,我认为我们会探索这些领域,并在这些方面取得一些进展。
But we need to really see evidence that it's the right way, that it has a lot of benefits. Some of those benefits may be improved quality, some may be less concretely measurable, like this ability to have lots of parallel development of different modules. But that's still a pretty exciting improvement because I think that would enable us to make faster progress on improving the model's capabilities for lots of different distinct areas.
但我们确实需要看到证据,证明这是正确的方法,并且具有许多好处。其中一些好处可能是提高了质量,另一些则可能不那么具体可测,比如能够实现大量不同模块的并行开发。但这依然是一个令人兴奋的改进,因为我认为这将使我们能更快地提升模型在众多不同领域的能力。
Noam Shazeer 01:51:00
Even the data control modularity stuff seems really cool because then you could have the piece of the model that's just trained for me. It knows all my private data.
甚至数据控制和模块化相关的东西看起来也非常酷,因为那样你就可以拥有一个仅为我训练的模型部分,它了解我所有的私人数据。
Jeff Dean 01:51:09
Like a personal module for you would be useful. Another thing might be you can use certain data in some settings but not in other settings.
比如,为你量身定制的个人模块会非常有用。另外一种情况可能是,你可以在某些环境中使用特定数据,而在其他环境中则不使用。
Maybe we have some YouTube data that's only usable in a YouTube product surface but not in other settings. So, we could have a module that is trained on that data for that particular purpose.
也许我们有一些 YouTube 数据,只能在 YouTube 产品界面中使用,而不能在其他环境中使用。因此,我们可以有一个专门为该用途在这些数据上训练的模块。
Dwarkesh Patel 01:51:29
Yeah.
嗯。
Noam Shazeer 01:51:32
We're going to need a million automated researchers to invent all of this stuff.
我们将需要一百万个自动化研究人员来发明所有这些东西。
Jeff Dean 01:51:39
It's going to be great.
那将会非常棒。
Dwarkesh Patel 01:51:41
Yeah, well the thing itself, you build the blob, and it tells you how to make the blob better.
是啊,你构建了这个“整体”,它就会告诉你如何让这个“整体”变得更好。
Jeff Dean 01:51:47
Blob 2.0. Or maybe they're not even versions, it's just like an incrementally growing blob.
Blob 2.0,或者说它们甚至不算版本,只是一个逐步增长的“整体”。
Dwarkesh Patel 01:51:56
Yeah, that's super fascinating. Okay, Jeff, motivate for me, big picture: why is this a good idea? Why is this the next direction?
是的,这太吸引人了。好吧,杰夫,从大局上激励我:为什么这是个好主意?为什么这是下一步的发展方向?
Jeff Dean 01:52:06
Yeah, this notion of an organic, not quite so carefully mathematically constructed machine learning model is one that's been with me for a little while. I feel like in the development of neural nets, the artificial neurons, inspiration from biological neurons is a good one and has served us well in the deep learning field.
是的,我一直在思考这种有机的、并非那种严格按照数学构造的机器学习模型。在神经网络的发展中,人工神经元从生物神经元中获得灵感是个好主意,并且在深度学习领域为我们带来了很大帮助。
We've been able to make a lot of progress with that. But I feel like we're not necessarily looking at other things that real brains do as much as we perhaps could, and that's not to say we should exactly mimic that because silicon and wetware have very different characteristics and strengths. But I do think one thing we could draw more inspiration from is this notion of having different specialized portions, sort of areas of a model of a brain that are good at different things.
我们在这方面已经取得了很多进展。但我觉得我们不一定充分研究过真实大脑所做的其他事情,这并不是说我们应该完全模仿大脑,因为硅基和生物基的系统有着截然不同的特性和优势。不过,我确实认为我们可以更多地从这种概念中汲取灵感,即拥有不同专门化部分——类似大脑中在不同方面擅长不同任务的区域。
We have a little bit of that in Mixture of Experts models, but it's still very structured. I feel like this kind of more organic growth of expertise, and when you want more expertise of that, you add some more capacity to the model there and let it learn a bit more on that kind of thing.
在混合专家模型中我们已经有一点这样的尝试,但那仍然非常结构化。我觉得这种更有机的专业能力增长方式,如果你想要更多这种专长,只需为模型增加一些容量,让它在这方面学得更多。
Also this notion of adapting the connectivity of the model to the connectivity of the hardware is a good one. I think you want incredibly dense connections between artificial neurons in the same chip and the same HBM because that doesn't cost you that much. But then you want a smaller number of connections to nearby neurons. So, like a chip away, you should have some amount of connections and then, like many, many chips away, you should have a smaller number of connections where you send over a very limited kind of bottlenecky thing: the most important things that this part of the model is learning for other parts of the model to make use of. And even across multiple TPU pods, you'd like to send even less information but the most salient kind of representations. And then across metro areas, you'd like to send even less.
还有一个好想法是将模型的连接方式适应于硬件的连接性。我认为你希望在同一芯片和同一高带宽内存中的人工神经元之间拥有极其密集的连接,因为这代价不高。但随后你希望与附近神经元之间的连接数量较少。比如,在相邻芯片之间,你应该有一定数量的连接,而在相隔很远的多个芯片之间,你应当有更少的连接,用于传递那种极为有限的瓶颈数据:也就是这一部分模型为其他部分学习到的最重要的信息。即使在多个 TPU 集群之间,你也希望传递的信息更少,但包含最显著的表征;而在不同都市区之间,则传递更少的信息。
Dwarkesh Patel 01:54:23
Yeah, and then that emerges organically.
是的,然后这一切自然地涌现出来。
Jeff Dean 01:54:26
Yeah, I'd like that to emerge organically. You could hand-specify these characteristics, but I think you don't know exactly what the right proportions of these kinds of connections are so you should just let the hardware dictate things a little bit. Like if you're communicating over here and this data always shows up really early, you should add some more connections, then it'll take longer and show up at just the right time.
是的,我希望这一切能够自然涌现。你可以手动指定这些特性,但我认为你并不确切知道这种连接的合理比例,所以最好让硬件在一定程度上决定这些东西。比如,如果你这里通信时这部分数据总是很早出现,你就应该增加一些连接,这样它会延迟出现,并恰到好处地呈现出来。
Dwarkesh Patel 01:54:48
Oh here's another interesting implication: Right now, we think about the growth in AI use as a sort of horizontal- so, suppose you're like, how many AI engineers will Google have working for it? You think about how many instances of Gemini 3 will be working at one time. If you have this, whatever you want to call it, this blob, and it can sort of organically decide how much of itself to activate, then it's more of, if you want 10 engineers worth of output, it just activates a different pattern or a larger pattern. If you want 100 engineers of output, it's not like calling more agents or more instances, it's just calling different sub-patterns.
哦,这里还有一个有趣的含义:目前,我们将AI使用量的增长视为一种水平增长——比如,假设你在想,谷歌会有多少AI工程师为它工作?你会想到同时有多少个Gemini 3的实例在工作。如果你有这个,不管你想叫它什么,这个blob,它可以有机地决定激活多少自己,那么如果你想要10个工程师的产出,它就激活一个不同的模式或更大的模式。如果你想要100个工程师的产出,它不是调用更多的代理或更多实例,而是调用不同的子模式。
Jeff Dean 01:55:34
I think there's a notion of how much compute do you want to spend on this particular inference, and that should vary by factors of 10,000 for really easy things and really hard things, maybe even a million. It might be iterative, you might make a pass through the model, get some stuff, and then decide you now need to call on some other parts of the model. The other thing I would say is this sounds super complicated to deploy because it's this weird, constantly evolving thing with maybe not super optimized ways of communicating between pieces, but you can always distill from that. If you say, "This is the kind of task I really care about, let me distill from this giant kind of organic thing into something that I know can be served really efficiently," you could do that distillation process whenever you want, once a day, once an hour. That seems like it'd be kind of good.
我认为有一个概念是你想在这个特定的推理上花费多少计算资源,这应该根据非常简单的事情和非常困难的事情而变化10,000倍,甚至可能100万倍。它可能是迭代的,你可能通过模型进行一次传递,得到一些东西,然后决定你现在需要调用模型的其他部分。我想说的另一件事是,这听起来部署起来非常复杂,因为它是一个奇怪的、不断发展的东西,部件之间的通信方式可能不是超级优化的,但你总是可以从中提炼。如果你说,“这是我真正关心的任务类型,让我从这个巨大的有机物中提炼出我知道可以非常有效地服务的东西”,你可以随时进行提炼过程,一天一次,一小时一次。这似乎会很好。
Noam Shazeer 01:56:32
Yeah, we need better distillation.
是的,我们需要更好的提炼技术。
Jeff Dean 01:56:34
Yeah.
是的。
Noam Shazeer 01:56:34
Anyone out there who invents amazing distillation techniques that instantly distill from a giant blob onto your phone, that would be wonderful.
如果有人发明出神奇的提炼技术,可以立即从一个巨大的blob提炼到你的手机上,那将是非常棒的。
Dwarkesh Patel 01:56:43
How would you characterize what's missing from current distillation techniques?
你会如何描述当前提炼技术中缺失的东西?
Noam Shazeer 01:56:46
Well, I just want it to work faster.
嗯,我只是希望它能更快地工作。
Jeff Dean 01:56:49
A related thing is I feel like we need interesting learning techniques during pre-training. I'm not sure we're extracting the maximal value from every token we look at with the current training objective. Maybe we should think a lot harder about some tokens. When you get to "the answer is," maybe the model should, at training time, do a lot more work than when it gets to "the".
相关的一点是,我觉得我们需要在预训练期间使用有趣的学习技术。我不确定我们是否通过当前的训练目标从我们看到的每个标记中提取了最大价值。也许我们应该对一些标记进行更深入的思考。当你到达“答案是”时,也许模型在训练时应该比到达“the”时做更多的工作。
Noam Shazeer 01:57:16
Right. There's got to be some way to get more from the same data, make it learn it forwards and backwards.
对。必须有一些方法可以从相同的数据中获得更多,让它学会正向和反向学习。
Jeff Dean 01:57:24
And every which way. Hide some stuff this way, hide some stuff that way, make it infer from partial information. I think people have been doing this in vision models for a while. You distort the model or you hide parts of it and try to make it guess the bird from half, like that it's a bird from this upper corner of the image or the lower left corner of the image. That makes the task harder, and I feel like there's an analog for more textual or coding-related data where you want to force the model to work harder. You'll get more interesting observations from it.
而且是各种方式。这边隐藏一些东西,那边隐藏一些东西,让它从部分信息中推断。我认为人们在视觉模型中已经这样做了很长时间。你扭曲模型或隐藏它的部分,试图让它从一半的图像中猜出鸟,比如从图像的左上角或左下角猜出是鸟。这使得任务更难,我觉得对于更多文本或编码相关的数据,有一个类似的类比,你想强迫模型更努力地工作。你会从中获得更有趣的观察。
Noam Shazeer 01:58:03
Yeah, the image people didn't have enough labeled data so they had to invent all this stuff.
是的,图像处理的人没有足够的标记数据,所以他们不得不发明所有这些东西。
Jeff Dean 01:58:08
And they invented -- I mean, dropout was invented on images, but we're not really using it for text mostly. That's one way you could get a lot more learning in a more large-scale model without overfitting is just make like 100 epochs over the world's text data and use dropout. But that's pretty computationally expensive, but it does mean we won't run it. Even though people are saying, "Oh no, we're almost out of textual data," I don't really believe that because I think we can get a lot more capable models out of the text data that does exist.
他们发明了——我的意思是,dropout是在图像上发明的,但我们主要没有在文本上使用它。这是一种方法,你可以在更大规模的模型中获得更多的学习而不过拟合,只需在世界文本数据上进行100个epoch并使用dropout。但这在计算上非常昂贵,但这确实意味着我们不会耗尽数据。尽管人们说,“哦不,我们几乎用完了文本数据”,我并不真的相信,因为我认为我们可以从现有的文本数据中获得更强大的模型。
Noam Shazeer 01:58:44
I mean, a person has seen a billion tokens.
我的意思是,一个人已经看过十亿个标记。
Jeff Dean 01:58:47
Yeah, and they're pretty good at a lot of stuff.
是的,而且他们在很多事情上都相当不错。
Dwarkesh Patel 01:58:54
So obviously human data efficiency sets a lower bound on how, or I guess, upper bound, one of them, maybe not.
所以显然,人类的数据效率为如何,或者我猜,上限,其中之一,也许不是,设定了下限。
Jeff Dean 01:59:04
It's an interesting data point.
这是一个有趣的数据点。
Dwarkesh Patel 01:59:05
Yes. So there's a sort of modus ponens, modus tollens thing here. One way to look at it is, look, LLMs have so much further to go, therefore we project orders of magnitude improvement in sample efficiency just if they could match humans. Another is, maybe they're doing something clearly different given the orders of magnitude difference. What's your intuition of what it would take to make these models as sample efficient as humans are?
是的。所以这里有一种 modus ponens,modus tollens 的事情。一种看法是,看,LLM 还有很长的路要走,因此我们预测样本效率会提高几个数量级,只要它们能与人类匹配。另一种是,也许它们在做一些明显不同的事情,考虑到数量级的差异。你的直觉是什么,需要什么才能使这些模型像人类一样样本高效?
Jeff Dean 01:59:33
Yeah, I think we should consider changing the training objective a little bit. Just predicting the next token from the previous ones you've seen seems like not how people learn. It's a little bit related to how people learn, I think, but not entirely. A person might read a whole chapter of a book and then try to answer questions at the back, and that's a different kind of thing. I also think we're not learning from visual data very much. We're training a little bit on video data, but we're definitely not anywhere close to thinking about training on all the visual inputs you could get. So you have visual data that we haven't really begun to train on. Then I think we could extract a lot more information from every bit of data we do see. I think one of the ways people are so sample efficient is they explore the world and take actions in the world and observe what happens. You see it with very small infants picking things up and dropping them; they learn about gravity from that. And that's a much harder thing to learn when you're not initiating the action. I think having a model that can take actions as part of its learning process would be just a lot better than just sort of passively observing a giant dataset.
是的,我认为我们应该考虑稍微改变训练目标。仅仅从你之前看到的标记中预测下一个标记似乎不是人们学习的方式。我认为这与人们学习的方式有点相关,但不完全相关。一个人可能会阅读一本书的整个章节,然后尝试回答后面的问题,这是不同的事情。我还认为我们没有从视觉数据中学习很多。我们在视频数据上进行了一点训练,但我们肯定还没有接近考虑在所有你能获得的视觉输入上进行训练。所以你有我们还没有真正开始训练的视觉数据。然后我认为我们可以从我们看到的每一点数据中提取更多的信息。我认为人们之所以如此样本高效,其中一个原因是他们探索世界并在世界上采取行动并观察发生了什么。你会在非常小的婴儿身上看到,他们拿起东西然后扔掉;他们从中学到了重力。当你没有主动采取行动时,这是一件更难学到的事情。我认为有一个模型可以作为其学习过程的一部分采取行动,这将比仅仅被动地观察一个巨大的数据集要好得多。

预训练和推理=学习+实践。
Dwarkesh Patel 02:00:50
Is Gato the future, then?
那么,Gato 是未来吗?
Jeff Dean 02:00:53
Something where the model can observe and take actions and observe the corresponding results seems pretty useful.
一个模型可以观察并采取行动并观察相应结果的东西似乎非常有用。
Noam Shazeer 02:01:04
I mean, people can learn a lot from thought experiments that don't even involve extra input. Einstein learned a lot of stuff from thought experiments, or like Newton went into quarantine and got an apple dropped on his head or something and invented gravity. And like mathematicians -- math didn't have any extra input. Chess, okay, you have the thing play chess against itself and it gets good at chess. That was DeepMind, but also all it needs is the rules of chess. So there's actually probably a lot of learning that you can do even without external data, and then you can make it in exactly the fields that you care about. Of course, there is learning that will require external data, but maybe we can just have this thing talk to itself and make itself smarter.
我的意思是,人们可以从甚至不涉及额外输入的思想实验中学习很多。爱因斯坦从思想实验中学到了很多东西,或者像牛顿进入隔离,有个苹果掉在他头上或什么的,然后发明了重力。像数学家——数学没有任何额外输入。国际象棋,好吧,你让这个东西自己和自己下棋,它就擅长国际象棋。那是 DeepMind,但它只需要国际象棋的规则。所以实际上,即使没有外部数据,你也可以做很多学习,然后你可以在你关心的领域中进行。当然,有些学习需要外部数据,但也许我们可以让这个东西自己和自己交谈,使自己更聪明。
Dwarkesh Patel 02:02:03
So here's the question I have. What you've just laid out over the last hour is potentially just like the big next paradigm shift in AI. That's a tremendously valuable insight, potentially. Noam, in 2017 you released the Transformer paper on which tens, if not hundreds, of billions of dollars of market value is based in other companies, not to mention all this other research that Google has released over time, which you've been relatively generous with. In retrospect, when you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal Transformer was. We should have kept it indoors." How do you think about that?
所以我有一个问题。你在过去一个小时里所阐述的,可能是 AI 中下一个大的范式转变。这是一个极其宝贵的洞察力,可能。Noam,你在 2017 年发布了 Transformer 论文,基于此,数十亿甚至数百亿美元的市场价值在其他公司中产生,更不用说谷歌随着时间发布的其他所有研究,你一直相对慷慨。回想起来,当你想到披露这些对你的竞争对手有帮助的信息时,回想起来是像,'是的,我们仍然会这样做',还是你会说,'啊,我们没有意识到 Transformer 有多重要。我们应该把它留在室内。' 你如何看待这个问题?
Noam Shazeer 02:02:51
It's a good question because I think probably we did need to see the size of the opportunity, often reflected in what other companies are doing. And also it's not a fixed pie. The current state of the world is pretty much as far from fixed pie as you can get. I think we're going to see orders of magnitude of improvements in GDP, health, wealth, and anything else you can think of. So I think it's definitely been nice that Transformer has got around.
这是一个好问题,因为我认为我们确实需要看到机会的规模,这通常反映在其他公司正在做什么上。而且这不是一个固定的馅饼。世界的当前状态几乎是你能想象的离固定馅饼最远的状态。我认为我们将看到 GDP、健康、财富以及你能想到的任何其他方面的数量级改善。所以我认为 Transformer 能够传播开来确实很好。
Jeff Dean 02:03:39
It’s transformative.
这具有变革性。
Noam Shazeer 02:03:51
Woo. Thank God Google's doing well as well. So these days we do publish a little less of what we're doing.
哇。感谢上帝,谷歌也表现得很好。所以这些天我们确实少发布一些我们正在做的事情。
Jeff Dean 02:03:54
There's always this trade-off: should we publish exactly what we're doing right away? Should we put it in the next stages of research and then roll it out into production Gemini models and not publish it at all? Or is there some intermediate point?
总是存在这种权衡:我们是否应该立即公布我们正在做的事情?是否应该将其放在下一阶段的研究中,然后在生产的Gemini模型中推出,而完全不发布?或者有某种中间方案?
And for example, in our computational photography work in Pixel cameras, we've often taken the decision to develop interesting new techniques, like the ability to do super good night sight vision for low-light situations or whatever, put that into the product and then published a real research paper about the system that does that after the product is released.
例如,在我们的Pixel相机的计算摄影工作中,我们常常决定开发一些有趣的新技术,比如在低光环境下实现超强夜视的能力之类的,将其应用到产品中,然后在产品发布后发表一篇关于该系统的真实研究论文。
Different techniques and developments have different treatments. Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; we'll get them out into our products and then make a decision: did we publish this or do we give kind of a lightweight discussion of it, but maybe not every last detail?
不同的技术和发展有不同的处理方式。我们认为一些非常关键的内容可能不会公开;而一些我们认为非常有趣但对改进产品很重要的内容,我们会先将其应用到产品中,然后再决定:我们是否发布了这一点,或者仅做轻量级讨论,而不透露每一个细节?
Other things I think we publish openly and try to advance the field and the community because that's how we all benefit from participating. I think it's great to go to conferences like NeurIPS last week with 15,000 people all sharing lots and lots of great ideas. We publish a lot of papers there as we have in the past, and see the field advance is super exciting.
而其他一些内容,我认为我们应该公开发布,努力推动该领域和社区的发展,因为这是所有人参与的共同受益。我觉得参加上周有15,000人共享众多精彩创意的NeurIPS会议非常棒。我们像过去一样在那里发表了许多论文,看到这一领域不断进步令人非常兴奋。
Dwarkesh Patel 02:05:29
How would you account for... so obviously Google had all these insights internally rather early on, including the top researchers. And now Gemini 2 is out. We didn't get a chance much to talk about it, but people know it's a really great model.
你如何解释……显然,谷歌很早就内部拥有所有这些见解,包括顶尖的研究人员。而现在Gemini 2已经发布。我们没有太多机会讨论它,但大家都知道它是一个非常出色的模型。
Jeff Dean 02:05:53
Such a good model. As we say around the micro-kitchen, “such a good model, such a good model”.
真是个好模型。正如我们在微厨房周围所说的,“真是个好模型,真是个好模型”。
Dwarkesh Patel 02:05:57
So it's top in LMSYS Chatbot Arena. And so now Google's on top. But how would you account for basically coming up with all the great insights for a couple of years? Other competitors had models that were better for a while despite that.
所以它在LMSYS聊天机器人竞技场中名列前茅。现在谷歌处于领先地位。但你如何解释基本上花了几年的时间才得出所有这些卓越见解?尽管如此,其他竞争对手曾一度拥有更好的模型。
Jeff Dean 02:06:16
We've been working on language models for a long time. Noam's early work on spelling correction in 2001, the work on translation, very large-scale language models in 2007, and seq2seq and word2vec and more recent Transformers and then BERT.
我们在语言模型领域已经工作了很长时间。诺姆在2001年的拼写纠错早期工作、翻译工作、2007年的大规模语言模型,以及seq2seq、word2vec,以及最近的Transformer和BERT。
Things like the internal Meena system that was actually a chatbot-based system designed to kind of engage people in interesting conversations. We actually had an internal chatbot system that Googlers could play with even before ChatGPT came out. And actually, during the pandemic, a lot of Googlers would enjoy spending, you know, everyone was locked down at home, and so they enjoyed spending time chatting with Meena during lunch because it was like a nice, you know, lunch partner.
比如内部的Meena系统,实际上是一个基于聊天机器人的系统,旨在引导人们进行有趣的对话。实际上,在ChatGPT问世之前,我们就有了一个谷歌员工可以使用的内部聊天机器人系统。而且在疫情期间,许多谷歌员工喜欢利用午餐时间与Meena聊天,因为大家都被锁在家里,它就像一个不错的午餐伙伴。
I think one of the things we were a little, our view of things from a search perspective was these models hallucinate a lot, they don't get things right a lot of the time- or some of the time- and that means that they aren't as useful as they could be and so we’d like to make that better. From a search perspective, you want to get the right answer 100% of the time, ideally and be very high on factuality. These models were not near that bar.
我认为,从搜索的角度来看,我们的一种观点是这些模型经常会产生幻觉,它们大部分时间——或者部分时间——不能准确回答,这意味着它们没有达到应有的实用性,因此我们希望加以改进。从搜索角度来看,你希望始终得到100%正确的答案,并且事实性非常高。而这些模型远未达到这个标准。
I think what we were a little unsure about is that they were incredibly useful. Oh and they also had all kinds of safety issues, like they might say offensive things and we had to work on that aspect and get that to a point where we were comfortable releasing the model. But I think what we didn’t quite appreciate was how useful they could be for things you wouldn't ask a search engine, right? Like, help me write a note to my veterinarian, or like, can you take this text and give me a quick summary of it? I think that's the kind of thing we've seen people really flock to in terms of using chatbots as amazing new capabilities rather than as a pure search engine.
我认为我们有点不确定的是,这些模型极其有用。而且它们也存在各种安全问题,比如可能会说出冒犯性的话语,我们不得不在这方面下功夫,直到我们对发布该模型感到满意。但我认为我们没能完全意识到它们对于一些你不会询问搜索引擎的问题有多么有用,比如,帮我写一封给兽医的便条,或者,你能把这段文本快速总结一下吗?我觉得这正是我们看到人们蜂拥而至使用聊天机器人作为全新强大功能,而不仅仅是作为纯粹的搜索引擎的原因。
So I think we took our time and got to the point where we actually released quite capable chatbots and have been improving them through Gemini models quite a bit. I think that's actually not a bad path to have taken. Would we like to have released the chatbot earlier? Maybe. But I think we have a pretty awesome chatbot with awesome Gemini models that are getting better all the time. And that's pretty cool.
所以我认为我们花了时间,最终推出了相当强大的聊天机器人,并通过Gemini模型不断改进它们。我认为这实际上是一条不错的发展道路。我们是否希望更早发布聊天机器人?也许吧。但我认为我们拥有一个非常棒的聊天机器人,配合不断提升的优秀Gemini模型,这真是太酷了。
Dwarkesh Patel 02:08:54
Okay, final question. So we've discussed some of the things you guys have worked on over the last 25 years, and there are so many different fields, right? You start off with search and indexing to distributed systems, to hardware, to AI algorithms. And genuinely, there are a thousand more, just go on either of their Google Scholar pages or something. What is the trick to having this level of, not only career longevity where you're having many decades of making breakthroughs, but also the breadth of different fields, both of you, in either order, what’s the trick to career longevity and breadth?
好了,最后一个问题。我们已经讨论了过去25年中你们所从事的一些工作领域,而且这些领域实在是太多了,对吧?从搜索和索引到分布式系统,再到硬件和人工智能算法,实际上还有上千个领域,只要看看他们的Google Scholar页面就知道了。请问,要实现如此长久的职业生涯,不仅在几十年里不断取得突破,还能涉猎如此广泛的不同领域,你们的秘诀是什么?
Jeff Dean 02:09:46
One thing that I like to do is to find out about a new and interesting area, and one of the best ways to do that is to pay attention to what's going on, talk to colleagues, pay attention to research papers that are being published, and look at the kind of research landscape as it's evolving.
我喜欢做的一件事就是去探索一个新的、有趣的领域,而最好的方法之一就是关注正在发生的事情,与同事交流,关注正在发表的研究论文,并观察研究格局如何演变。
Be willing to say, "Oh, chip design. I wonder if we could use reinforcement learning for some aspect of that." Be able to dive into a new area, work with people who know a lot about a different domain or AI for healthcare or something. I've done a bit of working with clinicians about what are the real problems, how could AI help? It wouldn't be that useful for this thing, but it would be super useful for this.
要敢于说,“哦,芯片设计。我在想是否能在某些方面利用强化学习。”要能深入一个新领域,与那些对其他领域或医疗保健AI了解颇多的人合作。我曾与临床医生合作探讨真正的问题是什么,人工智能能如何提供帮助?对于某些问题它可能没用,但对于另一些问题则极为有用。
Getting those insights, and often working with a set of five or six colleagues who have different expertise than you do. It enables you to collectively do something that none of you could do individually. Then some of their expertise rubs off on you and some of your expertise rubs off on them, and now you have this bigger set of tools in your tool belt as an engineering researcher to go tackle the next thing.
获取这些见解,并且常常与五六位拥有不同专业知识的同事一起工作,这使得你们能够共同完成任何一个人单独无法做到的事情。然后,他们的专业知识会影响你,你的也会影响他们,这样你作为一名工程研究人员就拥有了更丰富的工具箱,可以去应对下一个挑战。
I think that's one of the beauties of continuing to learn on the job. It's something I treasure. I really enjoy diving into new things and seeing what we can do.
我认为这正是持续在工作中学习的美妙之处。这是我非常珍视的东西。我真心享受探索新事物,看看我们能做些什么。
Noam Shazeer 02:11:10
I'd say probably a big thing is humility, like I’d say I’m the most humble. But seriously, to say what I just did is nothing compared to what I can do or what can be done. And to be able to drop an idea as soon as you see something better, like you or somebody with some better idea, and you see how maybe what you're thinking about, what they're thinking about or something totally different can conceivably work better.
我想,谦逊可能是关键,我甚至可以说我是最谦逊的。但说实话,我刚才说的与我能做的或能实现的相比微不足道。而且,一旦看到更好的想法,就能立刻放弃当前的念头,不论是你自己的还是别人的,意识到可能你们所想的或者完全不同的东西可能会更好地发挥作用。
I think there is a drive in some sense to say, "Hey, the thing I just invented is awesome, give me more chips." Particularly if there's a lot of top-down resource assignment. But I think we also need to incentivize people to say, "Hey, this thing I am doing is not working at all. Let me just drop it completely and try something else."
我觉得某种程度上存在一种冲动,那就是,“嘿,我刚发明的这个东西真棒,再给我更多资源。”尤其是在大量自上而下的资源分配下。但我认为我们也需要激励人们说,“嘿,我现在做的这件事完全行不通,让我彻底放弃,试试别的东西。”
Which I think Google Brain did quite well. We had the very kind of bottoms-up UBI kind of chip allocation.
我认为谷歌大脑在这方面做得相当不错。我们采用了一种自下而上的UBI(普遍基本收入)式资源分配方法。
Dwarkesh Patel 02:12:39
You had a UBI?
你们有采用UBI吗?
Noam Shazeer 02:12:41
Yeah, it was like basically everyone had one credit and you could pool them.
是的,基本上每个人都有一个额度,而且这些额度可以合并使用。
Gemini has been mostly top-down, which has been very good in some sense because it has led to a lot more collaboration and people working together. You less often have five groups of people all building the same thing or building interchangeable things.
而Gemini大多采用自上而下的方式,这在某种程度上非常好,因为它促成了更多的协作和团队合作。你很少会看到五个团队同时构建同样的东西或可以互换的产品。
But on the other hand, it does lead to some incentive to say, "Hey, what I'm doing is working great." And then, as a lead, you hear hundreds of groups, and everything is, "So you should give them more chips." There's less of an incentive to say, "Hey, what I'm doing is not actually working that well. Let me try something different."
但另一方面,这也会导致一种激励机制,即“嘿,我做的这个很有效”,然后作为领导者,你会听到数百个团队的声音,都在说,“所以你应该给他们更多资源。”而缺乏激励去承认,“嘿,我做的这个其实效果不佳,让我试试别的方式。”
So I think going forward, we're going to have some amount of top-down, some amount of bottom-up, so as to incentivize both of these behaviors: collaboration and flexibility. I think both those things lead to a lot of innovation.
所以我认为,展望未来,我们将既有自上而下,也有自下而上的方式,从而激励这两种行为:协作与灵活。我认为这两者都能带来大量的创新。

缺少统一的哲学思想。
Jeff Dean 02:13:49
I think it's also good to articulate interesting directions you think we should go. I have an internal slide deck called "Go, Jeff, Wacky Ideas." I think those are a little bit more product-oriented things, like, "Hey, I think now that we have these capabilities, we could do these 17 things."
我认为,阐明我们应该走的有趣方向也是很重要的。我有一份内部幻灯片,名为“走吧,杰夫,疯狂的点子”。我认为那些更多是面向产品的内容,比如,“嘿,我认为既然我们拥有这些能力,我们可以做这17件事。”
I think that's a good thing because sometimes people get excited about that and want to start working with you on one or more of them. And I think that's a good way to bootstrap where we should go without necessarily ordering people, "We must go here."
我觉得这是一件好事,因为有时人们会因此感到兴奋,并想要与你合作开展其中一项或多项工作。我认为这是一种很好的方式来引导我们的前进方向,而不必强制命令别人,“我们必须走这条路。”
Dwarkesh Patel 02:14:32
Alright, this was great.
好的,太棒了。
Jeff Dean 02:14:34
Yeah.
是的。
Dwarkesh Patel 02:14:34
Thank you, guys.
谢谢你们。
Jeff Dean 02:14:35
Appreciate you taking the time, it was great chatting. That was awesome.
感谢你们抽出时间,聊天真是太愉快了。那真是太棒了。