The caps on exports of AI chips apply in different ways to different countries and companies.
The 18 close U.S. allies will face no restrictions on purchases of chips. And smaller orders from customers around the world—up to around 1,700 advanced AI chips—won’t require a license or count against caps on countries’ chip purchases, the Commerce Department said.
That leaves the question of whether companies based in the U.S. or its allies can build significant AI capacity in a country falling into a middle zone—neither trusted ally nor top adversary. The Commerce Department said yes, but with limits. Companies that meet high security standards can apply for a status that allows them to place up to 7% of their global AI computing capacity in any single such country. That could be as many as hundreds of thousands of chips, the department said.
这就留下了一个问题,即总部位于美国或其盟国的公司是否可以在一个处于中间地带的国家建立显著的 AI 能力——既不是可信赖的盟友,也不是主要对手。商务部表示可以,但有限制。符合高安全标准的公司可以申请一种状态,允许他们在任何一个这样的国家中放置其全球 AI 计算能力的最多 7%。商务部表示,这可能多达数十万片芯片。
A further category of companies based in countries that aren’t U.S. adversaries can apply for a status allowing them to buy up to the equivalent of 320,000 of today’s advanced AI chips over the next two years. Those that don’t get this status can still buy up to the equivalent of 50,000 advanced AI chips.
位于非美国对手国家的公司可以申请一种状态,允许他们在未来两年内购买相当于 32 万片当今先进 AI 芯片的产品。未获得此状态的公司仍然可以购买相当于 5 万片先进 AI 芯片的产品。
The limits suggest many countries could be challenged in setting up AI computing facilities capable of competing with the largest and most advanced in the U.S. and its closely allied countries. Some of the biggest AI computing facilities in the U.S. contain huge numbers of Nvidia’s AI chips, including the Colossus supercomputer being built by Elon Musk’s xAI in Memphis, Tenn., which is being scaled up to include 200,000 of them.
这些限制表明,许多国家在建立能够与美国及其紧密盟国中最大和最先进的 AI 计算设施竞争的 AI 计算设施方面可能面临挑战。美国一些最大的 AI 计算设施包含大量 Nvidia 的 AI 芯片,包括由 Elon Musk 的 xAI 在田纳西州孟菲斯建造的 Colossus 超级计算机,该计算机正在扩展以包含其中的 20 万个芯片。
The Major Threats 主要威胁
At a very high level, you can think of things like this: Nvidia operated in a pretty niche area for a very long time; they had very limited competition, and the competition wasn't particular profitable or growing fast enough to ever pose a real threat, since they didn't have the capital needed to really apply pressure to a market leader like Nvidia. The gaming market was large and growing, but didn't feature earth shattering margins or particularly fabulous year over year growth rates.
从一个非常高的层面来看,你可以这样理解:Nvidia 在一个相当小众的领域运营了很长时间;他们的竞争非常有限,而且竞争对手并不特别盈利,也没有足够快的增长速度来真正构成威胁,因为他们没有足够的资本来真正向 Nvidia 这样的市场领导者施加压力。游戏市场规模庞大且在增长,但并没有惊人的利润率或特别出色的年度增长率。
A few big tech companies started ramping up hiring and spending on machine learning and AI efforts around 2016-2017, but it was never a truly significant line item for any of them on an aggregate basis— more of a "moonshot" R&D expenditure. But once the big AI race started in earnest with the release of ChatGPT in 2022— only a bit over 2 years ago, although it seems like a lifetime ago in terms of developments— that situation changed very dramatically.
几家大型科技公司在 2016-2017 年左右开始加大对机器学习和 AI 的招聘和投入,但从整体来看,这从未成为它们真正重要的支出项目,更像是一种“登月”式的研发投入。然而,自从 2022 年 ChatGPT 发布后——仅仅两年多以前,尽管从技术发展的角度来看似乎已经过去了很久——这一情况发生了巨大变化。
Suddenly, big companies were ready to spend many, many billions of dollars incredibly quickly. The number of researchers showing up at the big research conferences like Neurips and ICML went up very, very dramatically. All the smart students who might have previously studied financial derivatives were instead studying Transformers, and $1mm+ compensation packages for non-executive engineering roles (i.e., for independent contributors not managing a team) became the norm at the leading AI labs.
突然,大公司准备以极快的速度花费数百亿美元。出现在 NeurIPS 和 ICML 等大型研究会议上的研究人员数量急剧增加。所有那些以前可能研究金融衍生品的聪明学生,转而研究 Transformers,而在顶级 AI 实验室,非管理团队的独立贡献工程师职位的薪酬套餐超过 100 万美元已成为常态。
It takes a while to change the direction of a massive cruise ship; and even if you move really quickly and spend billions, it takes a year or more to build greenfield data centers and order all the equipment (with ballooning lead times) and get it all set up and working. It takes a long time to hire and onboard even smart coders before they can really hit their stride and familiarize themselves with the existing codebases and infrastructure.
要改变一艘巨型游轮的方向需要一段时间;即使行动非常迅速并投入数十亿美元,建设全新的数据中心、订购所有设备(交付周期不断延长)、完成安装和调试也需要一年或更长时间。即使是聪明的程序员,在真正进入状态并熟悉现有代码库和基础设施之前,招聘和培训也需要很长时间。
But now, you can imagine that absolutely biblical amounts of capital, brainpower, and effort are being expended in this area. And Nvidia has the biggest target of any player on their back, because they are the ones who are making the lion's share of the profits TODAY, not in some hypothetical future where the AI runs our whole lives.
但现在,你可以想象到,绝对庞大的资本、智慧和努力正被投入到这个领域。而 Nvidia 是所有参与者中最受瞩目的目标,因为他们才是今天获取最大份额利润的一方,而不是在某个假设的未来,AI 主导我们的整个生活。
So the very high level takeaway is basically that "markets find a way"; they find alternative, radically innovative new approaches to building hardware that leverage completely new ideas to sidestep barriers that help prop up Nvidia's moat.
所以,高层次的要点基本上是“市场总能找到出路”;它们会找到替代的、极具创新性的全新方法来构建硬件,利用全新的理念来规避那些支撑 Nvidia 护城河的障碍。
The Hardware Level Threat 硬件层级威胁
For example, so-called "wafer scale" AI training chips from Cerebras, which dedicate an entire 300mm silicon wafer to an absolutely gargantuan chip that contains orders of magnitude more transistors and cores on a single chip (see this recent blog post from them explaining how they were able to solve the "yield problem" that had been preventing this approach from being economically practical in the past).
例如,所谓的“晶圆级”AI 训练芯片来自 Cerebras,它将整块 300mm 硅晶圆用于一个极其庞大的芯片,在单个芯片上包含数量级更多的晶体管和核心(参见他们最近的博客文章,解释了他们如何解决此前阻碍这种方法在经济上可行的“良率问题”)。
To put this into perspective, if you compare Cerebras' newest WSE-3 chip to Nvidia's flagship data-center GPU, the H100, the Cerebras chip has a total die area of 46,225 square millimeters compared to just 814 for the H100 (and the H100 is itself considered an enormous chip by industry standards); that's a multiple of ~57x! And instead of having 132 "streaming multiprocessor" cores enabled on the chip like the H100 has, the Cerebras chip has ~900,000 cores (granted, each of these cores is smaller and does a lot less, but it's still an almost unfathomably large number in comparison). In more concrete apples-to-apples terms, the Cerebras chip can do around ~32x the FLOPS in AI contexts as a single H100 chip. Since an H100 sells for close to $40k a pop, you can imagine that the WSE-3 chip isn't cheap.
从这个角度来看,如果将 Cerebras 最新的 WSE-3 芯片与 Nvidia 的旗舰数据中心 GPU——H100 进行比较,Cerebras 芯片的总晶圆面积为 46,225 平方毫米,而 H100 仅为 814 平方毫米(而 H100 本身在行业标准中已被认为是一个巨大的芯片);这相当于大约 57 倍!此外,与 H100 在芯片上启用了 132 个“流式多处理器”核心不同,Cerebras 芯片拥有约 900,000 个核心(当然,这些核心更小,功能也少得多,但相比之下,这仍然是一个几乎难以想象的巨大数字)。从更具体的对比来看,在 AI 计算环境中,Cerebras 芯片的 FLOPS 计算能力约为单个 H100 芯片的 32 倍。由于 H100 的售价接近 40,000 美元,可以想象 WSE-3 芯片的价格也不会便宜。
So why does this all matter? Well, instead of trying to battle Nvidia head-on by using a similar approach and trying to match the Mellanox interconnect technology, Cerebras has used a radically innovative approach to do an end-run around the interconnect problem: inter-processor bandwidth becomes much less of an issue when everything is running on the same super-sized chip. You don't even need to have the same level of interconnect because one mega chip replaces tons of H100s.
那么,为什么这一切都很重要?Cerebras 并没有试图通过类似的方法与 Nvidia 正面竞争,也没有试图匹配 Mellanox 互连技术,而是采用了一种极具创新性的方式来绕过互连问题:当所有计算都在同一个超大芯片上运行时,处理器间带宽问题就变得不那么重要了。你甚至不需要相同级别的互连,因为一个巨型芯片可以替代大量 H100。
And the Cerebras chips also work extremely well for AI inference tasks. In fact, you can try it today for free here and use Meta's very respectable Llama-3.3-70B model. It responds basically instantaneously, at ~1,500 tokens per second. To put that into perspective, anything above 30 tokens per second feels relatively snappy to users based on comparisons to ChatGPT and Claude, and even 10 tokens per second is fast enough that you can basically read the response while it's being generated.
而且 Cerebras 芯片在 AI 推理任务中也表现极为出色。事实上,你今天就可以在这里免费试用,并使用 Meta 的非常优秀的 Llama-3.3-70B 模型。它的响应几乎是瞬时的,速度约为 1,500 个 token 每秒。为了让你有个直观的对比,基于与 ChatGPT 和 Claude 的比较,任何超过 30 个 token 每秒的速度对用户来说都感觉相当流畅,甚至 10 个 token 每秒的速度也足够快,以至于你基本上可以在生成的同时阅读响应。
Cerebras is also not alone; there are other companies, like Groq (not to be confused with the Grok model family trained by Elon Musk's X AI). Groq has taken yet another innovative approach to solving the same fundamental problem. Instead of trying to compete with Nvidia's CUDA software stack directly, they've developed what they call a "tensor processing unit" (TPU) that is specifically designed for the exact mathematical operations that deep learning models need to perform. Their chips are designed around a concept called "deterministic compute," which means that, unlike traditional GPUs where the exact timing of operations can vary, their chips execute operations in a completely predictable way every single time.
Cerebras 也并非孤军奋战;还有其他公司,比如 Groq(不要与 Elon Musk 的 X AI 训练的 Grok 模型家族混淆)。Groq 采用了另一种创新方法来解决同样的基本问题。他们没有直接试图与 Nvidia 的 CUDA 软件栈竞争,而是开发了一种名为“张量处理单元”(TPU)的技术,专门用于执行深度学习模型所需的特定数学运算。他们的芯片围绕一个名为“确定性计算”的概念设计,这意味着,与传统 GPU 可能导致操作时间变化不同,他们的芯片每次执行操作时都完全可预测。
This might sound like a minor technical detail, but it actually makes a massive difference for both chip design and software development. Because the timing is completely deterministic, Groq can optimize their chips in ways that would be impossible with traditional GPU architectures. As a result, they've been demonstrating for the past 6+ months inference speeds of over 500 tokens per second with the Llama series of models and other open source models, far exceeding what's possible with traditional GPU setups. Like Cerebras, this is available today and you can try it for free here.
这听起来可能像是一个微小的技术细节,但实际上它对芯片设计和软件开发都有巨大的影响。由于时序是完全确定的,Groq 可以以传统 GPU 架构无法实现的方式优化其芯片。因此,在过去 6 个月以上的时间里,他们一直在展示 Llama 系列模型和其他开源模型的推理速度超过每秒 500 个 token,远远超出传统 GPU 方案的可能性。与 Cerebras 类似,这项技术今天已经可用,你可以在这里免费试用。
Using a comparable Llama3 model with "speculative decoding," Groq is able to generate 1,320 tokens per second, on par with Cerebras and far in excess of what is possible using regular GPUs. Now, you might ask what the point is of achieving 1,000+ tokens per second when users seem pretty satisfied with ChatGPT, which is operating at less than 10% of that speed. And the thing is, it does matter. It makes it a lot faster to iterate and not lose focus as a human knowledge worker when you get instant feedback. And if you're using the model programmatically via the API, which is increasingly where much of the demand is coming from, then it can enable whole new classes of applications that require multi-stage inference (where the output of previous stages is used as input in successive stages of prompting/inference) or which require low-latency responses, such as content moderation, fraud detection, dynamic pricing, etc.
使用可比的 Llama3 模型和“speculative decoding”,Groq 能够生成每秒 1,320 个 token,与 Cerebras 相当,并且远远超过常规 GPU 的能力。现在,你可能会问,当用户似乎对 ChatGPT 的速度(不到这个速度的 10%)感到满意时,实现每秒 1,000 多个 token 有什么意义。而事实是,这确实很重要。当你能够即时获得反馈时,作为人类知识工作者,可以更快地迭代并保持专注。而且,如果你是通过 API 以编程方式使用该模型——这正是越来越多需求的来源——那么它可以支持全新的应用类别,例如需要多阶段推理(即前一阶段的输出作为后续阶段提示/推理的输入)或需要低延迟响应的应用,如内容审核、欺诈检测、动态定价等。
But even more fundamentally, the faster you can serve requests, the faster you can cycle things, and the busier you can keep the hardware. Although Groq's hardware is extremely expensive, clocking in at $2mm to $3mm for a single server, it ends up costing far less per request fulfilled if you have enough demand to keep the hardware busy all the time.
但更根本的是,你处理请求的速度越快,循环的速度就越快,硬件的利用率就越高。尽管 Groq 的硬件极其昂贵,单台服务器的成本高达 200 万至 300 万美元,但如果有足够的需求让硬件始终保持忙碌,每个已完成请求的成本最终会低得多。
And like Nvidia with CUDA, a huge part of Groq's advantage comes from their own proprietary software stack. They are able to take the same open source models that other companies like Meta, DeepSeek, and Mistral develop and release for free, and decompose them in special ways that allow them to run dramatically faster on their specific hardware.
就像 Nvidia 的 CUDA 一样,Groq 的巨大优势很大程度上来自他们自有的专有软件栈。他们能够利用 Meta、DeepSeek 和 Mistral 等公司开发并免费发布的相同开源模型,并以特殊方式对其进行分解,使其能够在他们特定的硬件上运行得更快。
Like Cerebras, they have taken different technical decisions to optimize certain particular aspects of the process, which allows them to do things in a fundamentally different way. In Groq's case, it's because they are entirely focused on inference level compute, not on training: all their special sauce hardware and software only give these huge speed and efficiency advantages when doing inference on an already trained model.
像 Cerebras 一样,他们在技术上做出了不同的决策,以优化流程中的某些特定方面,从而使他们能够以根本不同的方式执行任务。在 Groq 的情况下,这是因为他们完全专注于推理级计算,而不是训练:他们所有的专有硬件和软件只有在对已训练模型进行推理时,才能提供这些巨大的速度和效率优势。
But if the next big scaling law that people are excited about is for inference level compute— and if the biggest drawback of COT models is the high latency introduced by having to generate all those intermediate logic tokens before they can respond— then even a company that only does inference compute, but which does it dramatically faster and more efficiently than Nvidia can— can introduce a serious competitive threat in the coming years. At the very least, Cerebras and Groq can chip away at the lofty expectations for Nvidia's revenue growth over the next 2-3 years that are embedded in the current equity valuation.
但如果人们期待的下一个重要扩展法则是针对推理级计算——而如果 COT 模型最大的缺点是由于必须生成所有这些中间逻辑标记才能响应而导致的高延迟——那么即使是一家只专注于推理计算的公司,但如果它的计算速度远远快于且效率远高于 Nvidia,也可能在未来几年内带来严重的竞争威胁。至少,Cerebras 和 Groq 可以削弱当前股权估值中对 Nvidia 未来 2-3 年收入增长的高预期。
Besides these particularly innovative, if relatively unknown, startup competitors, there is some serious competition coming from some of Nvidia's biggest customers themselves who have been making custom silicon that specifically targets AI training and inference workloads. Perhaps the best known of these is Google, which has been developing its own proprietary TPUs since 2016. Interestingly, although it briefly sold TPUs to external customers, Google has been using all its TPUs internally for the past several years, and it is already on its 6th generation of TPU hardware.
除了这些特别创新但相对不知名的初创公司竞争对手外,Nvidia 一些最大的客户本身也带来了激烈的竞争,他们一直在制造专门针对 AI 训练和推理工作负载的定制芯片。或许最知名的就是 Google,自 2016 年以来一直在开发其专有的 TPU。有趣的是,尽管 Google 曾短暂向外部客户销售 TPU,但在过去几年里一直在内部使用所有 TPU,并且其 TPU 硬件已经发展到第六代。
Amazon has also been developing its own custom chips called Trainium2 and Inferentia2. And while Amazon is building out data centers featuring billions of dollars of Nvidia GPUs, they are also at the same time investing many billions in other data centers that use these internal chips. They have one cluster that they are bringing online for Anthropic that features over 400k chips.
Amazon 也在开发自己的定制芯片,名为 Trainium2 和 Inferentia2。尽管 Amazon 正在建设包含数十亿美元 Nvidia GPU 的数据中心,但与此同时,他们也在投资数十亿美元用于采用这些内部芯片的其他数据中心。他们正在为 Anthropic 启用一个集群,其中包含超过 40 万颗芯片。
Amazon gets a lot of flak for totally bungling their internal AI model development, squandering massive amounts of internal compute resources on models that ultimately are not competitive, but the custom silicon is another matter. Again, they don't necessarily need their chips to be better and faster than Nvidia's. What they need is for their chips to be good enough, but build them at a breakeven gross margin instead of the ~90%+ gross margin that Nvidia earns on its H100 business.
Amazon 因完全搞砸了其内部 AI 模型开发而受到大量批评,浪费了大量内部计算资源在最终并不具备竞争力的模型上,但定制芯片则是另一回事。同样,他们的芯片不一定需要比 Nvidia 的更好更快。他们需要的是芯片足够优秀,但以盈亏平衡的毛利率生产,而不是 Nvidia 在其 H100 业务上获得的约 90%+毛利率。
OpenAI has also announced their plans to build custom chips, and they (together with Microsoft) are obviously the single largest user of Nvidia's data center hardware. As if that weren't enough, Microsoft have themselves announced their own custom chips!
OpenAI 也已宣布计划打造定制芯片,而他们(与 Microsoft 一起)显然是 Nvidia 数据中心硬件的最大单一用户。仿佛这还不够,Microsoft 也宣布了他们自己的定制芯片!
And Apple, the most valuable technology company in the world, has been blowing away expectations for years now with their highly innovative and disruptive custom silicon operation, which now completely trounces the CPUs from both Intel and AMD in terms of performance per watt, which is the most important factor in mobile (phone/tablet/laptop) applications. And they have been making their own internally designed GPUs and "Neural Processors" for years, even though they have yet to really demonstrate the utility of such chips outside of their own custom applications, like the advanced software based image processing used in the iPhone's camera.
而 Apple,这家全球最有价值的科技公司,多年来一直以高度创新和颠覆性的定制芯片业务超出市场预期,如今在每瓦性能方面完全击败了 Intel 和 AMD 的 CPU,而这正是移动(手机/平板/笔记本)应用中最重要的因素。此外,他们多年来一直在自主设计 GPU 和“神经处理器”,尽管这些芯片在自家定制应用(如 iPhone 相机的高级软件图像处理)之外的实用性尚未真正得到证明。
While Apple's focus seems somewhat orthogonal to these other players in terms of its mobile-first, consumer oriented, "edge compute" focus, if it ends up spending enough money on its new contract with OpenAI to provide AI services to iPhone users, you have to imagine that they have teams looking into making their own custom silicon for inference/training (although given their secrecy, you might never even know about it directly!).
虽然 Apple 的重点在其以移动为先、面向消费者的“边缘计算”方向上似乎与其他参与者有所不同,但如果它最终在与 OpenAI 的新合同上投入足够的资金,以向 iPhone 用户提供 AI 服务,你不得不想象他们会有团队在研究为推理/训练打造自定义芯片(尽管考虑到他们的保密性,你可能永远都不会直接知道!)。
Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?
现在,Nvidia 的超大规模客户群体呈现出明显的幂律分布,前几大客户占据了高利润收入的绝大部分。这项业务的未来该如何看待,当这些重要客户无一例外都在为 AI 训练和推理打造自己的定制芯片?
When thinking about all this, you should keep one incredibly important thing in mind: Nvidia is largely an IP based company. They don't make their own chips. The true special sauce for making these incredible devices arguably comes more from TSMC, the actual fab, and ASML, which makes the special EUV lithography machines used by TSMC to make these leading-edge process node chips. And that's critically important, because TSMC will sell their most advanced chips to anyone who comes to them with enough up-front investment and is willing to guarantee a certain amount of volume. They don't care if it's for Bitcoin mining ASICs, GPUs, TPUs, mobile phone SoCs, etc.
在考虑所有这些时,你应该牢记一件极其重要的事情:Nvidia 在很大程度上是一家基于 IP 的公司。他们不自己制造芯片。制造这些令人惊叹的设备的真正关键技术,或许更多来自于 TSMC(实际的晶圆厂)和 ASML(制造 TSMC 用于生产这些先进工艺节点芯片的特殊 EUV 光刻机)。这点至关重要,因为 TSMC 会将其最先进的芯片出售给任何愿意提供足够前期投资并保证一定产量的客户。他们并不在乎这些芯片是用于比特币挖矿 ASIC、GPU、TPU、手机 SoC 等。
As much as senior chip designers at Nvidia earn per year, surely some of the best of them could be lured away by these other tech behemoths for enough cash and stock. And once they have a team and resources, they can design innovative chips (again, perhaps not even 50% as advanced as an H100, but with that Nvidia gross margin, there is plenty of room to work with) in 2 to 3 years, and thanks for TSMC, they can turn those into actual silicon using the exact same process node technology as Nvidia.
尽管 Nvidia 的高级芯片设计师每年的收入不菲,但其中一些最优秀的人才肯定会被其他科技巨头用足够的现金和股票挖走。而一旦他们拥有团队和资源,他们可以在 2 到 3 年内设计出创新的芯片(或许甚至达不到 H100 的 50%先进程度,但凭借 Nvidia 的高毛利率,仍有足够的空间可供操作),并且多亏了 TSMC,他们可以使用与 Nvidia 完全相同的工艺节点技术将这些设计变成实际的硅芯片。
The Software Threat(s) 软件威胁
As if these looming hardware threats weren't bad enough, there are a few developments in the software world in the last couple years that, while they started out slowly, are now picking up real steam and could pose a serious threat to the software dominance of Nvidia's CUDA. The first of these is the horrible Linux drivers for AMD GPUs. Remember we talked about how AMD has inexplicably allowed these drivers to suck for years despite leaving massive amounts of money on the table?
仿佛这些迫在眉睫的硬件威胁还不够糟糕,软件领域在过去几年里也出现了一些发展,虽然起初进展缓慢,但现在正在加速,并可能对 Nvidia CUDA 的软件主导地位构成严重威胁。其中第一个就是 AMD GPU 的糟糕 Linux 驱动程序。还记得我们谈到 AMD 多年来莫名其妙地允许这些驱动程序表现糟糕,尽管这意味着放弃了大量潜在收入吗?
Well, amusingly enough, the infamous hacker George Hotz (famous for jailbreaking the original iphone as a teenager, and currently the CEO of self-driving startup Comma.ai and AI computer company Tiny Corp, which also makes the open-source tinygrad AI software framework), recently announced that he was sick and tired of dealing with AMD's bad drivers, and desperately wanted to be able to to leverage the lower cost AMD GPUs in their TinyBox AI computers (which come in multiple flavors, some of which use Nvidia GPUs, and some of which use AMD GPUS).
有趣的是,臭名昭著的黑客 George Hotz(因青少年时期破解原始 iPhone 而闻名,现为自动驾驶初创公司 Comma.ai 和 AI 计算公司 Tiny Corp 的 CEO,该公司还开发了开源 tinygrad AI 软件框架)最近宣布,他已经厌倦了处理 AMD 糟糕的驱动程序,并迫切希望能够利用成本更低的 AMD GPU 来运行他们的 TinyBox AI 计算机(这些计算机有多个版本,其中一些使用 Nvidia GPU,而另一些使用 AMD GPU)。
Well, he is making his own custom drivers and software stack for AMD GPUs without any help from AMD themselves; on Jan. 15th of 2025, he tweeted via his company's X account that "We are one piece away from a completely sovereign stack on AMD, the RDNA3 assembler. We have our own driver, runtime, libraries, and emulator. (all in ~12,000 lines!)" Given his track record and skills, it is likely that they will have this all working in the next couple months, and this would allow for a lot of exciting possibilities of using AMD GPUs for all sorts of applications where companies currently feel compelled to pay up for Nvidia GPUs.
嗯,他正在为 AMD GPU 制作自己的自定义驱动程序和软件栈,而没有得到 AMD 本身的任何帮助;在 2025 年 1 月 15 日,他通过其公司 X 账户发推称:“我们距离在 AMD 上实现完全自主的软件栈只差最后一块——RDNA3 汇编器。我们已经有了自己的驱动程序、运行时、库和模拟器。(总共约 12,000 行代码!)” 鉴于他的过往记录和技能,他们很可能会在接下来的几个月内让这一切正常运行,这将为使用 AMD GPU 进行各种应用带来许多令人兴奋的可能性,而目前公司往往不得不为 Nvidia GPU 付出高昂成本。
OK, well that's just a driver for AMD, and it's not even done yet. What else is there? Well, there are a few other areas on the software side that are a lot more impactful. For one, there is now a massive concerted effort across many large tech companies and the open source software community at large to make more generic AI software frameworks that have CUDA as just one of many "compilation targets".
好的,那只是 AMD 的一个驱动程序,而且还没有完成。还有什么?在软件方面,还有一些其他领域影响更大。首先,现在许多大型科技公司和整个开源软件社区正在大规模协同努力,开发更通用的 AI 软件框架,使 CUDA 只是众多“编译目标”之一。
That is, you write your software using higher-level abstractions, and the system itself can automatically turn those high-level constructs into super well-tuned low-level code that works extremely well on CUDA. But because it's done at this higher level of abstraction, it can just as easily get compiled into low-level code that works extremely well on lots of other GPUs and TPUs from a variety of providers, such as the massive number of custom chips in the pipeline from every big tech company.
也就是说,你使用更高级的抽象来编写软件,系统本身可以自动将这些高级结构转换为在 CUDA 上运行极其高效的低级代码。但由于这是在更高级的抽象层面完成的,它同样可以被编译成在许多其他 GPU 和 TPU 上运行极其高效的低级代码,这些 GPU 和 TPU 来自各种供应商,例如各大科技公司正在开发的大量定制芯片。
The most famous examples of these frameworks are MLX (sponsored primarily by Apple), Triton (sponsored primarily by OpenAI), and JAX (developed by Google). MLX is particularly interesting because it provides a PyTorch-like API that can run efficiently on Apple Silicon, showing how these abstraction layers can enable AI workloads to run on completely different architectures. Triton, meanwhile, has become increasingly popular as it allows developers to write high-performance code that can be compiled to run on various hardware targets without having to understand the low-level details of each platform.
这些框架最著名的例子是 MLX(主要由 Apple 赞助)、Triton(主要由 OpenAI 赞助)和 JAX(由 Google 开发)。MLX 特别有趣,因为它提供了类似 PyTorch 的 API,可以在 Apple Silicon 上高效运行,展示了这些抽象层如何使 AI 工作负载能够在完全不同的架构上运行。与此同时,Triton 变得越来越受欢迎,因为它允许开发者编写高性能代码,并能编译以在各种硬件目标上运行,而无需理解每个平台的底层细节。
These frameworks allow developers to write their code once using high powered abstractions and then target tons of platforms automatically— doesn't that sound like a better way to do things, which would give you a lot more flexibility in terms of how you actually run the code?
这些框架允许开发者使用高效的抽象方式编写代码一次,然后自动适配大量平台——这难道不是一种更好的做事方式,让你在实际运行代码时拥有更多灵活性吗?
In the 1980s, all the most popular, best selling software was written in hand-tuned assembly language. The PKZIP compression utility for example was hand crafted to maximize speed, to the point where a competently coded version written in the standard C programming language and compiled using the best available optimizing compilers at the time, would run at probably half the speed of the hand-tuned assembly code. The same is true for other popular software packages like WordStar, VisiCalc, and so on.
在 1980 年代,所有最流行、最畅销的软件都是用手工优化的汇编语言编写的。例如,PKZIP 压缩工具被精心打造以最大化速度,以至于即使使用标准 C 编程语言编写并由当时最好的优化编译器编译的版本,其运行速度可能也只有手工优化的汇编代码的一半。其他流行的软件包,如 WordStar、VisiCalc 等,也都是如此。
Over time, compilers kept getting better and better, and every time the CPU architectures changed (say, from Intel releasing the 486, then the Pentium, and so on), that hand-rolled assembler would often have to be thrown out and rewritten, something that only the smartest coders were capable of (sort of like how CUDA experts are on a different level in the job market versus a "regular" software developer). Eventually, things converged so that the speed benefits of hand-rolled assembly were outweighed dramatically by the flexibility of being able to write code in a high-level language like C or C++, where you rely on the compiler to make things run really optimally on the given CPU.
随着时间的推移,编译器变得越来越好,每当 CPU 架构发生变化(比如从 Intel 发布 486,到 Pentium,等等),那些手写的汇编代码往往不得不被丢弃并重写,而这只有最聪明的程序员才能做到(有点类似于 CUDA 专家在就业市场上的水平远高于“普通”软件开发者)。最终,情况逐渐趋同,以至于手写汇编的速度优势被高层语言(如 C 或 C++)的灵活性所大幅超越,在这些语言中,程序员依赖编译器来使代码在特定 CPU 上运行得尽可能高效。
Nowadays, very little new code is written in assembly. I believe a similar transformation will end up happening for AI training and inference code, for similar reasons: computers are good at optimization, and flexibility and speed of development is increasingly the more important factor— especially if it also allows you to save dramatically on your hardware bill because you don't need to keep paying the "CUDA tax" that gives Nvidia 90%+ margins.
如今,几乎没有新的代码是用汇编编写的。我认为类似的转变最终也会发生在 AI 训练和推理代码上,原因类似:计算机擅长优化,而灵活性和开发速度正变得越来越重要——尤其是如果这还能让你大幅节省硬件成本,因为你不需要继续支付“CUDA 税”,这让 Nvidia 的利润率超过 90%。
Yet another area where you might see things change dramatically is that CUDA might very well end up being more of a high level abstraction itself— a "specification language" similar to Verilog (used as the industry standard to describe chip layouts) that skilled developers can use to describe high-level algorithms that involve massive parallelism (since they are already familiar with it, it's very well constructed, it's the lingua franca, etc.), but then instead of having that code compiled for use on Nvidia GPUs like you would normally do, it can instead be fed as source code into an LLM which can port it into whatever low-level code is understood by the new Cerebras chip, or the new Amazon Trainium2, or the new Google TPUv6, etc. This isn't as far off as you might think; it's probably already well within reach using OpenAI's latest O3 model, and surely will be possible generally within a year or two.
另一个可能发生巨大变化的领域是,CUDA 本身可能最终会成为更高级的抽象——类似于 Verilog(作为行业标准用于描述芯片布局)的一种“规范语言”,熟练的开发者可以用它来描述涉及大规模并行计算的高级算法(因为他们已经熟悉它,它构造良好,它是通用语言等)。但不同的是,这段代码不再像通常那样被编译用于 Nvidia GPU,而是可以作为源代码输入到一个 LLM,然后将其转换为新型 Cerebras 芯片、新的 Amazon Trainium2 或新的 Google TPUv6 所能理解的低级代码等。这一变化可能比你想象的更近;使用 OpenAI 最新的 O3 模型,这或许已经可以实现,并且在一两年内很可能会普遍成为可能。
The Theoretical Threat 理论上的威胁
Perhaps the most shocking development which was alluded to earlier happened in the last couple of weeks. And that is the news that has totally rocked the AI world, and which has been dominating the discourse among knowledgeable people on Twitter despite its complete absence from any of the mainstream media outlets: that a small Chinese startup called DeepSeek released two new models that have basically world-competitive performance levels on par with the best models from OpenAI and Anthropic (blowing past the Meta Llama3 models and other smaller open source model players such as Mistral). These models are called DeepSeek-V3 (basically their answer to GPT-4o and Claude3.5 Sonnet) and DeepSeek-R1 (basically their answer to OpenAI's O1 model).
也许最令人震惊的发展正如前面提到的那样,发生在过去几周。而这条新闻彻底震撼了 AI 界,并且尽管主流媒体完全没有报道,但它一直主导着 Twitter 上知识人士的讨论:一家名为 DeepSeek 的中国初创公司发布了两个新模型,其性能基本上达到了世界竞争水平,可与 OpenAI 和 Anthropic 的最佳模型相媲美(远超 Meta Llama3 模型以及其他较小的开源模型玩家,如 Mistral)。这些模型分别是 DeepSeek-V3(基本上是他们对 GPT-4o 和 Claude3.5 Sonnet 的回应)和 DeepSeek-R1(基本上是他们对 OpenAI 的 O1 模型的回应)。
Why is this all so shocking? Well, first of all, DeepSeek is a tiny Chinese company that reportedly has under 200 employees. The story goes that they started out as a quant trading hedge fund similar to TwoSigma or RenTec, but after Xi Jinping cracked down on that space, they used their math and engineering chops to pivot into AI research. Who knows if any of that is really true or if they are merely some kind of front for the CCP or the Chinese military. But the fact remains that they have released two incredibly detailed technical reports, for DeepSeek-V3 and DeepSeekR1.
为什么这一切如此令人震惊?首先,DeepSeek 是一家规模很小的中国公司,据称员工不到 200 人。据说他们最初是一家类似 TwoSigma 或 RenTec 的量化交易对冲基金,但在习近平对该领域进行打压后,他们利用自己的数学和工程能力转向了 AI 研究。谁知道这些说法是否属实,或者他们是否只是中共或中国军方的某种幌子。但事实是,他们已经发布了两份极其详细的技术报告,分别是 DeepSeek-V3 和 DeepSeekR1。
These are heavy technical reports, and if you don't know a lot of linear algebra, you probably won't understand much. But what you should really try is to download the free DeepSeek app on the AppStore here and install it using a Google account to log in and give it a try (you can also install it on Android here), or simply try it out on your desktop computer in the browser here. Make sure to select the "DeepThink" option to enable chain-of-thought (the R1 model) and ask it to explain parts of the technical reports in simple terms.
这些是高深的技术报告,如果你不太懂线性代数,可能不会理解太多。但你真正应该尝试的是在 AppStore 这里下载免费的 DeepSeek 应用并安装,使用 Google 账户登录并试用(你也可以在 Android 这里安装),或者直接在桌面浏览器中试用这里。确保选择“DeepThink”选项,以启用链式思维(R1 模型),并让它用简单的语言解释技术报告的部分内容。
This will simultaneously show you a few important things:
这将同时向你展示一些重要的内容:

One, this model is absolutely legit. There is a lot of BS that goes on with AI benchmarks, which are routinely gamed so that models appear to perform great on the benchmarks but then suck in real world tests. Google is certainly the worst offender in this regard, constantly crowing about how amazing their LLMs are, when they are so awful in any real world test that they can't even reliably accomplish the simplest possible tasks, let alone challenging coding tasks. These DeepSeek models are not like that— the responses are coherent, compelling, and absolutely on the same level as those from OpenAI and Anthropic.
首先,这个模型绝对是合法的。AI 基准测试中经常充斥着大量的虚假信息,这些测试经常被操纵,使得模型在基准测试中表现出色,但在真实世界测试中却表现糟糕。Google 在这方面无疑是最严重的违规者,不断吹嘘他们的 LLMs 有多么惊人,但在任何真实世界测试中都表现得极其糟糕,甚至无法可靠地完成最简单的任务,更不用说具有挑战性的编码任务了。这些 DeepSeek 模型并非如此——它们的回答连贯、有说服力,并且绝对与 OpenAI 和 Anthropic 的模型处于同一水平。
Two, that DeepSeek has made profound advancements not just in model quality, but more importantly in model training and inference efficiency. By being extremely close to the hardware and by layering together a handful of distinct, very clever optimizations, DeepSeek was able to train these incredible models using GPUs in a dramatically more efficient way. By some measurements, over ~45x more efficiently than other leading-edge models. DeepSeek claims that the complete cost to train DeepSeek-V3 was just over $5mm. That is absolutely nothing by the standards of OpenAI, Anthropic, etc., which were well into the $100mm+ level for training costs for a single model as early as 2024.
其次,DeepSeek 在模型质量方面取得了深远的进步,但更重要的是在模型训练和推理效率方面的提升。通过与硬件的极致贴合,并结合一系列独特且巧妙的优化,DeepSeek 能够以极高的效率使用 GPU 训练这些令人惊叹的模型。根据某些测量,其效率比其他前沿模型高出约 45 倍。DeepSeek 声称训练 DeepSeek-V3 的总成本仅略高于 500 万美元。相比之下,这一成本在 OpenAI、Anthropic 等公司的标准下几乎微不足道,因为它们早在 2024 年训练单个模型的成本就已远超 1 亿美元。
How in the world could this be possible? How could this little Chinese company completely upstage all the smartest minds at our leading AI labs, which have 100 times more resources, headcount, payroll, capital, GPUs, etc? Wasn't China supposed to be crippled by Biden's restriction on GPU exports? Well, the details are fairly technical, but we can at least describe them at a high level. It might have just turned out that the relative GPU processing poverty of DeepSeek was the critical ingredient to make them more creative and clever, necessity being the mother of invention and all.
这怎么可能?这家小小的中国公司怎么能完全压倒我们顶尖 AI 实验室中所有最聪明的人才,而这些实验室拥有 100 倍的资源、员工、薪资、资本、GPU 等?中国不是应该因为拜登对 GPU 出口的限制而受挫吗?其实,细节相当技术性,但我们至少可以从高层次上进行描述。或许,DeepSeek 在 GPU 处理能力上的相对贫乏,恰恰成为促使他们更加富有创造力和聪明才智的关键因素,毕竟,需求是发明之母。
A major innovation is their sophisticated mixed-precision training framework that lets them use 8-bit floating point numbers (FP8) throughout the entire training process. Most Western AI labs train using "full precision" 32-bit numbers (this basically specifies the number of gradations possible in describing the output of an artificial neuron; 8 bits in FP8 lets you store a much wider range of numbers than you might expect— it's not just limited to 256 different equal-sized magnitudes like you'd get with regular integers, but instead uses clever math tricks to store both very small and very large numbers— though naturally with less precision than you'd get with 32 bits.) The main tradeoff is that while FP32 can store numbers with incredible precision across an enormous range, FP8 sacrifices some of that precision to save memory and boost performance, while still maintaining enough accuracy for many AI workloads.
一个重大创新是他们复杂的混合精度训练框架,使其能够在整个训练过程中使用 8 位浮点数(FP8)。大多数西方 AI 实验室使用“全精度”32 位数进行训练(这基本上指定了在描述人工神经元输出时可能的梯度数量;FP8 中的 8 位允许存储的数值范围比预期的要广——它不仅仅局限于 256 个等间距的数值,如同常规整数那样,而是利用巧妙的数学技巧来存储非常小和非常大的数值——尽管自然比 32 位的精度要低)。主要的权衡在于,虽然 FP32 可以在极大的范围内存储极高精度的数值,FP8 牺牲了一部分精度以节省内存并提升性能,同时仍能保持足够的准确性以满足许多 AI 任务的需求。
DeepSeek cracked this problem by developing a clever system that breaks numbers into small tiles for activations and blocks for weights, and strategically uses high-precision calculations at key points in the network. Unlike other labs that train in high precision and then compress later (losing some quality in the process), DeepSeek's native FP8 approach means they get the massive memory savings without compromising performance. When you're training across thousands of GPUs, this dramatic reduction in memory requirements per GPU translates into needing far fewer GPUs overall.
DeepSeek 通过开发一个巧妙的系统破解了这个问题,该系统将数字拆分为用于激活的小块和用于权重的块,并在网络的关键点战略性地使用高精度计算。与其他实验室先以高精度训练然后再压缩(在此过程中会损失一些质量)不同,DeepSeek 的原生 FP8 方法意味着他们在不影响性能的情况下实现了大规模的内存节省。当你在数千个 GPU 上进行训练时,每个 GPU 的内存需求大幅减少,这意味着整体所需的 GPU 数量大大降低。
Another major breakthrough is their multi-token prediction system. Most Transformer based LLM models do inference by predicting the next token— one token at a time. DeepSeek figured out how to predict multiple tokens while maintaining the quality you'd get from single-token prediction. Their approach achieves about 85-90% accuracy on these additional token predictions, which effectively doubles inference speed without sacrificing much quality. The clever part is they maintain the complete causal chain of predictions, so the model isn't just guessing— it's making structured, contextual predictions.
另一个重大突破是他们的多标记预测系统。大多数基于 Transformer 的 LLM 模型通过逐个预测下一个标记来进行推理。DeepSeek 找到了在保持单标记预测质量的同时预测多个标记的方法。他们的方法在这些额外的标记预测上达到了约 85-90% 的准确率,从而有效地将推理速度提高了一倍,而质量几乎没有损失。巧妙之处在于他们保持了完整的因果预测链,因此模型不仅仅是在猜测,而是在进行结构化、具有上下文的预测。
One of their most innovative developments is what they call Multi-head Latent Attention (MLA). This is a breakthrough in how they handle what are called the Key-Value indices, which are basically how individual tokens are represented in the attention mechanism within the Transformer architecture. Although this is getting a bit too advanced in technical terms, suffice it to say that these KV indices are some of the major uses of VRAM during the training and inference process, and part of the reason why you need to use thousands of GPUs at the same time to train these models— each GPU has a maximum of 96 gb of VRAM, and these indices eat that memory up for breakfast.
他们最具创新性的开发之一是他们称之为 Multi-head Latent Attention (MLA) 的技术。这是在处理所谓的 Key-Value 索引方面的突破,这些索引基本上决定了在 Transformer 架构的注意力机制中,单个 token 是如何表示的。尽管这在技术上有些过于高级,但简单来说,这些 KV 索引是训练和推理过程中 VRAM 的主要用途之一,也是为什么需要同时使用成千上万块 GPU 来训练这些模型的部分原因——每块 GPU 最多只有 96GB 的 VRAM,而这些索引会迅速消耗掉这部分内存。
Their MLA system finds a way to store a compressed version of these indices that captures the essential information while using far less memory. The brilliant part is this compression is built directly into how the model learns— it's not some separate step they need to do, it's built directly into the end-to-end training pipeline. This means that the entire mechanism is "differentiable" and able to be trained directly using the standard optimizers. All this stuff works because these models are ultimately finding much lower-dimensional representations of the underlying data than the so-called "ambient dimensions". So it's wasteful to store the full KV indices, even though that is basically what everyone else does.
他们的 MLA 系统找到了一种方法来存储这些索引的压缩版本,在保留关键信息的同时占用更少的内存。巧妙之处在于,这种压缩直接融入了模型的学习过程——它不是一个额外的步骤,而是直接构建在端到端的训练流程中。这意味着整个机制是“可微的”,可以直接使用标准优化器进行训练。所有这些都能奏效,是因为这些模型最终找到的是底层数据的低维表示,而不是所谓的“环境维度”。因此,存储完整的 KV 索引是浪费的,尽管基本上所有其他人都是这么做的。
Not only do you end up wasting tons of space by storing way more numbers than you need, which gives a massive boost to the training memory footprint and efficiency (again, slashing the number of GPUs you need to train a world class model), but it can actually end up improving model quality because it can act like a "regularizer," forcing the model to pay attention to the truly important stuff instead of using the wasted capacity to fit to noise in the training data. So not only do you save a ton of memory, but the model might even perform better. At the very least, you don't get a massive hit to performance in exchange for the huge memory savings, which is generally the kind of tradeoff you are faced with in AI training.
不仅会因为存储远超所需的数字而浪费大量空间,从而大幅增加训练的内存占用和效率(同样减少训练世界级模型所需的 GPU 数量),但实际上这还能提升模型质量,因为它可以充当“正则化器”,迫使模型关注真正重要的内容,而不是利用多余的容量去拟合训练数据中的噪声。因此,不仅可以节省大量内存,模型甚至可能表现得更好。至少,你不会因为巨大的内存节省而遭受严重的性能损失,而这通常是在 AI 训练中需要权衡的因素。
They also made major advances in GPU communication efficiency through their DualPipe algorithm and custom communication kernels. This system intelligently overlaps computation and communication, carefully balancing GPU resources between these tasks. They only need about 20 of their GPUs' streaming multiprocessors (SMs) for communication, leaving the rest free for computation. The result is much higher GPU utilization than typical training setups achieve.
他们还通过 DualPipe 算法和自定义通信内核在 GPU 通信效率方面取得了重大进展。该系统智能地重叠计算和通信,精确平衡 GPU 资源在这些任务之间的分配。他们只需要大约 20 个 GPU 的流式多处理器(SMs)用于通信,其余部分可用于计算。结果是 GPU 的利用率远高于典型的训练设置所能达到的水平。
Another very smart thing they did is to use what is known as a Mixture-of-Experts (MOE) Transformer architecture, but with key innovations around load balancing. As you might know, the size or capacity of an AI model is often measured in terms of the number of parameters the model contains. A parameter is just a number that stores some attribute of the model; either the "weight" or importance a particular artificial neuron has relative to another one, or the importance of a particular token depending on its context (in the "attention mechanism"), etc.
另一个非常聪明的做法是使用了所谓的 Mixture-of-Experts (MOE) Transformer 架构,并在负载均衡方面进行了关键创新。正如你可能知道的,AI 模型的规模或容量通常以模型包含的参数数量来衡量。参数只是存储模型某些属性的数值;它可以是某个人工神经元相对于另一个神经元的“权重”或重要性,或者是在“注意力机制”中某个特定标记在特定上下文中的重要性等。
Meta's latest Llama3 models come in a few sizes, for example: a 1 billion parameter version (the smallest), a 70B parameter model (the most commonly deployed one), and even a massive 405B parameter model. This largest model is of limited utility for most users because you would need to have tens of thousands of dollars worth of GPUs in your computer just to run at tolerable speeds for inference, at least if you deployed it in the naive full-precision version. Therefore most of the real-world usage and excitement surrounding these open source models is at the 8B parameter or highly quantized 70B parameter level, since that's what can fit in a consumer-grade Nvidia 4090 GPU, which you can buy now for under $1,000.
Meta 最新的 Llama3 模型有几种不同的规模,例如:一个 10 亿参数版本(最小的)、一个 70B 参数模型(最常部署的),甚至还有一个庞大的 405B 参数模型。这个最大模型对大多数用户的实用性有限,因为仅仅为了在推理时达到可接受的速度,你的计算机就需要价值数万美元的 GPU,至少在你以天真全精度版本部署它的情况下。因此,大多数实际应用和对这些开源模型的关注都集中在 8B 参数或高度量化的 70B 参数级别,因为这些可以适配于消费级 Nvidia 4090 GPU,而这款 GPU 现在的售价不到 1,000 美元。
So why does any of this matter? Well, in a sense, the parameter count and precision tells you something about how much raw information or data the model has stored internally. Note that I'm not talking about reasoning ability, or the model's "IQ" if you will: it turns out that models with even surprisingly modest parameter counts can show remarkable cognitive performance when it comes to solving complex logic problems, proving theorems in plane geometry, SAT math problems, etc.
那么,为什么这些重要呢?从某种意义上说,参数数量和精度可以告诉你模型在内部存储了多少原始信息或数据。请注意,我这里并不是在谈论推理能力,或者模型的“智商”,如果你愿意这么称呼的话:事实证明,即使是参数数量相对较少的模型,在解决复杂的逻辑问题、证明平面几何定理、SAT 数学问题等方面,也能展现出惊人的认知能力。
But those small models aren't going to be able to necessarily tell you every aspect of every plot twist in every single novel by Stendhal, whereas the really big models can potentially do that. The "cost" of that extreme level of knowledge is that the models become very unwieldy both to train and to do inference on, because you always need to store every single one of those 405B parameters (or whatever the parameter count is) in the GPU's VRAM at the same time in order to do any inference with the model.
但那些小模型不一定能够告诉你司汤达每部小说中每个情节转折的所有细节,而真正的大模型可能可以做到。达到这种极端知识水平的“代价”是,这些模型在训练和推理时变得非常笨重,因为你始终需要在 GPU 的 VRAM 中同时存储所有 405B 个参数(或其他参数数量),才能对模型进行任何推理。
The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge. DeepSeek's innovation here was developing what they call an "auxiliary-loss-free" load balancing strategy that maintains efficient expert utilization without the usual performance degradation that comes from load balancing. Then, depending on the nature of the inference request, you can intelligently route the inference to the "expert" models within that collection of smaller models that are most able to answer that question or solve that task.
MOE 模型方法的优雅之处在于,你可以将大模型分解为一组较小的模型,每个模型掌握不同的、至少是完全不重叠的知识片段。DeepSeek 的创新在于开发了一种他们称之为“无辅助损失”的负载均衡策略,该策略在保持高效专家利用率的同时,避免了负载均衡通常带来的性能下降。然后,根据推理请求的性质,你可以智能地将推理路由到这组较小模型中的“专家”模型,使其最有能力回答该问题或解决该任务。
You can loosely think of it as being a committee of experts who have their own specialized knowledge domains: one might be a legal expert, the other a computer science expert, the other a business strategy expert. So if a question comes in about linear algebra, you don't give it to the legal expert. This is of course a very loose analogy and it doesn't actually work like this in practice.
你可以粗略地将其理解为一个由专家组成的委员会,每位专家都有自己专门的知识领域:一个可能是法律专家,另一个是计算机科学专家,另一个是商业战略专家。因此,如果有关于线性代数的问题,你不会把它交给法律专家。当然,这只是一个非常粗略的类比,实际情况并不是这样运作的。
The real advantage of this approach is that it allows the model to contain a huge amount of knowledge without being very unwieldy, because even though the aggregate number of parameters is high across all the experts, only a small subset of these parameters is "active" at any given time, which means that you only need to store this small subset of weights in VRAM in order to do inference. In the case of DeepSeek-V3, they have an absolutely massive MOE model with 671B parameters, so it's much bigger than even the largest Llama3 model, but only 37B of these parameters are active at any given time— enough to fit in the VRAM of two consumer-grade Nvidia 4090 GPUs (under $2,000 total cost), rather than requiring one or more H100 GPUs which cost something like $40k each.
这种方法的真正优势在于,它使模型能够包含大量知识而不会变得过于笨重。因为尽管所有专家的参数总数很高,但在任何给定时间内,只有一小部分参数是“激活”的,这意味着在进行推理时,你只需要在 VRAM 中存储这小部分权重。以 DeepSeek-V3 为例,他们拥有一个庞大的 MOE 模型,参数量高达 6710 亿,比最大的 Llama3 模型还要大得多,但在任何给定时间内,只有 370 亿个参数是激活的——足以适应两张消费级 Nvidia 4090 GPU 的 VRAM(总成本低于 2000 美元),而无需使用一张或多张 H100 GPU(每张成本约 4 万美元)。
It's rumored that both ChatGPT and Claude use an MoE architecture, with some leaks suggesting that GPT-4 had a total of 1.8 trillion parameters split across 8 models containing 220 billion parameters each. Despite that being a lot more doable than trying to fit all 1.8 trillion parameters in VRAM, it still requires multiple H100-grade GPUs just to run the model because of the massive amount of memory used.
有传言称 ChatGPT 和 Claude 都使用 MoE 架构,一些泄露信息表明 GPT-4 总共有 1.8 万亿参数,分布在 8 个模型中,每个包含 2200 亿参数。尽管这样比尝试将全部 1.8 万亿参数装入 VRAM 更可行,但由于占用的内存量巨大,仍然需要多块 H100 级别的 GPU 才能运行该模型。
Beyond what has already been described, the technical papers mention several other key optimizations. These include their extremely memory-efficient training framework that avoids tensor parallelism, recomputes certain operations during backpropagation instead of storing them, and shares parameters between the main model and auxiliary prediction modules. The sum total of all these innovations, when layered together, has led to the ~45x efficiency improvement numbers that have been tossed around online, and I am perfectly willing to believe these are in the right ballpark.
除了已经描述的内容之外,技术论文还提到了其他几个关键优化。其中包括他们极其高效的内存训练框架,该框架避免了张量并行,在反向传播过程中重新计算某些操作而不是存储它们,并在主模型和辅助预测模块之间共享参数。所有这些创新叠加在一起,总体上带来了大约 45 倍的效率提升,这一数字在网上被广泛讨论,而我完全愿意相信这些数据大致是准确的。
One very strong indicator that it's true is the cost of DeepSeek's API: despite this nearly best-in-class model performance, DeepSeek charges something like 95% less money for inference requests via its API than comparable models from OpenAI and Anthropic. In a sense, it's sort of like comparing Nvidia's GPUs to the new custom chips from competitors: even if they aren't quite as good, the value for money is so much better that it can still be a no-brainer depending on the application, as long as you can qualify the performance level and prove that it's good enough for your requirements and the API availability and latency is good enough (thus far, people have been amazed at how well DeepSeek's infrastructure has held up despite the truly incredible surge of demand owing to the performance of these new models).
一个非常有力的指标表明这是真的,那就是 DeepSeek 的 API 成本:尽管其模型性能几乎是业内最优之一,DeepSeek 通过其 API 处理推理请求的收费比 OpenAI 和 Anthropic 的类似模型低约 95%。在某种意义上,这有点像将 Nvidia 的 GPU 与竞争对手的新定制芯片进行比较:即使它们不完全一样好,但性价比要高得多,因此根据具体应用,它仍然可能是不二之选,只要你能确定性能水平并证明它足够满足你的需求,同时 API 的可用性和延迟也足够好。(到目前为止,人们对 DeepSeek 的基础设施在这些新模型的卓越性能带来的惊人需求激增下仍能保持稳定感到惊讶)。
But unlike the case of Nvidia, where the cost differential is the result of them earning monopoly gross margins of 90%+ on their data-center products, the cost differential of the DeepSeek API relative to the OpenAI and Anthropic API could be simply that they are nearly 50x more compute efficient (it might even be significantly more than that on the inference side— the ~45x efficiency was on the training side). Indeed, it's not even clear that OpenAI and Anthropic are making great margins on their API services— they might be more interested in revenue growth and gathering more data from analyzing all the API requests they receive.
但与 Nvidia 的情况不同,Nvidia 的数据中心产品之所以存在成本差异,是因为他们获得了 90%以上的垄断毛利率,而 DeepSeek API 相对于 OpenAI 和 Anthropic API 的成本差异可能仅仅是因为它们的计算效率几乎高出 50 倍(在推理方面甚至可能远超这一数值——约 45 倍的效率提升是在训练阶段)。事实上,OpenAI 和 Anthropic 是否能在其 API 服务上获得高利润率尚不明确——他们可能更关注收入增长,并通过分析收到的所有 API 请求来收集更多数据。
Before moving on, I'd be remiss if I didn't mention that many people are speculating that DeepSeek is simply lying about the number of GPUs and GPU hours spent training these models because they actually possess far more H100s than they are supposed to have given the export restrictions on these cards, and they don't want to cause trouble for themselves or hurt their chances of acquiring more of these cards. While it's certainly possible, I think it's more likely that they are telling the truth, and that they have simply been able to achieve these incredible results by being extremely clever and creative in their approach to training and inference. They explain how they are doing things, and I suspect that it's only a matter of time before their results are widely replicated and confirmed by other researchers at various other labs.
在继续之前,如果我不提及这一点,那就太失职了:许多人猜测 DeepSeek 可能在 GPU 数量和训练这些模型所花费的 GPU 小时数上撒谎,因为他们实际上拥有的 H100 远超过他们应该拥有的数量,考虑到这些显卡的出口限制。他们不想给自己惹麻烦,也不想影响他们获取更多这些显卡的机会。虽然这确实有可能,但我认为更有可能的情况是他们说的是真话,他们只是通过极其聪明和有创造力的训练和推理方法,才取得了这些惊人的成果。他们解释了自己的做法,我怀疑只是时间问题,他们的结果就会被其他实验室的研究人员广泛复制和验证。
A Model That Can Really Think 一个真正能思考的模型
The newer R1 model and technical report might even be even more mind blowing, since they were able to beat Anthropic to Chain-of-thought and now are basically the only ones besides OpenAI who have made this technology work at scale. But note that the O1 preview model was only released by OpenAI in mid-September of 2024. That's only ~4 months ago! Something you absolutely must keep in mind is that, unlike OpenAI, which is incredibly secretive about how these models really work at a low level, and won't release the actual model weights to anyone besides partners like Microsoft and other who sign heavy-duty NDAs, these DeepSeek models are both completely open-source and permissively licensed. They have released extremely detailed technical reports explaining how they work, as well as the code that anyone can look at and try to copy.
更新的 R1 模型和技术报告可能更加令人惊叹,因为他们成功在 Chain-of-thought 方面击败了 Anthropic,并且现在基本上是除了 OpenAI 之外唯一能够让这项技术大规模运作的团队。但需要注意的是,O1 预览模型是 OpenAI 在 2024 年 9 月中旬才发布的,仅仅大约 4 个月前!你绝对必须牢记的一点是,与 OpenAI 不同,后者对这些模型在底层如何运作极为保密,并且不会向除 Microsoft 等签署了严格 NDA 的合作伙伴之外的任何人公开实际的模型权重,这些 DeepSeek 模型则是完全开源且采用宽松许可的。他们发布了极为详细的技术报告,解释其工作原理,并提供了任何人都可以查看和尝试复制的代码。
With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.
借助 R1,DeepSeek 基本上破解了 AI 领域的一个圣杯:让模型能够逐步推理,而无需依赖大规模的监督数据集。他们的 DeepSeek-R1-Zero 实验展示了一项非凡的成果:通过纯强化学习和精心设计的奖励函数,他们成功让模型完全自主地发展出复杂的推理能力。这不仅仅是解决问题——模型自发地学会了生成长链思维、自我验证其工作,并为更难的问题分配更多的计算时间。
The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.
这里的技术突破在于他们对奖励建模的创新方法。与其使用可能导致“奖励黑客攻击”(即模型找到虚假方式来提高奖励,但实际上并未提升真实世界模型性能)的复杂神经奖励模型,他们开发了一个巧妙的基于规则的系统,该系统结合了准确性奖励(验证最终答案)和格式奖励(鼓励结构化思维)。这种更简单的方法被证明比其他人尝试的基于过程的奖励模型更稳健且更具可扩展性。
What's particularly fascinating is that during training, they observed what they called an "aha moment," a phase where the model spontaneously learned to revise its thinking process mid-stream when encountering uncertainty. This emergent behavior wasn't explicitly programmed; it arose naturally from the interaction between the model and the reinforcement learning environment. The model would literally stop itself, flag potential issues in its reasoning, and restart with a different approach, all without being explicitly trained to do this.
特别有趣的是,在训练过程中,他们观察到了一个所谓的“顿悟时刻”,即模型在遇到不确定性时,自发地学会在中途调整其思维过程。这种涌现行为并非被明确编程出来的,而是自然地从模型与强化学习环境的交互中产生的。模型会主动暂停自身,标记推理中的潜在问题,并以不同的方法重新开始,而这一切都不是通过显式训练实现的。
The full R1 model built on these insights by introducing what they call "cold-start" data— a small set of high-quality examples— before applying their RL techniques. They also solved one of the major challenges in reasoning models: language consistency. Previous attempts at chain-of-thought reasoning often resulted in models mixing languages or producing incoherent outputs. DeepSeek solved this through a clever language consistency reward during RL training, trading off a small performance hit for much more readable and consistent outputs.
完整的 R1 模型基于这些见解构建,并引入了他们称之为“冷启动”数据——一小组高质量示例——然后再应用他们的 RL 技术。他们还解决了推理模型中的一个主要挑战:语言一致性。先前的链式思维推理尝试经常导致模型混合语言或生成不连贯的输出。DeepSeek 通过在 RL 训练过程中引入巧妙的语言一致性奖励解决了这一问题,以轻微的性能损失换取更可读且更一致的输出。
The results are mind-boggling: on AIME 2024, one of the most challenging high school math competitions, R1 achieved 79.8% accuracy, matching OpenAI's O1 model. On MATH-500, it hit 97.3%, and it achieved the 96.3 percentile on Codeforces programming competitions. But perhaps most impressively, they managed to distill these capabilities down to much smaller models: their 14B parameter version outperforms many models several times its size, suggesting that reasoning ability isn't just about raw parameter count but about how you train the model to process information.
结果令人难以置信:在 AIME 2024 这一最具挑战性的高中数学竞赛之一中,R1 达到了 79.8% 的准确率,与 OpenAI 的 O1 模型相匹配。在 MATH-500 上,它达到了 97.3%,并在 Codeforces 编程竞赛中达到了 96.3 百分位。但或许最令人印象深刻的是,他们成功地将这些能力提炼到更小的模型中:其 140 亿参数版本的表现优于许多体积数倍于它的模型,这表明推理能力不仅仅取决于参数数量,还取决于如何训练模型来处理信息。
The Fallout 余波
The recent scuttlebutt on Twitter and Blind (a corporate rumor website) is that these models caught Meta completely off guard and that they perform better than the new Llama4 models which are still being trained. Apparently, the Llama project within Meta has attracted a lot of attention internally from high-ranking technical executives, and as a result they have something like 13 individuals working on the Llama stuff who each individually earn more per year in total compensation than the combined training cost for the DeepSeek-V3 models which outperform it. How do you explain that to Zuck with a straight face? How does Zuck keep smiling while shoveling multiple billions of dollars to Nvidia to buy 100k H100s when a better model was trained using just 2k H100s for a bit over $5mm?
最近在 Twitter 和 Blind(一个公司谣言网站)上的传闻是,这些模型让 Meta 完全措手不及,并且它们的表现优于仍在训练中的新 Llama4 模型。显然,Meta 内部的 Llama 项目已经引起了高级技术高管的极大关注,因此他们有大约 13 个人在研究 Llama 相关工作,而这些人的年总薪酬每人都超过了 DeepSeek-V3 模型的总训练成本,而后者的性能还优于 Llama4。你要如何面不改色地向 Zuck 解释这一点?当一个更好的模型仅用 2k H100 训练,成本略高于 500 万美元,而 Zuck 却在向 Nvidia 砸下数十亿美元购买 10 万块 H100 时,他是如何还能保持微笑的?
But you better believe that Meta and every other big AI lab is taking these DeepSeek models apart, studying every word in those technical reports and every line of the open source code they released, trying desperately to integrate these same tricks and optimizations into their own training and inference pipelines. So what's the impact of all that? Well, naively it sort of seems like the aggregate demand for training and inference compute should be divided by some big number. Maybe not by 45, but maybe by 25 or even 30? Because whatever you thought you needed before these model releases, it's now a lot less.
但你最好相信,Meta 和其他所有大型 AI 实验室都在拆解这些 DeepSeek 模型,研究技术报告中的每一个字,以及他们发布的开源代码中的每一行,拼命尝试将这些相同的技巧和优化整合到自己的训练和推理流程中。那这一切的影响是什么?从直觉上看,训练和推理计算的总需求似乎应该被某个大数除以一定比例。也许不是 45,但可能是 25,甚至 30?因为无论你之前认为自己需要多少计算资源,在这些模型发布之后,现在的需求已经少了很多。
Now, an optimist might say "You are talking about a mere constant of proportionality, a single multiple. When you're dealing with an exponential growth curve, that stuff gets washed out so quickly that it doesn't end up matter all that much." And there is some truth to that: if AI really is as transformational as I expect, if the real-world utility of this tech is measured in the trillions, if inference-time compute is the new scaling law of the land, if we are going to have armies of humanoid robots running around doing massive amounts of inference constantly, then maybe the growth curve is still so steep and extreme, and Nvidia has a big enough lead, that it will still work out.
现在,乐观主义者可能会说:“你只是在谈论一个简单的比例常数,一个单一的倍数。当你处理指数增长曲线时,这些东西会很快被冲淡,以至于最终并不会产生太大影响。” 这其中确实有一定道理:如果人工智能真的像我预期的那样具有变革性,如果这项技术的实际效用以万亿美元计,如果推理计算时间成为新的规模法则,如果我们将拥有成群的类人机器人不断进行大规模推理,那么也许增长曲线仍然会如此陡峭和极端,而 Nvidia 的领先优势足够大,以至于它仍然能够成功。
But Nvidia is pricing in a LOT of good news in the coming years for that valuation to make sense, and when you start layering all these things together into a total mosaic, it starts to make me at least feel extremely uneasy about spending ~20x the 2025 estimated sales for their shares. What happens if you even see a slight moderation in sales growth? What if it turns out to be 85% instead of over 100%? What if gross margins come in a bit from 75% to 70%— still ridiculously high for a semiconductor company?
但是,Nvidia 在未来几年已经计入了大量利好消息,以使该估值合理化,当你将所有这些因素整合在一起时,至少让我对以约 20 倍 2025 年预估销售额的价格购买其股票感到极度不安。如果销售增长率哪怕稍微放缓会发生什么?如果最终增长率是 85% 而不是超过 100% 呢?如果毛利率从 75% 降至 70%——对于一家半导体公司来说仍然高得离谱——又会怎样?
Wrapping it All Up 总结
At a high level, NVIDIA faces an unprecedented convergence of competitive threats that make its premium valuation increasingly difficult to justify at 20x forward sales and 75% gross margins. The company's supposed moats in hardware, software, and efficiency are all showing concerning cracks. The whole world— thousands of the smartest people on the planet, backed by untold billions of dollars of capital resources— are trying to assail them from every angle.
从高层次来看,NVIDIA 正面临前所未有的竞争威胁收敛,使其在 20 倍远期销售和 75% 毛利率下的高估值越来越难以合理化。公司在硬件、软件和效率方面所谓的护城河都显现出令人担忧的裂缝。全世界——成千上万最聪明的人才,在无数十亿美元资本资源的支持下——正从各个角度试图攻破它们。
On the hardware front, innovative architectures from Cerebras and Groq demonstrate that NVIDIA's interconnect advantage— a cornerstone of its data center dominance— can be circumvented through radical redesigns. Cerebras' wafer-scale chips and Groq's deterministic compute approach deliver compelling performance without needing NVIDIA's complex interconnect solutions. More traditionally, every major NVIDIA customer (Google, Amazon, Microsoft, Meta, Apple) is developing custom silicon that could chip away at high-margin data center revenue. These aren't experimental projects anymore— Amazon alone is building out massive infrastructure with over 400,000 custom chips for Anthropic.
在硬件方面,Cerebras 和 Groq 的创新架构表明,NVIDIA 的互连优势——其数据中心主导地位的基石——可以通过激进的重新设计来规避。Cerebras 的晶圆级芯片和 Groq 的确定性计算方法在无需 NVIDIA 复杂互连解决方案的情况下提供了强劲的性能。更传统的是,NVIDIA 的每个主要客户(Google、Amazon、Microsoft、Meta、Apple)都在开发定制芯片,这可能会侵蚀其高利润的数据中心收入。这些已不再是实验性项目——仅 Amazon 就正在为 Anthropic 构建超过 400,000 颗定制芯片的大规模基础设施。
The software moat appears equally vulnerable. New high-level frameworks like MLX, Triton, and JAX are abstracting away CUDA's importance, while efforts to improve AMD drivers could unlock much cheaper hardware alternatives. The trend toward higher-level abstractions mirrors how assembly language gave way to C/C++, suggesting CUDA's dominance may be more temporary than assumed. Most importantly, we're seeing the emergence of LLM-powered code translation that could automatically port CUDA code to run on any hardware target, potentially eliminating one of NVIDIA's strongest lock-in effects.
软件护城河同样显得脆弱。新的高级框架,如 MLX、Triton 和 JAX,正在削弱 CUDA 的重要性,而改进 AMD 驱动程序的努力可能会解锁更便宜的硬件替代方案。向更高级抽象发展的趋势类似于汇编语言让位于 C/C++,这表明 CUDA 的主导地位可能比预期的更短暂。最重要的是,我们正在看到 LLM 支持的代码翻译的出现,它可以自动移植 CUDA 代码以在任何硬件目标上运行,可能会消除 NVIDIA 最强大的锁定效应之一。
Perhaps most devastating is DeepSeek's recent efficiency breakthrough, achieving comparable model performance at approximately 1/45th the compute cost. This suggests the entire industry has been massively over-provisioning compute resources. Combined with the emergence of more efficient inference architectures through chain-of-thought models, the aggregate demand for compute could be significantly lower than current projections assume. The economics here are compelling: when DeepSeek can match GPT-4 level performance while charging 95% less for API calls, it suggests either NVIDIA's customers are burning cash unnecessarily or margins must come down dramatically.
也许最具破坏性的是 DeepSeek 最近的效率突破,以大约 1/45 的计算成本实现了可比的模型性能。这表明整个行业在计算资源上可能存在大规模的过度配置。结合通过链式思维模型出现的更高效的推理架构,总体计算需求可能显著低于当前预测的假设。这里的经济学令人信服:当 DeepSeek 能够以 95%更低的 API 调用费用匹配 GPT-4 级别的性能时,这表明要么 NVIDIA 的客户在不必要地烧钱,要么利润率必须大幅下降。
The fact that TSMC will manufacture competitive chips for any well-funded customer puts a natural ceiling on NVIDIA's architectural advantages. But more fundamentally, history shows that markets eventually find a way around artificial bottlenecks that generate super-normal profits. When layered together, these threats suggest NVIDIA faces a much rockier path to maintaining its current growth trajectory and margins than its valuation implies. With five distinct vectors of attack— architectural innovation, customer vertical integration, software abstraction, efficiency breakthroughs, and manufacturing democratization— the probability that at least one succeeds in meaningfully impacting NVIDIA's margins or growth rate seems high. At current valuations, the market isn't pricing in any of these risks.
台积电为任何资金充足的客户制造具有竞争力的芯片,这对 NVIDIA 的架构优势设定了一个天然上限。但从更根本的角度来看,历史表明,市场最终会找到绕过那些带来超常利润的人为瓶颈的方法。当这些威胁叠加在一起时,表明 NVIDIA 在维持当前增长轨迹和利润率方面面临的挑战比其估值所暗示的要严峻得多。凭借五个不同的攻击方向——架构创新、客户垂直整合、软件抽象、效率突破和制造民主化——至少有一个成功对 NVIDIA 的利润率或增长率产生重大影响的可能性似乎很高。而在当前估值下,市场并未将这些风险计入其中。
I hope you enjoyed reading this article. If you work at a hedge fund and are interested in consulting with me on NVDA or other AI-related stocks or investing themes, I'm already signed up as an expert on GLG and Coleman Research.
希望你喜欢阅读这篇文章。如果你在对冲基金工作,并对就 NVDA 或其他与人工智能相关的股票或投资主题与我咨询感兴趣,我已经在 GLG 和 Coleman Research 注册为专家。