For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.
一直以来,OpenAI 和 Google 等公司都在宣扬其最新人工智能模型中的先进“推理”能力,称其将成为下一项重大突破。然而,苹果公司六位工程师的新研究表明,这些高级大型语言模型所展示的数学“推理”在面对看似微不足道的常见基准问题变化时,可能会非常脆弱且不可靠。
The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data."
这些新研究结果所揭示的脆弱性也支持了此前的观点:大型语言模型所使用的概率模式匹配方法,缺乏真正可靠的数学推理所需的基本概念形式理解。研究人员根据这些结果推测,“目前的 LLM 并不能进行真正的逻辑推理。相反,它们只是试图复制训练数据中观察到的推理步骤。”
Mix it up
混合测试
In
"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"—currently available as a pre-print paper—the six Apple researchers start with GSM8K's standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.
在题为《GSM-Symbolic:理解大型语言模型在数学推理方面的局限性》的预印本论文中,这六位苹果研究人员先从 GSM8K 的标准化题库入手,该题库包含 8000 多道小学水平的数学文字题,常被用作现代大型语言模型复杂推理能力的基准。接着,他们采取了一种新颖的方法,对其中一部分测试题进行修改,动态替换其中的姓名和数字——例如,GSM8K 中一道关于 Sophie 给她侄子买了 31 块积木的问题,在新的 GSM-Symbolic 测评中,就可能被改成 Bill 给他弟弟买了 19 块积木。
This approach helps avoid any potential "data contamination" that can result from the static GSM8K questions being fed directly into an AI model's training data. At the same time, these incidental changes don't alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.
这种方法有助于避免静态的 GSM8K 题目直接被纳入 AI 模型训练数据时可能造成的“数据污染”。与此同时,这些偶然性改动并未改变任何内在的数学推理难度,这意味着模型在 GSM-Symbolic 上的表现理论上应该与在 GSM8K 上相当。
Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.
然而,当研究人员使用 GSM-Symbolic 对 20 多个最先进的 LLM 进行测试时发现,与 GSM8K 相比,所有模型的平均准确率都出现下滑,具体跌幅因模型而异,在 0.3% 到 9.2% 之间。此外,在使用不同姓名和数值进行的 50 次 GSM-Symbolic 测试中,结果也呈现出高度差异。同一模型在最佳和最差测试结果之间的准确率差距常常高达 15%,并且不知何故,变动数字比变动姓名更容易导致准确率降低。
This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, "the overall reasoning steps needed to solve a question remain the same." The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead "attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data."
这种差异——无论是在不同的 GSM-Symbolic 测试运行之间,还是与 GSM8K 的结果相比——都令人颇为惊讶,因为研究人员指出,“解决问题所需的整体推理步骤并没有发生变化”。如此小幅的改动却能带来如此大的结果波动,这让研究人员认为,这些模型并没有进行任何“形式化”的推理,而是“试图进行某种分布内的模式匹配,将给定的问题及解题步骤与训练数据中看到的相似内容相对齐”。
Don’t get distracted
别被干扰
Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That's a pretty high success rate using either benchmark, regardless of whether or not the model itself is using "formal" reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).
尽管如此,就大局而言,GSM-Symbolic 测试的整体波动通常并不算太大。例如,OpenAI 的 ChatGPT-4o 在 GSM8K 上的准确率是 95.2%,而在 GSM-Symbolic 上依然能达到 94.9%,依旧表现不俗。无论模型本身是否在背后进行“形式化”推理,这两个基准下的成功率都相当可观(不过,研究人员在题目中仅仅增加一两个额外逻辑步骤时,许多模型的总体准确率都会大幅下跌)。
The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding "seemingly relevant but ultimately inconsequential statements" to the questions. For this "GSM-NoOp" benchmark set (short for "no operation"), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that "five of them [the kiwis] were a bit smaller than average."
然而,当苹果研究人员在 GSM-Symbolic 基准中加入“看似相关但实则无关紧要的描述”时,所测试的 LLM 表现就要糟糕得多。在这个名为 “GSM-NoOp”(“no operation”的缩写) 的基准中,一道关于某人在数天内采摘多少猕猴桃的题目,可能会被改写为“其中有五只猕猴桃比平均大小更小”这样看似无关的细节。
Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple "pattern matching" to "convert statements to operations without truly understanding their meaning," the researchers write.
添加这些混淆性细节后,与 GSM8K 相比,准确率出现了研究人员所称的“灾难性下跌”,跌幅因模型而异,从 17.5% 到高达 65.7% 不等。研究人员在文中指出,这些大幅度的准确率下滑凸显了“在缺乏对语义的真正理解的情况下,使用简单的‘模式匹配’来‘将描述转换为运算’”所固有的局限性。
In the example with the smaller kiwis, for instance, most models try to subtract the smaller fruits from the final total because, the researchers surmise, "their training datasets included similar examples that required conversion to subtraction operations." This is the kind of "critical flaw" that the researchers say "suggests deeper issues in [the models'] reasoning processes" that can't be helped with fine-tuning or other refinements.
以那个提到较小猕猴桃的例子为例,大多数模型都会尝试把那些较小的猕猴桃从总数里减去,因为研究人员推测,“它们的训练数据集中包含需要将某些事物转化为减法操作的相似示例。”研究人员指出,这正是一种“关键缺陷”,说明“(模型)推理过程存在更深层次的问题”,无法通过微调或其他改进手段来解决。
The illusion of understanding
理解的幻象
The results of this new GSM-Symbolic paper aren't completely new in the world of AI research. Other recent papers have similarly suggested that LLMs don't actually perform formal reasoning and instead mimic it with probabilistic pattern-matching of the closest similar data seen in their vast training sets.
这篇关于 GSM-Symbolic 的新论文结果在 AI 研究领域并非全然新颖。近期的其他论文也有类似观点,认为大型语言模型并未真正进行形式化推理,而是在其庞大训练集里寻找最相似的数据,并用概率模式匹配的方式加以模仿。
Still, the new research highlights just how fragile this kind of mimicry can be when the prompt in question pushes it in a direction that doesn't precisely match any training data. It also highlights the inherent limitations in trying to perform high-level reasoning without any underlying model of the logic or world behind it. As Ars' Benj Edwards put it in a July story about AI video generation:
不过,这项新研究强调了当给定提示将模型推向与其任何训练数据都不精确匹配的方向时,这种模仿有多么脆弱。它同时也凸显了在缺乏背后逻辑或世界模型的情况下,尝试进行高级推理时所面临的局限性。正如《Ars》记者 Benj Edwards 在一篇关于 AI 视频生成的七月报道中所言:

One of the reasons OpenAI's GPT-4 turned heads in text synthesis is that the model finally reached a size where it was large enough to have absorbed enough information (in training data) to give the impression that it might be able to genuinely understand and model the world when, in reality, a key aspect of its success is that it "knows" far more than most humans and can impress us by combining those existing concepts in novel ways. With enough training data and computation, the AI industry will likely reach what you might call "the illusion of understanding" with AI video synthesis eventually...
OpenAI 的 GPT-4 在文本生成领域引人注目的原因之一,是该模型规模终于大到可以从训练数据中吸收足够的信息,让人觉得它或许能真正理解并构建世界模型。然而,实际上其成功的关键在于它“知道”的东西比大多数人要多,并能通过将已有概念以新颖方式组合来使我们印象深刻。伴随足够的训练数据和算力,AI 行业最终可能会在 AI 视频合成方面达到你可以称之为“理解幻象”的境地……
We're likely seeing a similar "illusion of understanding" with AI's latest "reasoning" models, and seeing how that illusion can break when the model runs in to unexpected situations.
我们或许也正在 AI 最新的“推理”模型中看到类似的“理解幻象”,并且能观察到当模型陷入意料之外的场景时,这种幻象会如何破灭。
AI expert Gary Marcus, in his analysis of the new GSM-Symbolic paper, argues that the next big leap in AI capability will only come when these neural networks can integrate true "symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming..." Until then, we're going to get the kind of brittle "reasoning" that can lead AI models to fail mathematical tests in ways that calculators never do.
在对这篇 GSM-Symbolic 新论文的分析中,AI 专家 Gary Marcus 指出,AI 能力的下一次重大飞跃,只有在这些神经网络能够整合真正的“符号操作”时才会出现,其中部分知识能够以变量及其运算的形式被抽象地表达,就像我们在代数和传统计算机编程中看到的那样……在这之前,我们都会遭遇这种脆弱的“推理”,使得 AI 模型在数学测验中可能出现计算器绝不会犯下的错误。