2025-03-09 Alperen Keles.Verifiability is the Limit

Refer To：《Verifiability is the Limit》。

LLMs have created an enormous turmoil within the software engineering community within the past 5 years, much of it revolving around one central question, what is the future of our profession?

在过去五年中，大语言模型（LLMs）在软件工程界引发了巨大的震荡，而这一切几乎都围绕一个核心问题展开：我们这个职业的未来究竟会怎样？

A set of competing paradigms with various degrees of human intervention and control has emerged, on the one side agents with little to no humans in the loop, recently popularized “vibe coding” that discourages manually modifying code and instead interacting with the codebase over LLMs, the chatbots implementing standalone modules or functionalities, sometimes even running code using code interpretation feature, and on the other side coding assistants that predict what the user would like to implement, acting as “autocompletes on steroids”.

如今出现了一系列相互竞争的范式，它们在人类干预与控制的程度上各不相同。一方面，是基本不依赖人类的“代理式”模式，例如最近流行的“vibe coding”，它不鼓励手动修改代码，而是鼓励通过 LLM 与代码库互动；又如聊天机器人能够实现独立模块或功能，甚至借助代码解释功能直接运行代码。另一方面，是更贴近传统开发的“编程助手”，它们预测用户想要实现的功能，本质上是“升级版自动补全”。

There are many beliefs in terms of the limits, usefulness, and potentials of these models. Some believe the models will ascend into human level capabilities with unprecedented levels of time and resources and will be able to automate the construction of production grade software systems, while skeptics belittle the models, claiming their code will be mediocre and trivial, unable to scale up to any kind of production.

关于这些模型的能力边界、实用性与潜力，存在着各种观点。有些人相信，只要投入足够多的时间与资源，这些模型将具备类人智能，能够自动构建可投入生产的软件系统；也有人持怀疑态度，认为这些模型写出的代码不过是平庸琐碎之作，根本无法扩展到生产级别的系统。

I am not here to argue about the limits or capabilities of the models themselves, there are much better experts out there with deep theoretical knowledge and practical experience than I do. I, however, will argue the following.

我并不打算讨论这些模型的极限或能力，那些拥有深厚理论知识与实战经验的专家远比我更适合展开这类争论。但我想提出以下观点。

Every piece of software in the world has to be “correct” with respect to some notion of correctness. Before LLMs, our time was split between writing, reading and verifying code. With LLMs, we are able to offload code writing phase to LLMs, but the verification process cannot be offloaded, only pushed up to a different layer.

世界上每一段软件代码都必须在某种“正确性”的标准下是正确的。在 LLM 出现之前，我们的时间被用来编写、阅读和验证代码。有了 LLM 之后，我们可以将“写代码”这一步交给模型完成，但“验证过程”无法被转移，它只会被推移到另一个层级上。

Out there, you will see people arguing correctness is only a thing in a small set of domains, such as rockets or cryptography. Make no mistake, the degree to which correctness requires rigor changes, but the requirement for correctness does not. Software is always created with a purpose, whether it is to consume and transform some data, or provide some information to a user, or allow a user to interact and create some new data. Correctness is the notion that the created software conforms to the intent of the creator, and the creator has to verify such conformance.

你会看到有人争辩说，只有在火箭、密码学这些少数领域，正确性才重要。但别误会，尽管对“正确性”的严谨程度在不同场景下有所不同，但对“正确性”的需求始终存在。软件总是带着某种目的被创造出来的，无论是为了消费与转换数据、向用户提供信息，还是允许用户交互并生成新数据。所谓“正确性”，就是软件所表现出的行为是否符合其创作者的意图，而这个“符合性”最终必须由创作者亲自验证。

最终正确只是一个无限逼近的值。

In less critical applications and domains, correctness is a matter of trial. If I’m making a website to present some information regarding a party I’m throwing, I will check that the website shows all the relevant information I wanted it to. If the party's just with some friends, that is probably fine to have. If the party is for work, I will try it on my phone in addition to my computer, perhaps even ask a few friends with different devices to check it out and make sure everything looks good.

在不那么关键的应用和领域中，正确性通常通过“试一试”来判断。如果我做了一个网站来展示某个派对的信息，我会检查这个网站是否显示了我想要呈现的所有内容。如果这是个朋友之间的小型聚会，做到这个程度可能就已经足够了。如果这个派对是为公司办的，我可能还会在手机上检查一次，甚至请几位使用不同设备的朋友帮忙看看，确保一切看起来都没问题。

Even in the example above, the importance of correctness is contextual. With changing domains and scale, the mechanisms we verify correctness also changes. When we are creating software as a business application, experts on “trying” (QAs) are hired to test many paths in the program to discover unexpected behavior, a.k.a. bugs. A typical company spends a great deal of time on testing their code in addition to trying it, where the rigor of testing changes with respect to the scale of the number of users, money at stake, criticality of the application domain at hand. A small SaaS company might only write some integration tests checking out some scenarios or implement unit tests for some of their components, whereas AWS will mathematically prove the correctness of the S3 implementation that serves petabytes of data every day.

即便是在上述这个例子中，正确性的“重要性”也是依上下文而定的。随着应用领域与规模的变化，我们验证正确性的机制也随之变化。当我们将软件作为一项业务来开发时，往往会雇佣“试错专家”（QA）来测试程序中的各条路径，以发现非预期行为（也就是所谓的 bug）。一家典型公司除了“试用”之外，还会投入大量时间在代码测试上，而测试的严谨程度会根据用户规模、涉金金额和所处领域的关键性而有所不同。一家小型 SaaS 公司可能只会写些集成测试覆盖若干场景，或者给部分组件写写单元测试；而像 AWS 这样的公司会对其每天处理数 PB 数据的 S3 系统进行数学上的正确性证明。

So, basing our argument on the discussion so far, it should be possible for an LLM to generate a codebase that does not require rigorous correctness. Perhaps it even could generate tests, or perhaps even write proofs on the code it writes, right? Maybe LLMs could really remove humans from the process and work on their own as a set of agents working together; if not now, perhaps when they’re better. I’ve heard this argument multiple times during discussions. The problem is, we have not taken into account verifiability of the code. We have talked about rigor and scale, but rigor is not enough as a parameter on the LLM-ability of a domain, we need to talk about how the code LLM generates is verified, and that includes testing code too.

所以，基于我们到目前为止的讨论，我们或许可以说：LLM 能够生成无需高度严谨正确性的代码库。甚至它们也许还能生成测试，甚至能对自己写的代码进行形式化证明，对吧？或许 LLMs 真的能把人类从开发过程中移除，作为一组协作的代理系统自主完成开发任务；即便现在还不能做到，也许等它们更强大时可以。我在很多讨论中都听到过类似的观点。问题是：我们忽略了一个关键因素——代码的可验证性。我们讨论了“严谨程度”与“规模”，但仅有严谨性还不足以决定一个领域是否适合 LLM，还必须讨论 LLM 所生成的代码要如何被验证，这也包括 LLM 所写的测试代码。

Let’s take a step back and take UI programming (mostly connotated with frontend development) into consideration. Anecdotal evidence from many practitioners state that LLMs are able to generate entire frontend applications in one prompt, or even drawings on a napkin. We see specialized tools such as V0 that focuses solely on UI generation. Yet, they are not that successful, or at least popularized in other domains. A popular folk sentiment regarding this phenomenon is that LLMs are trained over public domain repositories, and UI programming makes up a great portion of such code. Yet, that does not explain why LLMs are not that popular in web server applications (mostly connotated with backend development), which are perhaps almost as prevalent as UI programming, and has even less diversity and more structure in its code.

我们不妨退一步来考虑 UI 编程（通常指的是前端开发）。有很多开发者的亲身经验表明，LLM 能够通过一个提示词就生成完整的前端应用，甚至能根据纸巾上的手绘草图实现 UI。我们也看到一些专门用于 UI 生成的工具，比如 V0。尽管如此，它们在其他领域并没有取得同样的成功，至少还没有被广泛推广。一个常见的说法是：LLM 的训练数据大多来自公共代码库，而 UI 编程在这些代码中占了很大比例。但这并不能解释为什么 LLM 在 web 服务器应用（通常与后端开发相关）中不那么流行——而后端代码可能与前端一样普遍，甚至更少样化、结构性更强。

Although there is some psychological explanations regarding this discrepancy between frontend and backend popularity, mainly that creating UIs that are shiny and impressive is easier than creating backends that could attract the same level of attention, hence anecdotal evidence regarding the popularity or usability of backend/frontend comparison should not be given much credit, I hypothesize that UIs are much easier to verify than any other domain we write code for.

尽管也有一些心理层面的解释来说明前端与后端之间在 LLM 应用上的差异，比如说华丽炫目的前端 UI 更容易吸引注意力，而后端很难达到相同的视觉冲击，因此不应过分依赖这些有关“前后端谁更流行”的坊间传闻。但我更倾向于认为，UI 编程之所以在 LLM 中更成功，是因为它比我们编写的其他类型代码更容易验证。

Verification of UI is almost literally “looking”. When we look at a web page, we almost instantly recognize how the generated code diverges from our intent and can provide instant feedback back to the LLM. Verification of a web server requires setting up a set of test inputs, sending them to the server, perhaps creating some ephemeral state, and checking that the outputs conform to our expectations. This is, in fact, so much easier to write tests for than trying, whereas writing UI tests is a 30 year old long pain point that has not yet been solved.

UI 的验证几乎就是“看一眼”的过程。当我们查看网页时，几乎能够立即察觉生成的页面是否偏离了我们的意图，并迅速将反馈传达给 LLM。而验证一个 web 服务器则需要设置一组测试输入，发送到服务器上，可能还需要创建一些临时状态，并检查输出是否符合预期。从这个角度看，web 服务其实更容易写测试而非靠“试一试”，而 UI 测试却是一个 30 年都没能彻底解决的老痛点。

The rise of “vibe coding games” last week seems to stem from the same root. The generated games are nonetheless impressive, but their verification is possible via trial. Typically implemented over a single server with no persistent state, the games look over complex networking or performance problems by degrading user experience or removing features that cause those problems rather than solving the problems themselves. This is not to say this is not a valid choice, but the reason for this choice is not that those things do not matter, it is that they are harder to verify correctness for, require custom testing infrastructures or domain knowledge for verification.

上周兴起的“vibe coding games”（氛围编程游戏）似乎也是源于同样的根源。这些由 LLM 生成的游戏确实令人印象深刻，但它们的验证方式基本靠“试玩”。这类游戏通常部署在单一服务器上、无持久状态，通过牺牲用户体验或直接移除可能导致复杂性能或网络问题的功能，而非真正解决这些问题，从而绕过验证难题。并不是说这不是一种合理的选择，但这种选择的背后原因，并不是这些问题不重要，而是它们的正确性更难验证，需要定制的测试基础设施或特定领域的专业知识才能完成验证。

Going back to the title, verifiability is the limit.

回到标题所述：可验证性才是界限。

It is the limit that tells us what we cannot implement via LLMs, and it cannot be solved with agentic approaches. A security agent might in theory add security checks to an application, but such a check is worthless without verifying it as the person who intends to produce the code. A testing agent might add tests to a program, but the tests themselves are meaningless until we check to see they correspond to our intent.

它是那道告诉我们哪些事无法通过 LLM 实现的界限，而且它不能靠“代理式方法”解决。理论上，一个安全代理可以给应用加上安全检查，但如果代码生产者自己不去验证这些检查是否有效，那么这些检查也毫无意义。测试代理可以给程序添加测试用例，但这些测试是否有价值，必须由我们检查其是否真正对应我们的意图，才能得出结论。

So far, I theorized about the LLMs and the way we produced software to make my point. Now, I shall preach about what I think we should do.

到目前为止，我从 LLM 与软件生产方式的角度出发提出了我的观点。接下来，我想谈谈我认为我们应该做些什么。

If verifiability is the limit, if it’s the bottleneck of using LLMs for programming, the natural question is how do we raise the limit, how do we make it easier to verify?

如果可验证性就是极限，是我们用 LLM 编程的瓶颈，那么一个自然而然的问题就是：我们该如何提升这一极限？我们如何才能让验证变得更容易？

I believe the first step we need to take is to concede defeat on the infinitely powerful and scalable software agents ideology. If verifiability is really the limit, no amount of agents is going to solve the problem. The way I see it, we need better tools and interfaces for verification. For example, instead of the programmer reading the tests LLM generates, it could be possible to summarize the generated tests into a human readable format, while of course the summarization also poses a threat of information loss. We should rely much more on declarative random testing methods such as property-based testing, where the programmer defines a predicate that should hold over all possible inputs, and this predicate is tested by generating random inputs and passing them to the program. Such methods have been used to enhance the reliability of programs in many domains, but they haven’t caught up with the software engineering community. The advantage of such methods is that having one general predicate is much easier to inspect and understand than many unit tests, in addition to the added testing strength.

我认为我们首先必须在“无限强大且可扩展的软件代理”这一理念上承认失败。如果可验证性确实是极限，那么再多的代理也无法解决这个问题。在我看来，我们需要的是更好的验证工具和界面。例如，与其让程序员去阅读 LLM 生成的测试代码，我们不如尝试将这些测试总结为人类可读的形式——尽管摘要本身也可能带来信息丢失的风险。我们应更加依赖声明式的随机测试方法，比如基于属性的测试（property-based testing）：程序员定义一个应在所有输入上成立的谓词，然后通过生成随机输入并传入程序来测试该谓词。在许多领域，这种方法已被用于提高程序的可靠性，但在软件工程界仍未广泛流行。与编写和维护大量单元测试相比，编写一个通用谓词不仅更容易理解，而且测试能力更强。

We need to increase the vocabulary we’re using when talking about correctness. We need to have a wider understanding of performance, security, accessibility, flexibility, and how to measure such qualities of programs so that we know what we are looking for in our applications, and we have mechanisms for verifying these qualities. In status quo, correctness is typically attributed as functional correctness, where it mostly means “input-outputs are as expected”, and most other expectations from an application is bundled under “non-functional properties” and the software engineering community has ignored to properly measure and verify them at large.

我们在讨论“正确性”时，也需要扩充我们的词汇体系。我们必须拓展对程序性能、安全性、可访问性、灵活性等维度的理解，并探索如何对这些质量特性进行量化评估。只有这样，我们才真正知道自己在构建应用时追求的是什么，也才能建立起相应的验证机制。在当前的主流认知中，“正确性”往往仅限于功能正确性，即“输入-输出是否符合预期”；而程序的其他期望大多被归类为“非功能性属性”，并且在整个软件工程社区里，这些属性一直没有被系统性地度量和验证。

Let me sum up with a prediction. For a very long time, I have believed LLMs would not be good coders, even with further improvements. My views have shifted after their success in competitive programming, perhaps as a reformed competitive programmer, that was my Lee Sedol moment. I now, although tentatively, believe that LLMs can indeed succeed in all domains with perfect oracles.

最后让我用一个预测作结。很长一段时间里，我都认为 LLM 不会成为优秀的程序员，即便随着能力增强也不例外。但在它们在竞技编程中的成功之后，我的看法发生了转变。也许是因为我曾是一个转型过来的竞赛程序员，这对我而言是类似“李世乭时刻”的转折点。我现在，虽然仍持保留态度，但相信在拥有完美判定器（perfect oracles）的领域中，LLM 的确有可能取得全面胜利。

It is important to understand that this does not mean LLMs will be gods producing 100x code, because virtually no domain that software engineering is useful has a perfect oracle. A perfect oracle is a type of feedback where you are given a “correct/incorrect” answer every single time, and they almost only appear in games as real world typically doesn’t have perfect models of correctness. Winning or losing a game is a perfect oracle, as well as creating a program that can pass the judge in a competitive programming contest.

但必须理解，这并不意味着 LLM 将成为“生产百倍代码”的神。因为在实际软件工程中，几乎没有任何一个有用的领域拥有“完美判定器”。所谓“完美判定器”，指的是你每次都能得到明确“正确/错误”反馈的机制，而这种机制几乎只存在于游戏中，因为现实世界往往没有完美的正确性模型。比如输赢明确的游戏，或者能否通过在线编程评测系统的竞赛程序，就是典型的完美判定器。

Even the most formalized aspect of programming, the theorem provers where users can prove their code correct, are not perfect oracles, as they cannot tell you if you are on a good path at an intermediate point in the proof, but only give you information if your proof is correct or if you are stuck. I hope that this imperfect oracle will be enough to win the game of proving, and that LLMs will be better than us in automated proofs. If such a day comes, our next job could be creating new theorems, asking LLMs to produce code and proofs, and scaling such systems into production grade codebases safely and securely.

即使是在编程中最形式化的领域——定理证明器——中，系统也不是完美判定器。它无法告诉你当前的证明路径是否正确，只能告诉你最终的证明是否成立，或者你是否陷入死胡同。我希望这种“不完美的判定器”足以让我们在证明的游戏中胜出，并希望 LLM 能在自动化证明方面超过我们。如果这一天真的到来，我们的下一个任务可能就是去提出新的定理，让 LLM 为其编写代码与证明，并将这些系统安全且可控地扩展到生产级代码库中。

Please let me know what you think of this at [akeles@umd.edu](mailto:akeles@umd.edu), and share with others if you found this interesting.

如果你对这些观点有任何想法，欢迎通过 [akeles@umd.edu](mailto:akeles@umd.edu) 与我联系。如果你觉得这些内容有趣，也请分享给他人。

2025-03-09 Alperen Keles.Verifiability is the Limit

2025-03-09 Alperen Keles.Verifiability is the Limit

热门主题

Recent Articles

1987-02-27 Warren Buffett's Letters to Berkshire Shareholders

2014-02-28 Warren Buffett's Letters to Berkshire Shareholders

2015-02-27 Warren Buffett.Intrinsic Value

2023-09-16 Satya Nadella.Why we need re-founders

2024-03-13 Satya Nadella.AI, chip shortage, empathy, and poetry