Formal verification is still too limited to be useful for most app developers. The article gives an example of an e-commerce platform using it to prove the correctness of managing refunds, but then acknowledges:
> As of today, the formally verified core can handle most effect-free logic—invariants, transitions, conflict resolution. But the UI, network calls, and database interactions typically sit outside the verification boundary. Verification makes the core airtight but doesn’t guarantee end-to-end correctness.
So you can formally prove that your e-commerce refund management logic is correct, except for proving that you processed the refund. You can't even prove anything about recording the refund in your database, say nothing of proving anything about your interactions with your payment processor.
If your app is mostly tricky logic with just a bit of I/O, your app is very unusual, and it's almost certainly not an e-commerce app. E-commerce apps are mostly CRUD apps; I/O with the database, the UI, and third-party APIs (e.g. payment processors) is 99% of the code.
Even property-based testing is mostly unhelpful for e-commerce apps like these.
Instead, think of formal verification as a runtime performance improvement of property-based testing. If property-based testing is useful for your app (it probably isn't), then you may be able to convert some of your property-based tests into formal verifications.
But, honestly, you probably can't do it, not even with a high budget of tokens.
I'd love to be proven wrong, but the way to do it would be to formally prove the correctness of non-trivial open-source code with property tests. Perhaps you could formally verify significant chunks of Postgres! (But I doubt it.)
I don't know a lot about formal verification, but:
> So you can formally prove that your e-commerce refund management logic is correct, except for proving that you /processed the refund/. You can't even prove anything about recording the refund in your database, say nothing of proving anything about your interactions with your payment processor.
You could say the same thing about the viability of functional programming on a CRUD webapp, but languages like clojure have been used to great effect here. The fact that thera are important, even fundamental, bits that you cannot verify, doesn't take out value from the fact that you can eliminate whole dimensions of issues.
I actually did take a formal verification course in college. Our final project was to use the techniques we'd been learning to verify some classic critical-section locking algorithm. I chose to verify an implementation of Lamport's bakery algorithm[0] in C (this was the 90s -- a lot of code was still being written in C).
The problem is that Lamport's algorithm makes an assumption that the "ticket number" is unbounded and any finite implementation in C will almost certainly use a value which is limited to 32 bits or so.
So I was able to formally verify that the algorithm fails to protect the critical section if enough processes are kept waiting to overflow the counter. :)
This probably just means that Lamport's algorithm isn't a great choice for such environments, but I'm still bummed that the professor gave me a B.
You shouldn't have been able to formally verify the algorithm fails to protect the critical section. Wrapping ticket numbers can lead to starvation (literally, if we follow the baker analogy), but the algorithm protects the critical section so long as thread IDs are unique.
The sort of environments in which this is a problem would be extremely uncommon. For a start, you need continuous contention. If you ever get a break in contention you no longer have starvation, and you 'reset' the ticket number monotonicity - to zero, if you actually take a maximum of entering thread ticket numbers instead of a cheap global counter.
If you do actually have an environment in which you expect to have continuous contention over a critical section, some quick napkin math can tell you if it's something to worry about based on the running time of your critical section. If it's over, say, 10ms, you've got a few years of runtime before it's a problem. If it's under 1ms, maybe you want to use 64-bit arithmetic for your ticket number so you can run your system until long after the human species is extinct.
You probably got a 'B' because you didn't give the professor the answer they expected, though, not because of any technical reasons.
I'm quite certain I formally verified that two processes were both able to get into the critical section WITH THE ALGORITHM I WAS VERIFYING, but this was 35 years ago, so details are fuzzy.
Remember that this is long before Wikipedia and even before Google, so I have no idea where I'd have gotten the alleged Baker's algorithm from.
Also you're right that this isn't much of an issue in practice, but that's not what formal verification is about, now is it? :)
Well, I'm someone who barely knows more than jack about formal verification, but in pretty much every case you have to have some kind of model that you are actually verifying.
How close that model sits to the real thing you have modeled is an important question, and you are free to be as close or distant as you want -- e.g. for verifying different properties of a programming language you might decide to not care about CPU instructions, registers, etc, and only care about the semantic model. This has absolutely many use cases (e.g. whether a particular optimization is sound) where this "model mismatch" doesn't matter, this doesn't make formal verification useless in any way or shape, imo.
Getting back to at the "e-commerce refund management" -- you can absolutely have a model that does e.g. a particular database IO call that either succeeds or not. With such a model in place, you can have the rest of your codebase formally verified and know that 'with a properly working database it will always work correctly' [1]. Is that not a very significant and useful finding in and of itself? Would you be more confident in your end-to-end tested software than the above?
Especially that one can then separately test that particular call site as deeply as they want, to determine that the assumed property (it either returns success or fails) is sound.
[1] Given a correct specification, which is not easy to get right
Formal verification is a siren song. The siren sings, "bug-free code is possible in principle!" But it's a trap. Even with LLMs, bug-free code is impractical.
I argued that property-based testing is mostly unhelpful for e-commerce/CRUD apps, and that formal verification is a performance improvement on property-based tests.
In a property-based test, you identify some rule (an invariant) that you want to apply to your code. Then, you fuzz your app, testing it with autogenerated inputs, failing the test if the rule is broken at any point. In formal verification, you prove that the code always satisfies the rule, so you don't have to try millions of inputs.
Whether you're doing property-based testing or formal verification, it's extremely difficult to think of any non-trivial business logic properties that should apply to CRUD apps, even if they could be written in English, translated perfectly into code, and verified formally, instantly.
An actual rule that should always be followed, inflexibly, such that a mathematical proof would be useful (and that actually matters to your business) is so rare in CRUD apps that I'm not sure I've ever seen one.
Even with general-purpose rules (the app should never crash, the app should not leak memory), the property-based fuzzers tend to find bugs that have never happened in production, and probably never will. It's rarely economical for an e-commerce app developer to spend time fixing those bugs, even if finding them cost nothing at all (which is not remotely true, even with LLMs).
And what about UI? Maybe you'd want a rule like: "The title of the product for sale should never overflow its container rectangle in the UI."
OK, well, what if the title is one very long word? But… none of the products you sell happen to contain any words that are 500 characters with no spaces. I guess you could add code to prevent that product from ever being created? (And ensure that data in the database will never allow product titles that violate your business rules… how, exactly?)
Formal verification shines where property-based testing is already useful. It's already useful for many software platforms. It's useful for databases, where reliability is essential. It's useful for parsers, particularly when you expect the end user to be attempting to send you hostile code.
> An actual rule that should always be followed, inflexibly, such that a mathematical proof would be useful (and that actually matters to your business) is so rare in CRUD apps that I'm not sure I've ever seen one.
I see them often. Nearly every CRUD app I have come across in the wild has, for example, employed some form of email validation. To your concern, the rules of email validation are well defined and are unlikely to ever change. Importantly, a large percentage of the implementations I saw got it wrong.
As a user, I have also tried to use CRUD apps that have rejected my valid email address, so this isn't even a class of problems that will never be realized in practice. It is a pain I have felt in the real world. The use of PBTs or proofs would have easily uncovered the implementation failures.
IMO the industry consensus is never to "validate" email addresses syntactically, but simply to ensure that the email address contains at least one @ and to verify the email address by emailing it an activation code.
Proofs would not have uncovered these failures. The proofs would have proved that they rejected your email address as invalid, and the developers would have patted themselves on the back for a job well done.
> The rules of email validation are not remotely well defined!
RFC 5322 fully defines the structure.
> IMO the industry consensus is never to "validate" email addresses syntactically
That is true, but not because it isn't well defined, but because it is hard to get right. Keep in mind that most developers don't even know what a property-based test is, and of those that do, only a small subset of them know how to use them. If you find any testing around it at all, which is a stretch to begin with, you will be lucky to find more than a small set of common addresses without any care or concern for the complex edge cases that lead to problems like I have had as a user in the past. Encouraging developers to only validate for the presence of @ means that there is no additional room to screw things up.
But better than to rely on gimping your code to deal with developers is to use the tools at your disposal.
> and to verify the email address by emailing it an activation code.
That may also be beneficial, but not for the same reason. Not all CRUD use-cases fit that mould.
Don't forget RFC 5321! But the RFCs are ignored in practice by all popular mail servers. There are email addresses that work in practice that don't comply with the RFCs, and there are email addresses that the RFCs permit that don't work in practice.
(This happens with a lot of standards; sometimes people just ignore them and do their own thing. Something similar has happened with SVGs.)
If you write a formal verification of a syntactical email validator that ensures that all/only RFC-compliant email addresses are valid, you'll have completely wasted your time. Don't do it. Just check for at least one @ sign, and email the address to test it.
(This is a perfect example of the trap of formal verification.)
> But the RFCs are ignored in practice by all popular mail servers.
While I agree that you can make a compelling case that sending email is the C in CRUD, usually when someone is talking about CRUD they are referring to systems that satisfy all four letters. U violates the spirit of email, and R and D is usually handled independently of the MTA. So what email servers do here is irrelevant. If you go way back up the comment chain you will even see that Postgres was specifically mentioned. Postgres doesn't care what an email server does, but it does care about data consistency.
> Just check for at least one @ sign
That's a valid specification and in practice you are going to want to make that a PBT to ensure that your implementation actually adheres to the specification. You might try testing foo@bar.com, but what about foo\u0040bar.com? Will you think to test it too? Probably not. Will your code handle it correctly? You may be perfect, but when we get out into the general developer population where all kinds of crazy things show up when they start monkeying with your code, the answer is also probably not.
Looking simple isn’t a reason to not use the tools at your disposal, even if many won’t.
> and email the address to test it.
Poor general advice. That is expressly illegal in some jurisdictions.
I'm speaking from painful experience here: if you assume RFC 5322 has anything whatsoever to do with how email addresses actually work in the wild, you're in for a world of hurt. Popular email providers don't give a shit what RFC 5322 says, and you can't either if you want to have any hope of actually sending and receiving mail. Test messages are the only way to validate an email address, period.
> It's already useful for many software platforms. It's useful for databases, where reliability is essential. It's useful for parsers, particularly when you expect the end user to be attempting to send you hostile code.
> But e-commerce apps?
So the e-commerce app doesn't use a DB?
And the e-commerce app receives zero user input that needs to be parser so doesn't use any parser either?
If you say its useful for, for example, DB and parsers, then any app using these benefits from it.
Read carefully. There's a big difference between developing a database and developing an app that uses a database.
If you're developing a database, you should use property-based tests to ensure that your database behaves as expected (ACID reliability, etc.). If you can formally verify parts of your database, you won't have to fuzz it, because it will have been proven correct. But if you're developing a CRUD app that uses a database, there may be no properties of your app at all that are worth fuzzing.
Similarly, if you write a library that parses Markdown, you should write property-based tests of it, and fuzz it. If you use a library that parses Markdown, you should pass it your Markdown, and let the library handle it.
E-commerce apps typically don't need to do any non-trivial parsing. Effect-free logic is likely to be ~1% of your code base, or less.
This is why we keep talking past each other: the techniques that make sense for real-time operating systems don't make sense for an e-commerce app, a line-of-business admin dashboard, or a single-player game.
> But if you're developing a CRUD app that uses a database, there may be no properties of your app at all that are worth fuzzing.
Surely we have seen enough SQL injection issues over the years to know that, where "uses a database" means using SQL, fuzzing (or formal verification) is a necessity? That only scratches the surface of where fuzzing and property-based testing can be applicable to CRUD apps.
> Effect-free logic is likely to be ~1% of your code base, or less.
Excellent news. Side effects are where property-based testing finds its maximum return on investment. Mutations in any kind of complex system are especially hard to reason about and are easy to get wrong. Property-based testing is great at uncovering the edge cases you failed to consider.
> This is why we keep talking past each other
You keep talking past each other because "CRUD app" or "e-commerce app" is too nebulous. Concretely define exactly what this CRUD app is and then we can all meaningfully zero in on where property-based testing and fuzzing may or may not be useful. As long as you leave everyone to imagine what their own pet CRUD app looks like, you will never find a a shared understanding in which to discuss it.
In this thesis, we develop and evaluate a formal model and contracting framework for data-centric Web services. The central component of our framework is a formal specification of a common Create-Read-Update-Delete (CRUD) data store. We show how this model can be used in the formal specification and verification of both basic and transactional Web service compositions. We demonstrate through both formal proofs and empirical evaluations that our
proposed framework significantly decreases ambiguity about a service, enhances its reuse, and facilitates detection of errors in service-based implementations.
The first part of formal verification is getting a formal specification. I don't know about most developers, but I rarely get a written specification for anything I work on, and when I do, it's no where near what would be needed to turn it into a formal specification.
Anyway, the specification is subject to change at the whim of a hat, so putting a lot of effort into verifying it is foolish.
I do see value in formal verification of IPC/threading communication primitives (locks, semaphores, queues, whatevs), but then formal verification usually require assumptions for hardware behavior and those aren't always correct, so. But I've never used formal methods outside exposure in an undergrad survey class, so I dunno.
At my last job, I got a formal specification for one project. But the implementation happened in parallel for time reasons, so by the time the spec was frozen, 90% of the code was written.
Nitpick: You don't necessarily need any specification at all in order to reap benefits. Formal verification languages come with a lot of conditions that your program must fulfill in order to be accepted: Every loop terminates, every object you want to read/write is non-null, every list or array access is in bounds, etc.
For example, if you load an arbitrary C program into Frama-C, you'll have tons of properties to prove before you can even think of adding your own specification. The promises you get is that the program will always terminate and will never invoke undefined behavior. These properties are extremely useful, and these requirements are unlikely to change! And yes, C is... special. But pretty much any language has lists or arrays, nullable/option types, and unbounded loops. "No crashes/panics/uncaught exceptions" should be worth it.
And then, even absent a specification, you can start modeling invariants of the system anyway. This list is always sorted, this number is always strictly positive, this data structure never contains duplicate entries, etc. Anything that today would be a comment saying: "Important! The caller must ensure that..." This can help you gain confidence in your system. And if some day a requirement comes along that really requires you to violate one of these invariants, well, you can just remove it, the same way you would remove a test that no longer reflects something that should hold.
> "No crashes/panics/uncaught exceptions" should be worth it.
Surprisingly, no. Property-based testing and formal validation make it easy to spend tons of time and money "preventing" bugs that would never have occurred in production, especially uncaught exceptions.
There is code where strong guarantees can be worth it, (databases, platforms/operating systems, parsers accepting hostile input) but it's not most application-level code, and certainly not most e-commerce CRUD apps.
Remember, we're here to make users' lives better, not to write correct code for its own sake.
Oh no, you have just proved formal verification useless! Or, if you have an outer event loop that should potentially run forever, you mark only that one loop accordingly (details depend on the system).
What we get typically is a second or third-hand summary of an analysis someone else did about how the current version works, with no indication of what stuff has to stay as-is in our replacement and what stuff can be improved.
Nothing stops you from writing a formal spec for an implementation or the technical parts of a problem that the product / upstream requirements don't explicitly demand. In fact I would say this is a large part of what software engineering fundamentally is; the reason specs are written in a certain way is because they model a problem domain and encode fundamental (or incidental, which could still be very well-reasoned, eg due to backwards compatibility, limitations of dependencies, or practical constraints) properties that are unlikely to change. Once you've actually written software that other systems/components are depending on, you've in practice created a spec. And it's a false dichotomy to think that just because there is no formalized spec for something before it's implemented at all, that this means that once the domain is better understood or stabilizied, you cannot use formal verification on the de-facto spec that has been arrived at/de-jure rules and structure that are prescribed but not yet formalized.
Even when you are writing something for the first time, like a parser for a certain kind of data format, you often can greatly accelerate your production-safety/security maturity by formalizing your spec enough to be fuzzed and to use existing parsing tooling that avoids the footguns most developers will introduce writing their own parsers from scratch. This is IMO the most common situation in which formal verification becomes very useful; even if the data format is purely an implementaton detail, any time parsing untrusted/user-provided input is involved, it would be much more foolish to try to just throw together some random string processing functions than to use something like flex/bison, an existing serialization format/protocol, or a metaparser (which may themselves be formally verified).
If you have a set of axioms that Postgres works as designed, you can prove that your code updates the database. If you define "the refund was processed" to mean "the refunded column of the order is true" you can prove that.
Property based testing is useful for finding bugs even in these kinds of CRUD heavy apps. There can be a surprising number of bugs and unexpected behaviors in the interaction of multiple sub-systems or APIs, and one way to find those bugs is to come up with properties on traces of calls.
For example, I saw a paper on using metamorphic testing (a particular technique for defining properties to test) to find bugs in the web APIs of Spotify and YouTube[1]. I don't have time to reread the paper just now, but I remember that they found weird behavior in pagination of search results. Not a big deal in that particular case, but I've definitely seen internal APIs with behavior that could be similarly wrong but with a much larger real-world impact.
Personally, I see property-based testing and formal specification more as tools for design and debugging more than full-on correctness. Even with AI models it's still really hard to fully prove something correct, but having even a partial logical specification can let you trade design time for debugging time and lets you find inconsistencies or potential edge-cases when you're initially figuring out a system, rather than waiting until you have a massive codebase to debug and redesign in production.
It's not a panacea and you still have to be careful at the boundary between your nicely modeled system and the real world, but, once I got the hang of working in that style, having some formal properties or partial logical specifications of the behavior I needed has been really nice to have, as much for saving effort as for ensuring correctness.
I've mostly worked in slightly different domains, but I've found property-based testing useful both as a tool to catch bugs but also as a tool for communication. I spent a couple of years building and supporting a supply chain simulation at Target, where I frequently got requests for new metrics from the supply chain planning team. By teaching them how to specify either the whole metric or, at least, some of the expected behaviors of the metric as mathematical properties, it took far fewer back-and-forth conversations to correctly implement the metric in the simulation. We could then test these things by checking these properties over a bunch of random simulation traces. Day-to-day this saved a bunch of time in debugging small simulation mistakes. In the longer-term, having this test suite also let us rewrite the simulation code in a new style in Rust to drastically increase performance. All of this would have been possible without the set of properties, it would have just involved a whole lot more slow, tedious work.
Have you looked at model-based testing? One way to think of it as property-based testing for stateful system, though that's underselling it a little. It's surprisingly easy to come up models/specs for most stateful systems, including CRUD apps.
There's a lot of really important software out there where being able to easily verify effect-free core logic would certainly be very useful. An e-commerce web app is not a good example. Anything safety-critical -- aerospace, defense, medical devices, power generation, industrial machines -- already requires a certification process. Auto-generating proof evidence as part of the cert process (which generally requires a rigorous spec anyway) in the near future seems like a no brainer.
> Formal verification is still too limited to be useful for most app developers
Limited formal verification is useful to most app developers. Consider a simplistic system that requires you to specify variables as strings or integers. You've almost certainly used a language that can express that before. Being able to formally verify that you haven't tried to stuff a string into an integer in order to give your editor the squiggles is a productivity enhancer. Or used to be in the pre-vibe coding days, at least; who knows anymore.
Your tests will eventually tell you the same thing, that is true, but the industry has decided that there is little downside to having at least some degree of formal verification and a whole lot of upsides. The question is always: where do the returns diminish? This is where you get the silly commentary on HN with the Rust guys crying out that Go isn't expressive enough and the Rocq guys laughing at it all.
I don't understand your argument, why precisely is contemporary app development thought to be so simple that it is not even worth thorough property testing, let alone FV? Isn't that attitude (on the part of the industry as a whole) a type of self-fulfilling prophecy?
If you are willing to relax the restrictions, and you probably should, model checking is probably worth its weight in gold for these scenarios.
You won’t get proofs but you will spell out your logic in a formal language[0] and each run of the checker will exhaustively check your invariants[1].
[0] Useful because often you will learn something you hadn’t considered.
[1] A proof will guarantee your statements hold over quantifiers that are much too large for a model checker to check exhaustively. But, you can say that for a model of size N, property Y is guaranteed to hold. The “small model theorem,” posits that if there is an error in your specification, it is more likely to show up in a small model. You sacrifice the completeness of proofs but this trade-off has been worth it to me.
I endorse this, and IME find there are major benefits to “relaxing” even moreso through reductions or decompositions into well-known models rather than defining your own.
A lot of problems in practice are just too complex, too theoretically difficult, or some combination of not-theoretically-interesting-enough for an academic or specialized worker to be attracted/assigned/exposed to it, or just too difficult/expensive/time consuming/unimportant to be worth directly formally modeling. Or, they’re big problem spaces and you want to start where you get the most bang-for-your-buck. A (partial) reduction lets you still realize practical benefits for problems that may not perfectly or entirely be amenable to simple/small models, or which themselves may be “jagged”/gnarly problem domains due to real-world constraints like backwards compatibility.
Reduction to a formal target arise naturally from defining type conversions (likely to a codomain that is either actually a strict superset of the real reduction image, ie overly generalized/non-surjective, or bijective for some partial subset/step). Mathematically or computationally, it may be less than ideal to use an overly broad/flexible/general abstraction to model something more structured. But practically, it lets you use actual software written for the target reduction and apply any results/properties of the more familiar or better studied model.
The less-than-ideal reduction may be easily verifiable/simple and so obviously safe that, in-context, it’s clearly worth choosing the technically-suboptimal/less-elegant/partial approach to free ride off existing formal verification, without needing to wade deep into theoretical territory, than to chase perfection.
For example, if you can model a data format as a Context Free Grammar, or a subset of all context free grammars, a developer could spell out that grammar or grammars and then use existing metaparsing tools to safely process input data rather than write their own parser. And they wouldn't even need the format to be a complete CFG to leverage CFG tooling! Suppose portions are context-sensitive - for example, decoding the payload requires applying some length-prefix extracted from the header - just encode and parse the header format as a CFG with an opaque payload blob suffix with a length equal to eg the max payload/packet/fragment size. Perhaps some part of the header also specifies the payload’s structure in addition to the length: you could have one CFG for reading the header, a simple wrapping CFG for a fixed-length frame containing a header followed by junk (FRAME_HEADER_CFG :=, a CFG for each payload format, and one wrapping/router function that handles the very simple non-cfg step of using the first-pass parsing output to determine the length/format of the second pass.
Then even though your data format is not technically a CFG, the vast majority of your parsing occurs within a CFG, with only a small part that does (in pseudocode). And you did not have to do or run any formal verifications, or use special model checkers, only encode the CFG-compatible subsets of your data format into something like EBNF, then use a compatible formally-verified metaparser, and take a leap of faith that the small amount of non-formally-verified/not-nice-to-model code is safe. That's much more realistic for a developer to do than to implement a formal-verification of a non-cfg format de novo.
I have minimal experience with formal verification, but I’d be curious to hear your view on Deterministic Simulation Testing (DST) and how that slots in to the conversation.
That to me seems to address a lot of the IO facing pitfalls formal verification can struggle with. At the obvious cost of it being a bear to implement.
I mean the cumbersome gold standard is literally just using the program. Fictional dialogue:
A: Is it possible to pay for a product on Firefox mobile?
B: Yeah, I manually test it for each new version
Please let us not forget: any automated solution is just there to reduce the amount of manual testing needed.
Of course that is a lot of work and testing every possible combination of situations, technologies etc is probably not feasible, but you shoud have a checklist that you follow, if there are tests you can't automate.
I have been working on metaparsing and some related tools for spec-based-development, fuzzing, and property-based-testing off and on for the past few months. I hope to make it easier for application developers to get the benefits of formal verification without having to spend inordinate amounts of time or resources writing unfamiliar, difficult formal-verification software or tests. You're right that it's not useful for most developers to concern themselves with this now, but I think that's mostly just because nothing makes it easy enough yet.
> Even property-based testing is mostly unhelpful for e-commerce apps like these.
Only in practice because of lack of application-developer oriented tooling. Why can you not have a callback assertion or database trigger that alerts or opens a ticket when a refund is not processed within a specified amount of time? Why can't you map out your payment processors' APIs and, if amenable, define a more transactional interface on top of their CRUD API to automatically rollback incomplete or partial state transitions? (Obviously, not all payment processors will expose an interface allowing you to do this in every case; but often times they do provide the fundamental pieces in-practice, just not in a way that is easy to discover or implement). Why can't you define cross-dependency static analysis guaranteeing that eg the core checkout flow continues to work before even running any integration tests? With the right tooling you could, it's just unreasonable to expect every developer working on ecommerce to get a PhD in automated theorem proving to build their own tools for that.
> tricky logic with just a bit of I/O
I actually think that IO is the #1 most important and high ROI surface to apply formal verification (and fuzzing/other related tools which are downstream of formal methods). How many ecommerce apps are secure against DOS or runtime panics from an adversarial client sending them junk data? How many write their own janky/insecure parsers against input formats or external APIs and introduce security vulnerabilities there? I would hazard a guess that most 5+ dev webapps are guilty of basic IO sins that formal verification tools could identify in short time and also prevent from being introduced if they were added to the core development process.
The approach I'm taking is to use codegen to automatically convert OpenAPI specs and grpc service definitions (and in some cases actually just methods and http handlers themselves), and wire-format data, and application-level data structures into formalized representations that I can use downstream with autofuzzing/trace/black-box integration testing/etc tools. There are certain things that only make sense to do with AI, that are very hip and envogue, eg black-box or API-surface-informed adversarial integration testing. But you can actually much more cheaply just use AI to set up deterministic tests like this once; the problem up until now is the lack of good end-user tooling and support for those tools, which are hard to write and test especially in real-world applications that don't conform to the specific industries/academic niches that formal verification are already adopted within. That's getting solved too.
I'm not entirely sure what this is showing people don't understand? Especially when going with such silly ill defined concepts as "financial conservation". Just what?
Now model in that it was shipped, but an earthquake caused the delivery truck to be destroyed. Or it was shipped, but the person that ordered passed away before delivery and the estate is refusing to accept packages.
People will want to somehow transfer the model of an online order as similar to an in store purchase. Does that mean that as soon as a customer takes an item through the door that the store is free of any and all obligations on the item?
The answers in all of these will have to be that there are processes in place to be executed. Some may require overrides on state of execution that have to be applied to get things back to a resolved status.
Now, do we want to make sure that normal execution of some code does not leave us in an unresolved status? Of course we do. And many people want to think they can find a way to model the world such that no contested states can exist. I have my doubts. But I welcome efforts to make it so that we surprise ourselves fewer times with some outcomes.
What I think I hear you saying is, do formal verification as much as you can. But also remember that that's never 100%, and therefore you need to leave some kind of escape hatch or alternate process or something.
Absolutely! I always have worries when posting on my personal doubts on any technique, as I genuinely look forward to people making progress in spite of my doubts.
I use formal verification as part of my development process. The needs of the proof guide the development of the code as much as vice-versa. The result is usually cleaner, simpler, smaller and usually more efficient programs developed much faster as debugging effort is minimal. I still create complete test cases. Proof maintenance as code changes is a pain and I would like LLMs and/or other tools to help with that. I would never try to formally verify code written with regular processes!
I've been thinking about formal verification a lot, recently. I've dabbled in it before, but it was clear that it was only used by a small research community, and the effort required to verify anything larger than toy code would be immense. I agree with the author that there is enormous potential to use AI to automate the annoying parts of the verification process. What's more, the current security environment, in which the tiniest security flaw can quickly be exploited, suggests that provably secure code might be the future.
Others are correct to point out that formal verification is too difficult to apply to many types of application code. But there are domains where it is applicable today, and the main reason it is not used there is that developers lack the time and know-how. For example, many file format parsers are exploitable, but they are simple enough that they could be formally verified.
My professor was literally E. Allen Emerson (RIP) in uni so I feel like I'm in a unique position where the jab in the article title doesn't apply to me.
I had fun in a college class that used Dafny, building a pseudo digital wallet, it wasn't the main focus of the class, so didn't get that much out of it
> It’s no longer just for safety-critical systems with the budget for specialized proof engineers. It’s for anyone who has a property worth proving
... and the budget to pay the AI to prove it.
I have quite a bit of experience with formal verification, but I don't understand the claim made in the article. As an aside, AI's ability to reliably prove the correctness of significantly large programs is still theoretical at this point, but let's assume it's possible. The claim in the article is that writing 10,000 lines of proof to prove a 100-line program was very expensive, and that's why it isn't done. But this increase in cost continues with AI! Whether you pay people to write the proofs or you pay an LLM to write the proof, you still have to pay for it. If I run a software company, saying that "verificaton is the AI's problem" isn't much different from saying, "it's the engineers' problem." Either way I'm not doing the work myself, but I am paying for it.
If the premise is that writing proofs was 100x more expensive than testing, I see nothing in this article to even suggest why it wouldn't still be 100x more expensive when an LLM is doing the work.
(BTW, the reason there aren't many specialised proof engineers is because they aren't in high demand; they're not being paid that much more than other engineers at a similar level)
> writing 10,000 lines of proof to prove a 100-line program was very expensive, and that's why it wasn't done.
We are not that silly. We are writing compilers (ie model checkers) which translate the source code to formal proofs. No cost at all, you just need to limit loop sizes and function call depths, to keep the cost of the proof down. And then extrapolate the little proof to the general proof.
Whatever the cost multiplier is, I see no reason why that same multiplier won't remain with AI.
Personally, I don't think that picture is quite accurate. Yes, there is a high cost multiplier for small programs, albeit perhaps not so prohibitive. But for large programs, that multiplier is, for most intents and purposes infinite, unless, perhaps, you have experts who know what's worth proving and what is not.
Anyway, I'd like to see that put to the test. Have an LLM write a 50-100KLOC program and prove all correctness properties - with the properties themselves approved by an expert human - and tell us what it cost. A colleague of mine stopped his AI proof experiment when he got an email from some functionary at the company to stop doing what he was doing with the model, because it was costing too much money.
> Have an LLM write a 50-100KLOC program and prove all correctness properties [...] and tell us what it cost.
Assuming the 50-100KLOC program is of real-world use and not something contrived for the sake of offering something to prove, it is unlikely that proving all correctness properties will be possible, fundamentally. So costs will be nothing — or infinite if you foolishly remain determined to try the impossible.
In the real world we restrict what properties we care about and what models we reason in. Some of those models are woven into the fabric of an LLM. I would think the cost multiplier in those cases is much lower for an LLM as compared to a human that doesn't have an inherit understanding and needs to give it thought. Wouldn't you?
> I would think the cost multiplier in those cases is much lower for an LLM as compared to a human that doesn't have an inherit understanding and needs to give it thought. Wouldn't you?
No. I don't see why proving would require less relative effort for an LLM. In fact, years ago, long before LLMs, I wrote about why it is relatively easy to write sort-of-correct software yet hard to write provably correct software, and I don't see why it's any different for LLMs. Their power lies in inductive "intuition", while deduction requires effort, just as it does for humans: https://pron.github.io/posts/people-dont-write-programs
But there's no need to speculate. Those who think verification-by-LLM is feasible and cost-effective on an industrial scale, are welcome to try it and report what they find. So far I've seen only tiny examples, and even they don't show effortless (i.e. token-light) work by the agent.
Every time you compile a statically-typed programming language you are using formal verification, so we have all kinds of industrial scale examples. The reports suggest that outputting tokens for these languages is as easy for LLMs as Javascript. And actually, I would suggest that the reports indicate that LLMs find it easier to output tokens for those languages than Javascript. LLMs are laughably bad at writing Javascript.
On the other hand we can watch humans struggle to do the same. How often have you heard things like "I won't use Rust because it is too hard to use"? I have never seen an LLM refuse to output Rust because it thought it was too hard. So what in that suggests an equivalent multiplier?
> yet hard to write provably correct software
That isn't just hard. Proving software correct in complete generally is impossible. There are all kinds of practical and fundamental constraints that leave it to be impossible. Verification is only useful when you are acting within the scope of a compressed specification of a system's behaviour.
> Every time you compile a statically-typed programming language you are using formal verification
Yeah, this is not what we're talking about here. We're talking about proving properties with deep alternative quantifiers.
> That isn't just hard. Proving software correct in complete generally is impossible. There are all kinds of practical and fundamental constraints that leave it to be impossible. Verification is only useful when you are acting within the scope of a compressed specification of a system's behaviour.
Nobody said anything about complete generality. We're talking about the practice of applying formal methods. It's not writing in Rust, and it's not a general program verifier, but a practice that's applied in some parts of the industry and not others, as the article says.
Put another way, the question is: for those programs and those properties that humans are able to prove with proof assistants, how expensive is it for LLMs to do that work.
Only that it won't trigger a core meltdown as long as it's not connected to a reactor. Otherwise, all bets are off. One could use safe languages to improve the odds, but even that's not common enough yet. Let alone formal verification.
"The proof was only as good as the spec." Is the sentence which will go on the tombstone of Formal Verification.
The spec has always been the problem. Whether the flaws in the spec arise from the customer or from the developer filling in the gaps in the spec incorrectly.
If I built an entire system and the spec does not mention access control logic, then formal verification will neither prove nor disprove that the software is secure. It's just not in the spec...
You have to know exactly what security properties you want. By the time you've written them down succinctly and correctly as a spec, you might as well have written them down succinctly and correctly as code.
The spec has to be just as detailed and will be just as error-prone as the code itself.
Formal verification is still too limited to be useful for most app developers. The article gives an example of an e-commerce platform using it to prove the correctness of managing refunds, but then acknowledges:
> As of today, the formally verified core can handle most effect-free logic—invariants, transitions, conflict resolution. But the UI, network calls, and database interactions typically sit outside the verification boundary. Verification makes the core airtight but doesn’t guarantee end-to-end correctness.
So you can formally prove that your e-commerce refund management logic is correct, except for proving that you processed the refund. You can't even prove anything about recording the refund in your database, say nothing of proving anything about your interactions with your payment processor.
If your app is mostly tricky logic with just a bit of I/O, your app is very unusual, and it's almost certainly not an e-commerce app. E-commerce apps are mostly CRUD apps; I/O with the database, the UI, and third-party APIs (e.g. payment processors) is 99% of the code.
Even property-based testing is mostly unhelpful for e-commerce apps like these.
Instead, think of formal verification as a runtime performance improvement of property-based testing. If property-based testing is useful for your app (it probably isn't), then you may be able to convert some of your property-based tests into formal verifications.
But, honestly, you probably can't do it, not even with a high budget of tokens.
I'd love to be proven wrong, but the way to do it would be to formally prove the correctness of non-trivial open-source code with property tests. Perhaps you could formally verify significant chunks of Postgres! (But I doubt it.)
I don't know a lot about formal verification, but:
> So you can formally prove that your e-commerce refund management logic is correct, except for proving that you /processed the refund/. You can't even prove anything about recording the refund in your database, say nothing of proving anything about your interactions with your payment processor.
You could say the same thing about the viability of functional programming on a CRUD webapp, but languages like clojure have been used to great effect here. The fact that thera are important, even fundamental, bits that you cannot verify, doesn't take out value from the fact that you can eliminate whole dimensions of issues.
So much this.
I actually did take a formal verification course in college. Our final project was to use the techniques we'd been learning to verify some classic critical-section locking algorithm. I chose to verify an implementation of Lamport's bakery algorithm[0] in C (this was the 90s -- a lot of code was still being written in C).
The problem is that Lamport's algorithm makes an assumption that the "ticket number" is unbounded and any finite implementation in C will almost certainly use a value which is limited to 32 bits or so.
So I was able to formally verify that the algorithm fails to protect the critical section if enough processes are kept waiting to overflow the counter. :)
This probably just means that Lamport's algorithm isn't a great choice for such environments, but I'm still bummed that the professor gave me a B.
[0] https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
You shouldn't have been able to formally verify the algorithm fails to protect the critical section. Wrapping ticket numbers can lead to starvation (literally, if we follow the baker analogy), but the algorithm protects the critical section so long as thread IDs are unique.
The sort of environments in which this is a problem would be extremely uncommon. For a start, you need continuous contention. If you ever get a break in contention you no longer have starvation, and you 'reset' the ticket number monotonicity - to zero, if you actually take a maximum of entering thread ticket numbers instead of a cheap global counter.
If you do actually have an environment in which you expect to have continuous contention over a critical section, some quick napkin math can tell you if it's something to worry about based on the running time of your critical section. If it's over, say, 10ms, you've got a few years of runtime before it's a problem. If it's under 1ms, maybe you want to use 64-bit arithmetic for your ticket number so you can run your system until long after the human species is extinct.
You probably got a 'B' because you didn't give the professor the answer they expected, though, not because of any technical reasons.
I'm quite certain I formally verified that two processes were both able to get into the critical section WITH THE ALGORITHM I WAS VERIFYING, but this was 35 years ago, so details are fuzzy.
Remember that this is long before Wikipedia and even before Google, so I have no idea where I'd have gotten the alleged Baker's algorithm from.
Also you're right that this isn't much of an issue in practice, but that's not what formal verification is about, now is it? :)
Well, I'm someone who barely knows more than jack about formal verification, but in pretty much every case you have to have some kind of model that you are actually verifying.
How close that model sits to the real thing you have modeled is an important question, and you are free to be as close or distant as you want -- e.g. for verifying different properties of a programming language you might decide to not care about CPU instructions, registers, etc, and only care about the semantic model. This has absolutely many use cases (e.g. whether a particular optimization is sound) where this "model mismatch" doesn't matter, this doesn't make formal verification useless in any way or shape, imo.
Getting back to at the "e-commerce refund management" -- you can absolutely have a model that does e.g. a particular database IO call that either succeeds or not. With such a model in place, you can have the rest of your codebase formally verified and know that 'with a properly working database it will always work correctly' [1]. Is that not a very significant and useful finding in and of itself? Would you be more confident in your end-to-end tested software than the above?
Especially that one can then separately test that particular call site as deeply as they want, to determine that the assumed property (it either returns success or fails) is sound.
[1] Given a correct specification, which is not easy to get right
Formal verification is a siren song. The siren sings, "bug-free code is possible in principle!" But it's a trap. Even with LLMs, bug-free code is impractical.
I argued that property-based testing is mostly unhelpful for e-commerce/CRUD apps, and that formal verification is a performance improvement on property-based tests.
In a property-based test, you identify some rule (an invariant) that you want to apply to your code. Then, you fuzz your app, testing it with autogenerated inputs, failing the test if the rule is broken at any point. In formal verification, you prove that the code always satisfies the rule, so you don't have to try millions of inputs.
Whether you're doing property-based testing or formal verification, it's extremely difficult to think of any non-trivial business logic properties that should apply to CRUD apps, even if they could be written in English, translated perfectly into code, and verified formally, instantly.
An actual rule that should always be followed, inflexibly, such that a mathematical proof would be useful (and that actually matters to your business) is so rare in CRUD apps that I'm not sure I've ever seen one.
Even with general-purpose rules (the app should never crash, the app should not leak memory), the property-based fuzzers tend to find bugs that have never happened in production, and probably never will. It's rarely economical for an e-commerce app developer to spend time fixing those bugs, even if finding them cost nothing at all (which is not remotely true, even with LLMs).
And what about UI? Maybe you'd want a rule like: "The title of the product for sale should never overflow its container rectangle in the UI."
OK, well, what if the title is one very long word? But… none of the products you sell happen to contain any words that are 500 characters with no spaces. I guess you could add code to prevent that product from ever being created? (And ensure that data in the database will never allow product titles that violate your business rules… how, exactly?)
Formal verification shines where property-based testing is already useful. It's already useful for many software platforms. It's useful for databases, where reliability is essential. It's useful for parsers, particularly when you expect the end user to be attempting to send you hostile code.
But e-commerce apps? CRUD apps? Not so much.
> An actual rule that should always be followed, inflexibly, such that a mathematical proof would be useful (and that actually matters to your business) is so rare in CRUD apps that I'm not sure I've ever seen one.
I see them often. Nearly every CRUD app I have come across in the wild has, for example, employed some form of email validation. To your concern, the rules of email validation are well defined and are unlikely to ever change. Importantly, a large percentage of the implementations I saw got it wrong.
As a user, I have also tried to use CRUD apps that have rejected my valid email address, so this isn't even a class of problems that will never be realized in practice. It is a pain I have felt in the real world. The use of PBTs or proofs would have easily uncovered the implementation failures.
The rules of email validation are not remotely well defined! Syntactic email validation is an impossibly hard problem. https://www.netmeister.org/blog/email.html
IMO the industry consensus is never to "validate" email addresses syntactically, but simply to ensure that the email address contains at least one @ and to verify the email address by emailing it an activation code.
Proofs would not have uncovered these failures. The proofs would have proved that they rejected your email address as invalid, and the developers would have patted themselves on the back for a job well done.
> The rules of email validation are not remotely well defined!
RFC 5322 fully defines the structure.
> IMO the industry consensus is never to "validate" email addresses syntactically
That is true, but not because it isn't well defined, but because it is hard to get right. Keep in mind that most developers don't even know what a property-based test is, and of those that do, only a small subset of them know how to use them. If you find any testing around it at all, which is a stretch to begin with, you will be lucky to find more than a small set of common addresses without any care or concern for the complex edge cases that lead to problems like I have had as a user in the past. Encouraging developers to only validate for the presence of @ means that there is no additional room to screw things up.
But better than to rely on gimping your code to deal with developers is to use the tools at your disposal.
> and to verify the email address by emailing it an activation code.
That may also be beneficial, but not for the same reason. Not all CRUD use-cases fit that mould.
Don't forget RFC 5321! But the RFCs are ignored in practice by all popular mail servers. There are email addresses that work in practice that don't comply with the RFCs, and there are email addresses that the RFCs permit that don't work in practice.
(This happens with a lot of standards; sometimes people just ignore them and do their own thing. Something similar has happened with SVGs.)
If you write a formal verification of a syntactical email validator that ensures that all/only RFC-compliant email addresses are valid, you'll have completely wasted your time. Don't do it. Just check for at least one @ sign, and email the address to test it.
(This is a perfect example of the trap of formal verification.)
> Don't forget RFC 5321!
As long as you don't forget RFC 6531.
> But the RFCs are ignored in practice by all popular mail servers.
While I agree that you can make a compelling case that sending email is the C in CRUD, usually when someone is talking about CRUD they are referring to systems that satisfy all four letters. U violates the spirit of email, and R and D is usually handled independently of the MTA. So what email servers do here is irrelevant. If you go way back up the comment chain you will even see that Postgres was specifically mentioned. Postgres doesn't care what an email server does, but it does care about data consistency.
> Just check for at least one @ sign
That's a valid specification and in practice you are going to want to make that a PBT to ensure that your implementation actually adheres to the specification. You might try testing foo@bar.com, but what about foo\u0040bar.com? Will you think to test it too? Probably not. Will your code handle it correctly? You may be perfect, but when we get out into the general developer population where all kinds of crazy things show up when they start monkeying with your code, the answer is also probably not.
Looking simple isn’t a reason to not use the tools at your disposal, even if many won’t.
> and email the address to test it.
Poor general advice. That is expressly illegal in some jurisdictions.
I'm speaking from painful experience here: if you assume RFC 5322 has anything whatsoever to do with how email addresses actually work in the wild, you're in for a world of hurt. Popular email providers don't give a shit what RFC 5322 says, and you can't either if you want to have any hope of actually sending and receiving mail. Test messages are the only way to validate an email address, period.
> It's already useful for many software platforms. It's useful for databases, where reliability is essential. It's useful for parsers, particularly when you expect the end user to be attempting to send you hostile code.
> But e-commerce apps?
So the e-commerce app doesn't use a DB?
And the e-commerce app receives zero user input that needs to be parser so doesn't use any parser either?
If you say its useful for, for example, DB and parsers, then any app using these benefits from it.
Read carefully. There's a big difference between developing a database and developing an app that uses a database.
If you're developing a database, you should use property-based tests to ensure that your database behaves as expected (ACID reliability, etc.). If you can formally verify parts of your database, you won't have to fuzz it, because it will have been proven correct. But if you're developing a CRUD app that uses a database, there may be no properties of your app at all that are worth fuzzing.
Similarly, if you write a library that parses Markdown, you should write property-based tests of it, and fuzz it. If you use a library that parses Markdown, you should pass it your Markdown, and let the library handle it.
E-commerce apps typically don't need to do any non-trivial parsing. Effect-free logic is likely to be ~1% of your code base, or less.
This is why we keep talking past each other: the techniques that make sense for real-time operating systems don't make sense for an e-commerce app, a line-of-business admin dashboard, or a single-player game.
> But if you're developing a CRUD app that uses a database, there may be no properties of your app at all that are worth fuzzing.
Surely we have seen enough SQL injection issues over the years to know that, where "uses a database" means using SQL, fuzzing (or formal verification) is a necessity? That only scratches the surface of where fuzzing and property-based testing can be applicable to CRUD apps.
> Effect-free logic is likely to be ~1% of your code base, or less.
Excellent news. Side effects are where property-based testing finds its maximum return on investment. Mutations in any kind of complex system are especially hard to reason about and are easy to get wrong. Property-based testing is great at uncovering the edge cases you failed to consider.
> This is why we keep talking past each other
You keep talking past each other because "CRUD app" or "e-commerce app" is too nebulous. Concretely define exactly what this CRUD app is and then we can all meaningfully zero in on where property-based testing and fuzzing may or may not be useful. As long as you leave everyone to imagine what their own pet CRUD app looks like, you will never find a a shared understanding in which to discuss it.
Check out seL4 and CompCert for real-world examples of complex software proven correct and used in industry.
Formal Specification and Verification of Data-Centric Web Services by Iman Saleh (PhD thesis) - https://www.researchgate.net/publication/235675652_Formal_Sp...
Excerpt:
In this thesis, we develop and evaluate a formal model and contracting framework for data-centric Web services. The central component of our framework is a formal specification of a common Create-Read-Update-Delete (CRUD) data store. We show how this model can be used in the formal specification and verification of both basic and transactional Web service compositions. We demonstrate through both formal proofs and empirical evaluations that our proposed framework significantly decreases ambiguity about a service, enhances its reuse, and facilitates detection of errors in service-based implementations.
Formalizing Data-Centric Web Services by Iman Saleh (book) - https://link.springer.com/book/10.1007/978-3-319-24678-9
Author website - https://www.imansaleh.com/
The first part of formal verification is getting a formal specification. I don't know about most developers, but I rarely get a written specification for anything I work on, and when I do, it's no where near what would be needed to turn it into a formal specification.
Anyway, the specification is subject to change at the whim of a hat, so putting a lot of effort into verifying it is foolish.
I do see value in formal verification of IPC/threading communication primitives (locks, semaphores, queues, whatevs), but then formal verification usually require assumptions for hardware behavior and those aren't always correct, so. But I've never used formal methods outside exposure in an undergrad survey class, so I dunno.
At my last job, I got a formal specification for one project. But the implementation happened in parallel for time reasons, so by the time the spec was frozen, 90% of the code was written.
Nitpick: You don't necessarily need any specification at all in order to reap benefits. Formal verification languages come with a lot of conditions that your program must fulfill in order to be accepted: Every loop terminates, every object you want to read/write is non-null, every list or array access is in bounds, etc.
For example, if you load an arbitrary C program into Frama-C, you'll have tons of properties to prove before you can even think of adding your own specification. The promises you get is that the program will always terminate and will never invoke undefined behavior. These properties are extremely useful, and these requirements are unlikely to change! And yes, C is... special. But pretty much any language has lists or arrays, nullable/option types, and unbounded loops. "No crashes/panics/uncaught exceptions" should be worth it.
And then, even absent a specification, you can start modeling invariants of the system anyway. This list is always sorted, this number is always strictly positive, this data structure never contains duplicate entries, etc. Anything that today would be a comment saying: "Important! The caller must ensure that..." This can help you gain confidence in your system. And if some day a requirement comes along that really requires you to violate one of these invariants, well, you can just remove it, the same way you would remove a test that no longer reflects something that should hold.
> "No crashes/panics/uncaught exceptions" should be worth it.
Surprisingly, no. Property-based testing and formal validation make it easy to spend tons of time and money "preventing" bugs that would never have occurred in production, especially uncaught exceptions.
There is code where strong guarantees can be worth it, (databases, platforms/operating systems, parsers accepting hostile input) but it's not most application-level code, and certainly not most e-commerce CRUD apps.
Remember, we're here to make users' lives better, not to write correct code for its own sake.
One problem with the "won't happen in production" argument is that humans are demonstrably bad at predicting "won't happen in production" situations.
> The promises you get is that the program will always terminate
I don't want my programs to terminate though.
Oh no, you have just proved formal verification useless! Or, if you have an outer event loop that should potentially run forever, you mark only that one loop accordingly (details depend on the system).
What we get typically is a second or third-hand summary of an analysis someone else did about how the current version works, with no indication of what stuff has to stay as-is in our replacement and what stuff can be improved.
Nothing stops you from writing a formal spec for an implementation or the technical parts of a problem that the product / upstream requirements don't explicitly demand. In fact I would say this is a large part of what software engineering fundamentally is; the reason specs are written in a certain way is because they model a problem domain and encode fundamental (or incidental, which could still be very well-reasoned, eg due to backwards compatibility, limitations of dependencies, or practical constraints) properties that are unlikely to change. Once you've actually written software that other systems/components are depending on, you've in practice created a spec. And it's a false dichotomy to think that just because there is no formalized spec for something before it's implemented at all, that this means that once the domain is better understood or stabilizied, you cannot use formal verification on the de-facto spec that has been arrived at/de-jure rules and structure that are prescribed but not yet formalized.
Even when you are writing something for the first time, like a parser for a certain kind of data format, you often can greatly accelerate your production-safety/security maturity by formalizing your spec enough to be fuzzed and to use existing parsing tooling that avoids the footguns most developers will introduce writing their own parsers from scratch. This is IMO the most common situation in which formal verification becomes very useful; even if the data format is purely an implementaton detail, any time parsing untrusted/user-provided input is involved, it would be much more foolish to try to just throw together some random string processing functions than to use something like flex/bison, an existing serialization format/protocol, or a metaparser (which may themselves be formally verified).
If you have a set of axioms that Postgres works as designed, you can prove that your code updates the database. If you define "the refund was processed" to mean "the refunded column of the order is true" you can prove that.
Property based testing is useful for finding bugs even in these kinds of CRUD heavy apps. There can be a surprising number of bugs and unexpected behaviors in the interaction of multiple sub-systems or APIs, and one way to find those bugs is to come up with properties on traces of calls.
For example, I saw a paper on using metamorphic testing (a particular technique for defining properties to test) to find bugs in the web APIs of Spotify and YouTube[1]. I don't have time to reread the paper just now, but I remember that they found weird behavior in pagination of search results. Not a big deal in that particular case, but I've definitely seen internal APIs with behavior that could be similarly wrong but with a much larger real-world impact.
[1]: https://ieeexplore.ieee.org/document/8074764
Personally, I see property-based testing and formal specification more as tools for design and debugging more than full-on correctness. Even with AI models it's still really hard to fully prove something correct, but having even a partial logical specification can let you trade design time for debugging time and lets you find inconsistencies or potential edge-cases when you're initially figuring out a system, rather than waiting until you have a massive codebase to debug and redesign in production.
It's not a panacea and you still have to be careful at the boundary between your nicely modeled system and the real world, but, once I got the hang of working in that style, having some formal properties or partial logical specifications of the behavior I needed has been really nice to have, as much for saving effort as for ensuring correctness.
I've mostly worked in slightly different domains, but I've found property-based testing useful both as a tool to catch bugs but also as a tool for communication. I spent a couple of years building and supporting a supply chain simulation at Target, where I frequently got requests for new metrics from the supply chain planning team. By teaching them how to specify either the whole metric or, at least, some of the expected behaviors of the metric as mathematical properties, it took far fewer back-and-forth conversations to correctly implement the metric in the simulation. We could then test these things by checking these properties over a bunch of random simulation traces. Day-to-day this saved a bunch of time in debugging small simulation mistakes. In the longer-term, having this test suite also let us rewrite the simulation code in a new style in Rust to drastically increase performance. All of this would have been possible without the set of properties, it would have just involved a whole lot more slow, tedious work.
Have you looked at model-based testing? One way to think of it as property-based testing for stateful system, though that's underselling it a little. It's surprisingly easy to come up models/specs for most stateful systems, including CRUD apps.
Source: I've modeled a number of CRUD like and non-CRUD like systems through the Accordant framework (https://github.com/microsoft/accordant)
There's a lot of really important software out there where being able to easily verify effect-free core logic would certainly be very useful. An e-commerce web app is not a good example. Anything safety-critical -- aerospace, defense, medical devices, power generation, industrial machines -- already requires a certification process. Auto-generating proof evidence as part of the cert process (which generally requires a rigorous spec anyway) in the near future seems like a no brainer.
> Formal verification is still too limited to be useful for most app developers
Limited formal verification is useful to most app developers. Consider a simplistic system that requires you to specify variables as strings or integers. You've almost certainly used a language that can express that before. Being able to formally verify that you haven't tried to stuff a string into an integer in order to give your editor the squiggles is a productivity enhancer. Or used to be in the pre-vibe coding days, at least; who knows anymore.
Your tests will eventually tell you the same thing, that is true, but the industry has decided that there is little downside to having at least some degree of formal verification and a whole lot of upsides. The question is always: where do the returns diminish? This is where you get the silly commentary on HN with the Rust guys crying out that Go isn't expressive enough and the Rocq guys laughing at it all.
I don't understand your argument, why precisely is contemporary app development thought to be so simple that it is not even worth thorough property testing, let alone FV? Isn't that attitude (on the part of the industry as a whole) a type of self-fulfilling prophecy?
If you are willing to relax the restrictions, and you probably should, model checking is probably worth its weight in gold for these scenarios.
You won’t get proofs but you will spell out your logic in a formal language[0] and each run of the checker will exhaustively check your invariants[1].
[0] Useful because often you will learn something you hadn’t considered.
[1] A proof will guarantee your statements hold over quantifiers that are much too large for a model checker to check exhaustively. But, you can say that for a model of size N, property Y is guaranteed to hold. The “small model theorem,” posits that if there is an error in your specification, it is more likely to show up in a small model. You sacrifice the completeness of proofs but this trade-off has been worth it to me.
I’ve recently started using PBT at work and it has been more than worth it, and I want to try Model checking next.
PBT is great and often good enough.
I use model checking mostly in the design phase of a project.
Some much bigger projects at MS and Amazon use it continuously throughout their process.
I endorse this, and IME find there are major benefits to “relaxing” even moreso through reductions or decompositions into well-known models rather than defining your own.
A lot of problems in practice are just too complex, too theoretically difficult, or some combination of not-theoretically-interesting-enough for an academic or specialized worker to be attracted/assigned/exposed to it, or just too difficult/expensive/time consuming/unimportant to be worth directly formally modeling. Or, they’re big problem spaces and you want to start where you get the most bang-for-your-buck. A (partial) reduction lets you still realize practical benefits for problems that may not perfectly or entirely be amenable to simple/small models, or which themselves may be “jagged”/gnarly problem domains due to real-world constraints like backwards compatibility.
Reduction to a formal target arise naturally from defining type conversions (likely to a codomain that is either actually a strict superset of the real reduction image, ie overly generalized/non-surjective, or bijective for some partial subset/step). Mathematically or computationally, it may be less than ideal to use an overly broad/flexible/general abstraction to model something more structured. But practically, it lets you use actual software written for the target reduction and apply any results/properties of the more familiar or better studied model.
The less-than-ideal reduction may be easily verifiable/simple and so obviously safe that, in-context, it’s clearly worth choosing the technically-suboptimal/less-elegant/partial approach to free ride off existing formal verification, without needing to wade deep into theoretical territory, than to chase perfection.
For example, if you can model a data format as a Context Free Grammar, or a subset of all context free grammars, a developer could spell out that grammar or grammars and then use existing metaparsing tools to safely process input data rather than write their own parser. And they wouldn't even need the format to be a complete CFG to leverage CFG tooling! Suppose portions are context-sensitive - for example, decoding the payload requires applying some length-prefix extracted from the header - just encode and parse the header format as a CFG with an opaque payload blob suffix with a length equal to eg the max payload/packet/fragment size. Perhaps some part of the header also specifies the payload’s structure in addition to the length: you could have one CFG for reading the header, a simple wrapping CFG for a fixed-length frame containing a header followed by junk (FRAME_HEADER_CFG :=, a CFG for each payload format, and one wrapping/router function that handles the very simple non-cfg step of using the first-pass parsing output to determine the length/format of the second pass.
FRAME_HEADER_CFG := HEADER_CFG {BYTE}
...
Parse(byte[] in) {
}
Then even though your data format is not technically a CFG, the vast majority of your parsing occurs within a CFG, with only a small part that does (in pseudocode). And you did not have to do or run any formal verifications, or use special model checkers, only encode the CFG-compatible subsets of your data format into something like EBNF, then use a compatible formally-verified metaparser, and take a leap of faith that the small amount of non-formally-verified/not-nice-to-model code is safe. That's much more realistic for a developer to do than to implement a formal-verification of a non-cfg format de novo.
I have minimal experience with formal verification, but I’d be curious to hear your view on Deterministic Simulation Testing (DST) and how that slots in to the conversation.
That to me seems to address a lot of the IO facing pitfalls formal verification can struggle with. At the obvious cost of it being a bear to implement.
I mean the cumbersome gold standard is literally just using the program. Fictional dialogue:
Please let us not forget: any automated solution is just there to reduce the amount of manual testing needed.
Of course that is a lot of work and testing every possible combination of situations, technologies etc is probably not feasible, but you shoud have a checklist that you follow, if there are tests you can't automate.
I have been working on metaparsing and some related tools for spec-based-development, fuzzing, and property-based-testing off and on for the past few months. I hope to make it easier for application developers to get the benefits of formal verification without having to spend inordinate amounts of time or resources writing unfamiliar, difficult formal-verification software or tests. You're right that it's not useful for most developers to concern themselves with this now, but I think that's mostly just because nothing makes it easy enough yet.
> Even property-based testing is mostly unhelpful for e-commerce apps like these.
Only in practice because of lack of application-developer oriented tooling. Why can you not have a callback assertion or database trigger that alerts or opens a ticket when a refund is not processed within a specified amount of time? Why can't you map out your payment processors' APIs and, if amenable, define a more transactional interface on top of their CRUD API to automatically rollback incomplete or partial state transitions? (Obviously, not all payment processors will expose an interface allowing you to do this in every case; but often times they do provide the fundamental pieces in-practice, just not in a way that is easy to discover or implement). Why can't you define cross-dependency static analysis guaranteeing that eg the core checkout flow continues to work before even running any integration tests? With the right tooling you could, it's just unreasonable to expect every developer working on ecommerce to get a PhD in automated theorem proving to build their own tools for that.
> tricky logic with just a bit of I/O
I actually think that IO is the #1 most important and high ROI surface to apply formal verification (and fuzzing/other related tools which are downstream of formal methods). How many ecommerce apps are secure against DOS or runtime panics from an adversarial client sending them junk data? How many write their own janky/insecure parsers against input formats or external APIs and introduce security vulnerabilities there? I would hazard a guess that most 5+ dev webapps are guilty of basic IO sins that formal verification tools could identify in short time and also prevent from being introduced if they were added to the core development process.
The approach I'm taking is to use codegen to automatically convert OpenAPI specs and grpc service definitions (and in some cases actually just methods and http handlers themselves), and wire-format data, and application-level data structures into formalized representations that I can use downstream with autofuzzing/trace/black-box integration testing/etc tools. There are certain things that only make sense to do with AI, that are very hip and envogue, eg black-box or API-surface-informed adversarial integration testing. But you can actually much more cheaply just use AI to set up deterministic tests like this once; the problem up until now is the lack of good end-user tooling and support for those tools, which are hard to write and test especially in real-world applications that don't conform to the specific industries/academic niches that formal verification are already adopted within. That's getting solved too.
I'm not entirely sure what this is showing people don't understand? Especially when going with such silly ill defined concepts as "financial conservation". Just what?
Now model in that it was shipped, but an earthquake caused the delivery truck to be destroyed. Or it was shipped, but the person that ordered passed away before delivery and the estate is refusing to accept packages.
People will want to somehow transfer the model of an online order as similar to an in store purchase. Does that mean that as soon as a customer takes an item through the door that the store is free of any and all obligations on the item?
The answers in all of these will have to be that there are processes in place to be executed. Some may require overrides on state of execution that have to be applied to get things back to a resolved status.
Now, do we want to make sure that normal execution of some code does not leave us in an unresolved status? Of course we do. And many people want to think they can find a way to model the world such that no contested states can exist. I have my doubts. But I welcome efforts to make it so that we surprise ourselves fewer times with some outcomes.
What I think I hear you saying is, do formal verification as much as you can. But also remember that that's never 100%, and therefore you need to leave some kind of escape hatch or alternate process or something.
Absolutely! I always have worries when posting on my personal doubts on any technique, as I genuinely look forward to people making progress in spite of my doubts.
I use formal verification as part of my development process. The needs of the proof guide the development of the code as much as vice-versa. The result is usually cleaner, simpler, smaller and usually more efficient programs developed much faster as debugging effort is minimal. I still create complete test cases. Proof maintenance as code changes is a pain and I would like LLMs and/or other tools to help with that. I would never try to formally verify code written with regular processes!
It has bugs!
I've been thinking about formal verification a lot, recently. I've dabbled in it before, but it was clear that it was only used by a small research community, and the effort required to verify anything larger than toy code would be immense. I agree with the author that there is enormous potential to use AI to automate the annoying parts of the verification process. What's more, the current security environment, in which the tiniest security flaw can quickly be exploited, suggests that provably secure code might be the future.
Others are correct to point out that formal verification is too difficult to apply to many types of application code. But there are domains where it is applicable today, and the main reason it is not used there is that developers lack the time and know-how. For example, many file format parsers are exploitable, but they are simple enough that they could be formally verified.
My professor was literally E. Allen Emerson (RIP) in uni so I feel like I'm in a unique position where the jab in the article title doesn't apply to me.
https://en.wikipedia.org/wiki/E._Allen_Emerson
Anyone interested in this should check out my Qed project I've been working on, a formally verified web frontend. https://github.com/JacobAsmuth/qed
I had fun in a college class that used Dafny, building a pseudo digital wallet, it wasn't the main focus of the class, so didn't get that much out of it
> It’s no longer just for safety-critical systems with the budget for specialized proof engineers. It’s for anyone who has a property worth proving
... and the budget to pay the AI to prove it.
I have quite a bit of experience with formal verification, but I don't understand the claim made in the article. As an aside, AI's ability to reliably prove the correctness of significantly large programs is still theoretical at this point, but let's assume it's possible. The claim in the article is that writing 10,000 lines of proof to prove a 100-line program was very expensive, and that's why it isn't done. But this increase in cost continues with AI! Whether you pay people to write the proofs or you pay an LLM to write the proof, you still have to pay for it. If I run a software company, saying that "verificaton is the AI's problem" isn't much different from saying, "it's the engineers' problem." Either way I'm not doing the work myself, but I am paying for it.
If the premise is that writing proofs was 100x more expensive than testing, I see nothing in this article to even suggest why it wouldn't still be 100x more expensive when an LLM is doing the work.
(BTW, the reason there aren't many specialised proof engineers is because they aren't in high demand; they're not being paid that much more than other engineers at a similar level)
> writing 10,000 lines of proof to prove a 100-line program was very expensive, and that's why it wasn't done.
We are not that silly. We are writing compilers (ie model checkers) which translate the source code to formal proofs. No cost at all, you just need to limit loop sizes and function call depths, to keep the cost of the proof down. And then extrapolate the little proof to the general proof.
Whatever the cost multiplier is, I see no reason why that same multiplier won't remain with AI.
Personally, I don't think that picture is quite accurate. Yes, there is a high cost multiplier for small programs, albeit perhaps not so prohibitive. But for large programs, that multiplier is, for most intents and purposes infinite, unless, perhaps, you have experts who know what's worth proving and what is not.
Anyway, I'd like to see that put to the test. Have an LLM write a 50-100KLOC program and prove all correctness properties - with the properties themselves approved by an expert human - and tell us what it cost. A colleague of mine stopped his AI proof experiment when he got an email from some functionary at the company to stop doing what he was doing with the model, because it was costing too much money.
> Have an LLM write a 50-100KLOC program and prove all correctness properties [...] and tell us what it cost.
Assuming the 50-100KLOC program is of real-world use and not something contrived for the sake of offering something to prove, it is unlikely that proving all correctness properties will be possible, fundamentally. So costs will be nothing — or infinite if you foolishly remain determined to try the impossible.
In the real world we restrict what properties we care about and what models we reason in. Some of those models are woven into the fabric of an LLM. I would think the cost multiplier in those cases is much lower for an LLM as compared to a human that doesn't have an inherit understanding and needs to give it thought. Wouldn't you?
> I would think the cost multiplier in those cases is much lower for an LLM as compared to a human that doesn't have an inherit understanding and needs to give it thought. Wouldn't you?
No. I don't see why proving would require less relative effort for an LLM. In fact, years ago, long before LLMs, I wrote about why it is relatively easy to write sort-of-correct software yet hard to write provably correct software, and I don't see why it's any different for LLMs. Their power lies in inductive "intuition", while deduction requires effort, just as it does for humans: https://pron.github.io/posts/people-dont-write-programs
But there's no need to speculate. Those who think verification-by-LLM is feasible and cost-effective on an industrial scale, are welcome to try it and report what they find. So far I've seen only tiny examples, and even they don't show effortless (i.e. token-light) work by the agent.
Every time you compile a statically-typed programming language you are using formal verification, so we have all kinds of industrial scale examples. The reports suggest that outputting tokens for these languages is as easy for LLMs as Javascript. And actually, I would suggest that the reports indicate that LLMs find it easier to output tokens for those languages than Javascript. LLMs are laughably bad at writing Javascript.
On the other hand we can watch humans struggle to do the same. How often have you heard things like "I won't use Rust because it is too hard to use"? I have never seen an LLM refuse to output Rust because it thought it was too hard. So what in that suggests an equivalent multiplier?
> yet hard to write provably correct software
That isn't just hard. Proving software correct in complete generally is impossible. There are all kinds of practical and fundamental constraints that leave it to be impossible. Verification is only useful when you are acting within the scope of a compressed specification of a system's behaviour.
> Every time you compile a statically-typed programming language you are using formal verification
Yeah, this is not what we're talking about here. We're talking about proving properties with deep alternative quantifiers.
> That isn't just hard. Proving software correct in complete generally is impossible. There are all kinds of practical and fundamental constraints that leave it to be impossible. Verification is only useful when you are acting within the scope of a compressed specification of a system's behaviour.
Nobody said anything about complete generality. We're talking about the practice of applying formal methods. It's not writing in Rust, and it's not a general program verifier, but a practice that's applied in some parts of the industry and not others, as the article says.
Put another way, the question is: for those programs and those properties that humans are able to prove with proof assistants, how expensive is it for LLMs to do that work.
That there are bugs in it. That would be the only thing I can 100% guarantee.
That is flawed
Only that it won't trigger a core meltdown as long as it's not connected to a reactor. Otherwise, all bets are off. One could use safe languages to improve the odds, but even that's not common enough yet. Let alone formal verification.
60% of the time it works every time.
it's made with bits of real panther, so you know it's good
"The proof was only as good as the spec." Is the sentence which will go on the tombstone of Formal Verification.
The spec has always been the problem. Whether the flaws in the spec arise from the customer or from the developer filling in the gaps in the spec incorrectly.
If I built an entire system and the spec does not mention access control logic, then formal verification will neither prove nor disprove that the software is secure. It's just not in the spec...
You have to know exactly what security properties you want. By the time you've written them down succinctly and correctly as a spec, you might as well have written them down succinctly and correctly as code.
The spec has to be just as detailed and will be just as error-prone as the code itself.
ACM now stooping to the level of clickbait youtubers. Just great.
You don't know jack, that's why you should subscribe to my ACM channel. As for me? I know two Jacks.
Smash that like button