Large Language Models and Software Quality Assurance

One of the comparisons I've seen in defense using LLM technology to generate code, is that we after all use compilers to generate code from high level languages. But compilers are by design predictable and if they are not predictable that is an error; LLMs are stochastic in nature and there is so far no reliable way to tell when they are in error without careful examination of their output. Currently no one knows how to make LLMs predictable. There is a second problem: If your coder never tells you, "You're wrong, that doesn't work" or, "I don't think that's what you want" you will never find errors in specification; that is as true with a human coder as an LLM.

There seems no way to bound the errors of LLM-generated code. I'm not even sure how to measure the errors; testing does not, cannot, do this. As Dijkstra famously obseved, "Testing shows the presence, not the absence, of bugs." The problem shifts from "Is your code correct?" to "Is the prompt (what in the old days we would have called the specification) correct?" with the added problem of, "Did the stochastic LLM correctly implement the specification?" The only way I can see to do this is with careful code review. LLMs are notorious for producing persuasive wrong answers and that review often Is not undertaken; people trust the machine without reason. Without review, can anyone say with any conviction that the code generated by an LLM will reliably be what is asked for? Even with review, there are problems. In natural language, LLM's make unexpected errors, errors which a human would not make, and these errors often slide past a human editor. I don't think you'd find any PE or CPA who would sign off on projects developed without careful code review. And they shouldn't! They're in the same position as a lawyer who uses an LLM to generate their briefs and, notoriously, multiple lawyers have been caught and disciplined for using unedited erroneous LLM output in their briefs.

Maybe, with some verification technology, the validation of LLM-generated code could be automated. This is a research topic, but does not have a settled result as yet. It may only move the correctness problem; now you have to give your verifier a formal specification, which you have to write! Not every computing project has the demanding accuracy requirements of accounting or engineering. But how many errors we want to accept? How do we quantify that? And if we can't quantify that, do we want to generate code in this way? There are engineering situations in which stochastically-generated solutions are acceptable, but those are ones in which we know how to bound the probability of error. So far we don't have that for the vast range of problems to which code-generating LLMs are being applied.

One problem of software quality assurance is "How do we make sure errors are caught and corrected before code is put into a production?" Programmer overconfidence and the difficulty of validating code has long made this a problem. If one compares an LLM with a human coder, they have an even worse problem; the LLM doesn't even know what's right! With human coders, we have a collection of procedures, methodology, that prevent and catch errors; with LLMs we have no such thing. To use LLM technology to generate code with any confidence, somehow we have to be able to answer that question, and so far we cannot with any reliability.

These are all the same arguments we've had for decades about quality assurance procedures with human written code, and after years and years of effort, we made progress on software quality. LLMs make different errors than humans and our existing procedures no longer are effective. So we are right back to the problems of software quality we had decades ago. 

Comments

Popular posts from this blog

"AI" and Productivity

Nuclear Fusion, "AI," and Big Science

The General Intelligence of Robots