Is “Specification Repair” the AI Endgame?

Let’s stop trying to make computers human and get back to making them awesome.

by Paul Ford - May 7, 2025

Recently, I had a nice solid glimpse of the future and I wanted to tell you about it.

Not long ago, I visited an office with a beautiful view, and I took this photo:

Photograph of New York Harbor and the coast of New Jersey with Liberty Island and the Statue of Liberty in the foreground. — She seems nice.

That day, people online had been discussing how good ChatGPT was at guessing where a picture was taken, so I fed it this image and asked it not just to figure out what building, but which floor. It identified the exact building (17 State Street), and then suggested it was the 34th floor (it was the 30th). This was, I thought, surprisingly accurate.

This is another dangerous superpower we’re not prepared to handle, but we’ve had that conversation many times before, and we can have it again tomorrow. Today, I want to go under the hood.

When I looked at its chain of reasoning, this is how it worked: It would “think” for a while, then describe what it was doing, then write some code. One piece of code cropped the image so that it could “zoom in,” and another bit of code calculated latitude and longitude. The code would produce data—real actual numeric or visual data—and the data would be fed back into the “reasoning process.”

That process is:

You enter a prompt.
The LLM rambles a bit about the prompt.
The LLM writes some code based on its own rambling.
The LLM describes how to invoke the code.
The AI company runs the code on a server somewhere.
The AI company uses the code output to re-prompt the LLM.
Go back to step 2 and repeat until some answer shows up.

Here’s a sample of the output from my geo-guessing session (emphasis added):

From Battery Park (40.7033° N, -74.0170° W), the bearing to Liberty Island appears to be around the north-northeast direction. The latitude difference is about 1.5 km south, and the longitude difference is 2.3 km east. So, the bearing from Battery Park to Liberty seems to be approximately 212° southwest, though I suspect it might shift slightly based on the exact coordinates, so I need to verify with a quick calculation.

import math
def bearing(lat1, lon1, lat2, lon2):
    lat1, lon1, lat2, lon2 = map(math.radians,[lat1, lon1, lat2, lon2])
    dlon = lon2 - lon1
    x = math.sin(dlon)*math.cos(lat2)
    y = math.cos(lat1)*math.sin(lat2)-math.sin(lat1)*math.cos(lat2)*math.cos(dlon)=
    brng = math.degrees(math.atan2(x,y))
    return (brng+360)%360

print(bearing(40.703, -74.017, 40.6892, -74.0445))

This is straight uncut ChatGPT reasoning. See how it’s letting itself…guess? LLMs are very good at generating ambiguities. Meanwhile, code is very reliable. Barring a cosmic ray hitting your microchip, if you ask a spreadsheet to add up a thousand numbers, you will get the same answer every time. That’s why we trust old, boring coding systems to process our credit card transactions, while we would never expect an LLM to do our banking. It just learned how many “r”s are in the word “strawberry.”

My gut tells me this pattern, which doesn’t really have a name, is going to be the pattern for working with these technologies to do complex things. The best name I can find for it is “specification repair”—see, for example, this paper. You ask a question and the LLM transforms it into a more detailed specification of a problem, then writes code to test that specification, then creates more specification, then tests that. This kind of loop is good for many things like:

Guessing your location from a photo.
Teaching a robot how to assemble a circuit board from a photo.
Estimating the effects of climate change on a given location.
Predicting the global market for strawberries based on past data.
Comparing two product marketing funnels and figuring out if you could merge them into one.
Creating a nutrition plan given a set of health conditions.
Evaluating a stock portfolio against a thesis.

Critically, in each of these examples, the LLM can be used to kick things off, but ultimately, if the output is going to be genuinely trusted, the LLM should retreat to programming and data analysis (or old-school AI like machine vision) to get to the next step and make that explicit. For example, in the nutrition plan example, you’d want to use an official nutrition database, like this one from the U.S. Department of Agriculture. Otherwise you shouldn’t trust the LLM—it could make up all kinds of foods. From a product perspective, I’d prefer this.

We’ve been testing out the climate-change idea for a few months. If you ask an LLM to simply make a climate adaptation plan for a given location or building, it will draw a lot of random conclusions because of the biases in its data set.

For example, a tornado is very newsworthy in New Jersey because serious tornadoes are rare there—so when a tornado happens, there are a lot of news stories. LLMs spidered the web and indexed that news. Now, when you ask an LLM to help make a plan to prepare for climate change in New Jersey, it will tell you that tornadoes are one of your biggest problems and you should prepare for them. But…they probably aren’t.

However, if you feed an LLM hard data about precipitation and heat risk, and tell it to work with that data, writing code if necessary, it will do a much better job and write a better report.

Why does any of this matter? Because adaptive, code-driven systems like this represent immense possibilities and everyone knows that. A huge driver for everything with AI is figuring out how to make it automate physical production of goods. The problem is that an LLM alone can’t really figure out how to control a robot to manufacture a PC.

But an LLM that has watched a million hours of factory-operations footage and then writes code to control a robot arm based on a prompt, steadily making higher-quality goods with less and less human intervention? That is the dream of the entire industrial part of the economy, and it represents tens of trillions of dollars in opportunity.

Now I’m way ahead of my skis, but my instinct is that faster, reactive systems that can write code to continually improve processes are the absolute endgame for this technology (more than AGI). Over time, the cultural stress we’re experiencing over generative AI will be subsumed by the conversation around AI-based automation. And a huge number of signals point to China being far ahead of the U.S. on this front, and ready to go very fast.

And meanwhile? “Hallucinations” are getting worse, even as the models get smarter. The demand for really robust, reliable loops between LLM and “classic” code is going to be very high. So “specification repair” (or whatever it ends up being called) feels like the future to me—and as a pattern, it finds a balance between what LLMs can do extremely well (transform text into less text) and what computers can do (reliable math and historical data access, at incredible speeds).

I know this is an interesting, confusing moment, but AI is just more software. The truly interesting stuff is going to happen when computers stop trying to be human, and get back to being computers.