Build an Eval Harness That Catches LLM Regressions
Golden datasets, LLM-as-judge scoring, and cheap deterministic gates that prove a prompt change actually improved things.
You tweak a prompt to fix one annoying behavior, the output looks better in the three examples you check, and you ship it. A week later a customer reports that the feature stopped doing something it used to do reliably, something your prompt change quietly broke while you were busy fixing the other thing. You never saw it because you eyeballed a handful of cases and called it good. This is the core problem with shipping changes to anything driven by a language model: the surface you are editing is huge, the behavior is non-deterministic, and a fix in one place routinely breaks behavior somewhere you did not look.
Traditional software has unit tests that catch this. You change a function, the test suite runs, and a red light tells you that you broke something three modules away. LLM-driven features have lacked that safety net, which is why teams ship prompt changes on vibes and discover regressions in production. An evaluation harness is the unit-test equivalent for prompts and models. It is the difference between "the output looks better to me" and "this change improved my measured pass rate from 84 to 91 percent without regressing any of my critical cases."
The golden dataset is your ground truth
The foundation of an eval harness is a golden dataset: a curated set of input-output pairs that serve as the ground truth you measure every change against. These are not random examples. They are the cases you care about, including the ones that have broken before, the edge cases that matter to your business, and the representative happy-path inputs that must keep working. The golden dataset is the static baseline against which every regression delta is measured.
The discipline that makes a golden dataset valuable is curation. Every time you find a bug, a case where the model did the wrong thing, you add that case to the dataset with the correct expected behavior. Over time the dataset becomes a memory of every mistake your system has ever made, and any change that reintroduces an old mistake fails immediately. This is how you stop fixing the same regression twice. The dataset remembers so you do not have to.
A practical structure pairs a lean, deterministic golden set with broader random sampling. The golden set acts as a hard gate before production, deterministic and stable. Random prompt sampling against real or synthetic inputs surfaces new, unexpected failures you did not know to look for. The golden set protects what you know matters; the sampling finds what you did not know was at risk.
Cheap checks before expensive ones
Not every evaluation needs a language model to grade it, and running an LLM judge on a check a regex could handle is wasteful. The right order is to run deterministic metrics first, then escalate to LLM judging only for the parts that genuinely require judgment.
- Exact match for outputs that must equal a known value.
- Regex match for format validation, like confirming a date or an ID follows the required shape.
- JSON validity for structured outputs that must parse, catching the model wandering off-format before you spend anything on semantic grading. This pairs directly with forcing LLM output into schemas your code can actually trust, since the schema defines what "valid" even means.
These checks are fast, cheap, and reliable, and they catch a large class of failures, structural ones, without invoking another model. A change that breaks your JSON output is caught instantly by an IsJson check, no judge required. Reserve the expensive evaluation for the cases where correctness is a matter of meaning rather than form.
LLM-as-a-judge for the rest
Many of the things you care about cannot be checked deterministically. Whether an answer is helpful, whether a summary is faithful to its source, whether a tone matches your brand, these are judgment calls. This is where LLM-as-a-judge comes in: a capable secondary model configured to score your target model's outputs against an explicit grading rubric. It is AI evaluating AI, at a scale and consistency no human review team could match.
The judge is most valuable exactly when you are iterating, during the experimental phase when you are changing prompts and trying model variants, and for regression testing after any update to the model or prompt. It is also where you settle the bigger architecture questions cheaply, like when fine-tuning beats prompting and when it just burns money: run both against the same goldens and let the numbers decide. The efficiency gain is the unlock. Because the judge runs automatically, you can evaluate on every change rather than in periodic manual audits. Every prompt edit, every model swap, every adjustment gets scored against your dataset before it ships, turning evaluation from an occasional review into a continuous gate.
The judge is only as good as its rubric. A vague instruction like "rate the quality from 1 to 10" produces noisy, inconsistent scores. A specific rubric that defines what each score means and what to check for produces gradings you can actually trust to gate a release. The judge is also a meaningful line item in your token bill, so cutting your LLM bill in half without touching answer quality means running it only where a deterministic check cannot. Writing that rubric well is the real work of building a judge, and it pays off every time the harness runs.
Wiring it into a workflow that fits how you ship
You do not need a heavyweight pipeline to get the value. The harness can be a script you run locally before merging a change, and for many teams that is exactly the right shape, a fast local gate rather than a hosted system. You make your prompt change, run the eval, and read the result: did the pass rate go up, did any golden case regress, did the judge scores improve. If a critical case broke, you do not ship. If the numbers improved across the board, you do, with evidence.
The same goldens get re-run across model versions and app iterations, generating fresh outputs each time. This is what makes comparison clean: when a new model version comes out, you run your existing golden dataset against it and see exactly where it is better and where it is worse for your specific use case, rather than trusting a generic benchmark that has nothing to do with your application. The same harness is how you would prove that a small local classifier trained on your own data outperforms a frontier model on the one task you care about.
This rigor is the same discipline we apply to any AI feature we build, including LadenX, the AI site-reliability engineer we built, where a model is deciding which commands to run against real servers and getting it wrong has consequences beyond a bad answer. When the stakes are that high, you do not ship model behavior you have not measured, and the harness is what lets you measure it before it touches anything that matters. It is the same posture behind teaching an AI SRE to diagnose root cause rather than just restart the service: prove the behavior first, then let it act. It is exactly the kind of foundation we put under the AI and automation work we deliver, so a model in production is one you can prove improved rather than one you hope did. If retrieval is in the loop, the eval also has to cover the retrieval layer that quietly wrecks your RAG answers, because a perfect prompt over the wrong documents still fails.
What the harness actually buys you
The value of an eval harness is confidence you can defend. Before, "I improved the prompt" was an opinion. After, it is a measurement: this change moved the pass rate from here to there and regressed nothing critical. You can iterate faster because you are no longer afraid every change will silently break something, since the harness will tell you the moment it does. You catch the regression in your terminal during development instead of in a customer's bug report a week later.
There is a quieter benefit too. The act of building the golden dataset forces you to define what correct actually means for your feature, case by case. That definition is valuable on its own, because a surprising number of LLM features ship without anyone having written down what a good output even is. The harness makes you answer that question, and once it is answered, you have a target to optimize toward instead of a feeling to chase.
The teams that ship reliable AI features are not the ones with the best prompts. They are the ones who can prove a change is an improvement before their users find out it was not, and once a change ships, agent observability tracing every span is what lets you replay the failures the harness did not anticipate. Build the golden dataset, run the cheap checks first, judge the rest with a sharp rubric, and gate every change on the result. The regression you catch in your eval harness is the production incident that never happens.






