Skip to content
DERKONLINE

Force LLM Output Into Schemas Your Code Can Actually Trust

Use strict structured outputs, schema-first design, and a validation layer so a model can never silently break the code that acts on its answers.

Derrick S. K. Siawor8 min read

A language model that returns prose is a demo. A language model wired into a pipeline that charges a card, files a ticket, or updates a record has to return data your code can parse without flinching. The gap between those two things is where most AI features quietly break. The model answers correctly nine times, then on the tenth it wraps the JSON in a markdown fence, renames a field, or appends a friendly sentence, and your JSON.parse throws in production at 2am.

The fix is not a better prompt. The fix is to stop asking the model to behave and start constraining what it is allowed to emit, then validating what comes back against a schema your code owns. Done well, this turns the model from a source of surprises into a component you can build on.

Why "respond in JSON" is not enough

Telling a model to "respond only in valid JSON" relies on the model choosing to comply at every token. It usually does. Usually is the problem. OpenAI's own benchmark put gpt-4-0613 at under 40 percent on complex JSON schema following when prompted with instructions alone. That number is not a model being dumb, it is the expected behaviour of free-form sampling against a structural target it was never forced to hit.

There are two distinct guarantees worth separating, because teams conflate them and ship the weaker one. The cost of getting it wrong scales with volume, which is also why keeping the LLM bill from eating your margin starts with not burning tokens on calls you have to redo.

JSON mode guarantees the output is syntactically valid JSON. It does not guarantee the output matches your shape. You can ask for { "amount": number, "currency": string } and get back valid JSON that happens to be { "total": "fifty dollars" }. Parseable, useless.

Structured outputs (sometimes called strict mode) goes further: the output is constrained to match a JSON schema you supply, field by field, type by type. With gpt-4o-2024-08-06 and strict mode on, OpenAI measured 100 percent schema compliance on the same evaluation that scored under 40 percent for prompting alone. That is the difference between hoping and knowing.

How constrained decoding actually works

The reason strict mode is a hard guarantee and not a softer "we tried" is that it operates at the token level, not the response level. This is constrained decoding, and it is worth understanding because it shapes how you design schemas.

At each generation step a normal model picks from the full token vocabulary weighted by its learned probabilities. A constrained decoder sits in front of that step and masks out every token that would violate the grammar. If the schema says the next thing must be a closing brace or a comma, every token that is not a closing brace or a comma has its logit driven to negative infinity before sampling. The model still uses its learned distribution, but only over the tokens that keep the output on a valid path.

Libraries implement this with a finite state machine or a pushdown automaton derived from your schema. The Outlines library, for example, precompiles a JSON schema into an index structure that gives an O(1) lookup of valid next tokens at each step, so the constraint adds little latency. vLLM, LM Format Enforcer, and the major hosted APIs all ship a version of this. The practical result is that invalid output is not merely caught after the fact, it is made impossible to generate.

One team that moved a data extraction pipeline from prompt-based JSON requests to constrained decoding reported post-processing errors dropping from 32 percent to 0.4 percent. That residual 0.4 is the lesson for the next section: constrained decoding fixes shape, not meaning.

Schema first, prompt second

Schema-first LLM flow: constrained decoding then validate on return with one targeted retry

The shift that matters more than any single API flag is making the schema the primary artefact. You define a Zod or Pydantic schema, generate the JSON schema from it, hand that to the model as the strict constraint, and parse the response back through the same schema on return. The schema is the contract. The prompt is just the instruction for filling it.

import { z } from "zod";

const Invoice = z.object({
  vendor: z.string().min(1),
  amount_cents: z.number().int().positive(),
  currency: z.enum(["USD", "GHS", "EUR", "GBP"]),
  due_date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  line_items: z.array(z.object({
    description: z.string(),
    cents: z.number().int().nonnegative(),
  })).max(200),
});

Notice the choices. Amounts are integer cents, never floats, so rounding never enters the data. Currency is an enum, so the model cannot return "dollars" or "US Dollar". The date is a constrained string the model cannot mangle. The line items array is capped so a hallucinated thousand-row response cannot blow up downstream memory. Every one of those is a place the model could have surprised you, closed off in the schema instead of patched in a try/catch later.

Validate even when the API says it is strict

Strict mode constrains structure. It does not constrain truth, and it does not protect you from the seams between your code and the provider. Several failure modes survive a perfectly compliant schema:

  • A field can be the right type and the wrong value. amount_cents: 0 is schema-valid and probably wrong for an invoice.
  • Enums and unions have edge cases. Some providers have documented gaps where strict schemas with deep nesting or certain oneOf constructs degrade.
  • Refusals and length cutoffs return a different shape. A model that refuses, or that hits its token ceiling mid-object, hands you something the schema never described.
  • You may be calling more than one provider, and their strict implementations are not identical.

So the rule is: constrain on the way out, validate again on the way in. Parse the response through your Zod or Pydantic schema even though the API promised compliance. The parse is cheap, and it is the boundary where a provider change, a model swap, or a truncated stream gets caught by your code instead of by your customer.

When validation fails, you have a real decision: retry with the validation error fed back into the prompt, fall back to a safe default, or surface a clean error. Retrying blind, the same call with the same inputs, is the worst option because it burns tokens and latency on the same failure. The same care applies when those retries trigger external side effects, so make any tool call the model issues idempotent before a double-fire can double-charge a customer. Feeding the specific validation message back ("amount_cents must be a positive integer, you returned -5") gives the model the information to correct, and usually one retry is enough. Catching a truncated object here is also why you should stream responses that never cut off mid-object rather than parse a half-finished payload.

Where this discipline pays off most

The stakes scale with what the output triggers. A schema slip in a summarization feature is cosmetic. A schema slip in an agent that runs shell commands or moves money is an incident. We learned this building LadenX, our AI site-reliability engineer: it reads logs, decides what to do, and acts on a server. Every action it proposes is classified and validated before anything runs, and destructive operations are refused outright without a human sign-off. That same instinct shows up in teaching an AI SRE to diagnose root cause rather than blindly restart, and in putting guardrails around autonomous fixes before they touch production. The model's freedom ends exactly where the schema and the policy begin. That boundary is what makes an autonomous tool safe to point at production rather than a liability, and it pairs naturally with defending the same agent against prompt injection from untrusted content.

The same principle holds for any system where model output feeds the next step automatically. If the downstream code assumes a shape, the model must be forced into that shape and then checked against it. There is no version of "the prompt is good enough" that survives a few hundred thousand calls. When several agents pass structured payloads to one another, that contract is also what lets you orchestrate multiple agents without losing control of the flow.

A checklist that holds up

Before you ship a feature where model output crosses into code that acts on it, confirm:

  • The output shape is a real schema in Zod or Pydantic, not a description in the prompt.
  • Strict structured outputs are enabled, not plain JSON mode.
  • Numbers that represent money are integers, dates and identifiers are constrained strings, and free choices are enums.
  • Arrays and strings have sane maximum sizes so a runaway response cannot exhaust memory.
  • The response is parsed through the same schema on return, every time, even with strict mode on.
  • Validation failures feed the specific error back for one targeted retry, then fall back cleanly rather than looping.

When the failure does happen in production, you want to see exactly which call returned the off-shape payload, which is the job of tracing every span your agent emits.

Once the shape is locked, the next failures hide in the layers around the model: the retrieval layer quietly wrecking your RAG answers and the eval harness that catches regressions before users do are where schema-clean output still goes wrong on meaning.

Get those right and the model stops being the flaky part of your system. It becomes the part you can reason about, because the only outputs it can produce are ones your code already knows how to handle. If you are wiring a model into something that acts on its own, we build these guardrails for a living and would rather help you design the boundary than debug it after launch.