LLM structured output is reliable now. The reliability problem just moved.
Constrained decoding guarantees your JSON parses. It says nothing about whether the values are correct.
Eighteen months ago, the biggest failure mode for LLM structured output was syntax. A stray comma, an unclosed bracket, a string value where an integer was expected, and the pipeline threw a parse error. Teams wrote JSON repair functions. Some called the API twice with a correction prompt. A few wrapped every LLM call in a retry loop that continued until the response parsed cleanly.
Then constrained decoding landed. OpenAI's Strict Mode, Anthropic's tool-use schema enforcement, and Google's response_schema parameter all shipped within roughly six months of each other. With constrained decoding, the model samples only from tokens that could legally extend the current sequence under the provided schema. The JSON always parses now. The syntax problem is genuinely, completely solved.
So why are engineers still debugging LLM structured output failures in production?
What constrained decoding actually guarantees
Constrained decoding restricts the token distribution at inference time. At each sampling step, only tokens that could legally extend the current output under the JSON schema are eligible. OpenAI's Strict Mode reports a parse-valid rate that rounds to 100%. Anthropic routes tool-call responses through a format that structurally cannot produce malformed JSON.
This eliminates a real class of production failure. The guarantee covers: the response is valid JSON; required fields are present; each value matches the declared type; enum fields contain only the declared values.
It says nothing about whether a deadline field contains "2026-09-15T00:00:00Z" or "sometime next quarter". Nothing about whether a customerId field contains an ID that exists in your database or one the model invented. Nothing about whether a confidence field being absent means the model assigned low confidence or simply didn't include it.
Four failure classes that valid JSON hides
Type-coerced strings. Your schema declares deadline as a string, and your downstream code feeds that string to datetime.fromisoformat(). The model returns "next Friday". Valid under the schema. Runtime exception two lines later. The model has no way to infer that your system demands ISO 8601 format unless you explicitly say so; even then it sometimes doesn't hold.
Hallucinated references. Fields meant to hold a database ID — customerId, productSku, invoiceRef — get populated with values the model constructs that look plausible: the right prefix, the right approximate length, the right character set. They satisfy the string type. They don't exist in your database. The failure surfaces at the lookup, not at parsing, and the error message points at your lookup code.
Silent optional omissions. A field marked optional can legally be absent. When downstream code evaluates if result.confidence > 0.8:, an absent confidence field silently takes the else branch. No exception. No log entry. Just a business logic path taken incorrectly. These take time to find because they look like business logic errors rather than LLM output errors.
Semantic enum collisions. If your schema allows status: "pending" | "processing" | "done" and the input describes a state that should map to "processing", the model may return "pending" — the adjacent value in its representation space. Structurally valid. Business logic branches wrong. Unlike type mismatches, this one is particularly hard to catch in testing because it only appears when the model's concept boundary falls in the wrong place relative to your domain model.
| Failure class | Example | Schema catches it | What catches it |
|---|---|---|---|
| Type-coerced string | "deadline": "next Friday" | No | Runtime format check |
| Hallucinated reference | "customerId": "cus_8j3h2k" | No | Database lookup validation |
| Silent optional omission | confidence field absent | No | Explicit null + null check |
| Semantic enum collision | "status": "pending" for a processing state | No | Runtime business-logic assertion |
Why your LLM structured output test suite misses these failures
The standard eval pattern for LLM structured output is schema validation: if the response matches the schema, the test passes. This was appropriate when syntax was the primary failure mode. It's no longer a sufficient check.
Adding golden-answer comparison helps, but it has two gaps. First, LLM-as-judge accuracy degrades for precise values. A judge model regularly treats "2025-03-15" and "March 15, 2025" as equivalent even when your downstream parser doesn't. Testing tools optimised for natural language equivalence don't help with format precision. Second, semantic failures are long-tail and input-dependent: a 50-example eval suite won't surface the failure that only appears when a date is two years in the future, or when a customer name contains Unicode, or when the context is four tool-call rounds deep.
The missing tool is property-based testing for LLM outputs: a harness that generates adversarial inputs targeting boundary conditions in your schema and validates output values against semantic constraints, not structural ones. The closest analogy in traditional software testing is Hypothesis in Python or fast-check in JavaScript. No mainstream library ships an LLM-specific version of this.
The practical consequence is that semantic failures are under-tested until production surfaces them. Teams that discover these issues often trace back to a schema validated against a small set of representative inputs: inputs that happened to produce well-formatted outputs. The tail of the real input distribution didn't make it into the eval suite.
Four patterns that catch semantic failures
Runtime value validation is the most direct fix. After parsing the structured output, run a second pass that checks semantic rules programmatically. For date fields: datetime.fromisoformat(val). For ID fields: a lookup or format check against the expected prefix, length, and character set. For numeric fields: explicit bounds. This is mechanical to write and works reliably. Pydantic v2's @field_validator is a clean pattern for this in Python; Zod refinements work similarly in TypeScript.
Field descriptions as priming context. The description property in JSON Schema is part of the context window — the model reads it when generating the response. Adding "Return as ISO 8601 format, e.g. 2026-06-10T00:00:00Z" to a date field's description measurably reduces format variance. It doesn't guarantee correct output, but it shifts the distribution noticeably. Specific examples with the exact expected format work better than general instructions like "return a valid date".
Explicit nulls over absent fields. Mark optional fields as Optional[T] with a default of None rather than leaving them fully omittable. A response of "confidence": null is a deliberate signal from the model. A response with no confidence field at all is ambiguous — it might mean low confidence, or it might mean the model didn't consider the field. Collapsing that ambiguity forces the model to make the absence intentional and makes your downstream null checks predictable.
Shadow evals on production traffic. Run a second call on a random sample of real production inputs and compare the outputs for distributional consistency, not exact match. Are dates consistently in ISO 8601 format? Are numeric amounts consistently in the expected range? Are ID fields consistently in the expected format? This catches semantic drift from model updates, prompt changes, and context-length increases that offline eval suites miss. Even a 1% sample of production traffic surfaces failure modes that a hand-crafted 50-example test suite doesn't.
from pydantic import BaseModel, field_validator
from datetime import datetime
from typing import Optional
class ParsedTask(BaseModel):
deadline: str # schema allows any string; we need ISO 8601
customer_id: str # schema allows any string; we need an existing ID
confidence: Optional[float] = None # explicit None beats silent omission
@field_validator("deadline")
@classmethod
def deadline_must_be_iso8601(cls, v: str) -> str:
try:
datetime.fromisoformat(v)
except ValueError:
raise ValueError(
f"deadline must be ISO 8601 (e.g. 2026-09-15T00:00:00Z), got: {v!r}"
)
return v
# In your calling code:
raw = llm_client.parse_structured(prompt, schema=ParsedTask.model_json_schema())
try:
task = ParsedTask.model_validate(raw)
except ValidationError as e:
# Semantic failure — log, retry with corrected prompt, or surface to user
log_semantic_failure(raw, e)
raiseWhat the tooling landscape doesn't give you yet
Instructor, Outlines, and Marvin focus on the structural layer: making the JSON parse and match the schema. They solve the problem that constrained decoding already handles. They're useful, but they're not the layer where semantic failures live.
Pydantic v2 with @field_validator and model_validator lets you write semantic constraints as code. This works well. The limitation is that it requires you to write the validation logic explicitly; it doesn't infer constraints from field names or schema descriptions. Combined with detailed field descriptions that prime the model's output format, this is the most practical approach available today.
Guardrails AI and similar safety-focused libraries target policy compliance and PII detection, not data quality. They address a different layer of the stack and don't overlap with the semantic validity problem.
The gap is a tool that treats semantic validation as a first-class concern: something that infers plausible constraints from field names and descriptions, generates adversarial inputs against those constraints, and reports failure rates by class. Teams that build this internally get meaningful coverage improvements. No one has shipped a general-purpose open-source version yet. The space between 'the schema enforces structure' and 'the values are semantically correct' is still largely hand-rolled in each codebase.
The production failure mode that drives debugging in 2026 is not 'the JSON didn't parse.' It is the value that parsed cleanly and broke something two steps downstream. Constrained decoding moved the reliability boundary from the syntactic layer to the semantic one. Building the semantic validation layer — runtime checks, adversarial evals, explicit nulls over omissions — is the next piece of infrastructure that LLM-powered products need to treat as standard, not optional.
Frequently asked questions
Related reading
95% of enterprise GenAI pilots hit zero P&L impact. Here's what separates the 5%.
MIT's Project NANDA analysed 300 enterprise AI deployments and found 95% delivered no measurable P&L impact. The reason is almost never the model. It's task structure.
Local LLMs in production, 2026: the honest economics
Vendor benchmarks leave out the two cost items that usually flip the self-hosting decision: engineering overhead and the model-update cycle. Here is the honest break-even analysis.
Context rot is real: what the 18-model study means for production LLM engineering
Chroma's 2025 research tested 18 frontier models and found every one degrades as context grows. This is what context rot means for production engineering decisions — and the specific patterns that address it.