Function calling, tool use, and structured outputs are not the same thing
Three LLM mechanisms that look alike but work differently, and why choosing wrong costs you at production scale
Three different names, two major providers, one persistent confusion. Most teams building on LLM APIs in 2026 reach for function calling when they need reliable JSON back from a model. It works well enough in testing. At production volume, it becomes the source of intermittent validation failures that are hard to reproduce and expensive to debug.
The terminology does not help. OpenAI shipped 'function calling,' then renamed it to 'tools.' Anthropic calls the same concept 'tool use.' Both providers now support 'structured outputs' as a separate mechanism. Gemini added its own variant. A developer reading the docs could reasonably conclude these are all different names for the same thing.
They are not. The three mechanisms have different internal implementations, different reliability profiles, different latency costs, and different semantic meanings. Picking the wrong one at architecture time creates compounding problems at scale.
Why the confusion exists
Function calling and structured outputs both produce JSON. Both accept a JSON schema to describe the output shape. The API surface looks similar: you pass a schema, you get back structured data. The difference is what happens inside the model.
Structured outputs work through constrained decoding. The model's sampling process is constrained at the token level so it can only produce tokens that are valid continuations of the target schema. Invalid tokens are masked out before sampling occurs. This is enforced at inference time, not learned through training.
Function calling works differently. It is a trained behaviour: the model learned during fine-tuning to produce structured output when given a function schema. There is no token-level enforcement. The model is more likely to produce valid JSON, but it can still hallucinate field names, produce values outside a specified enum, or omit required fields. The schema is a strong prior, not a guarantee.
What function calling actually does
Function calling has one semantic purpose: it lets the model signal that it wants to hand control to external code. The model decides whether to call a function, which function to call, and what arguments to pass. Your application runs the function and optionally returns the result. The model reads that result and decides what to do next.
This is agency. The model is deciding what to do, not just formatting an answer you already know the shape of. That is appropriate when you need a model to choose between multiple tools, decide whether to ask a clarifying question, or sequence a multi-step operation where each step depends on the result of the last.
It is not appropriate when you already know exactly what you want the model to produce and are using function calling purely to get reliable JSON. In that case, you are paying for the agency semantics without getting anything back for the overhead.
Structured outputs: constrained decoding, not a training artefact
OpenAI shipped structured outputs as a distinct mechanism from function calling in mid-2024. Anthropic added native support in early 2026. The mechanism: you pass a JSON schema via a response_format parameter, and the API guarantees the response matches that schema. Not 'usually matches.' Guarantees.
The guarantee holds because constrained decoding is enforced mechanically. At each token, the sampling process considers only tokens that keep the response parseable against the schema. If a field is an enum of ["pending", "approved", "rejected"], the model cannot produce "Approved" or "approved by finance" — those token sequences are masked before sampling occurs. The model has no path to a schema-invalid output.
For extraction tasks (pulling structured information from a document, classifying input into a fixed category set, converting unstructured notes into a typed record), structured outputs are the correct choice. The model reads input and produces output in a format you specified. There is no external function call, no tool invocation, no agency decision.
The reliability improvement over function calling for pure extraction is measurable. Teams running high-volume extraction pipelines consistently report a 3 to 7 per cent reduction in output validation failures after switching from function calling to structured outputs. At any meaningful scale, that is thousands fewer error-handling cycles per week.
Tool use: the agentic primitive
'Tool use' in Anthropic's API and 'function calling' in OpenAI's are functionally the same concept. The model signals which tool to call and with what arguments, your code executes the tool, and you optionally return the result. The name difference is historical: 'function calling' was OpenAI's original term and leaked into the API surface. OpenAI has moved largely to the 'tools' terminology in recent versions.
Tool use is appropriate when the model needs to make decisions, not just format something it already knows. Specifically, when:
- The model must choose between multiple possible actions based on context
- The action involves an external system: a database read, an API call, a file operation
- The result of the action changes what the model does next
- The correct response might be to ask a clarifying question rather than take immediate action
An agent that looks up a customer record, checks their account tier, and decides whether to escalate a support ticket is using tool use. The model is deciding what to do and in what sequence.
A pipeline that reads a contract and extracts the effective date, parties, and governing law into a typed record is using structured outputs. The model is not deciding what to do; it is doing what you told it to, in a format you specified.
| Mechanism | How it works | Use when | Avoid when |
|---|---|---|---|
| Structured outputs | Constrained decoding; schema enforced at inference time, not by training | Extraction, classification, formatting with a known output shape | The model needs to decide what to do next |
| Tool use / Function calling | Trained behaviour; model signals which external function to invoke | Agentic flows, multi-step decisions, external system interaction | You just need valid JSON and there is no real function to call |
| JSON mode (legacy) | Instruction-based; model told to output JSON, no schema enforcement | Not recommended in production | Any extraction or agentic use case at scale |
The production cost of choosing wrong
The most common mistake over the past two years was using function calling for extraction at scale. Teams that built this way typically did not notice in development, because development datasets are clean and representative. The failure mode is distribution shift: documents that look different from training data, unusual formatting, non-English input, tables instead of prose.
Function calling's reliability is a function of what the model saw during fine-tuning. When your input drifts from that distribution, output quality degrades. Structured outputs do not have this property. The constraint is mechanical and applies equally to unusual input.
The second common mistake runs the other direction: using structured outputs for a task where the model genuinely needs agency. The symptom is a model that produces correctly shaped JSON filled with hallucinated values, because it has no way to express uncertainty or request more information. The schema forces a complete, valid response even when the right answer is 'I do not have enough information.'
If your extraction results are valid JSON but the values are wrong, the model is guessing at fields it cannot determine from the input. You likely need tool use with an explicit 'I cannot determine this' path in the schema, or a redesigned flow that allows a clarifying question before committing to a structured response.
What the 2026 provider landscape looks like
| Provider | Structured outputs | Tool use | Notes |
|---|---|---|---|
| OpenAI | response_format: json_schema; mature, stable | tools array; generally available, function_call param deprecated | Most ecosystem tooling (LangChain, LlamaIndex) built against OpenAI's API shape |
| Anthropic | Native structured output added early 2026 (via betas parameter) | tool_use content blocks; full multi-turn support | Pre-2026 extraction pipelines often used a single extraction tool as a workaround; migration path now exists |
| Google Gemini | response_schema field; generally available | functionDeclarations; generally available | Ecosystem tooling less tested against Gemini than OpenAI; API surface evolving faster |
If you built extraction pipelines against Anthropic's API before early 2026 using a single extraction tool as a structured-output workaround, the migration to native structured outputs is worth benchmarking on your specific data. Constrained decoding changes error characteristics, particularly for fields with strict enum constraints, in ways that matter for downstream validation.
One question to answer before picking a mechanism
Is the model deciding what to do, or are you deciding and the model is executing?
If the model is deciding what to call, what action to take, or whether to ask a follow-up question, use tool use. The model needs agency: the ability to invoke external code, receive results, and decide what happens next.
If you are deciding what the output should look like (extract these fields, classify this input, convert this text to a typed record), use structured outputs. You know the output shape; the model fills it reliably.
The two compose naturally. An agentic workflow can use tool use at the outer level (the model decides to call an extraction subtask) and structured outputs at the inner level (the extraction itself produces guaranteed-valid JSON). You do not have to choose one mechanism for the whole system.
Function calling was the first mechanism that let models interact with external systems in a structured way, and it got used for extraction by teams who needed reliable JSON and had no better option. Structured outputs closed that gap in 2024. In 2026, there is no good reason to use function calling when you just need valid JSON from a model that is not making any decisions. The mechanism that gives you a schema guarantee is available on all major providers. Use it where it fits.
Frequently asked questions
Related reading
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.
The AI coding productivity data keeps contradicting itself. Here's why.
AI coding productivity studies swing between '26% faster' and '19% slower on real tasks'. Both are probably correct — for different things. Here's what the research actually measures and what you should track instead.