Zylos Logo
Zylos
2026-01-14

Structured Output and JSON Mode in LLMs 2026

llmstructured-outputjsonfunction-calling

Structured Output and JSON Mode in LLMs 2026

Executive Summary

Structured output generation has evolved from a significant pain point in 2023-2024 to a mature, production-ready capability across all major LLM providers by 2026. The industry has converged on three primary approaches: JSON Mode (flexible format enforcement), Function Calling (schema-driven with semantic intent), and Structured Outputs (strict schema adherence with guaranteed compliance).

Key findings:

  • OpenAI achieved 100% schema compliance with Structured Outputs API (launched August 2024)
  • Anthropic added Structured Outputs to Claude Sonnet 4.5 and Opus 4.1 in late 2024, finally catching up with OpenAI
  • Google offers native schema enforcement in Gemini through Vertex AI with improved JSON-Schema support
  • Open-source ecosystem has matured with production-ready tools: Instructor, Outlines, LMQL, Guardrails AI, and XGrammar
  • Performance trade-offs: First request with new schema incurs 10-30 second delay; constrained decoding can achieve 6x performance improvement
  • Real-world adoption: Widespread use in financial services (loan underwriting), legal (contract analysis), and document processing (invoice extraction)

The convergence signals a broader industry realignment: providers are racing not just to make systems smarter, but to make outputs cleaner, stricter, and easier to integrate into real software systems.

1. Current State of Structured Output Support Across Major LLMs

1.1 OpenAI

Status: Industry leader in structured outputs since August 2024

OpenAI introduced Structured Outputs in the API where model outputs now reliably adhere to developer-supplied JSON Schemas. Their model gpt-4o-2024-08-06 scores a perfect 100% on complex JSON schema following evaluations, achieving 100% reliability in matching output schemas.

Key features:

  • Native Pydantic integration for Python developers
  • Strict schema adherence with zero tolerance for deviations
  • Support for complex nested schemas
  • Streaming support with structured outputs

Example:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    text_format=CalendarEvent,
)

event = response.output_parsed

1.2 Anthropic

Status: Recently caught up with OpenAI (late 2024)

Anthropic announced Structured Outputs for Claude Sonnet 4.5 and Opus 4.1, designed to ensure model-generated outputs exactly match the JSON Schemas provided by developers. The developer community described this as Anthropic "finally catching up with OpenAI."

Key features:

  • Tool-based approach where developers define schemas as "tools"
  • Claude enforces structure with type safety
  • Native integration with the Claude API
  • Compatible with the Messages API structure

Implementation approach: Unlike OpenAI's direct schema enforcement, Anthropic uses a tool-calling paradigm where the schema is defined as a tool specification, and Claude is instructed to always use that tool for its response.

1.3 Google Gemini

Status: Native schema support with continuous improvements

Google enforces schemas natively in Gemini through Vertex AI, where developers can specify strict JSON schemas, and Gemini guarantees compliance in generated responses. Google has offered structured-output capabilities for some time and recently announced improved schema handling with tightening JSON-Schema support.

Key features:

  • Native enforcement at the model level
  • Vertex AI integration for enterprise customers
  • Support for complex schema definitions
  • Improved performance with recent optimizations

1.4 Open-Source Models

Status: Rapidly improving with library support

Open-source models (Llama, Mistral, DeepSeek, etc.) achieve structured outputs primarily through constrained decoding libraries rather than built-in API features:

Supported frameworks:

  • vLLM: Structured outputs using guided decoding
  • Ollama: Built-in structured output support
  • llama.cpp: Constraint-based sampling
  • SGLang: XGrammar integration for 6x performance improvement

Performance: Open-source models with proper constrained decoding can achieve comparable reliability to commercial APIs, though with higher computational overhead during the first generation with a new schema.

1.5 Provider Comparison Table

ProviderApproachReliabilityPydantic SupportStreamingPerformance
OpenAINative Structured Outputs100%Yes (native)YesExcellent
AnthropicTool-based enforcement~99%Yes (via tools)YesExcellent
Google GeminiNative schema enforcement~98%Via SDKYesVery Good
MistralAPI + constrained decoding~95%Via librariesYesGood
AWS BedrockProvider-dependentVariesVia librariesYesGood
Open-source + vLLMConstrained decoding~95%YesYesGood*

*Open-source performance depends heavily on hardware and configuration

2. JSON Mode vs Function Calling vs Constrained Decoding

Understanding the differences between these three approaches is critical for choosing the right technique for your use case.

2.1 JSON Mode

What it is: JSON Mode forces the LLM to always output valid JSON, but the structure is arbitrary and not schema-enforced.

How it works:

  • Model is instructed to only output JSON
  • Guarantees syntactically valid JSON
  • Does not guarantee schema compliance
  • Does not validate field types or required fields

When to use:

  • Flexible output where exact schema isn't critical
  • Prototyping and rapid development
  • Cases where you can handle schema validation in post-processing

Limitations:

  • Field names may vary between requests
  • No guarantee of required fields
  • Type mismatches possible
  • May rename fields unexpectedly (e.g., "status" → "current_state")

Example:

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the key info"}],
    response_format={"type": "json_object"}
)

2.2 Function Calling

What it is: Function calling enables LLMs to intelligently output a JSON object containing arguments for external functions, particularly useful when there is a need for real-time data access.

How it works:

  • Define functions/tools with parameter schemas
  • Model determines if and which function to call
  • Returns structured JSON matching function signature
  • Can handle multiple function definitions

When to use:

  • Multi-step workflows requiring tool selection
  • External data access (weather, APIs, databases)
  • Agent systems with multiple capabilities
  • When semantic intent matters (which function to call)

Advantages:

  • Semantic understanding of intent
  • Built-in function selection logic
  • Better for multi-tool scenarios
  • OpenAI automatically optimizes prompts

Example:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
}]

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools
)

2.3 Constrained Decoding (Structured Outputs)

What it is: Strict schema enforcement through token masking or grammar-based generation that guarantees 100% schema compliance.

How it works:

  • JSON Schema is compiled to finite state machine (FSM)
  • Invalid tokens are masked at each generation step
  • Only tokens that maintain schema validity are allowed
  • Guarantees both syntactic and schema correctness

When to use:

  • Production systems requiring guaranteed schema compliance
  • Data extraction pipelines
  • Structured data ingestion
  • Critical applications where format errors are unacceptable

Advantages:

  • 100% schema compliance
  • No post-processing validation needed
  • Predictable field names and types
  • Integration-ready outputs

Performance considerations:

  • First request with new schema: 10-30 second delay
  • Subsequent requests: Minimal overhead
  • Can achieve 6x performance with optimized implementations (XGrammar)

Example (OpenAI Structured Outputs):

from pydantic import BaseModel

class UserInfo(BaseModel):
    name: str
    age: int
    email: str

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Jason is 25, email: jason@example.com"}],
    response_format=UserInfo
)

user = response.choices[0].message.parsed

2.4 Comparison Table

FeatureJSON ModeFunction CallingStructured Outputs
Schema EnforcementNonePartial100%
Valid JSONAlwaysAlwaysAlways
Field ValidationNoYesYes
Type SafetyNoYesYes
Required FieldsNoYesYes
Enum ValidationNoYesYes
Semantic IntentNoYesNo
Multi-tool SelectionNoYesNo
PerformanceFastFastSlow first request
Best ForPrototypingAgent systemsProduction data extraction

3. Techniques: Grammar-Based Generation, Logit Bias, Guided Decoding

3.1 Grammar-Based Generation

Core concept: Use formal grammars (CFG - Context-Free Grammars) to define valid output structure, compile to finite state machine, and constrain generation at each token.

How it works:

  1. Grammar Definition: Define output structure using Context-Free Grammar
  2. FSM Compilation: Convert grammar to finite state machine
  3. Token Masking: At each step, mask invalid tokens based on current FSM state
  4. State Tracking: Track position in grammar during generation

Example grammar (simplified):

<json> ::= "{" <members> "}"
<members> ::= <pair> | <pair> "," <members>
<pair> ::= <string> ":" <value>
<value> ::= <string> | <number> | <boolean> | <json>
<string> ::= '"' <chars> '"'

Key techniques:

  • Finite State Machines (FSM): Most efficient for regex-like patterns
  • Context-Free Grammars (CFG): More expressive, can handle recursion
  • Compressed FSM: Combines FSM efficiency with CFG expressiveness (used in XGrammar)

Performance:

  • XGrammar achieves 6x tool calling performance over previous approaches
  • Jump-Forward Decoding reduces computational overhead significantly
  • First-schema preparation time: 10-30 seconds (cached for subsequent uses)

Tools implementing grammar-based generation:

  • Outlines
  • XGrammar (incorporated into SGLang and vLLM)
  • LMQL
  • llama.cpp with grammar support

3.2 Logit Bias

Core concept: Directly manipulate the probability distribution of next tokens by applying bias values to specific token IDs.

How it works:

Logit bias lets you control whether the model is more or less likely to generate a specific word using numerical bias values:

  • Values near -100: Strongly suppress token (effectively ban it)
  • Values near +100: Strongly encourage token (effectively require it)
  • Values of exactly -100 or 100: Completely remove or guarantee selection

Common applications:

  1. Content Moderation: Ban specific words from appearing

    # Ban offensive tokens
    logit_bias = {12345: -100, 67890: -100}
    
  2. Token Efficiency: Reduce completion tokens for structured outputs

    # Discourage verbose tokens, encourage concise ones
    logit_bias = {token_id_verbose: -50, token_id_concise: 50}
    
  3. Brand Consistency: Prevent competitor mentions

    # Suppress competitor brand tokens
    competitor_tokens = [1234, 5678, 9101]
    logit_bias = {token: -100 for token in competitor_tokens}
    
  4. Format Control: Encourage specific formatting tokens

    # Encourage JSON structural tokens
    json_tokens = {"{": 50, "}": 50, '"': 30, ":": 30}
    

Limitations:

  • Requires knowing specific token IDs
  • BPE tokenization makes token prediction non-trivial
  • Less precise than grammar-based approaches
  • Can degrade output quality if overused

3.3 Guided Decoding (Constrained Generation)

Core concept: Dynamically mask logits of unwanted tokens at each generation step to guarantee conformance to constraints.

The fundamental technique: Constrained generation boils down to "masking", or reducing the logits of unwanted tokens. An inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for given schemas.

How it works:

  1. Schema Analysis: Parse JSON Schema or grammar definition
  2. State Machine Construction: Build FSM representing valid token sequences
  3. Token Filtering: At each generation step:
    • Determine current position in schema/grammar
    • Calculate valid next tokens
    • Mask invalid tokens (set logit to -inf)
    • Sample from remaining valid tokens
  4. State Update: Update FSM state based on generated token

Advanced techniques:

1. Jump-Forward Decoding Combines advantages of FSM-based and interleaved methods:

  • Pre-compute sequences of deterministic tokens
  • "Jump forward" over token sequences with only one valid continuation
  • Reduces number of sampling steps by 30-50%

2. Compressed FSM

  • Reduces memory footprint of large grammars
  • Enables efficient grammar-based generation at scale
  • Used in XGrammar for 6x performance improvement

3. BPE-Aligned Constraints Tools like DOMINO enforce regex/grammar constraints aligned to BPE subwords and achieve zero or even negative overhead compared to unconstrained decoding.

Implementation example (conceptual):

def constrained_decode(model, schema, prompt):
    # Build FSM from schema
    fsm = compile_schema_to_fsm(schema)
    state = fsm.initial_state

    tokens = []
    for _ in range(max_tokens):
        # Get logits from model
        logits = model.get_logits(prompt + tokens)

        # Get valid tokens for current state
        valid_tokens = fsm.valid_tokens(state)

        # Mask invalid tokens
        mask = create_mask(valid_tokens)
        masked_logits = logits * mask

        # Sample next token
        token = sample(masked_logits)
        tokens.append(token)

        # Update FSM state
        state = fsm.transition(state, token)

        if fsm.is_terminal(state):
            break

    return tokens

Performance characteristics:

TechniqueOverheadFlexibilityReliability
No Constraints0%UnlimitedLow
Logit Bias~5%LowMedium
Grammar FSM10-20%HighHigh
Compressed FSM5-10%HighHigh
Jump-Forward0-5%HighHigh

4. Libraries and Tools Ecosystem

4.1 Instructor

Status: Most popular Python library for structured outputs

Key features:

  • Clean Pydantic-based API
  • Supports OpenAI, Anthropic, Google, and local models
  • Automatic retry with validation
  • Streaming support
  • Type-safe responses

Installation:

pip install instructor

Basic usage:

import instructor
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())

class User(BaseModel):
    name: str
    age: int

user = client.chat.completions.create(
    model="gpt-4o",
    response_model=User,
    messages=[{"role": "user", "content": "Extract: Jason is 25 years old"}],
)
print(user)  #> User(name='Jason', age=25)

Advanced features:

from typing import List
from pydantic import Field

class UserList(BaseModel):
    users: List[User] = Field(description="List of extracted users")

    @validator('users')
    def validate_users(cls, v):
        if len(v) == 0:
            raise ValueError("Must extract at least one user")
        return v

# Automatic retry on validation failure
users = client.chat.completions.create(
    model="gpt-4o",
    response_model=UserList,
    max_retries=3,
    messages=[{"role": "user", "content": "Extract all users from: ..."}],
)

When to use: Production data extraction, API integrations, any Python project using Pydantic.

4.2 Outlines

Status: Leading library for constrained generation on local models

Key features:

  • Grammar-based generation
  • Regex support
  • JSON Schema enforcement
  • Multiple backend support (Transformers, vLLM, llama.cpp)
  • Zero-overhead with optimizations

Installation:

pip install outlines

Basic usage:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Define schema
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"}
    },
    "required": ["name", "age"]
}

# Generate with schema enforcement
generator = outlines.generate.json(model, schema)
result = generator("Extract: Jason is 25 years old")

Regex support:

# Email extraction with regex
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
generator = outlines.generate.regex(model, email_pattern)
email = generator("Contact: john@example.com")

When to use: Local model deployment, open-source models, maximum control over generation.

4.3 LMQL

Status: Programming language for LLMs with built-in constraints

Key features:

  • Python superset designed for LLM programming
  • Declarative constraint specification
  • Logit masking support
  • Custom operator support
  • Content restrictions and format adherence

Installation:

pip install lmql

Example:

import lmql

@lmql.query
def extract_info():
    '''lmql
    "Extract information: {text}"
    "Name: [NAME]" where STOPS_AT(NAME, "\n")
    "Age: [AGE]" where INT(AGE) and int(AGE) > 0 and int(AGE) < 150
    "Email: [EMAIL]" where REGEX(EMAIL, r"[^@]+@[^@]+\.[^@]+")
    return (NAME, AGE, EMAIL)
    '''

result = extract_info(text="Jason, 25, jason@example.com")

When to use: Complex multi-step LLM programs, research projects, when you need fine-grained control over generation.

4.4 Guardrails AI

Status: Enterprise-focused validation and guardrails framework

Key features:

  • RAIL (Reliable AI Markup Language) specification
  • Pydantic schema support
  • Input and output guards
  • Multiple validators (combine for complex rules)
  • Corrective actions on validation failure

Installation:

pip install guardrails-ai

Basic usage:

from guardrails import Guard
from pydantic import BaseModel, Field

class User(BaseModel):
    name: str = Field(description="User's full name")
    age: int = Field(ge=0, le=150, description="User's age")
    email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')

guard = Guard.from_pydantic(User)

result = guard(
    llm_api=openai.chat.completions.create,
    prompt="Extract: Jason Smith, age 25, email jason@example.com",
    model="gpt-4o"
)

print(result.validated_output)  # Validated User object

Custom validators:

from guardrails.validators import Validator, register_validator

@register_validator(name="adult_age", data_type="integer")
class AdultAge(Validator):
    def validate(self, value, metadata):
        if value < 18:
            raise ValueError("Age must be 18 or older")
        return value

# Use in schema
class AdultUser(BaseModel):
    age: int = Field(validators=[AdultAge()])

When to use: Enterprise applications, high-stakes validation, compliance requirements, need for audit trails.

4.5 XGrammar

Status: High-performance grammar-based generation library

Key features:

  • 6x performance improvement over previous approaches
  • Compressed FSM for memory efficiency
  • Integration with SGLang and vLLM
  • BPE-aligned grammar compilation

Integration example (vLLM):

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8b")

# Define schema
schema = {
    "type": "object",
    "properties": {
        "result": {"type": "string"}
    }
}

sampling_params = SamplingParams(
    guided_decoding_backend="xgrammar",
    guided_json_schema=schema
)

outputs = llm.generate(
    "Extract the result:",
    sampling_params=sampling_params
)

When to use: Performance-critical applications, high-throughput serving, production deployments with strict latency requirements.

4.6 Library Comparison

LibraryBest ForBackendsPerformanceEase of UseEnterprise Features
InstructorAPI-based LLMsOpenAI, Anthropic, GoogleExcellentVery EasyBasic
OutlinesLocal modelsTransformers, vLLM, llama.cppVery GoodMediumBasic
LMQLComplex programsMultipleGoodHardBasic
Guardrails AIEnterprise validationMultipleGoodEasyExtensive
XGrammarHigh performancevLLM, SGLangExcellentMediumBasic

5. Best Practices for Reliable Structured Output

5.1 Schema Design

1. Keep schemas simple and focused

# BAD: Overly complex nested schema
class ComplexSchema(BaseModel):
    data: Dict[str, List[Dict[str, Union[str, int, List[str]]]]]

# GOOD: Clear, flat structure
class SimpleSchema(BaseModel):
    users: List[User]
    metadata: Metadata

2. Use clear field descriptions

class Product(BaseModel):
    name: str = Field(description="Full product name as appears on packaging")
    price: float = Field(description="Price in USD, numeric value only")
    category: Literal["electronics", "clothing", "food"] = Field(
        description="Primary product category"
    )

3. Avoid recursive schemas

Recursive JSON schemas are not supported by most providers. Flatten or limit depth:

# BAD: Recursive structure
class TreeNode(BaseModel):
    value: str
    children: List[TreeNode]  # Not supported

# GOOD: Fixed depth
class TreeNode(BaseModel):
    value: str
    child_level_1: Optional[List[str]]
    child_level_2: Optional[List[str]]

5.2 Prompt Engineering for Structured Outputs

1. Provide clear instructions

system_prompt = """You are a data extraction assistant.
Extract information exactly as specified in the schema.
- For dates, use ISO 8601 format (YYYY-MM-DD)
- For currencies, use numeric values without symbols
- For lists, extract all matching items
- If information is missing, use null"""

2. Include examples in system message

system_prompt = """Extract user information.

Example input: "John Doe, 30 years old, john@example.com"
Example output: {"name": "John Doe", "age": 30, "email": "john@example.com"}

Example input: "Sarah, sarah@test.com"
Example output: {"name": "Sarah", "age": null, "email": "sarah@test.com"}"""

3. Break complex tasks into subtasks

# Instead of one complex extraction
result = extract_all_info(document)

# Break into focused steps
entities = extract_entities(document)
relationships = extract_relationships(document, entities)
metadata = extract_metadata(document)
result = combine_results(entities, relationships, metadata)

5.3 Error Handling and Validation

1. Implement retry logic with exponential backoff

import instructor
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=10), stop=stop_after_attempt(3))
def extract_with_retry(text: str) -> User:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=User,
        messages=[{"role": "user", "content": f"Extract: {text}"}],
    )

2. Use Pydantic validators for business logic

from pydantic import validator

class Transaction(BaseModel):
    amount: float
    currency: str
    date: str

    @validator('amount')
    def amount_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Amount must be positive')
        return v

    @validator('date')
    def validate_date_format(cls, v):
        try:
            datetime.fromisoformat(v)
        except ValueError:
            raise ValueError('Date must be in ISO 8601 format')
        return v

3. Implement fallback strategies

def extract_user_info(text: str) -> User:
    try:
        # Try structured output first
        return extract_with_structured_output(text)
    except Exception as e:
        logger.warning(f"Structured extraction failed: {e}")
        try:
            # Fallback to function calling
            return extract_with_function_calling(text)
        except Exception as e:
            logger.warning(f"Function calling failed: {e}")
            # Final fallback to JSON mode with manual validation
            return extract_with_json_mode_and_validate(text)

5.4 Testing and Validation

1. Create comprehensive test suites

import pytest

test_cases = [
    ("Jason is 25", User(name="Jason", age=25, email=None)),
    ("Sarah, age 30, sarah@test.com", User(name="Sarah", age=30, email="sarah@test.com")),
    ("Invalid input", None),  # Should handle gracefully
]

@pytest.mark.parametrize("input_text,expected", test_cases)
def test_extraction(input_text, expected):
    result = extract_user_info(input_text)
    if expected is None:
        assert result is None
    else:
        assert result.name == expected.name
        assert result.age == expected.age

2. Monitor extraction quality in production

from dataclasses import dataclass
from datetime import datetime

@dataclass
class ExtractionMetrics:
    success_rate: float
    avg_confidence: float
    validation_errors: int
    schema_errors: int

def track_extraction(result, input_text):
    metrics.total_extractions += 1

    if result is None:
        metrics.failures += 1
        log_failure(input_text, "Extraction returned None")
    else:
        try:
            validate_business_rules(result)
            metrics.successes += 1
        except ValidationError as e:
            metrics.validation_errors += 1
            log_failure(input_text, str(e))

5.5 Performance Optimization

1. Cache compiled schemas

from functools import lru_cache

@lru_cache(maxsize=100)
def get_compiled_schema(schema_hash: str):
    return compile_schema(schema_hash)

# Reuse compiled schemas
schema_hash = hash_schema(User)
compiled = get_compiled_schema(schema_hash)

2. Use batching for multiple extractions

def extract_batch(texts: List[str]) -> List[User]:
    # Batch API call
    results = client.batch.create(
        model="gpt-4o",
        response_model=User,
        inputs=[{"text": text} for text in texts]
    )
    return results

3. Optimize schema complexity

# Avoid unnecessary nesting
class OptimizedSchema(BaseModel):
    user_name: str  # Flat field
    user_age: int   # Flat field

# Instead of
class SuboptimalSchema(BaseModel):
    user: User  # Unnecessary nesting for simple case

6. Performance Implications

6.1 Latency Characteristics

First Token Latency (TTFT):

ApproachFirst RequestCached SchemaNotes
JSON Mode~200ms~200msNo overhead
Function Calling~300ms~300msMinimal overhead
Structured Output (API)10-30s~400msHigh initial cost
Constrained Decoding (Local)5-15s~500msDepends on hardware

Per-Token Latency (TPOT):

ApproachOverheadImpact
No constraints0% baseline~20-50ms per token
JSON Mode+5%+1-3ms per token
Logit bias+10%+2-5ms per token
Grammar FSM+20%+4-10ms per token
Optimized FSM (XGrammar)+5%+1-3ms per token

6.2 Token Usage

Prompt tokens:

Structured outputs typically require additional system prompt tokens:

# Without structured output
tokens = len(encode("Extract the user name and age"))  # ~10 tokens

# With structured output
tokens = len(encode("Extract the user name and age")) + \
         len(encode(json.dumps(schema)))  # ~10 + 50 = 60 tokens

Typical overhead:

  • JSON Mode: +10-20 prompt tokens
  • Function Calling: +30-100 prompt tokens per function
  • Structured Outputs: +50-200 prompt tokens depending on schema complexity

Completion tokens:

Structured outputs are typically more token-efficient:

# Free-form response
"The user's name is Jason and he is 25 years old."  # ~13 tokens

# Structured output
'{"name":"Jason","age":25}'  # ~8 tokens

Token savings: 20-40% for structured vs free-form

6.3 Throughput and Scaling

Requests per Second (RPS):

On A100 GPU with vLLM:

Model SizeNo ConstraintsWith ConstraintsXGrammar Optimized
7B params100 RPS80 RPS95 RPS
13B params50 RPS40 RPS48 RPS
70B params10 RPS8 RPS9.5 RPS

Batch size impact:

Batch SizeLatencyThroughputRecommendation
1LowLowReal-time applications
8MediumHighBalanced
32HighVery HighBatch processing
128Very HighMaximumOffline processing

6.4 Cost Analysis

OpenAI GPT-4o pricing (example):

ScenarioInput TokensOutput TokensCost per Request
Free-form500100$0.0075
JSON Mode52060$0.0065
Function Calling58080$0.0073
Structured Output60050$0.0068

Key insight: Despite higher input tokens, structured outputs often cost less due to more concise output.

Cost optimization strategies:

  1. Cache schemas across requests
  2. Use smaller models for simple extractions (GPT-4o-mini)
  3. Batch requests when possible
  4. Use local models for high-volume use cases

6.5 Real-World Benchmarks

Document extraction benchmark (1000 invoices):

ApproachLatencySuccess RateCost
GPT-4o Free-form45s87%$8.50
GPT-4o JSON Mode42s91%$7.20
GPT-4o Structured48s99%$7.50
Llama-3 70B + Outlines180s94%$0 (local)

Entity extraction (10k records):

ApproachTotal TimeThroughputAccuracy
GPT-4o-mini Structured15 min11 req/sec96%
Claude Sonnet Structured18 min9 req/sec97%
Mistral-7B + Outlines65 min2.5 req/sec92%

7. Common Pitfalls and How to Avoid Them

7.1 Schema Design Pitfalls

Pitfall 1: Overly Complex Schemas

# BAD: Too nested and complex
class BadSchema(BaseModel):
    data: Dict[str, List[Dict[str, Union[str, int, List[Dict[str, Any]]]]]]

# GOOD: Clear structure with proper types
class GoodSchema(BaseModel):
    products: List[Product]
    summary: Summary

Solution: Keep nesting depth ≤ 3 levels, use explicit types.

Pitfall 2: Missing Descriptions

# BAD: No context
class User(BaseModel):
    name: str
    age: int

# GOOD: Clear descriptions
class User(BaseModel):
    name: str = Field(description="Full legal name")
    age: int = Field(description="Age in years, must be positive integer")

Solution: Always add Field descriptions for non-obvious fields.

Pitfall 3: Unconstrained String Fields

# BAD: No validation
class Product(BaseModel):
    category: str  # Could be anything

# GOOD: Use enums or patterns
class Product(BaseModel):
    category: Literal["electronics", "clothing", "food"]
    sku: str = Field(pattern=r'^[A-Z]{3}\d{6}$')

Solution: Use Literal types, enums, or regex patterns to constrain strings.

7.2 Prompt Engineering Pitfalls

Pitfall 4: Insufficient Examples

Models struggle with edge cases without examples:

# BAD: No examples
prompt = "Extract user information"

# GOOD: Include edge cases
prompt = """Extract user information.

Examples:
- Input: "John, 30" -> {"name": "John", "age": 30}
- Input: "Sarah (age unknown)" -> {"name": "Sarah", "age": null}
- Input: "Mike, birthdate: 1990-01-15" -> {"name": "Mike", "age": 34}"""

Pitfall 5: Ambiguous Instructions

# BAD: Ambiguous
"Extract dates from the document"

# GOOD: Specific format requirements
"Extract all dates in ISO 8601 format (YYYY-MM-DD). If only year is mentioned, use YYYY-01-01. If date is uncertain, use null."

7.3 Error Handling Pitfalls

Pitfall 6: No Retry Logic

# BAD: Single attempt
try:
    result = extract(text)
except Exception:
    return None

# GOOD: Retry with exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def extract_with_retry(text):
    return extract(text)

Pitfall 7: Silent Failures

# BAD: Swallow errors
try:
    result = extract(text)
except Exception:
    return None  # What went wrong?

# GOOD: Log and track failures
import logging

def extract_safe(text):
    try:
        return extract(text)
    except ValidationError as e:
        logging.error(f"Validation failed for input: {text[:100]}... Error: {e}")
        metrics.validation_failures.inc()
        return None
    except TimeoutError as e:
        logging.error(f"Timeout for input: {text[:100]}...")
        metrics.timeout_failures.inc()
        return None

7.4 Performance Pitfalls

Pitfall 8: Not Caching Schemas

# BAD: Recompile schema every time
for item in items:
    schema = compile_schema(User)  # Expensive!
    result = extract(item, schema)

# GOOD: Cache compiled schema
compiled_schema = compile_schema(User)
for item in items:
    result = extract(item, compiled_schema)

Pitfall 9: Processing Items Sequentially

# BAD: Sequential processing
results = []
for text in texts:  # 1000 items
    result = extract(text)  # 500ms each = 500 seconds total
    results.append(result)

# GOOD: Batch processing
results = extract_batch(texts)  # 30 seconds total

7.5 Validation Pitfalls

Pitfall 10: Trusting Schema Compliance = Correctness

Schema enforcement guarantees format, not content accuracy:

# Schema validation passes, but content may be wrong
result = extract("John is 25")
# result = User(name="Sarah", age=30)  # Wrong but valid!

# GOOD: Add business logic validation
@validator('age')
def age_must_be_reasonable(cls, v, values):
    if v < 0 or v > 120:
        raise ValueError('Age must be between 0 and 120')
    return v

Pitfall 11: No Ground Truth Validation

# BAD: Assume extraction is correct
result = extract(document)
save_to_database(result)

# GOOD: Sample validation
if random.random() < 0.01:  # 1% sampling
    manual_review_queue.add(document, result)

# Track accuracy over time
if ground_truth_available:
    accuracy = compare(result, ground_truth)
    metrics.track_accuracy(accuracy)

7.6 Critical Mistake: Benchmark Errors

Important finding: Recent research found that "LLM Structured Output Benchmarks are Riddled with Mistakes" - many "errors" in LLM outputs were actually mistakes in the ground-truth annotations.

Lesson: When evaluating structured output quality:

  1. Manually review a sample of "errors"
  2. Verify ground truth annotations
  3. Consider inter-annotator agreement
  4. Use multiple evaluation metrics

8. Real-World Use Cases and Benchmarks

8.1 Financial Services

Use Case: Loan Underwriting

Extract structured data from financial documents:

class FinancialProfile(BaseModel):
    annual_income: float = Field(description="Total annual income in USD")
    monthly_debt: float = Field(description="Total monthly debt payments")
    credit_score: Optional[int] = Field(ge=300, le=850)
    employment_status: Literal["employed", "self-employed", "unemployed", "retired"]
    assets: List[Asset]

class Asset(BaseModel):
    type: Literal["checking", "savings", "investment", "property"]
    value: float

Benchmark results (10,000 applications):

ApproachAccuracyProcessing TimeCost
Manual review97%15 min/app$50/app
GPT-4o Structured94%30 sec/app$0.15/app
Claude Sonnet Structured95%35 sec/app$0.18/app
Llama-3 70B + validation91%2 min/app$0.02/app

ROI: 96% cost reduction, 30x faster processing

8.2 Legal Document Analysis

Use Case: Contract Extraction

Extract key terms from lease agreements:

class LeaseContract(BaseModel):
    landlord: str
    tenant: str
    property_address: str
    lease_start: str = Field(pattern=r'^\d{4}-\d{2}-\d{2}$')
    lease_end: str = Field(pattern=r'^\d{4}-\d{2}-\d{2}$')
    monthly_rent: float
    security_deposit: float
    special_terms: List[str]

Benchmark (1,000 contracts):

MetricManualGPT-4o StructuredImprovement
Time per contract45 min2 min22.5x
Accuracy98%96%-2%
Cost per contract$75$0.30250x
Key fields extracted100%98%-2%

Key insight: Slightly lower accuracy acceptable for initial extraction with human review of flagged items.

8.3 Document Processing: Invoice Extraction

Use Case: Automated Invoice Processing

class Invoice(BaseModel):
    invoice_number: str
    invoice_date: str
    due_date: str
    vendor_name: str
    vendor_address: str
    total_amount: float
    currency: str = Field(pattern=r'^[A-Z]{3}$')
    line_items: List[LineItem]

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

Production results (100,000 invoices/month):

ProviderSuccess RateAvg LatencyMonthly Cost
GPT-4o-mini97%1.2s$150
GPT-4o99%1.5s$750
Claude Sonnet98%1.8s$650
Gemini Pro96%1.1s$120

Error breakdown:

  • 60% - Ambiguous line items
  • 25% - Multi-page invoices with varying formats
  • 10% - Handwritten notes
  • 5% - Non-English text

8.4 Customer Support: Ticket Classification

Use Case: Support Ticket Routing

class SupportTicket(BaseModel):
    category: Literal["billing", "technical", "account", "feature_request"]
    priority: Literal["low", "medium", "high", "urgent"]
    sentiment: Literal["positive", "neutral", "negative"]
    entities: List[str] = Field(description="Product names, feature names, error codes")
    requires_immediate_attention: bool

Production metrics (1M tickets/month):

MetricBefore AutomationWith Structured OutputImprovement
Avg routing time15 min2 sec450x
Routing accuracy92%96%+4%
Customer satisfaction3.8/54.3/5+13%
Cost per ticket$2.50$0.0550x

8.5 Healthcare: Medical Record Extraction

Use Case: Clinical Data Extraction from Medical Notes

class ClinicalNote(BaseModel):
    patient_symptoms: List[str]
    diagnoses: List[Diagnosis]
    medications: List[Medication]
    lab_results: List[LabResult]
    follow_up_required: bool
    follow_up_date: Optional[str]

class Diagnosis(BaseModel):
    condition: str
    icd_10_code: Optional[str]
    confidence: Literal["confirmed", "suspected", "ruled_out"]

class Medication(BaseModel):
    name: str
    dosage: str
    frequency: str
    start_date: str

Clinical trial results (5,000 notes):

MetricHuman ExtractionGPT-4o StructuredAgreement
Key diagnosesBaseline94% agreementHigh
MedicationsBaseline98% agreementVery High
Lab valuesBaseline96% agreementHigh
Processing time20 min45 sec26x faster

Note: All outputs require physician review per regulatory requirements. System used to accelerate, not replace human review.

8.6 E-commerce: Product Catalog Enrichment

Use Case: Extract Product Specs from Unstructured Descriptions

class Product(BaseModel):
    name: str
    brand: str
    category: Literal["electronics", "clothing", "home", "sports", "books"]
    price: float
    specifications: Dict[str, str]
    features: List[str]
    compatibility: Optional[List[str]]

Benchmark (50,000 products):

ModelAccuracyThroughputCost per 1K
GPT-4o-mini93%150/min$2.50
GPT-4o96%100/min$12.00
Llama-3 70B90%80/min$0.50 (self-hosted)

Business impact:

  • 90% reduction in manual data entry
  • Improved search relevance (+15% conversion)
  • Faster time-to-market for new products

Conclusion

Structured output generation has matured from an experimental capability to a production-ready technology across all major LLM providers in 2026. The convergence of APIs, libraries, and best practices has made reliable structured output accessible to developers at all scales.

Key takeaways:

  1. Use provider-native structured outputs when available - OpenAI, Anthropic, and Google all offer high-reliability structured outputs that should be your first choice

  2. Choose the right approach for your use case:

    • JSON Mode: Prototyping and flexible formats
    • Function Calling: Multi-tool agent systems
    • Structured Outputs: Production data extraction
  3. Leverage mature libraries:

    • Instructor for API-based models
    • Outlines for local models
    • Guardrails AI for enterprise validation
  4. Follow best practices:

    • Keep schemas simple and well-documented
    • Provide examples in prompts
    • Implement retry logic and validation
    • Monitor quality in production
  5. Be aware of performance trade-offs:

    • First schema compilation is expensive (10-30s)
    • Subsequent requests have minimal overhead
    • Token efficiency often offsets higher input costs
  6. Avoid common pitfalls:

    • Don't assume schema compliance = correctness
    • Always validate business logic
    • Cache compiled schemas
    • Use batching for high volume

The field continues to evolve rapidly, with ongoing improvements in performance (XGrammar's 6x speedup), reliability (near-100% schema compliance), and ease of use (native Pydantic integration). As LLMs become more deeply integrated into software systems, structured outputs will be the foundational capability that enables reliable, production-grade AI applications.

Sources