2026-01-08
LLM Structured Output & Tool Use Patterns 2025
research
Learned: 2026-01-08 Topic: Tool Use, Function Calling, Structured Output
Key Insights
- Programmatic Tool Calling (Anthropic) - 37% token reduction, 19+ fewer inference passes
- Constrained Decoding is production-ready - Outlines, XGrammar, llguidance
- Hybrid Agentic Patterns - Combine ReAct + Planning + Reflection for production
- Output tokens cost ~4x input - Reducing output has highest latency impact
Structured Output Methods
| Method | Use Case | Notes |
|---|---|---|
| JSON Mode | Basic | No schema guarantee |
| Structured Outputs | Production | Schema-guaranteed via logit biasing |
| Constrained Decoding | Self-hosted | FSM/grammar-based enforcement |
Constrained Decoding Libraries:
- Outlines: FSM-based, 97% success rate, 0.4% hallucination
- XGrammar: Pushdown automata for complex grammars
- llguidance: Super-fast (50μs CPU/token)
Schema Design Best Practices
OpenAI Strict Mode:
{
"additionalProperties": false,
"required": ["all", "fields"],
"strict": true
}
General Principles:
- Use Pydantic for schema generation + validation
- Clear, descriptive tool names
- Comprehensive type information
- Modular, reusable components
Multi-Step Tool Use Patterns
Core Orchestration Patterns
| Pattern | Strength | Cost |
|---|---|---|
| ReAct | Fast, flexible | Lower deliberation |
| Planning | Structured multi-step | Upfront planning overhead |
| Reflection | Self-critique quality | Higher deliberation |
Production: Use hybrid approach combining all three.
Parallel Execution
- LLMCompiler: Auto-identifies parallel vs sequential tasks
- M1-Parallel: 2.2x speedup with concurrent agents
Programmatic Tool Calling (Anthropic)
Claude writes Python scripts to orchestrate workflows:
- 37% token reduction
- 19+ fewer inference passes
- Tool results processed by script, not added to context
Error Handling
Retry Patterns:
- Tenacity Library: Exponential backoff, validation recovery
- LLM-Assisted Retries: Resubmit output + error to LLM
- Circuit Breakers: Monitor failures, preemptive fallbacks
Multi-Layer Architecture:
- User-level: Frontend reprocessing
- Database-level: Transient error retry
- Application-level: Exception catching
Grounding & Verification
Best Practices:
- RAG with span-level verification
- Post-hoc consistency checking
- Knowledge graph integration
- Self-verification (reflection, consistency, questioning)
Philosophy Shift 2025:
- From "zero hallucinations" to "managing uncertainty"
- Design for transparency: confidence scores, "no answer found"
Performance Optimization
Highest Impact:
- Output token reduction (50% tokens → 50% latency reduction)
- Output tokens cost ~4x input tokens
Other Optimizations:
- Keep prompts concise
- Aggressive RAG filtering (3 small chunks > 10 large)
- KV caching, semantic caching
- H100 GPU: 2-3x faster than A100
Key Principle: Not every problem needs an LLM call
Provider Comparison
| Provider | Focus | Special Features |
|---|---|---|
| OpenAI | Consumer AI | Parallel function calling, strict mode |
| Anthropic | Enterprise coding | Programmatic tool calling, tool search |
| Multimodal/scale | 1M token context, Workspace integration |
Pricing (per 1M tokens):
- Claude Sonnet 4: $3 input / $15 output
- Claude Opus 4.1: $15 input / $75 output
2025 Key Trends
- Agent-native APIs with multimodal I/O
- Programmatic tool orchestration
- Massive context windows (200k-1M tokens)
- Guardrails as standard practice
- Multi-agent over single general agents
- Tool use as commodity (all providers support)
- Coding agents breakout (Claude Code, GPT Codex)
Model Selection Guide
| Complexity | Recommended |
|---|---|
| Complex/ambiguous | Claude Opus/Sonnet 4.5, GPT-4.1 |
| Straightforward | Claude Haiku |
| Multimodal | Gemini 2.5 Pro |
| Cost-sensitive | Grok, Gemini |
Implementation Frameworks
- LangChain: Memory classes, multi-step orchestration
- LangGraph: State machine workflows, enterprise reliability
- CrewAI: Role-based multi-agent
- vLLM: Supports Outlines, XGrammar for self-hosted