Parallel Tool Calling and Execution Optimization in AI Agent Systems
Executive Summary
Tool-using AI agents spend the majority of their wall-clock time waiting for external function calls to return. When these calls are executed sequentially -- the default behavior in most agent loops -- latency compounds linearly with the number of tools invoked. A four-tool sequence where each call takes 300ms results in 1.2 seconds of dead time; running those same calls in parallel collapses it to 300ms. This simple observation has spawned a rich body of research and engineering practice around parallel tool calling: the automatic identification, scheduling, and concurrent execution of independent function calls within agent workflows.
The field has matured rapidly since the LLMCompiler paper (ICML 2024) introduced the idea of treating agent tool-call plans as dependency graphs amenable to compiler-style optimization. In 2025-2026, every major model provider -- OpenAI, Anthropic, and Google -- has shipped native parallel function calling support. Meanwhile, academic research has pushed beyond single-turn parallelism into hierarchical multi-agent parallelization (WideSeek, InfoSeeker), width-depth scaling tradeoffs (W&D), and specialized training regimes that teach models to generate parallelizable tool-call plans (ParaManager, graduated reward RL). Production systems report 3-5x latency reductions and 40-70% cost savings.
This article surveys the landscape of parallel tool calling optimization: the foundational compiler analogy, provider-level API support, scheduling and dependency analysis strategies, hierarchical and multi-agent extensions, training methodologies, benchmarks, and practical patterns for production deployment.
The Compiler Analogy: From Sequential to Parallel Tool Execution
LLMCompiler: The Foundational Architecture
The LLMCompiler framework, published by Kim et al. at ICML 2024, drew an explicit analogy between traditional compilers and agent tool orchestration [1]. Just as a compiler analyzes instruction dependencies and schedules independent operations for parallel execution on multiple CPU pipelines, LLMCompiler analyzes tool-call dependencies and dispatches independent calls concurrently.
The architecture comprises three components:
-
Function Calling Planner: The LLM generates an execution plan as a directed acyclic graph (DAG) where nodes are tool calls and edges represent data dependencies. For example, if task B requires the output of task A but task C is independent, the planner expresses this as
A -> BwithCas an isolated node. -
Task Fetching Unit: A scheduler that performs topological analysis on the DAG, identifying tasks whose dependencies have been satisfied. These are dispatched to the executor immediately, without waiting for unrelated tasks to complete.
-
Executor: A concurrent execution engine that runs dispatched tool calls in parallel, collecting results and feeding them back into the dependency graph.
The results were striking: up to 3.7x latency speedup, 6.7x cost reduction, and ~9% accuracy improvement compared to ReAct-style sequential execution [1]. The accuracy gains came from reduced context pollution -- fewer intermediate reasoning steps meant less opportunity for the model to hallucinate or lose track of its plan.
Why the Compiler Analogy Works
The parallel between traditional compilation and agent tool orchestration is deeper than it first appears:
| Compiler Concept | Agent Tool Analogy |
|---|---|
| Instruction dependency analysis | Tool-call dependency graph construction |
| Register allocation | Context window budget management |
| Instruction-level parallelism | Tool-call-level parallelism |
| Pipeline scheduling | Task fetching and dispatch ordering |
| Dead code elimination | Pruning unnecessary tool calls |
| Loop unrolling | Expanding iterative tool-call patterns |
This analogy has proven productive. The LLMCompiler implementation is now available as a first-class tutorial in LangGraph [2], and the pattern has been adopted by production agent frameworks including CrewAI, AutoGen, and OpenAI's Agents SDK.
Provider-Level Support: The API Landscape
OpenAI
OpenAI ships parallel function calling as a default behavior. When the model determines that multiple functions are needed, it can emit multiple tool-call objects in a single response. The parallel_tool_calls parameter (default true) controls this behavior [3]. The Responses API, introduced in 2025, further streamlined the interface. The Agents SDK provides higher-level abstractions for parallel agent execution, including an "agent as tool" pattern where sub-agents are invoked concurrently through the planner [4].
A notable limitation: the tool_choice parameter only accepts a single function name, meaning developers cannot force a specific combination of parallel calls. The model decides the parallelization strategy autonomously [3].
Anthropic Claude
Claude supports parallel tool use, with Claude 4 models offering built-in token-efficient tool calling. However, Claude's approach differs architecturally. Programmatic Tool Calling (PTC) allows Claude to write Python code that orchestrates multiple tool calls, processes their outputs, and controls what information enters the context window -- all within a single execution container [5]. This is a more expressive approach than emitting multiple tool-call objects: the model can express conditional logic, loops, and data transformations in code rather than through natural language.
PTC reduces latency for multi-tool workflows by eliminating round-trips through the model for each tool invocation and decreases token consumption by allowing Claude to filter or process data before it reaches the context window [5]. A known issue in 2026 is that Claude 4.6 models show reduced parallel tool calling in the Batch API with large tool definitions [6].
Google Gemini
Gemini supports calling multiple functions in a single turn (parallel function calling), in sequence (compositional function calling), and with built-in Gemini tools (multi-tool use) [7]. A distinctive feature is Gemini's ID-based result mapping: when the model initiates multiple function calls in a single turn, results do not need to be returned in the same order. The API maps each result back to its corresponding call using an ID, enabling true asynchronous execution on the client side [7]. Gemini supports up to 128 functions in a single declaration list, from which the model may select any subset for parallel invocation.
Scheduling Strategies: Width, Depth, and the Tradeoff
The W&D Framework
The Wide and Deep (W&D) framework from Salesforce AI Research (February 2026) represents a significant advance in understanding parallel tool calling as a scaling dimension [8]. The key insight is that agent performance can be scaled along two axes:
- Depth: More sequential reasoning steps (traditional approach)
- Width: More parallel tool calls per step (parallel scaling)
W&D demonstrates that jointly scaling both dimensions yields better results than scaling either alone. The optimal configuration uses 3 parallel tools per turn, which significantly reduces the number of turns required, wall-clock time, and LLM API costs [8]. The framework was evaluated on BrowseComp, HLE, and GAIA benchmarks.
Critically, W&D achieves this through intrinsic parallel tool calling -- the model's native ability to emit multiple tool calls -- rather than complex multi-agent orchestration. This makes it a lightweight, drop-in optimization for existing agent loops.
Descending Scheduling
Research on scheduling strategies reveals that a Descending strategy -- prioritizing broad exploration in early stages followed by focused exploitation -- outperforms static or ascending strategies by approximately 6% [9]. This mirrors classical search algorithms: cast a wide net first, then narrow down based on initial results. The practical implication is that agent systems should front-load parallel tool calls at the beginning of a task, when the information space is largest, and converge to sequential execution as the task nears completion.
Topological Scheduling in Practice
At the implementation level, parallel tool execution relies on topological sorting of the dependency DAG. A planner assigns each task to an execution wave:
# Simplified wave-based parallel execution
def schedule_waves(task_graph):
waves = []
remaining = set(task_graph.nodes())
while remaining:
# Tasks with all dependencies satisfied
ready = {t for t in remaining
if all(dep not in remaining
for dep in task_graph.predecessors(t))}
waves.append(ready)
remaining -= ready
return waves
# Wave 1: independent tasks (parallel)
# Wave 2: tasks depending on Wave 1 (parallel within wave)
# Wave 3: tasks depending on Waves 1-2 (parallel within wave)
Tasks in the same wave execute concurrently since they have no inter-dependencies. LangGraph implements this through its BSP/Pregel-based algorithm, which provides deterministic concurrency with full support for cycles -- a necessity since agent workflows often involve cyclical patterns like retry loops [2].
Hierarchical and Multi-Agent Parallelization
InfoSeeker: Three-Layer Hierarchy
InfoSeeker (April 2026) introduces a hierarchical framework based on near-decomposability principles [10]. The architecture has three layers:
- Host: Maintains compressed global state and issues high-level directives
- Managers: Domain-specific agents that decompose directives, verify quality, and aggregate results
- Workers: Execute atomic tool interactions via MCP simultaneously
The key innovation is strict context isolation between layers. Workers operate in parallel without sharing context, preventing the saturation and error propagation that plagues flat parallel architectures. Managers aggregate results before passing them up, acting as information bottlenecks that filter noise. This achieves 3-5x speedup on information-seeking benchmarks [10].
WideSeek: Dynamic Agent Forking
WideSeek (February 2026) takes a different approach: rather than pre-defining the agent hierarchy, the main agent dynamically forks parallel sub-agents based on task requirements [11]. The system uses end-to-end reinforcement learning to optimize the multi-agent trajectory, learning when and how many agents to spawn.
The companion work WideSeek-R1 further explores width scaling through multi-agent reinforcement learning (MARL), demonstrating that scaling the number of parallel agents is a viable alternative to scaling model size or reasoning depth [12].
AggAgent: Aggregating Parallel Rollouts
AggAgent (April 2026) addresses a fundamental challenge in parallel scaling: how to aggregate results from multiple parallel agent rollouts [13]. Simply concatenating trajectories exceeds context limits, while aggregating only final answers discards valuable intermediate information. AggAgent treats parallel trajectories as an environment, equipped with lightweight tools to inspect candidate solutions and search across trajectories on demand. This yields up to 5.3% absolute improvement on average and 10.3% on deep research tasks across three model families [13].
Training for Parallel Tool Orchestration
ParaManager: Small Model as Orchestrator
ParaManager (April 2026) demonstrates that parallel tool orchestration can be delegated to a small, specialized model while larger models handle the actual reasoning [14]. The approach introduces the Agent-as-Tool paradigm: both agents and tools are abstracted into a standardized, learnable action space with protocol normalization and explicit state feedback.
The training pipeline combines:
- Supervised fine-tuning (SFT) on trajectories with recovery mechanisms
- Reinforcement learning (RL) optimizing for task success, protocol compliance, diversity, and reasoning efficiency
This decomposition -- small model plans, large models execute -- reduces long-context interference and limits error propagation. The orchestrator focuses purely on dependency analysis and dispatch, which turns out to be a much simpler task than end-to-end problem solving [14].
Graduated Reward Training
Cheng et al. (March 2026) propose a training methodology specifically for multi-step tool orchestration [15]. The approach addresses two challenges:
- Data synthesis: A reinforcement learning environment backed by real API response caches generates training data with controllable complexity
- Graduated rewards: Correctness is decomposed into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect)
On ComplexFuncBench, this achieves substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential -- using either alone significantly degrades performance. Single-domain tasks achieve 50-60% turn accuracy, while cross-domain orchestration remains challenging at 22-28% [15].
Benchmarks and Evaluation
GTA-2: From Atomic to Workflow
GTA-2 (April 2026) introduces a hierarchical benchmark spanning atomic tool use and open-ended workflows [16]. The findings are sobering: while frontier models already struggle on atomic tasks (below 50% success), they largely fail on workflow-level tasks, with top models achieving only 14.39% success. This reveals a pronounced capability cliff between isolated tool calls and sustained multi-tool orchestration [16].
Notably, checkpoint-guided feedback improves performance, and advanced execution frameworks (Manus, OpenClaw) substantially enhance workflow completion -- highlighting that the execution harness matters as much as the underlying model.
The Evolution Survey
A comprehensive survey by researchers tracking the evolution of tool use in LLM agents (March 2026) organizes the literature around six dimensions [17]:
- Inference-time planning and execution
- Training and trajectory construction
- Safety and control
- Efficiency under resource constraints
- Capability completeness in open environments
- Benchmark design and evaluation
The survey emphasizes that the evaluation paradigm has shifted from measuring isolated API invocation correctness to assessing system-level intelligence in sustaining, adapting, and repairing extended tool-use trajectories. Long-horizon tool orchestration demands topological reasoning, persistent state management, and dynamic adaptation -- not merely the linear accumulation of atomic calls [17].
Production Patterns and Error Handling
Wave-Based Execution with Fallbacks
Production systems implement parallel tool calling with several essential patterns:
import asyncio
from typing import List, Dict
async def execute_wave_with_fallbacks(
wave: List[ToolCall],
timeout: float = 10.0,
max_retries: int = 2
) -> Dict[str, ToolResult]:
results = {}
async def execute_with_retry(call: ToolCall):
for attempt in range(max_retries + 1):
try:
result = await asyncio.wait_for(
call.execute(),
timeout=timeout
)
return call.id, result
except asyncio.TimeoutError:
if attempt == max_retries:
return call.id, ToolResult.timeout(call)
except ToolError as e:
if attempt == max_retries:
return call.id, ToolResult.error(call, e)
await asyncio.sleep(2 ** attempt) # Exponential backoff
tasks = [execute_with_retry(call) for call in wave]
for completed in asyncio.as_completed(tasks):
call_id, result = await completed
results[call_id] = result
return results
Key patterns include:
- Per-call timeouts: Individual tool calls get timeouts independent of the overall wave
- Exponential backoff: Failed calls retry with increasing delay
- Graceful degradation: Partial results are returned even when some calls fail
- Checkpoint-based recovery: Completed waves are checkpointed so that failures in later waves do not require re-executing earlier ones
Cost and Latency Optimization
Production deployments report consistent improvements from parallel tool calling [8][9][18]:
| Metric | Sequential | Parallel | Improvement |
|---|---|---|---|
| Latency per task | 30+ seconds | 6 seconds | ~5x |
| Token consumption | Baseline | ~50% fewer | 2x savings |
| API cost | Baseline | ~40% lower | 1.7x savings |
| Turns to completion | Baseline | ~60% fewer | 2.5x reduction |
The cost savings come from two sources: fewer LLM reasoning turns (each turn has fixed overhead in prompt tokens) and reduced context accumulation (parallel results are consolidated before the next reasoning step).
Framework Integration
LangGraph provides the most mature implementation of parallel tool execution through its graph-based execution model. The algorithm automatically selects parallel execution whenever node dependencies allow, executes parallel nodes with isolated state copies, and applies updates deterministically regardless of completion order [2]. CrewAI offers hierarchical delegation where a manager agent orchestrates parallel worker agents [19]. OpenAI's Agents SDK supports both explicit parallel agent patterns and the "agent as tool" route for dynamic parallelization [4].
Practical Implications
When to Parallelize
Not all tool calls benefit from parallelization. The decision framework is:
- Independent data gathering: Multiple API calls, database queries, or web searches with no data dependencies -- always parallelize
- Fan-out/fan-in patterns: A query decomposed into sub-queries that are later aggregated -- parallelize the fan-out, serialize the aggregation
- Speculative execution: Multiple possible next steps where only one result will be used -- parallelize if the wasted compute is cheaper than the latency of serial exploration
- Dependent chains: Each call requires the previous call's output -- cannot parallelize, but can pipeline (start processing the first result while the second call executes)
Implementation Checklist for Production
For teams deploying parallel tool calling in production:
- Profile your tool latencies: Parallelization helps most when individual tool calls have high latency (API calls, web requests). For sub-millisecond operations, the scheduling overhead may dominate.
- Set per-call timeouts: A single slow tool call should not block the entire wave. Use
asyncio.wait_foror equivalent. - Implement circuit breakers: If a tool consistently fails, stop calling it rather than retrying indefinitely.
- Monitor wave utilization: Track how many calls per wave actually execute in parallel versus being serialized by dependencies. Low utilization suggests the planner is not generating parallelizable plans.
- Budget context carefully: Parallel results arrive simultaneously and can flood the context window. Use summarization or filtering before feeding results back to the model.
- Test with realistic tool latencies: Mock tools with instant responses will not surface parallelization bugs. Inject realistic delays in testing.
Future Directions
Learned Scheduling Policies
Current scheduling strategies (constant width, descending, ascending) are static. The next frontier is learned scheduling policies that adapt width dynamically based on task type, intermediate results, and resource constraints. WideSeek-R1's multi-agent RL approach [12] points in this direction, but single-agent dynamic width adjustment remains underexplored.
Cross-Model Orchestration
ParaManager's small-model-as-orchestrator paradigm [14] suggests a future where cheap, fast models handle dependency analysis and scheduling while expensive frontier models only execute the tool calls that require sophisticated reasoning. This is a form of model routing at the tool-call level rather than the query level.
Compiler Passes for Agent Programs
The LLMCompiler analogy can be extended further. Future systems may implement multiple optimization passes: dead-call elimination (removing tool calls whose results are never used), call fusion (combining multiple calls to the same API into a batch request), and speculative prefetching (issuing likely-needed calls before the planner explicitly requests them). These optimizations are standard in traditional compilers but have barely been explored in the agent context.
Standardized Dependency Specification
Currently, dependency analysis relies on the LLM's natural language understanding of which calls depend on which. A more reliable approach would be a structured dependency specification language where tools declare their inputs, outputs, and side effects, enabling compile-time dependency analysis without LLM involvement. The MCP protocol's tool definition schema is a natural starting point for this extension.
Real-Time Adaptive Width
The W&D finding that 3 parallel tools per turn is optimal [8] likely varies by task type, model capability, and tool reliability. Real-time systems will need to adjust width based on observed tool failure rates, latency distributions, and remaining context budget -- essentially implementing a real-time scheduler similar to those in operating systems.
References
- Kim, S., Moon, S., Tabrizi, R., Lee, N., Mahoney, M., Keutzer, K., & Gholami, A. (2024). An LLM Compiler for Parallel Function Calling. ICML 2024. https://arxiv.org/abs/2312.04511
- LangChain. LLMCompiler Tutorial in LangGraph. https://langchain-ai.github.io/langgraph/tutorials/llm-compiler/LLMCompiler/
- OpenAI. Function Calling Documentation. https://platform.openai.com/docs/guides/function-calling
- OpenAI. Parallel Agents with the OpenAI Agents SDK. https://developers.openai.com/cookbook/examples/agents_sdk/parallel_agents
- Anthropic. Programmatic Tool Calling - Claude API Docs. https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling
- Anthropic SDK Issue #956: Claude Opus 4.6 and Sonnet 4.6 fail to make parallel tool calls in Batch API. https://github.com/anthropics/anthropic-sdk-typescript/issues/956
- Google. Function Calling with the Gemini API. https://ai.google.dev/gemini-api/docs/function-calling
- Lin, X., Liew, J. H., Savarese, S., & Li, J. (2026). W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents. Salesforce AI Research. https://arxiv.org/abs/2602.07359
- CodeAnt. Why Parallel Tool Calling Matters for LLM Agents. https://www.codeant.ai/blogs/parallel-tool-calling
- Lee, K. Y., et al. (2026). InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking. https://arxiv.org/abs/2604.02971
- Hao, Z., et al. (2026). WideSeek: Advancing Wide Research via Multi-Agent Scaling. https://arxiv.org/abs/2602.02636
- WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning. https://arxiv.org/abs/2602.04634
- Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks. (2026). https://arxiv.org/abs/2604.11753
- Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition. (2026). https://arxiv.org/abs/2604.17009
- Cheng, J., et al. (2026). Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards. https://arxiv.org/abs/2603.24709
- GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows. (2026). https://arxiv.org/abs/2604.15715
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration. (2026). https://arxiv.org/abs/2603.22862
- Kore.ai. Boost AI Agent Performance with Parallel Execution. https://www.kore.ai/blog/boost-ai-agent-performance-with-parallel-execution
- CrewAI Documentation: Hierarchical Processes. https://docs.crewai.com/en/concepts/processes
- Anthropic. Introducing Advanced Tool Use on the Claude Developer Platform. https://www.anthropic.com/engineering/advanced-tool-use
- ofox.ai. Function Calling and Tool Use: The Complete Guide for GPT, Claude, and Gemini (2026). https://ofox.ai/blog/function-calling-tool-use-complete-guide-2026/
- LangChain. Building LangGraph: Designing an Agent Runtime from First Principles. https://blog.langchain.com/building-langgraph/

