Mixture of Agents: Collaborative LLM Intelligence in 2026

Executive Summary

Mixture of Agents (MoA) represents a paradigm shift in how we leverage multiple large language models, introducing a collaborative framework where models iteratively refine responses through layered architecture. Unlike traditional ensemble methods or Mixture of Experts (MoE), MoA operates at the system level, orchestrating complete LLMs through the prompt interface without requiring any weight modifications. This approach achieved 65.1% on AlpacaEval 2.0 using only open-source models, surpassing GPT-4 Omni's 57.5%. However, recent research in early 2025 has challenged this methodology, with Self-MoA demonstrating that aggregating outputs from a single top-performing model can outperform traditional multi-model MoA by 6.6%, questioning whether model diversity truly provides benefits or simply introduces quality degradation.

The story of MoA illuminates a fundamental insight about LLM behavior: models exhibit inherent "collaborativeness," generating better responses when presented with outputs from other models, even if those reference responses are of lower quality. This phenomenon enabled MoA to achieve state-of-the-art results across multiple benchmarks including AlpacaEval 2.0, Arena-Hard, MT-Bench, and FLASK. Yet the emergence of Self-MoA suggests the field may be shifting from diversity-focused approaches toward quality-consolidation strategies, marking an important inflection point in how we think about LLM ensemble methods.

For production deployments, MoA presents significant tradeoffs: it offers cost-optimal performance compared to proprietary models like GPT-4, but introduces latency challenges from dense inter-agent communication and coordination complexity. Recent optimizations achieve up to 90% latency reduction through tree-structured routing and dependency-aware execution, making MoA increasingly viable for real-world applications. As we move through 2026, the debate between multi-model MoA and single-model Self-MoA will likely shape the future of collaborative AI systems.

What is Mixture of Agents?

Mixture of Agents (MoA) is a novel approach that leverages the collective strengths of multiple LLMs to enhance performance through a layered architecture. Unlike Mixture of Experts (MoE), which operates at the activation level within a single model using specialized sub-networks, MoA operates at the model level—orchestrating multiple complete LLMs across different layers through the prompt interface rather than requiring modifications to internal activations or weights.

The core architectural innovation of MoA lies in its iterative refinement process. Each layer comprises multiple LLM agents, and these agents take the outputs from the previous layer as auxiliary information to generate refined responses. Initially, several proposers independently generate responses to a given prompt. These responses are then presented to aggregators in the next layer, who synthesize them into higher-quality responses. This iterative process continues through several layers until a more robust and comprehensive response is achieved.

A typical MoA architecture might feature 4 layers: the first layer has 3 proposers generating initial responses, the second and third layers have 3 aggregators that also serve as proposers for the next layer, and the final layer has one aggregator that produces the final output. This demonstrates that aggregators from one layer can become proposers for subsequent layers, enabling progressive quality improvement through the pipeline.

The key distinction from other approaches is that MoA requires no finetuning at all. It's a purely system-level approach to ensemble learning, operating entirely at the prompt level without modifying model weights. This makes it immediately applicable to any collection of existing LLMs without requiring specialized training infrastructure or access to model internals.

The Collaborativeness Phenomenon

The foundation of MoA's success rests on a surprising empirical observation: LLMs exhibit inherent "collaborativeness." When a model is provided with answers generated by other models, its win rate significantly improves, even when the reference response quality is lower than the model's own performance baseline. This counterintuitive finding challenges conventional assumptions about ensemble learning, where we typically expect quality to degrade when incorporating lower-quality inputs.

Research demonstrates this phenomenon is widespread among LLMs. When models receive outputs from other models as context, they don't simply average or select among the options—they synthesize new responses that incorporate diverse perspectives and additional context, leading to measurably better outcomes. The collaborativeness effect appears even when reference responses come from weaker models, suggesting that exposure to alternative framings, reasoning paths, or information retrieval strategies provides genuine value.

This phenomenon enables two distinct roles within the MoA framework:

Proposers excel at generating useful reference responses for use by other models. A good proposer may not necessarily produce responses with high scores by itself, but it offers more context, diverse perspectives, and alternative approaches that downstream aggregators can leverage.

Aggregators synthesize different responses from proposers into one high-quality output. Rather than simply choosing the best answer among the responses, aggregators perform a complex synthesis process that blends the most optimal elements from multiple outputs.

The collaborativeness phenomenon is what makes MoA fundamentally different from simple majority voting or selection-based ensemble methods. Models actively build upon each other's outputs, creating emergent improvements through iterative refinement that exceed what any single model could achieve in isolation.

Performance Results and Benchmarks

MoA's empirical results represent a significant achievement in LLM capabilities, particularly when using only open-source models. On AlpacaEval 2.0, MoA achieved a score of 65.1% using solely open-source LLMs, compared to 57.5% by GPT-4 Omni—an absolute improvement of 7.6%. This positioned MoA as the leader on the AlpacaEval 2.0 leaderboard by a substantial margin.

The performance gains extended across multiple evaluation frameworks:

AlpacaEval 2.0: This benchmark measures instruction-following quality through automated pairwise comparisons. MoA's 65.1% represents the win rate against reference responses, demonstrating superior instruction-following capabilities compared to the strongest proprietary models.

Arena-Hard: Focuses on ranking LLMs through head-to-head response battles in challenging reasoning and instruction-following tasks. MoA achieved state-of-the-art results, benefiting from the diversity and depth of reasoning enabled by multi-layer synthesis.

MT-Bench: This multi-turn benchmark evaluates chatbot-like interactions across conversational turns. MoA's layered architecture proved particularly well-suited for multi-turn scenarios where maintaining context and coherence across exchanges is critical.

FLASK: Another evaluation framework where MoA demonstrated state-of-the-art performance, further validating the approach across diverse task types.

From a cost-performance perspective, MoA approaches lie on a Pareto frontier offering cost-optimal performance. GPT-4 Turbo and GPT-4o are not cost-optimal compared to MoA approaches achieving the same performance level but at significantly lower cost when using open-source models. This economic advantage makes MoA particularly attractive for production deployments where inference costs are a primary constraint.

The diversity of model outputs proved crucial to these results. Responses generated by heterogeneous models contribute significantly more than those produced by the same model, suggesting that model diversity—not just model quality—drives performance gains. This finding supported the original MoA hypothesis that collaborative diversity unlocks capabilities beyond what homogeneous ensembles can achieve.

The Self-MoA Challenge: Rethinking Model Diversity

In early 2025 (February 2nd), new research fundamentally challenged the MoA methodology with a provocative question: "Is Mixing Different Large Language Models Beneficial?" The answer, surprisingly, appears to be "not always"—and sometimes "no."

Self-MoA is an ensemble method that aggregates outputs from only a single top-performing LLM, sampling multiple outputs from the same model and aggregating them while exploiting in-model diversity rather than inter-model diversity. This approach avoids quality degradation from weaker models that can occur when mixing heterogeneous LLMs.

The performance results are striking. Self-MoA improves the Length-Controlled (LC) win rate by 6.6 points versus standard MoA on AlpacaEval 2.0 (e.g., 65.7% vs. 59.1%), establishing a new state-of-the-art on the leaderboard. More broadly, Self-MoA achieves an average of 3.8% improvement across various benchmarks including MMLU, CRUX, and MATH.

The core finding is that MoA performance is rather sensitive to quality, and mixing different LLMs often lowers the average quality of the models. When you aggregate outputs from heterogeneous models with varying quality levels, the weaker models can introduce noise or lower-quality reasoning that degrades the final synthesis, even though the collaborativeness phenomenon suggests models can benefit from lower-quality references.

Sequential Self-MoA (Self-MoA-Seq) further addresses context length limits by aggregating over sliding windows, enabling the approach to scale to longer contexts without hitting token limits.

This research raises profound questions about the original MoA hypothesis. If Self-MoA—using only a single model—outperforms multi-model MoA, what does that say about the value of diversity? Several interpretations emerge:

Quality Thresholds Matter: The collaborativeness phenomenon may have quality thresholds. Models benefit from diverse perspectives only when those perspectives meet a minimum quality bar. Below that threshold, diversity introduces more harm than help.
In-Model Diversity is Sufficient: A single high-quality model can generate diverse responses through temperature sampling and stochastic decoding. This in-model diversity may capture most of the benefits attributed to inter-model diversity without the quality degradation.
Aggregator Quality Dominates: The final aggregator's ability to synthesize responses may be the dominant factor. If the aggregator is high-quality, it can work with responses from the same model sampled multiple times. If the aggregator struggles with heterogeneous inputs, multi-model MoA suffers.
Task Dependency: The benefits of multi-model diversity may be task-dependent. For general instruction-following (AlpacaEval), Self-MoA wins. For specialized reasoning tasks or domain-specific problems, multi-model MoA might still provide advantages through complementary expertise.

The emergence of Self-MoA doesn't invalidate the original MoA research—it refines our understanding of when and why collaborative approaches work. The field is moving from a blanket assumption that "more models equals better" toward a more nuanced view that quality-controlled collaboration is what actually drives performance.

MoA vs. Mixture of Experts: Architectural Differences

While both MoA and Mixture of Experts (MoE) involve "mixing" multiple components, they operate at fundamentally different levels of abstraction and serve different architectural purposes.

Mixture of Experts (MoE) is a model-centric approach that applies ensemble learning within a single model. It operates at the activation level, using specialized sub-networks (experts) within the model architecture. A gating network decides which experts to activate for each input, routing computation to the most relevant specialists. MoE is a property of the LLM itself—it's built into the model architecture during training.

Mixture of Agents (MoA) is a system-centric approach that applies ensemble learning across multiple complete models. It operates at the model level, orchestrating full-fledged LLMs through the prompt interface. There's no internal routing mechanism or specialized sub-networks—just multiple independent models collaborating through successive layers. MoA is a property of the LLM system, not the individual LLMs.

The response processing differs fundamentally. In MoE, the gating network picks a set of experts to complete a job, and their outputs are typically combined through weighted averaging at the activation level. In MoA, teams build on the work of previous teams, improving the outcome at each stage through prompt-level synthesis. The MoA aggregator doesn't just select the best answer—it performs complex aggregation that seamlessly blends the most optimal outputs from all the LLMs within a layer.

From a practical deployment perspective, these differences have major implications:

MoE requires training infrastructure to build and fine-tune the gating mechanism and expert sub-networks. You can't easily add or remove experts post-training. However, at inference time, MoE is often more efficient because only a subset of experts activate for each input.
MoA requires no training or fine-tuning at all—you can assemble it from any collection of existing models. You can easily swap models in and out, experiment with different layer configurations, or adjust the architecture based on task requirements. However, inference involves running multiple complete models sequentially, which can be more resource-intensive and introduce latency.

The choice between MoE and MoA depends on your constraints and goals. If you're building a model from scratch and want to optimize for inference efficiency, MoE provides elegant internal routing. If you want to leverage existing pre-trained models without fine-tuning, and you're willing to accept higher inference costs for better performance, MoA offers immediate applicability and flexibility.

Production Deployment: Challenges and Optimizations

Despite impressive benchmark results, deploying MoA in production environments presents significant engineering challenges around reliability, latency, coordination, and cost.

Reliability and Unpredictability

Reliability—achieving consistent correct behavior over time—remains the top development challenge for agentic systems. LLM agents are notoriously unpredictable; small perturbations in input can lead to wildly divergent outputs despite careful prompt engineering and constitutional guidelines. In MoA, this unpredictability compounds across layers. If one proposer generates an unexpected output, it can cascade through subsequent layers, potentially derailing the entire synthesis process.

The compound nature of errors in agentic systems means that minor issues for traditional software can derail agents entirely. One step failing can cause agents to explore entirely different trajectories, leading to unpredictable outcomes. With MoA's layered architecture, error propagation becomes a critical concern—a failure in Layer 2 can invalidate all subsequent layers' work.

Latency and Synchronous Execution Bottlenecks

Synchronous execution creates bottlenecks where lead agents execute subagents synchronously, waiting for each set to complete before proceeding. This simplifies coordination but creates information flow bottlenecks—the lead agent can't steer subagents mid-execution, subagents can't coordinate with each other in real-time, and the entire system can be blocked while waiting for a single slow agent to finish.

MoA inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. In a naive implementation, each layer must fully complete before the next layer begins, creating a sequential pipeline that multiplies latency across the layer depth.

Recent research addresses this through several optimizations:

Tree-Structured Routing: Replacing dense agent interaction graphs with hierarchical tree topology reduces communication overhead and enables more efficient parallelization.
Adaptive Early Exit Mechanisms: Introducing mechanisms that allow the system to stop early if confidence thresholds are met, avoiding unnecessary computation in deeper layers when earlier layers have already converged on a high-quality response.
Dependency-Aware Prefill-Decode Overlap: Pipelining agent execution by overlapping the prefill phase (processing the prompt and context) with the decode phase (generating tokens) across dependency-related agents.

These optimizations substantially reduce end-to-end latency—up to 90% in some configurations—while maintaining comparable accuracy to the full MoA pipeline. This makes MoA increasingly viable for latency-sensitive production applications.

Coordination Complexity and State Management

Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability. Multi-Agent Systems can become messy with unreliable outputs, token budgets lost to coordination chatter, and performance that drifts or sometimes worsens instead of improving.

For MoA specifically, managing the flow of outputs between layers requires careful orchestration. You need mechanisms to:

Collect all outputs from Layer N before proceeding to Layer N+1
Format and inject those outputs as context for the next layer's prompts
Handle timeouts or failures gracefully without blocking the entire pipeline
Track which models generated which outputs for debugging and evaluation

Cost Tradeoffs

Agentic systems often trade latency and cost for better task performance. MoA amplifies this tradeoff by running multiple complete models across multiple layers. A 4-layer MoA with 3 proposers per layer could involve 12+ model invocations per query, multiplying inference costs compared to a single model call.

However, MoA's cost-optimal position on the Pareto frontier means that for a given performance target, MoA using open-source models can be cheaper than proprietary models. The key is optimizing the layer configuration and model selection to avoid redundant computation while maintaining performance gains.

Splitting tasks into smaller, simpler, and more narrowly scoped LLM calls reliably improved latency, cost, and reliability in production multi-agent systems. This principle applies to MoA: carefully designing what each layer's proposers and aggregators are responsible for can reduce the complexity and cost of each individual model call.

Evaluation and Observability

Due to limited available benchmarks and challenges in creating them, 75% of teams forgo formal benchmarking of agent systems, relying instead on A/B testing or expert feedback. Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts, which makes debugging harder.

For MoA, observability becomes critical. You need visibility into:

What each proposer in each layer generated
How the aggregator synthesized those inputs
Which layers contributed most to the final output quality
Where errors or quality degradation occurred in the pipeline

Implementing comprehensive logging, tracing, and evaluation frameworks is essential for production MoA deployments. Tools like LangSmith, Langfuse, or OpenTelemetry-based observability stacks can provide the necessary visibility into multi-layer agent interactions.

Together.ai Implementation and Open-Source Ecosystem

Together.ai released the original Mixture-of-Agents implementation as open-source, making it accessible for practitioners to experiment with and deploy. The implementation is available on GitHub at the togethercomputer/MoA repository, providing a reference architecture and integration with Together.ai's inference API.

How the Together MoA Works

Given a prompt, Together MoA sends it to 4 different open-source LLMs in the first layer. These proposers independently generate responses based on the prompt. The system then collects all 4 responses and combines them as input to a final aggregator LLM, which synthesizes all 4 responses into an ideal response.

To get started with Together MoA, you need to:

Install the Together Python library
Obtain a Together API key
Configure the models for each layer (proposers and aggregators)
Make API calls following the layered architecture pattern

The Together.ai platform provides optimized inference infrastructure for running multiple models efficiently, reducing some of the latency overhead inherent in MoA architectures.

Community Adaptations and Extensions

Beyond the official implementation, the open-source community has created several adaptations:

Multi-Platform Support: The original implementation only supported Together.ai endpoints, but developers have extended it to enable use of other LLM endpoints including OpenAI, Anthropic, and local models through Ollama. This makes MoA accessible regardless of your model hosting preference.

Local Deployment: Adaptations of the original work by Together.ai have been tailored for local model usage with user-friendly Gradio interfaces, enabling experimentation without API costs or network latency.

Groq Integration: Some implementations integrate with Groq's ultra-low-latency inference infrastructure, addressing one of MoA's key production challenges by dramatically reducing per-model inference time.

Self-MoA Implementations: Following the recent research, implementations of Self-MoA (sampling from a single model) have emerged on GitHub, allowing practitioners to compare multi-model vs. single-model approaches for their specific use cases.

The open-source ecosystem around MoA demonstrates strong community interest in collaborative LLM architectures. As the technology matures, we're likely to see more sophisticated tooling for layer configuration, model selection, and production optimization.

Practical Considerations for Model Selection

Effective MoA deployment requires thoughtful model selection for proposers and aggregators. Not all models perform equally well in each role, and the composition of your MoA layers significantly impacts performance.

Proposer Selection

Good proposers should excel at generating diverse, contextually rich reference responses rather than necessarily producing the highest-quality standalone outputs. Metrics for evaluating proposer quality include:

Diversity of responses: When sampled multiple times, does the model generate genuinely different approaches or just surface-level variations?
Information density: Do the responses provide additional context, alternative framings, or unique perspectives that aggregators can leverage?
Complementary strengths: If using multiple proposers in a layer, do they bring different capabilities (e.g., one strong at reasoning, another at factual recall)?

For open-source deployments, combinations like Qwen, WizardLM, and Llama-based models in the proposer layer have shown strong results, each bringing different training data distributions and architectural optimizations.

Aggregator Selection

Aggregators require strong synthesis and reasoning capabilities to blend multiple inputs coherently. Key criteria include:

Multi-document reasoning: Can the model effectively process and integrate information from multiple sources?
Coherence maintenance: Does it produce unified responses rather than disjointed summaries?
Quality filtering: Can it identify and emphasize higher-quality elements from proposer outputs while discarding lower-quality content?

Stronger models like Qwen-72B, DeepSeek-V2, or Llama-3-70B typically perform better as final aggregators, where synthesis quality directly determines the output users receive.

Temperature and Sampling Settings

In the context of MoA, responses can be generated by the same LLM with a temperature of 0.7 to introduce diversity, or by different LLMs as part of a multiple-proposer approach. Temperature-based sampling is a common strategy to increase diversity, but for tasks requiring high precision (e.g., mathematical reasoning), uncontrolled high temperature sampling can degrade reasoning quality.

Recent research proposes selective sampling methods that dynamically switch between greedy and high-temperature sampling based on a sampling risk metric, optimizing the quality-diversity tradeoff. For MoA implementations, this suggests:

Use moderate temperature (0.6-0.8) for proposers to encourage diversity without sacrificing too much quality
Use lower temperature (0.1-0.3) for final aggregators to ensure coherent, high-quality synthesis
Consider task-specific tuning: creative tasks benefit from higher proposer temperature, while reasoning tasks require tighter control

The Self-MoA Configuration Question

Following the Self-MoA research, practitioners now face a configuration decision:

Multi-Model MoA: Use heterogeneous models across layers, betting on complementary strengths and diverse perspectives
Self-MoA: Use a single top-performing model with multiple samples, betting on quality consolidation and in-model diversity

The choice depends on your specific use case:

For general instruction-following and task performance, Self-MoA appears to offer better results with lower operational complexity (only one model to deploy and manage)
For specialized domains where complementary expertise matters (e.g., combining a code-specialized model with a reasoning-specialized model), multi-model MoA may still provide advantages
For cost-constrained scenarios, Self-MoA with a smaller high-quality model may outperform multi-model MoA with multiple larger models

Experimentation is key. The benchmarks provide general guidance, but your specific task distribution and quality requirements will determine which approach works best.

Future Directions and Open Questions

The emergence of Self-MoA has opened new research directions while raising fundamental questions about collaborative LLM architectures.

Quality Thresholds and Collaborative Benefits

We need a better theoretical understanding of when and why the collaborativeness phenomenon occurs. What are the quality thresholds below which diversity introduces more harm than help? Can we develop metrics to predict whether a given model will be a good proposer or aggregator before running expensive evaluations? Understanding these dynamics could enable more principled MoA layer design.

Adaptive Layer Depth

Current MoA implementations use fixed layer depths, but different queries likely benefit from different amounts of iteration. Simple queries might achieve optimal quality in 2 layers, while complex multi-faceted questions might benefit from 5+ layers. Research into adaptive layer depth—where the system decides when to stop iterating based on convergence metrics—could optimize the latency-quality tradeoff on a per-query basis.

Hybrid Approaches

Rather than choosing purely between multi-model MoA and Self-MoA, hybrid approaches might combine the strengths of both. For example, use heterogeneous proposers in Layer 1 to capture diverse perspectives, then use Self-MoA with the top-performing model in subsequent layers to refine quality without degradation. Or use multi-model MoA for specialized tasks while defaulting to Self-MoA for general instruction-following.

Integration with Retrieval and Tools

MoA has been evaluated primarily on language-only tasks, but many production applications require integration with retrieval systems (RAG) and tool use. How does MoA interact with these capabilities? Should retrieval happen once before Layer 1, or should each layer have independent retrieval? When tool calls are involved, which layer should execute them?

Long-Context Optimization

Sequential Self-MoA addresses context length limits through sliding windows, but more sophisticated approaches might enable better long-context handling. Hierarchical aggregation, where early layers handle local context and later layers handle global context, could scale MoA to document-length inputs more effectively.

Real-Time Learning and Adaptation

Current MoA implementations are static—the layer configuration and model selection don't change based on observed performance. Future systems might adapt in real-time, learning which models to use as proposers or aggregators for different query types, or adjusting layer depth based on online performance metrics.

Conclusion

Mixture of Agents represents a significant milestone in collaborative AI systems, demonstrating that model-level orchestration can achieve state-of-the-art performance without requiring fine-tuning or architectural modifications. The original MoA research validated the counterintuitive finding that LLMs exhibit inherent collaborativeness, generating better responses when exposed to outputs from other models, even lower-quality ones.

However, the emergence of Self-MoA in early 2025 has fundamentally challenged the assumptions underlying multi-model MoA. By demonstrating that a single high-quality model sampled multiple times can outperform heterogeneous model ensembles, Self-MoA suggests that quality consolidation may matter more than diversity—at least for general instruction-following tasks. This doesn't invalidate MoA, but it refines our understanding of when and why collaborative approaches work.

For practitioners in 2026, the key takeaway is that MoA and Self-MoA are both valuable tools, each suited to different scenarios:

Self-MoA offers simplicity, operational efficiency, and superior performance for general tasks where a single strong model can capture sufficient diversity through sampling
Multi-Model MoA provides complementary expertise and diverse perspectives that may still be valuable for specialized domains or complex reasoning tasks

Production deployment requires careful attention to latency optimization, coordination complexity, error handling, and observability. Recent advances in tree-structured routing and dependency-aware execution have made MoA more viable for latency-sensitive applications, achieving up to 90% latency reduction while maintaining quality.

As we move through 2026, the debate between diversity-focused and quality-focused collaborative approaches will continue to shape the field. The fundamental insight—that LLMs can benefit from iterative refinement and synthesis of multiple perspectives—remains valid. The question is how to best harness that insight while navigating the tradeoffs between model diversity, quality thresholds, and operational complexity.

Mixture of Agents, in whatever form it ultimately takes, represents an important piece of the puzzle in building more capable, reliable, and cost-effective AI systems. The journey from the original MoA to Self-MoA and beyond will continue to reveal deeper insights about how language models collaborate, synthesize information, and achieve emergent capabilities through system-level design.

Sources: