Mixture of Agents (MoA) is a multi-model collaboration framework that combines multiple large language models (LLMs) in a layered architecture, where models in each layer take the outputs of models from the previous layer as additional context to produce improved responses. First introduced in the paper "Mixture-of-Agents Enhances Large Language Model Capabilities" by Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou from Together AI and affiliated universities, MoA demonstrated that combining several open-source LLMs could surpass the performance of frontier proprietary models like GPT-4 on major benchmarks. The paper was published as a Spotlight presentation at ICLR 2025.
MoA draws on the long tradition of ensemble methods in machine learning, where aggregating the predictions of multiple models yields better results than any single model alone. However, rather than combining numerical predictions or class labels, MoA operates at the level of natural language: one model reads and synthesizes the full text outputs of several other models to produce a single, higher-quality response.
The performance of individual LLMs, no matter how large, is bounded by the data, architecture, and training procedures used to build them. Different models develop different strengths. Some excel at creative writing; others perform better on mathematical reasoning or factual recall. Researchers have long observed that presenting an LLM with reference outputs from other models tends to improve the quality of its own response, even when those reference outputs come from weaker models. Wang et al. (2024) formalized this observation as the collaborativeness property of LLMs and built the MoA framework around it.
The motivation behind MoA also stems from practical considerations in the open-source LLM ecosystem. While proprietary models such as GPT-4o and Claude consistently ranked at the top of public leaderboards in 2024, a growing collection of capable open-source models, including LLaMA 3, Mixtral, Qwen, WizardLM, and DBRX, offered competitive performance at lower cost. MoA provided a principled way to combine these models so that their collective output could match or exceed that of any single proprietary model.
The MoA architecture arranges multiple LLMs into sequential layers. In the default configuration described by Wang et al., the system uses three layers with six models per layer. Each model in a given layer receives two inputs: the original user query and the concatenated outputs from all models in the preceding layer. The model in the final layer produces the system's output.
MoA assigns two conceptual roles to the participating models:
| Role | Description | Example Models |
|---|---|---|
| Proposer | Generates diverse candidate responses that serve as reference material for subsequent layers. Proposers are chosen for their ability to provide useful, varied perspectives on a given query. | WizardLM-8x22B, Mixtral-8x22B, DBRX-Instruct |
| Aggregator | Synthesizes the outputs of proposers into a single, high-quality response. Aggregators are selected for their ability to evaluate, compare, and merge information from multiple sources. | Qwen1.5-110B-Chat, LLaMA 3-70B-Instruct |
In the first layer, each proposer model independently generates a response to the user query. In subsequent layers, each model receives both the original query and all outputs from the previous layer, using them as auxiliary context. The final layer typically contains a single strong aggregator model that produces the definitive output.
The aggregator receives a system prompt instructing it to "synthesize these responses into a single, high-quality response" and to "critically evaluate the information provided" rather than simply replicating any of the given answers.
The processing flow can be summarized as follows:
Models within the same layer can run in parallel since they do not depend on each other, only on the outputs of the previous layer. This parallelism helps control latency, though the sequential nature of the layers still introduces overhead compared to querying a single model.
A central finding of the MoA paper is that LLMs exhibit an inherent collaborativeness: they tend to generate better responses when presented with outputs from other models, even when those other models are individually less capable. This property holds across a variety of model families and sizes. The researchers demonstrated this empirically by showing that presenting a model with reference outputs from any combination of other models improved its performance compared to generating a response from scratch.
This finding is significant because it suggests that the quality gains from MoA are not simply due to picking the best response from a pool of candidates. Instead, the aggregator model actively integrates and improves upon the information provided by proposers, producing outputs that are qualitatively better than any individual input.
The MoA framework achieved state-of-the-art results on several prominent LLM benchmarks at the time of publication. All results below are from the original Wang et al. (2024) paper.
AlpacaEval 2.0 measures the length-controlled (LC) win rate of a model's responses compared to a reference set, as judged by GPT-4.
| Configuration | LC Win Rate |
|---|---|
| MoA with GPT-4o as aggregator | 65.7% (+/- 0.7%) |
| MoA (open-source models only) | 65.1% (+/- 0.6%) |
| MoA-Lite (2 layers, open-source) | 59.3% (+/- 0.2%) |
| GPT-4o (single model) | 57.5% |
The open-source-only MoA configuration achieved a 7.6 percentage point improvement over GPT-4o, a striking result given that none of the individual open-source models involved could match GPT-4o on their own.
MT-Bench evaluates multi-turn conversational ability on a scale of 1 to 10, with GPT-4 serving as the judge.
| Configuration | Score |
|---|---|
| MoA with GPT-4o | 9.40 (+/- 0.06) |
| MoA (open-source only) | 9.25 (+/- 0.10) |
| MoA-Lite | 9.18 (+/- 0.09) |
| GPT-4o | 9.19 |
| GPT-4 Turbo | 9.31 |
FLASK provides fine-grained evaluation across multiple dimensions of response quality. The MoA configuration with Qwen1.5-110B-Chat as the aggregator outperformed GPT-4o in five specific categories: correctness, factuality, insightfulness, completeness, and metacognition. MoA also showed substantial improvements in robustness and commonsense reasoning.
The default MoA setup described in the paper used six open-source models across all proposer layers, with Qwen1.5-110B-Chat serving as the final aggregator:
| Model | Role | Organization |
|---|---|---|
| Qwen1.5-110B-Chat | Proposer and Final Aggregator | Alibaba |
| Qwen1.5-72B-Chat | Proposer | Alibaba |
| WizardLM-8x22B | Proposer | Microsoft |
| LLaMA-3-70B-Instruct | Proposer | Meta |
| Mixtral-8x22B-v0.1 | Proposer | Mistral AI |
| DBRX-Instruct | Proposer | Databricks |
The researchers found that not all models perform equally well in both roles. For example, WizardLM performed significantly better as a proposer (63.8% on AlpacaEval 2.0) than as an aggregator (52.9%), while Qwen and LLaMA-3 excelled in both roles. This distinction suggests that choosing the right model for each role is important for maximizing MoA performance.
Mixture of Agents is frequently confused with Mixture of Experts (MoE) due to the similar naming, but the two approaches operate at fundamentally different levels of abstraction.
| Aspect | Mixture of Agents (MoA) | Mixture of Experts (MoE) |
|---|---|---|
| Level of operation | System level: coordinates multiple complete, independent models | Architecture level: operates within a single model's internal layers |
| Components | Full LLMs (e.g., LLaMA, Mixtral, Qwen), each with their own weights and training | Specialized sub-networks ("experts") within one model that share an embedding layer |
| Routing mechanism | All proposers process every query; no selective routing | A learned gating network selects a sparse subset of experts per input token |
| Communication | Natural language: models read each other's text outputs | Numerical: activations are routed through selected expert sub-networks |
| Training | No joint training required; uses pre-trained models as-is | End-to-end training of experts and gating network together |
| Goal | Improve output quality through iterative refinement across models | Increase model capacity while keeping compute cost manageable via sparse activation |
In short, MoE is an architectural pattern used inside a single neural network (as seen in models like Mixtral-8x7B and GPT-4), while MoA is an orchestration pattern that sits on top of multiple independently trained models.
MoA can be understood as an extension of classical ensemble learning techniques to the domain of generative language models. Traditional ensemble methods in machine learning include:
MoA most closely resembles stacking: the proposer models serve as base learners, and the aggregator serves as the meta-learner that combines their outputs. However, there are important differences. In traditional stacking, the meta-learner typically works with numerical predictions or probability distributions. In MoA, the aggregator works with full natural language responses, giving it far more information to work with but also requiring a much more sophisticated combination strategy.
Another key difference is that traditional ensemble methods usually require retraining or fine-tuning the meta-learner. MoA, by contrast, uses off-the-shelf pre-trained LLMs and relies entirely on prompting to instruct the aggregator on how to combine the proposer outputs. This makes MoA easy to implement but also means the combination strategy is limited by the aggregator model's ability to follow instructions.
MoA is part of a broader family of multi-agent LLM collaboration patterns that emerged in 2023 and 2024. Several related approaches share the core idea of using multiple LLM instances to improve output quality.
In the influential paper "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (Du et al., 2023), researchers from MIT and elsewhere showed that having multiple LLM instances debate their answers over several rounds improved both factual accuracy and mathematical reasoning. In their approach, each model proposes an initial answer, reads the other models' answers, and then revises its own response. This process repeats for multiple rounds until the models converge on a common answer. The paper was published at ICML 2024.
While multiagent debate uses a peer-to-peer structure where all models participate equally in each round, MoA uses a more hierarchical structure with distinct proposer and aggregator roles. Both approaches leverage the finding that models improve when exposed to other models' outputs.
The LLM-as-a-Judge paradigm, formalized by Zheng et al. (2023) in their MT-Bench paper, uses a strong LLM (typically GPT-4) to evaluate the quality of outputs from other models. While LLM-as-a-Judge focuses on evaluation rather than generation, it shares with MoA the principle that one model can meaningfully assess and rank the outputs of other models. MoA's aggregator role can be seen as an extension of the judge concept: rather than merely selecting the best response, the aggregator synthesizes a new, improved response drawing on all inputs.
Model routing directs each query to the single most appropriate model based on the query's characteristics, while model cascading tries progressively more powerful (and expensive) models until a satisfactory answer is produced. Unlike MoA, both routing and cascading select one model's output as the final answer. MoA, by contrast, always uses all proposer models and combines their outputs through the aggregator.
| Pattern | How Models Interact | Number of Models Used per Query | Output Selection |
|---|---|---|---|
| MoA | Sequential layers; aggregator synthesizes all outputs | All models in every layer | Synthesized new response |
| Multiagent Debate | Peer-to-peer; models revise based on others' outputs | All models in every round | Converged consensus |
| Model Routing | Router selects one model per query | One | Selected model's output |
| Model Cascading | Sequential; stops when confidence is high enough | One to many (escalating) | First sufficient output |
| LLM-as-a-Judge | Judge evaluates candidate outputs | One judge plus candidates | Best candidate selected |
The primary drawback of MoA is its increased computational cost and latency. Running six models across three layers means the system processes roughly 18 model invocations per query (six per layer times three layers), compared to a single invocation for a standard LLM call. This translates directly into higher API costs and longer response times.
Despite the higher absolute number of model calls, MoA can be cost-competitive when compared to frontier proprietary models. The original paper showed that MoA configurations lie on a Pareto frontier of cost versus quality: for any given quality level, MoA offers the cheapest way to achieve that level using open-source models.
Latency is a more significant concern than cost for many applications. While proposer models within a layer can run in parallel, the sequential dependency between layers means the total latency is at least the sum of the slowest model in each layer. The Time to First Token (TTFT) is particularly affected, making MoA less suitable for real-time, interactive applications where users expect immediate streaming responses.
For this reason, the Together AI team recommended MoA primarily for offline processing tasks such as synthetic data generation, batch evaluation, and quality-sensitive document creation, rather than for latency-sensitive chatbot interfaces.
Several follow-up studies have examined the assumptions and limitations of the MoA approach.
In the paper "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" (Li et al., 2025), researchers challenged the assumption that model diversity is essential to MoA's success. They introduced Self-MoA, a variant that uses multiple outputs from a single top-performing model rather than outputs from several different models. Their findings were notable:
The authors argued that mixing different LLMs often lowers the average quality of inputs to the aggregator, and that MoA performance is more sensitive to input quality than to input diversity. This finding suggests that for many practical use cases, running the same strong model multiple times with different random seeds may be more effective than coordinating several different models.
Several open-source implementations of MoA are available for researchers and practitioners.
The reference implementation is available on GitHub at togethercomputer/MoA, released under the Apache 2.0 license. It includes:
moa.py) that runs a two-layer MoA with four language modelsadvanced-moa.py) supporting three or more layersbot.py) with multi-turn conversation supportConfiguration is done through command-line arguments including --aggregator (the final response model), --reference_models (the proposer models), --rounds (number of processing layers minus one), and --temperature.
The implementation requires a Together AI API key, as the models are accessed through Together's inference API. Installation is straightforward:
pip install together
export TOGETHER_API_KEY={your_key}
python moa.py
Several community projects have adapted the MoA pattern for different model providers. These include implementations that use Claude, Gemini, and GPT-4o as the participating models, allowing users to combine proprietary frontier models rather than being limited to Together AI's supported open-source models.
The MoA concept has continued to evolve since its initial publication, with several notable extensions.
Attention-MoA (2026) enhances the standard MoA framework by introducing an inter-agent semantic attention mechanism that allows models in each layer to selectively weight the outputs from the previous layer based on relevance to the current query. The framework also adds inter-layer residual connections and an adaptive early stopping mechanism to prevent information degradation in deep configurations. According to its authors, Attention-MoA achieved a 91.15% length-controlled win rate on AlpacaEval 2.0 and outperformed proprietary models including Claude 4.5 Sonnet and GPT-4.1 using only open-source models.
Pyramid MoA (2026) addresses the cost inefficiency of standard MoA by introducing a hierarchical architecture with a lightweight router that dynamically decides whether a query needs full multi-model processing or can be answered adequately by a single model. On the GSM8K benchmark, Pyramid MoA achieved 93.0% accuracy while reducing compute costs by 61% compared to a standard MoA configuration, by halting computation early for queries that did not require the full multi-layer pipeline.
The Iterative Consensus Ensemble (ICE) approach extends MoA-style collaboration by having three LLMs critique each other in iterative rounds until consensus emerges. ICE has been shown to improve accuracy by 7 to 15 percentage points over the best single model, with up to 27% improvement in final overall accuracy in certain domains. This approach has seen particular adoption in medical and multi-domain question-answering tasks.
MoA and related multi-model collaboration patterns are best suited for scenarios where output quality matters more than latency or cost.
One of the most natural applications of MoA is generating high-quality training data for fine-tuning smaller models. By using MoA to produce better instruction-response pairs, researchers can create synthetic datasets that capture the collective knowledge of multiple models. The resulting fine-tuned models can then serve as faster, cheaper single-model alternatives for production deployment.
Tasks like report writing, legal document drafting, and technical documentation benefit from MoA because these tasks reward thoroughness and accuracy over speed. Running multiple models and aggregating their outputs helps catch errors, fill gaps, and produce more comprehensive documents.
MoA can serve as a high-quality evaluation pipeline for AI-generated content. By having multiple models assess a piece of content and aggregating their evaluations, organizations can build more robust quality checks than relying on a single model's judgment.
Researchers use MoA configurations to establish upper bounds on what current open-source models can achieve when combined. This helps contextualize the performance of individual models and identify where the greatest gains from collaboration occur.
MoA sits within a broader movement toward multi-agent AI systems that has accelerated through 2025 and into 2026. According to Gartner, inquiries about multi-agent systems surged by over 1,400% between Q1 2024 and Q2 2025. While MoA specifically addresses the problem of improving single-query response quality through model collaboration, the underlying insight that multiple AI models working together can outperform individual models applies across many domains.
The rise of AI agents that use tools, plan multi-step tasks, and coordinate with other agents has created a rich ecosystem of multi-model patterns. MoA can be seen as one of the simplest and most well-studied instances of this broader trend: a clean demonstration that collaboration between models produces measurable quality gains, even without complex tool use or planning mechanisms.
Mixture of Agents represents a significant contribution to the field of LLM orchestration. By formalizing the observation that language models improve when exposed to other models' outputs and building a practical layered architecture around this property, Wang et al. showed that open-source models can collectively surpass frontier proprietary models on major benchmarks. While subsequent work has questioned whether model diversity is strictly necessary and highlighted the cost and latency trade-offs involved, MoA remains an important reference point for anyone designing systems that combine multiple language models. Its influence can be seen in the growing ecosystem of multi-agent frameworks, ensemble approaches, and collaborative AI systems that continue to develop in 2025 and 2026.