MetaGPT
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,041 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,041 words
Add missing citations, update stale details, or suggest a clearer explanation.
MetaGPT is an open-source multi-agent framework that organizes large language model agents into a simulated software-development company, with role-specialized agents (Product Manager, Architect, Project Manager, Engineer, QA Engineer) collaborating through codified Standardized Operating Procedures (SOPs) to turn a single-line natural-language requirement into a complete software artifact set, including a product requirements document (PRD), system design, task list, source code, and tests.[1] The framework was introduced in the paper MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework by Sirui Hong, Mingchen Zhuge, and colleagues at DeepWisdom and collaborating institutions, first posted to arXiv on 1 August 2023 (2308.00352) and later accepted as an oral presentation at the International Conference on Learning Representations 2024.[2][3] MetaGPT's central thesis is that "Code = SOP(Team)": by encoding the standardized operating procedures of human software organizations into prompt sequences and structured artifacts, cascading hallucinations and miscoordination among chained LLM agents can be substantially reduced.[1][4] The reference implementation is released on GitHub under the MIT License and, by November 2025, had surpassed 59,000 stars, making it one of the most popular multi-agent frameworks for software automation alongside AutoGen and CrewAI.[5][6] DeepWisdom subsequently commercialized the framework as MGX (later rebranded Atoms) in 2025.[7][8]
| Attribute | Value |
|---|---|
| Original name | MetaGPT |
| Type | Open-source LLM multi-agent framework |
| Domain | Automated software development, data science |
| Creator | DeepWisdom (Sirui Hong et al.) |
| First arXiv preprint | 1 August 2023 (arXiv:2308.00352)[2] |
| ICLR 2024 status | Oral presentation[3] |
| Repository | github.com/FoundationAgents/MetaGPT (formerly geekan/MetaGPT)[5] |
| License | MIT[5] |
| Primary language | Python (approximately 97.5% of codebase)[5] |
| Latest tagged release at writing | v0.8.1 (22 April 2024)[5] |
| Commercial product | MGX (Feb 2025), rebranded Atoms (Jan 2026)[7][8] |
| Core philosophy | "Code = SOP(Team)"[5] |
MetaGPT originated in the research group led by Sirui Hong at DeepWisdom (also referred to in some Chinese sources as Fuzhi), a Shenzhen-based startup focused on autonomous agents.[9] Hong is listed as co-founder of DeepWisdom and as a researcher who leads the MetaGPT team, with related work on agent platforms such as AgentStore.[9] The project was bootstrapped in mid-2023 in response to a wave of single-agent autonomous systems including Auto-GPT and BabyAGI that, while compelling demos, exhibited brittle behavior on multi-step software tasks and frequently produced logically inconsistent output because each step had no enforced contract with its successor.[4]
The initial public release of the repository under the GitHub handle geekan/MetaGPT occurred at the end of August 2023, immediately following the first arXiv submission of the paper on 1 August 2023.[2][5] Within weeks the project trended on GitHub's monthly charts and within months it was included in the Open100 list of top open-source achievements for 2023.[10]
The paper went through several revisions on arXiv between August 2023 and November 2024, with the seventh version (v7) representing the camera-ready ICLR 2024 manuscript.[2] OpenReview records confirm acceptance as an oral presentation at ICLR 2024, with publication dated 16 January 2024 under submission number 5488.[3] The author list expanded between preprint versions but the camera-ready paper credits fifteen authors: Sirui Hong, Mingchen Zhuge, Jonathan (Jiaqi) Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber, with collaborating affiliations including The Chinese University of Hong Kong (Shenzhen) and KAUST (where Schmidhuber holds an appointment).[11]
After the foundational paper, the MetaGPT team published a series of extensions:
geekan to the dedicated organization FoundationAgents in 2024 while keeping the original URL as a redirect.[5]By November 2025, the MetaGPT team reported on social media that the repository had crossed 59,400 stars; reporting by Korean and Chinese tech press described the project's commercial arm as having raised approximately RMB 220 million (about USD 30.8 million) across two 2025 rounds backed by Ant Group, Cathay Capital, Jinqiu Capital, MindWorks, Baidu Ventures, and Concept Capital.[6][14]
The MetaGPT paper opens with an explicit diagnosis of where prior LLM agent stacks failed. Naively chaining LLM calls (each agent only sees the previous agent's free-form text output) produces cascading hallucinations: a small fabrication in step k propagates into step k+1, where the next agent treats it as fact, and the error compounds.[1][4] Existing chat-style frameworks such as the early versions of Auto-GPT, BabyAGI, AgentVerse, and conversational systems built on AutoGen passed natural-language messages but did not constrain their structure, which meant ambiguity in the messages led to logic inconsistencies in code, file structure, or API design.[1][11]
The team's insight was that real software organizations solve the same problem with Standard Operating Procedures: a Product Requirements Document has a fixed schema, an architectural design produces specific named artifacts (file lists, API definitions, sequence diagrams), and code review follows a checklist. Encoding those SOPs into prompt sequences and demanding that each role produce structured output that conforms to a schema gives the next agent in the chain less room to misinterpret, and gives reviewers (human or another LLM agent) a stable surface to check.[1][11] The paper formalizes this as a meta-programming approach: the framework programs the team that programs the code, with the SOP itself being the meta-program.[11]
MetaGPT models a software project as an assembly line of roles, each consuming the structured artifact produced by the previous role and emitting a new structured artifact for the next role.[1][15] The canonical pipeline maps to a one-line user requirement passing through five role-specialized agents:
The roles are not chosen arbitrarily: they map closely to titles in human software organizations, and the prompts that define them include role description, goals, constraints, and a list of actions that the role is allowed to take. The framework also allows custom roles to be defined by subclassing a Role class with its own action set.[5]
Rather than have agents pass messages point-to-point (which would scale quadratically with the number of roles and create coupling between agent prompts), MetaGPT uses a shared message pool with a publish/subscribe protocol.[1][16] Every agent publishes its structured output to a central pool. Other agents declare a subscription profile describing which message types they care about (for example, the Engineer subscribes to messages of type "Task" produced by the Project Manager). When a relevant message arrives, the framework routes it to the subscribing agent's input queue.[16]
Two properties follow from this design. First, the messages exchanged are structured artifacts (PRDs, designs, task lists, code, test reports) rather than chat dialogue, which keeps the surface area auditable.[1][16] Second, an agent can transparently access historical messages from the pool, removing the need to query other agents directly and avoiding circular conversations that plague chat-only systems.[1][16]
A practical contribution of MetaGPT, particularly emphasized in the paper's ablation, is the executive feedback mechanism. When code produced by the Engineer fails to execute, the QA Engineer (or a code-review sub-action) captures the runtime trace and re-publishes it as a structured error message. The Engineer agent then receives the trace and an instruction to fix the offending file, iterating until tests pass or a budget is exhausted.[11] The paper attributes a 5.4 percentage point absolute improvement on the MBPP benchmark to this loop alone.[11]
The paper publishes detailed token-cost tables. On its SoftwareDev benchmark, a full MetaGPT run uses approximately 31,255 tokens per project compared with ChatDev's 19,292; in absolute terms MetaGPT is more expensive per project.[11] However, MetaGPT generates substantially more code per project, so the per-line cost is lower: roughly 124.3 tokens per generated code line versus ChatDev's 248.9 tokens per line.[11] Independent reproductions have reported overall per-task dollar costs frequently above ten USD on HumanEval when running with GPT-4, with token-duplication rates above 70% in some runs because the same artifacts are quoted into multiple agent prompts.[17]
The following table summarizes the canonical roles defined in the original paper and the production framework.
| Role | Inputs subscribed to | Primary output | Key responsibility |
|---|---|---|---|
| Product Manager | User requirement | Product Requirements Document, user stories, requirements pool | Translate one-line ask into structured requirements[15] |
| Architect | PRD | System design, file list, data structures, interfaces, sequence diagrams | Decide module decomposition and signatures[15] |
| Project Manager | System design | Task list, dependency graph | Sequence and assign engineering tasks[15] |
| Engineer | Task list, file list, interfaces | Implementation code | Implement classes and functions consistent with the design[15] |
| QA Engineer | Code, task list | Test cases, test results, error traces | Validate code; trigger debug loop on failure[11][15] |
The framework supports custom roles, including a Researcher role used in later work on AFlow, and the Data Interpreter role used in the eponymous follow-up agent.[5][12]
On standard code-generation benchmarks, the paper reports that MetaGPT with GPT-4 reaches a Pass@1 of 85.9% on HumanEval and 87.7% on MBPP, state-of-the-art numbers among multi-agent systems at the time of submission.[11] The improvements over GPT-4 alone are not exclusively attributable to the multi-agent structure: the ablation study credits a meaningful portion to the executive feedback debug loop.[11]
The paper introduces a custom benchmark called SoftwareDev, which evaluates whether a framework can produce a complete, executable software project from a single sentence. Projects are scored on a one-to-four executability scale, with four indicating a flawless run. The reported results are:[11]
| System | Executability (1-4) | Runtime (s) | Human revision cost | Tokens / code line |
|---|---|---|---|---|
| MetaGPT | 3.75 | 541 | 0.83 corrections | 124.3 |
| ChatDev | 2.25 | 762 | 2.5 corrections | 248.9 |
| Auto-GPT | 1.00 | not applicable | not applicable | not applicable |
| LangChain agent | 1.00 | not applicable | not applicable | not applicable |
| AgentVerse | 1.00 | not applicable | not applicable | not applicable |
The paper interprets the 1.00 scores for Auto-GPT, LangChain agents, and AgentVerse as evidence that those frameworks generally fail to produce code that runs at all on this benchmark, whereas the SOP-driven approach of MetaGPT and (to a lesser extent) ChatDev does.[11]
The Data Interpreter follow-up paper (arXiv:2402.18679) reports separate numbers on data-science workflows:[12]
These results are not directly comparable to HumanEval/MBPP, but they illustrate how the SOP-driven approach extends beyond software engineering into data science when paired with hierarchical graph planning.[12]
The AFlow paper, building on MetaGPT, reports a 5.7% average improvement over state-of-the-art baselines across six benchmarks (HumanEval, MBPP, GSM8K, MATH, HotpotQA, DROP), and shows that smaller models orchestrated by AFlow can match or exceed GPT-4o on specific tasks at roughly 4.55% of GPT-4o's inference cost in dollars.[13]
The reference implementation is the Python package metagpt, installable via pip install metagpt or directly from the FoundationAgents/MetaGPT GitHub repository.[5] Python accounts for roughly 97.5% of the codebase, with shell and other auxiliary languages making up the remainder.[5] The library exposes a Team abstraction that bundles a set of Role objects with a shared message pool and a run loop. Roles are defined as classes that declare a profile, a list of Action objects, and a _act coroutine. Backends are configured through a YAML file that supports OpenAI, Azure OpenAI, Ollama, Groq, and other endpoints by adjusting api_type and base_url.[5]
Data Interpreter is the most prominent role-level extension. It implements three pivotal techniques described in the 2024 paper: dynamic planning with hierarchical graph structures, dynamic tool integration during execution, and logical-inconsistency identification through feedback combined with experience recording.[12] The Data Interpreter agent is part of the same repository under examples/di/ and is documented as a flagship demo of the framework's flexibility outside of pure software engineering.[5]
MGX (MetaGPT X) was launched by DeepWisdom on 19 February 2025 as a hosted commercial product that wraps a refined MetaGPT-derived multi-agent team for natural-language application development. The hosted team is named with anthropomorphic roles: Mike (team leader), Emma (product manager), Bob (architect), Alex (engineer), and David (data analyst).[7] In January 2026 the product was rebranded as Atoms, with DeepWisdom citing pronounceability and consumer adoption as reasons for the change; the previous mgx.dev URL now redirects to atoms.dev.[8] Atoms adds commercial infrastructure including a visual editor, managed authentication and databases, payments integration, an SEO specialist agent, and a Race Mode that runs multiple model backends in parallel and selects the best output.[18] Reporting by KrAsia in early 2026 placed Atoms inside the broader Chinese "vibe coding" wave and noted that Atoms incorporates open-weight models such as DeepSeek and Qwen alongside closed APIs.[14]
Several community forks exist, including gofullthrottle/MetaGPT-Data-Interpreter and mskj-apaas/MetaGPT-2025, which track the upstream repository with localized changes.[5] Documentation lives in the separate geekan/MetaGPT-docs repository, which hosts the official tutorial site for the framework.[19]
Documented uses of MetaGPT cluster into four categories. First, end-to-end software prototyping, where the framework is given a single-sentence brief such as "write a snake game" or "build a CLI to-do tracker" and produces a runnable project with tests; this is the use case showcased in the original paper and in the SoftwareDev benchmark.[11] Second, requirements-document automation, where teams use only the Product Manager role to draft PRDs and competitive analyses from informal notes, often as a starting point for human refinement.[15] Third, data-science pipelines, via Data Interpreter, including exploratory analysis, machine-learning model training, and visualization.[12] Fourth, meta-workflow generation through AFlow, in which MetaGPT itself is used as the substrate for searching better agent topologies for downstream tasks like MBPP, HumanEval, and MATH.[13]
The commercial Atoms product targets a fifth use case: turning natural-language requirements into deployed, monetizable web applications complete with login, database, and payment integration. By September 2025, DeepWisdom reported that MGX/Atoms was processing approximately 1.2 million monthly visits and roughly 10,000 application launches per day.[14]
MetaGPT crystallized a design pattern that became influential across the agentic workflow ecosystem. Three contributions stand out.
First, the explicit framing of SOPs as prompt sequences gave researchers a vocabulary for what role-specialized multi-agent systems are doing: encoding organizational knowledge into the structure of communication. This framing has been picked up by subsequent multi-agent research, including Magentic-One and other orchestration frameworks that explicitly cite MetaGPT as motivation.[20]
Second, the shared message pool with publish/subscribe semantics offered an alternative to chat-style coordination that has informed the design of newer agentic systems and protocols.[16] By making structured artifacts the unit of communication, MetaGPT made it easier to log, inspect, and replay agent decisions, which is crucial for agent evaluation.
Third, the SoftwareDev benchmark, while small, normalized the practice of evaluating multi-agent code-generation systems on whether they can produce projects that run, rather than just snippets that pass unit tests. Subsequent work, including evaluations of multi-agent frameworks on SWE-bench, has built on this idea of end-to-end executability.[21]
The project's ICLR 2024 Oral status and the follow-up ICLR 2025 Oral acceptance of AFlow indicate that the academic community has consistently regarded the MetaGPT line of work as a serious research contribution rather than an engineering demo, despite some skepticism (discussed below) about the cost and reproducibility of multi-agent pipelines.[3][13]
Several independent reviews note that MetaGPT's quality comes at substantial cost. Multi-agent communication overhead has been measured at over USD 10 per task on HumanEval in published reproductions, driven primarily by serial messages that re-quote the same context across agents.[17] One published review reported token-duplication rates of roughly 72% for MetaGPT, meaning a large fraction of tokens billed by the LLM provider are repeated context rather than new content.[17] The paper itself acknowledges higher absolute token usage than ChatDev, while arguing that per-line cost is lower because more code is produced.[11]
MetaGPT-generated code occasionally references non-existent resource files (images, audio) or invokes undefined classes when synthesizing complex projects, because the LLM lacks live access to the project's actual file system or package indices.[17] LLMs without browsing also tend to produce code against outdated package versions; initial tests frequently fail due to version-compatibility issues, and the framework relies on its executive feedback loop to recover.[17][11]
When something breaks inside a multi-agent pipeline, tracing the cause across multiple agent prompts and intermediate documents is harder than debugging a single-agent chain. Reviews aimed at practitioners describe this as a recurring complaint, particularly when QA tests pass on the happy path but break on edge cases.[17] Galileo's analysis of multi-agent coordination identified several failure modes (cascading miscommunication, role drift, infinite loops in feedback cycles) that affect MetaGPT alongside other frameworks.[17]
A broader critique applies to the entire genre of role-based multi-agent frameworks (including MetaGPT, ChatDev, AutoGen, and CrewAI): assigning roles and hierarchies up front may simply re-encode human organizational structure into a place where it is not necessarily optimal for LLMs. Some recent empirical work argues that the assumption that role-specialization improves outcomes does not always hold when the cost of communication is taken into account.[22] The MetaGPT authors' own AFlow follow-up partially addresses this by searching for the workflow rather than fixing it in advance.[13]
The table below summarizes the practical differences among the most-cited LLM multi-agent frameworks at the time of writing.
| Framework | Origin | Communication style | Specialization | License |
|---|---|---|---|---|
| MetaGPT[11] | DeepWisdom, 2023 | Shared message pool, pub/sub of structured artifacts | Software dev SOP roles (PM, Architect, PgM, Eng, QA) | MIT[5] |
| ChatDev[23] | Tsinghua / OpenBMB, 2023 | Chat-chain along a phase-based pipeline; "communicative dehallucination" | Software phases (design, code, test) | Apache 2.0 |
| AutoGen[24] | Microsoft Research, 2023 | Conversable agents, free-form multi-agent chat | General; agents customized per app | CC-BY-4.0 / MIT |
| CrewAI[25] | crewAIInc, 2024 | Role + backstory + goal; sequential or hierarchical task flows | General teams with custom roles | MIT |
| Magentic-One[20] | Microsoft Research, 2024 | Orchestrator with planner and ledger; sub-agents for browsing, coding, file I/O | General computer-use tasks | MIT |
The signature difference between MetaGPT and the others is the insistence on structured artifacts (PRD, design, tasklist, code) rather than free-form natural-language chat as the unit of inter-agent communication. ChatDev is the closest peer because it also encodes software-development phases, but it does so via a conversation chain rather than a publish/subscribe message pool, and the original paper reports lower executability and higher revision costs on SoftwareDev.[11][23] AutoGen and CrewAI are more general-purpose; they leave it to the developer to define which artifacts are passed between agents and impose less structure on the messages themselves.[24][25]
The MetaGPT bibliography draws on three streams of prior work. The first is autonomous-agent demos such as Auto-GPT and BabyAGI that demonstrated chain-of-thought driven autonomy but suffered from drift; MetaGPT explicitly positions itself as a structured alternative.[1] The second is single-agent reasoning techniques such as Reflexion and ReAct that use self-reflection or interleaved reasoning and action to improve single-agent performance; MetaGPT's executive-feedback loop is conceptually related but operates across role-specialized agents rather than within one agent.[4] The third is multi-agent communication research, especially the contemporary AutoGen paper and ChatDev, which together with MetaGPT defined the 2023-2024 multi-agent landscape.[23][24]