ChatDev
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,134 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,134 words
Add missing citations, update stale details, or suggest a clearer explanation.
ChatDev is an open-source multi-agent software development framework in which large language model (LLM) agents play role-specialized parts (Chief Executive Officer, Chief Technology Officer, programmer, reviewer, tester, and other functions) and collaborate through a sequence of structured pair conversations called the Chat Chain. The framework was introduced in the paper "ChatDev: Communicative Agents for Software Development" by Chen Qian and collaborators at Tsinghua University's Natural Language Processing Lab (THUNLP) together with the OpenBMB community, first released on arXiv as 2307.07924 in July 2023 and accepted to the Annual Meeting of the Association for Computational Linguistics (ACL) 2024.[^1][^2] ChatDev casts traditional waterfall software engineering as a sequence of agent-pair dialogues that step through designing, coding, and testing, and it introduces a technique called communicative dehallucination in which the responding agent first asks clarifying questions before producing code.[^1] The system is distributed on GitHub under the Apache License 2.0 by the OpenBMB organization, and a no-code rewrite known as ChatDev 2.0 (or DevAll) was released in January 2026.[^3][^4]
| Field | Value |
|---|---|
| Project name | ChatDev |
| Type | Multi-agent LLM software development framework |
| Lead author | Chen Qian (Tsinghua University, THUNLP) |
| Originating organization | OpenBMB, Tsinghua University, ModelBest |
| First arXiv preprint | 16 July 2023 (arXiv:2307.07924) |
| Latest paper revision | 5 June 2024 (v5) |
| Conference publication | ACL 2024 (long paper, pp. 15174 to 15186) |
| Repository | github.com/OpenBMB/ChatDev |
| License | Apache License 2.0 |
| Default model in paper | ChatGPT-3.5 at temperature 0.2 |
| Evaluation suite | Software Requirement Description Dataset (SRDD), 1,200 tasks |
ChatDev sits at the intersection of two research threads that gained momentum in 2023: the use of LLMs for automatic code generation, and the use of LLMs as role-playing agents that interact with one another to solve open-ended tasks. By mid-2023 the first thread had produced tools such as GitHub Copilot and the open-source GPT-Engineer project, both of which used a single LLM call (or a thin chain of calls) to translate a natural-language description into code. The second thread was driven by research on conversational agents such as CAMEL, which paired two LLM personas as an "AI user" and an "AI assistant" to cooperatively complete a task, and by hobbyist projects such as Auto-GPT and BabyAGI that gave an LLM a loop, a memory, and a toolset. ChatDev synthesizes the two threads by treating software construction as a problem amenable to a structured multi-agent conversation.[^1]
The paper was led by Chen Qian, a postdoctoral researcher in the Shuimu Tsinghua Scholar Program at the Department of Computer Science and Technology of Tsinghua University and a member of THUNLP, jointly with Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Most co-authors were affiliated with Tsinghua, with additional contributions from the University of Sydney, Beijing University of Posts and Telecommunications, and ModelBest Inc., the commercial spin-off behind the OpenBMB ecosystem.[^1][^5]
OpenBMB (Open Lab for Big Model Base) is an open-source community originally created around Tsinghua's foundation-model research, and it hosts the ChatDev repository alongside related projects such as the CPM and MiniCPM model families and the Experiential Co-Learning and Iterative Experience Refinement follow-up papers.[^3][^6][^7] The framework was first published on GitHub in mid-2023 and quickly accumulated tens of thousands of stars, becoming one of the most prominent examples of LLM-driven multi-agent software automation in that period.[^3]
| Date | Event |
|---|---|
| 16 July 2023 | First arXiv preprint of "ChatDev: Communicative Agents for Software Development" posted as 2307.07924 |
| Summer 2023 | Initial public release of the OpenBMB/ChatDev GitHub repository under Apache 2.0 |
| December 2023 | "Experiential Co-Learning of Software-Developing Agents" preprint (arXiv:2312.17025) extends ChatDev with shortcut-oriented experience reuse |
| May 2024 | "Iterative Experience Refinement of Software-Developing Agents" preprint (arXiv:2405.04219) adds dynamic experience refinement |
| 5 June 2024 | Fifth and final arXiv revision (v5) of the main ChatDev paper |
| June 2024 | "Scaling Large Language Model-based Multi-Agent Collaboration" preprint (arXiv:2406.07155) introduces MacNet, a directed-acyclic-graph topology for thousands of agents, in the same code-base |
| August 2024 | ChatDev paper presented at ACL 2024 in Bangkok, Thailand |
| January 2026 | Official release of ChatDev 2.0 (also called DevAll), a no-code multi-agent orchestration platform |
ChatDev frames software construction as a virtual software company that walks through a simplified waterfall. The published paper organizes the pipeline into three sequential phases: designing, coding, and testing. Each phase decomposes into smaller subtasks: designing covers high-level choices about programming language and modality; coding covers code writing and code completion; testing covers static code review and dynamic system testing.[^1] (Repository documentation and several secondary write-ups describe an optional fourth documenting phase that produces README files and requirements lists at the end of a run.[^3])
Inside each subtask, exactly two agents talk to one another: an instructor that proposes the work and an assistant that produces the artifact. The instructor and assistant roles are drawn from a roster of role profiles that mirror a traditional engineering organization. The CEO collaborates with the human request and the CTO during planning, the CTO drives architecture decisions, the programmer writes code, the reviewer performs static code review, and the tester runs the resulting program and reports execution errors. The paper emphasizes that explicit role assignment, conveyed through system prompts, is the single most important ablation: removing role profiles caused the largest drop in any ablation studied.[^1]
The Chat Chain (denoted C in the paper) is the data structure that ties phases and subtasks together. Formally, C is a sequence of phases P, and each phase is a sequence of subtasks T. Each subtask is executed as a multi-turn conversation between an instructor agent and an assistant agent, with the conversation terminating when a shared consensus is reached or a turn budget is exhausted. The output of one subtask (a high-level decision, a code file, or a bug report) becomes part of the input context for the next subtask, but only as a compact solution summary rather than the full transcript.[^1]
Two memory mechanisms support this design. Short-term memory holds the full multi-turn dialogue of the current subtask so that the two agents can refer to each other's latest utterances. Long-term memory holds only the consensus outputs of previous subtasks and is what flows down the chain. The paper argues that this split is essential, because passing the entire transcript of every previous subtask would saturate the context window and produce what they call "context overload," whereas passing only the conclusions keeps each agent focused on its current decision.[^1]
The Chat Chain replaces the open-ended group chats used in some contemporary frameworks with a pre-defined topology. The trade-off is that ChatDev cannot dynamically invent new phases at runtime, but every conversation has a clear scope, a clear pair of speakers, and a clear termination criterion. The authors describe this as guiding agents in "what to communicate."[^1]
The second pillar, communicative dehallucination, addresses a failure mode that emerges when an assistant LLM tries to satisfy an under-specified instruction and silently fabricates details such as undefined function signatures, nonexistent library imports, or imagined file paths. The paper observes that this kind of hallucination is especially common during the coding phase and is hard to eliminate with single-turn prompting alone.[^1]
The fix is a deliberate role reversal. Instead of immediately producing code in response to an instruction, the assistant first asks the instructor for the specific information it needs (for instance, the exact name of an external dependency, the data shape returned by a sibling function, or a clarification about which file should be modified). Only after the clarification round does the assistant commit to a formal response. The technique transforms a one-shot instruction-response into a short clarification dialogue, and the ablation study in the paper shows that removing it lowers the composite Quality score from 0.3953 to 0.3094 on the SRDD benchmark.[^1][^8]
The mechanism is conceptually similar to active learning: rather than producing an answer with low confidence and many degrees of freedom, the assistant moves uncertainty back to the instructor, who has more context about user intent. The authors describe this as guiding agents in "how to communicate."[^1]
Inside the coding phase, ChatDev iteratively passes generated source files between the programmer and reviewer. The reviewer reads code statically and points out missing implementations, naming inconsistencies, and structural bugs; the programmer rewrites the affected files. Inside the testing phase, the tester invokes the Python interpreter (the published runs use Python 3.11.4) on the generated program, captures runtime errors such as ModuleNotFoundError, and feeds the stack trace back into a new round of programmer revisions. The system also offers incremental visualization of progress: as files are produced and edited they are streamed to a console view so a human operator can watch the virtual team build the software in real time.[^1]
To evaluate the framework end-to-end, the authors constructed the Software Requirement Description Dataset (SRDD), a corpus of 1,200 natural-language software requests organized into 5 broad categories (Education, Work, Life, Game, Creation) and 40 subcategories. The prompts were seeded from popular app-store and software-portal descriptions on platforms including Ubuntu, Google Play, Microsoft Store, and the Apple App Store, then expanded and cleaned through a mix of LLM generation and human post-processing.[^1] SRDD has since been used as a benchmark in several follow-up multi-agent software papers.
The paper reports four metrics computed over the 1,200 SRDD requests:[^1]
pass, TODO, or unfilled function bodies).The main numerical comparison in Table 1 of the paper places ChatDev alongside three contemporary baselines: GPT-Engineer (a single-prompt code generator), CAMEL (a generic two-agent role-play framework), and MetaGPT (a competing multi-agent framework that emphasizes structured artifacts and standard operating procedures).[^1][^8]
| System | Completeness | Executability | Consistency | Quality |
|---|---|---|---|---|
| GPT-Engineer | 0.5022 | 0.3583 | 0.7887 | 0.1419 |
| CAMEL | (reported as substantially lower) | (lower) | (lower) | (lower than ChatDev) |
| MetaGPT | 0.4834 | 0.4145 | 0.7601 | 0.1523 |
| ChatDev | 0.5600 | 0.8800 | 0.8021 | 0.3953 |
ChatDev's most striking margin is in executability, where its 0.88 figure roughly doubles the rate reported for GPT-Engineer and MetaGPT. In pairwise human evaluations the paper also reports that ChatDev defeats GPT-Engineer in 77.08 percent of head-to-head comparisons and MetaGPT in 88.00 percent.[^1]
The authors also report aggregate cost: a typical SRDD task is solved in roughly 148 seconds and consumes around 22,949 LLM tokens, producing on average 4.39 source files and approximately 144 lines of code. Secondary coverage often summarizes this as "an end-to-end software run for under one dollar in under seven minutes" using the GPT-3.5 family.[^1][^5]
The paper's ablation study (also Table 1 of the v5 manuscript) isolates the contributions of role assignment, multi-agent communication, and communicative dehallucination. Removing role assignment from the system prompts lowered the Quality score from 0.3953 to 0.2212, by far the largest effect. Removing the communicative dehallucination loop lowered Quality to 0.3094. Collapsing the multi-agent pipeline into a single-agent baseline produced an intermediate decline. The authors take these results as evidence that the framework is more than the sum of its parts, and that the role specialization plus the clarifying-question protocol jointly drive most of the headline gains.[^1]
The paper additionally analyzes communication patterns over all 1,200 tasks. Roughly 57.2 percent of inter-agent messages are predominantly natural-language (concentrated in the designing phase), while the remainder mix natural language with programming language. In the testing phase, about 45.76 percent of detected errors are ModuleNotFoundError exceptions, and within static code review about 34.85 percent of issues fall into the "method not implemented" category. These statistics motivate later work on tool use and dependency resolution.[^1]
In December 2023 the same group published "Experiential Co-Learning of Software-Developing Agents" (arXiv:2312.17025), accepted to ACL 2024. The paper proposes that instructor and assistant agents should not start each new task from scratch but should gather shortcut-oriented experiences from historical trajectories and reuse them on later tasks. The work positions ChatDev as the substrate on which the experience-collection and experience-reuse mechanisms are evaluated, and reports performance improvements on unseen software requests.[^6]
In May 2024 the group followed up with "Iterative Experience Refinement of Software-Developing Agents" (arXiv:2405.04219). Whereas Experiential Co-Learning treats the experience pool as static after collection, IER refines it during execution through a successive pattern (refining based on the nearest experiences in the current batch) and a cumulative pattern (acquiring experiences across all prior batches). An experience-elimination mechanism prunes the pool by keeping only high-quality, frequently used examples. The reported result is that 11.54 percent of the highest-quality experiences match the performance of the full pool, indicating substantial redundancy in raw collected experiences.[^7]
In June 2024 Chen Qian and collaborators published "Scaling Large Language Model-based Multi-Agent Collaboration" (arXiv:2406.07155), introducing MacNet (Multi-Agent Collaboration Networks). MacNet generalizes the strictly sequential Chat Chain to arbitrary directed acyclic graphs of agents and studies whether multi-agent performance follows a scaling law analogous to the neural scaling laws observed for single models. The paper reports a logistic-growth pattern in collaborative performance as the agent count rises, with irregular topologies outperforming regular ones and the system scaling to more than one thousand agents. MacNet is shipped as a branch of the OpenBMB/ChatDev repository.[^9][^3]
In January 2026 OpenBMB and ModelBest released ChatDev 2.0, also called DevAll ("Develop Everything"), as a no-code multi-agent orchestration platform.[^4] The rewrite generalizes the original software-only pipeline into a YAML-defined workflow runtime with a Vue 3 drag-and-drop canvas for editing agent topologies, a Python backend built on FastAPI and managed with the uv package manager, a Python SDK for batch execution, and Docker-based sandboxing for tool execution. Example workflows shipped with the platform include data visualization, deep research synthesis, 3D generation via Blender, and educational video generation, in addition to the original software construction scenario. ChatDev 2.0 remains Apache-2.0 licensed.[^3][^4]
The reference implementation lives in the github.com/OpenBMB/ChatDev repository. The original 1.0 line is preserved in the chatdev1.0 branch, and the main branch tracks the 2.0 rewrite.[^3] The original code-base is a Python application that wires together LLM API calls (originally targeting the OpenAI Chat Completions API for the GPT-3.5 and GPT-4 families) with the role profiles, a Chat Chain configuration file, and a console visualizer.[^3]
IBM's developer site has published a tutorial titled "Use ChatDev ChatChain for agent communication on watsonx.ai," which walks through running the framework on top of the watsonx.ai model gateway and substituting IBM-hosted models for OpenAI ones.[^10] This is one of several signals that ChatDev has become a reference implementation that researchers and platform vendors use to demonstrate multi-agent orchestration.
The Apache 2.0 license has, however, drawn scrutiny in the repository's issue tracker. Issue #470 raised a question about ChatDev shipping code that references an AGPL-licensed dependency in the original 1.0 release, illustrating that license-compatibility analysis remains the responsibility of downstream users.[^11]
ChatDev is widely cited in the multi-agent LLM literature: it is one of the standard baselines (alongside MetaGPT, CAMEL, and AutoGen) in surveys of LLM-based multi-agent software automation, and SRDD has been adopted as a benchmark in follow-up work.[^8][^12]
Several contemporaries explore the same design space but make different trade-offs. The brief comparison below draws on the original ChatDev paper, the MetaGPT and AutoGen papers, and survey coverage.
| Framework | Coordination model | Agent topology | Primary domain | Key distinguishing feature |
|---|---|---|---|---|
| ChatDev[^1] | Pairwise instructor-assistant chats organized into a Chat Chain | Sequential, fixed pipeline of phases and subtasks | Software development | Communicative dehallucination via clarifying questions |
| MetaGPT[^8] | Structured artifacts (PRD, design doc, task list) handed between role agents | Sequential, SOP-style pipeline with executable test feedback | Software development | Encodes standard operating procedures and tests as first-class artifacts |
| AutoGen[^12] | Composable conversation patterns including nested chat and group chat | Arbitrary, user-defined; not restricted to software | General multi-agent applications | Framework rather than fixed pipeline; flexible topologies including human-in-the-loop |
| CAMEL[^1] | Single instructor-assistant role-play between two LLM personas | Two-agent fixed pair | General task-solving | Original inspiration for the role-play pattern reused inside ChatDev |
| Auto-GPT / BabyAGI | Single agent with a self-managed task list and tool loop | Single agent | General autonomy demos | Popular hobbyist demonstration of autonomous LLM loops |
ChatDev's distinctive contribution is the explicit guidance layer: the Chat Chain decides what is discussed and the dehallucination protocol decides how. MetaGPT places more emphasis on encoding the artifacts produced by each role (product requirements document, system design document, task list), while AutoGen is closer to a programming framework that lets a developer assemble any topology, including the Chat Chain pattern itself.[^1][^8][^12] CAMEL pre-dates ChatDev and is essentially the two-agent pattern that ChatDev nests inside each Chat Chain subtask.[^1]
ChatDev is also frequently contrasted with single-LLM coding tools and autonomous coding agents. Tools such as GitHub Copilot focus on in-editor completions for a human developer, while end-to-end systems such as Devin (AI software engineer) and benchmarks such as SWE-bench target the harder problem of editing existing repositories. ChatDev, in contrast, is designed to construct small new programs from scratch and is most often evaluated on tasks closer in spirit to HumanEval and MBPP than to repository-scale benchmarks.[^1][^13]
ChatDev has been deployed in three broad ways. First, it serves as a research artifact: a substantial body of multi-agent LLM research now compares against ChatDev or uses SRDD, and the framework is one of the canonical examples in surveys of role-playing LLM agents.[^8][^12]
Second, it functions as a teaching and demonstration tool. The end-to-end visualization of a "virtual company" producing code in a few minutes makes ChatDev one of the most intuitive demonstrations of LLM agent collaboration available to non-specialists, and the IBM watsonx.ai tutorial is one example of vendors using it for that purpose.[^10]
Third, with the ChatDev 2.0 rewrite the project is being repositioned as a general orchestration platform: the no-code canvas and YAML workflow runtime allow non-programmers to wire up multi-agent systems for tasks beyond software construction, including data visualization, deep research, and content generation.[^4][^3]
The broader significance of ChatDev within the multi-agent system literature lies in three concrete claims that the paper sustains with measurements: that decomposing an LLM task into pair-wise structured conversations beats monolithic single-call prompting on end-to-end software requests; that explicit role assignment is the dominant lever in multi-agent code generation; and that allowing an assistant to ask clarifying questions before acting reduces a measurable category of hallucination.[^1]
The most common critique of ChatDev concerns the realism of its outputs. The reported averages of roughly 144 lines of code and 4.39 files per task place the generated artifacts in the range of beginner-course homework rather than production-grade software, and independent reviewers have argued that this disqualifies ChatDev as a stand-in for autonomous engineering of real applications, which typically span tens of thousands of lines.[^14] The repository's own documentation positions the framework as a foundation for experimentation rather than a production solution.[^3]
A second limitation is that ChatDev (in its 1.0 form) cannot edit existing repositories. Each run generates a fresh program from scratch; the framework offers no mechanism to import an existing code-base, modify it, and produce a delta. This is the problem that benchmarks such as SWE-bench target and that systems such as SWE-agent and Devin attempt to solve.[^14][^13]
A third limitation is methodological. The four SRDD metrics (completeness, executability, consistency, quality) all measure surface properties such as whether the program runs and how closely its code embedding matches the request embedding. They do not directly measure functional correctness or whether the program actually fulfills the request, and the paper's authors acknowledge that human evaluation is needed to capture the qualitative differences. ChatDev relies on a separate human-preference study for this purpose, which is harder for outside researchers to reproduce than automatic metrics.[^1]
Critics have also pointed out that the original ChatDev contribution is largely an engineering synthesis rather than a fundamentally new algorithmic idea: role-playing LLM agents (CAMEL) and chain-of-thought-style decomposition were already established before the paper, and ChatDev's novelty rests on the specific combination, the dehallucination protocol, and the SRDD evaluation rather than on new model architectures.[^14] The authors do not contest this framing; the paper presents itself as an empirical and systems contribution.[^1]
Finally, the original 1.0 release shipped code that referenced an AGPL-licensed dependency despite the project itself being Apache 2.0, which prompted an issue raising license-compatibility concerns. Downstream users should check current dependencies before redistribution.[^11]
ChatDev is one node in a dense graph of LLM-driven multi-agent systems and AI code generation tools. Closely related frameworks include MetaGPT, AutoGen (Microsoft Research), the CAMEL role-play paradigm, and hobbyist autonomous loops such as Auto-GPT and BabyAGI.[^1][^8][^12] On the single-developer assistant side it relates to GitHub Copilot and the OpenAI Codex family. On the repository-editing side it is contrasted with SWE-agent, Devin (AI software engineer), and the SWE-bench evaluation suite. The communicative-dehallucination idea fits inside the broader research program on reducing model hallucination through dialogue rather than through retrieval or fine-tuning, and agent memory discussions intersect with ChatDev's short-term and long-term memory split.
ChatDev also sits inside the OpenBMB stack alongside the CPM and MiniCPM language models, AgentVerse and XAgent (sibling multi-agent projects from the same community), and the more recent MacNet topology generalization.[^3][^9]