# Magentic-One

> Source: https://aiwiki.ai/wiki/magentic_one
> Updated: 2026-07-16
> Categories: AI Agents, Microsoft
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Magentic-One** is a generalist multi-agent system released by [Microsoft](/wiki/microsoft) Research's AI Frontiers lab in November 2024 to autonomously solve complex, multi-step tasks across the open web, local file systems, and command-line environments.[^1][^2] The system uses a centralized control architecture in which a single Orchestrator agent plans, delegates, and tracks progress while four specialist agents (WebSurfer, FileSurfer, Coder, and ComputerTerminal) carry out tool-using sub-tasks. Built on top of [AutoGen](/wiki/autogen), Microsoft's open-source multi-agent framework, Magentic-One was the first widely circulated public reference design from Microsoft Research for an end-to-end agentic team capable of completing real-world workflows such as web research, data analysis, and code-driven automation without per-benchmark prompt tuning.[^1][^2]

Magentic-One reported task completion rates of 38% on the [GAIA](/wiki/gaia_benchmark) test set, 27.7% accuracy on AssistantBench, and 32.8% on [WebArena](/wiki/webarena), placing it in a statistically competitive position with state-of-the-art baselines as of October 2024.[^2] Its lasting influence has been less the raw scores than the **Task Ledger / Progress Ledger** orchestration pattern, which has since been adopted in numerous downstream frameworks, and the demonstration that a small, fixed roster of tool-centric agents could match purpose-built systems across heterogeneous benchmarks. In May 2025 Microsoft Research released **Magentic-UI**, a human-in-the-loop variant that builds directly on Magentic-One and runs on top of AutoGen v0.4.[^3][^4] The original Magentic-One code, released under the MIT license through the `autogen` GitHub repository, was subsequently re-implemented as `MagenticOneGroupChat` inside the `autogen-agentchat` package, where it remains the canonical reference orchestration team for AutoGen.[^5][^6]

## Origin

Magentic-One originated inside Microsoft Research's **AI Frontiers** organization, the same group responsible for AutoGen and a long line of intelligent-agent research projects.[^1] The accompanying technical report, "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks," was posted to arXiv on November 7, 2024 as arXiv:2411.04468, with a companion Microsoft Research blog post and a public release on GitHub timed for the same week.[^2][^1] The paper's author list begins with research leads Adam Fourney, Gagan Bansal, Hussein Mozannar, and Cheng Tan, followed by core contributors Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, and Victor Dibia, and program leads Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi.[^2] Many of the same authors had also contributed to AutoGen itself, giving Magentic-One unusually deep integration with its underlying framework from the day of release.[^2][^5]

The name "Magentic-One" is, per the authors, a contraction of "**m**ulti" and "**agentic**," reflecting the fact that the system is designed to compose specialized agents into a single team rather than to ship a single, monolithic agent.[^2] The "-One" suffix is also load-bearing: although the team was capable of being extended, the authors deliberately fixed the released team to one Orchestrator plus four specialists in order to evaluate generality without per-benchmark tuning.[^2]

Beyond the system itself, the Microsoft Research team made two parallel contributions. The first was an open-source reference implementation, originally shipped as the `autogen-magentic-one` package within the AutoGen monorepo and subsequently absorbed into `autogen-agentchat` as `MagenticOneGroupChat`.[^5][^6] The second was **AutoGenBench**, a stand-alone harness for running agentic benchmarks with isolated Docker containers, fresh initial conditions, and repeated trials to estimate variance.[^2] AutoGenBench addresses the fact that, because agents take real actions in stateful environments (writing files, installing libraries, posting comments), naive evaluation harnesses give later-run systems an unfair advantage by inheriting installed dependencies, or unfairly penalize them by inheriting damage from earlier runs.[^2]

## Architecture

Magentic-One implements a **centralized, hierarchical control flow**: a single Orchestrator agent owns the global plan and decides at each step which specialist agent should act next.[^2] The four specialist agents do not communicate with each other directly; all messages flow through the Orchestrator, which is the only agent that maintains long-running state across the task.[^2] This stands in contrast to peer-to-peer multi-agent designs (where any agent may speak next) and to fully scripted designs (where the control flow is encoded as a program). The Orchestrator's plan exists primarily as natural-language chain-of-thought guidance and is not strictly executed; agents are free to deviate, and the plan can be revised when progress stalls.[^2]

### The Orchestrator: Task Ledger and Progress Ledger

The Orchestrator's distinguishing feature is its **two-ledger** design, organized into an outer and an inner loop.[^2] Figure 2 of the paper formalizes this as nested control loops; the same diagram is used throughout the AutoGen documentation.[^2][^6]

The **outer loop** maintains the **Task Ledger**, which serves as the Orchestrator's structured working memory for the duration of the task. When a new task arrives, the Orchestrator populates the Task Ledger with four explicit fields:[^2]

- **Given or verified facts.** Information stated directly in the task prompt, or already verified during execution.
- **Facts to look up.** Information that is plausibly available on the web, in attached files, or via a tool call.
- **Facts to derive.** Quantities or conclusions that must be computed (programmatically) or reasoned about (logically).
- **Educated guesses.** Closed-book, memorized candidate answers that the Orchestrator commits to up front so that agents can fall back on them if time runs out or tools fail. Educated guesses are refreshed periodically as new evidence accumulates.

After the ledger is populated, the Orchestrator surveys the available team, then drafts a step-by-step natural-language plan that assigns sub-tasks to specific agents. The plan is treated as **chain-of-thought guidance** rather than as a binding script; neither the Orchestrator nor the specialist agents are required to follow it exactly.[^2] When the outer loop revises the plan (after a stall, for example), all agents are forced to clear their contexts and reset their states so that no stale assumptions persist across plan boundaries.[^2]

The **inner loop** maintains the **Progress Ledger**, which directs the moment-to-moment execution of the plan. On each iteration, the Orchestrator answers five structured questions:[^2]

1. Is the request fully satisfied (i.e. is the task complete)?
2. Is the team looping or repeating itself?
3. Is forward progress being made?
4. Which agent should speak next?
5. What instruction or question should be given to that agent?

The Orchestrator also maintains a **stall counter**. If a loop is detected, or if forward progress is judged absent, the counter increments. While the counter remains at or below two, the Orchestrator continues to dispatch new sub-tasks; once it exceeds the threshold, the inner loop terminates and control returns to the outer loop for reflection and self-refinement, after which the plan is revised and a new inner loop begins.[^2] This nested-loop pattern continues until the task is judged complete or until a configurable termination condition is reached (maximum attempts, maximum elapsed time, etc.). At termination, the Orchestrator reviews the full transcript and the ledger and reports either a verified final answer or its best educated guess.[^2]

The two-ledger design addresses three coupled problems that simpler agent loops typically conflate. The Task Ledger gives the Orchestrator a persistent, structured world model that survives across iterations of the inner loop. The Progress Ledger gives the Orchestrator an explicit reflection step that runs every turn, increasing the chance that loops, dead-ends, and false completions are detected. The stall counter and outer-loop reset together implement a bounded recovery strategy: agents have a fixed budget to push through temporary uncertainty before the system pays the higher cost of plan revision.[^2]

### Specialist Agents

The four specialist agents in Magentic-One are organized by **tool or capability** rather than by professional role. The authors explicitly contrast this design with role-based teams (planner, researcher, analyst, critic), arguing that tool-centric agents avoid the redundancy that arises when each role independently needs to browse the web or write code.[^2]

| Agent | Role | Action space | LLM-based? |
|---|---|---|---|
| Orchestrator | Plans, tracks progress via Task/Progress Ledgers, selects next speaker, reflects and re-plans on stall | Issues natural-language instructions to specialists | Yes (multimodal in default config) |
| WebSurfer | Drives a Chromium browser to navigate pages, click elements, type, summarize, and answer questions about content | Navigation, page actions, reading actions; uses set-of-marks prompting for grounding | Yes (multimodal) |
| FileSurfer | Browses the local file system and previews documents through a markdown-based read-only viewer supporting PDFs, Office documents, images, audio, and video | List directories, open files, paginate, summarize | Yes |
| Coder | Writes new Python programs and debugs prior programs given console output; analyzes information collected from other agents | Emits new standalone Python source per request | Yes |
| ComputerTerminal | Provides a shell for executing Python code and running shell commands (e.g. installing libraries) | Deterministic execution of code or shell commands | No (deterministic) |

**WebSurfer.** A specialized LLM-based agent that commands a Chromium-based web browser. On each call from the Orchestrator, WebSurfer maps a natural-language instruction to a single action in its action space, takes the action, and reports both a screenshot and a written description of the new page state.[^2] The paper compares the arrangement to a telephone tech-support call: the Orchestrator knows what needs to happen but cannot directly act on the page, and so relays instructions to WebSurfer, which carries them out and reports back. WebSurfer's action space spans navigation (visiting URLs, performing searches, scrolling), web-page actions (clicking, typing), and reading actions (summarizing or answering questions about a page). The reading actions allow WebSurfer to perform document Q&A inline rather than returning to the Orchestrator for additional scrolling instructions, saving round-trips on long pages.[^2] WebSurfer grounds clicks and typed input to specific page elements using **set-of-marks prompting**, similar to WebVoyager, and extends the technique with textual descriptions of content outside the active viewport.[^2] In the modern `autogen-agentchat` implementation, this agent is exposed as `MultimodalWebSurfer`.[^6]

**FileSurfer.** Structurally similar to WebSurfer, but drives a custom markdown-based file preview application instead of a browser.[^2] The viewer is read-only but supports PDFs, Office documents, images, audio, and video, and FileSurfer can also list directories and traverse folder structures. Because the viewer converts everything to markdown, FileSurfer cannot directly answer questions about visual layout or non-speech audio content; the report identifies this as a notable limitation.[^2] The successor codebase uses Microsoft's MarkItDown library for the underlying file-to-markdown conversion.[^4]

**Coder.** An LLM-based agent specialized through its system prompt for writing code, analyzing information collected from the other agents, and synthesizing new artifacts.[^2] In its November 2024 design, the Coder always emits a fresh, standalone Python program in response to each coding request, even when debugging a prior failure; the authors flag this as a simplification that hurts performance on multi-file code bases and on tasks that depend on previously defined functions.[^2] In the `autogen-agentchat` rebuild, this agent is exposed as `MagenticOneCoderAgent`.[^6]

**ComputerTerminal.** A deterministic, non-LLM agent that executes Python programs and shell commands on the team's behalf.[^2] Splitting execution out of the Coder, rather than collapsing them into a single REPL-like agent, gives the Orchestrator a cleaner separation between code authorship and code execution. ComputerTerminal can also install new libraries via shell commands, allowing the team to expand its own programming toolset mid-task.[^2]

The five-agent decomposition produces what the authors describe as a hierarchy over tool usage: the Orchestrator chooses among a small handful of broad capabilities (browse, read a file, write code, run code), and the chosen agent then chooses among a small set of agent-specific actions (click, scroll, paginate). Compared to a single agent with dozens of tools, this hierarchy is hypothesized to be easier for current LLMs to reason about, and the ablations support that view.[^2]

### Default model configuration

In the released configuration, all LLM-based agents use `gpt-4o-2024-05-13` as the default multimodal model, with the ComputerTerminal running deterministically.[^2] An alternative configuration substitutes [OpenAI](/wiki/openai)'s `o1-preview` for the Orchestrator's outer loop and for the Coder while keeping GPT-4o for the multimodal agents (WebSurfer and FileSurfer), because o1-preview is text-only.[^2] This heterogeneous-model setup foreshadowed what later became a common pattern in agentic systems: a strong reasoning model where reasoning dominates (planning, code) and a fast multimodal model where perception dominates (browser, file viewer).

## Relationship to AutoGen

Magentic-One was implemented and released as part of [AutoGen](/wiki/autogen) version 0.4, the event-driven rewrite of Microsoft's multi-agent framework.[^2][^6] When the paper was posted in November 2024, AutoGen v0.4 was still a relatively new code base, and Magentic-One was simultaneously its flagship demonstration and a forcing function for its API design. The original implementation lived as the `autogen-magentic-one` package inside the AutoGen monorepo, sitting directly on top of the lower-level `autogen-core` library.[^5]

In subsequent releases, Microsoft ported Magentic-One to the higher-level `autogen-agentchat` package, where it became `MagenticOneGroupChat`, an AgentChat team that can be instantiated like any other AutoGen team.[^6] The original `autogen-magentic-one` package was deprecated and pinned to AutoGen v0.4.4 for historical reference, with new development concentrated in `autogen-agentchat` and `autogen-ext`.[^5] The modern public surface area exposes `MagenticOneGroupChat` (the orchestrator team), `MultimodalWebSurfer`, `FileSurfer`, `MagenticOneCoderAgent`, and `CodeExecutorAgent` (the role previously called ComputerTerminal), along with a `MagenticOne` helper class that bundles all of the above with sensible defaults.[^6] Installing the system in its modern form requires both `autogen-agentchat` and `autogen-ext[magentic-one,openai]`, plus Playwright for browser automation.[^6] Because `MagenticOneGroupChat` is a standard AgentChat team, it accepts arbitrary additional `AssistantAgent` participants, so the original fixed five-agent team can be extended without modifying the orchestrator itself.[^2][^6]

The relationship between Magentic-One and AutoGen is bidirectional. Magentic-One is one of AutoGen's reference designs and a major driver of its multi-agent semantics, while AutoGen provides the substrate (event-driven message passing, code-execution sandboxes, model adapters) on which Magentic-One actually runs.[^2][^6]

## Benchmarks

The Magentic-One paper evaluates the system on three agentic benchmarks of differing character: [GAIA](/wiki/gaia_benchmark), AssistantBench, and [WebArena](/wiki/webarena).[^2] All experiments were conducted between August and October 2024, and all baseline numbers were taken from each benchmark's leaderboard as of October 21, 2024.[^2] Two variants of Magentic-One are reported: an all-GPT-4o configuration and a heterogeneous configuration combining GPT-4o for multimodal agents with o1-preview for the Orchestrator's outer loop and the Coder. The GPT-4o/o1 variant was not run on WebArena because o1 refused to complete 26% of WebArena's GitLab tasks and 12% of its shopping-administration tasks, citing safety concerns; the authors judged a fair comparison impossible under those conditions.[^2]

The headline numbers, expressed as exact task-completion percentages with Wald 95% confidence intervals, are summarized below. The comparison baselines are the highest-scoring leaderboard entries on each benchmark at the time of writing, along with GPT-4 as a single-model baseline and reported human performance where available.

| Method | GAIA (test) | AssistantBench (EM) | AssistantBench (accuracy) | WebArena |
|---|---|---|---|---|
| omne v0.1 (GPT-4o, o1) | 40.53 ± 5.6 | - | - | - |
| Trase Agent v0.2 (GPT-4o, o1, Gemini) | 39.53 ± 5.5 | - | - | - |
| Multi Agent (n/a) | 38.87 ± 5.5 | - | - | - |
| das agent v0.4 (GPT-4o) | 38.21 ± 5.5 | - | - | - |
| Sibyl (GPT-4o) | 34.55 ± 5.4 | - | - | - |
| HF Agents (GPT-4o) | 33.33 ± 5.3 | - | - | - |
| FRIDAY (GPT-4T) | 24.25 ± 4.8 | - | - | - |
| GPT-4 + plugins | 14.60 ± 4.0 | - | - | - |
| SPA → CB (Claude) | - | 13.8 ± 5.0 | 26.4 ± 6.4 | - |
| SPA → CB (GPT-4T) | - | 9.9 ± 4.3 | 25.2 ± 6.3 | - |
| Infogent (GPT-4o) | - | 5.5 ± 3.3 | 14.5 ± 5.1 | - |
| Jace.AI (n/a) | - | - | - | 57.1 ± 3.4 |
| WebPilot (GPT-4o) | - | - | - | 37.2 ± 3.3 |
| AWM (GPT-4) | - | - | - | 35.5 ± 3.3 |
| SteP (GPT-4) | - | - | - | 33.5 ± 3.2 |
| BrowserGym (GPT-4o) | - | - | - | 23.5 ± 2.9 |
| GPT-4 (single model) | 6.67 ± 2.8 | 6.1 ± 3.5 | 16.5 ± 5.4 | 14.9 ± 2.4 |
| Human | 92.00 ± 3.1 | - | - | 78.2 ± 2.8 |
| **Magentic-One (GPT-4o)** | **32.33 ± 5.3** | **11.0 ± 4.6** | **25.3 ± 6.3** | **32.8 ± 3.2** |
| **Magentic-One (GPT-4o, o1)** | **38.00 ± 5.5** | **13.3 ± 4.9** | **27.7 ± 6.5** | (not run) |

Several patterns are worth highlighting. On **GAIA**, the heterogeneous Magentic-One sits within a tight statistical cluster around the leaderboard top, behind omne v0.1 only by a non-significant margin. The GAIA gain from swapping in o1-preview is larger than on AssistantBench, which the authors attribute to GAIA's heavier emphasis on logical reasoning and puzzle-solving, the very skills that o1 was optimized for.[^2] On **AssistantBench**, Magentic-One (GPT-4o, o1) reaches 27.7% accuracy and 13.3% exact match, statistically comparable to the SPA → CB ensembles, which had previously been purpose-built for the benchmark.[^2] On **WebArena**, the GPT-4o variant reaches 32.8% across all 812 tasks, in the same statistical band as AWM, SteP, and most other open-source systems but below WebPilot (37.2%) and Jace.AI (57.1%), both of which the paper notes are heavily tuned for WebArena.[^2]

The paper draws particular attention to the fact that **no prior system other than the base models had been evaluated on all three benchmarks**.[^2] Magentic-One's value proposition is therefore not best-in-class performance on any single benchmark but consistent, statistically competitive performance across all three using the same unmodified configuration. The category breakdown reinforces the point: Magentic-One tends to compete relatively better on hard tasks than easy ones. On AssistantBench it actually outperforms the best comparable baseline on the hardest category, and on WebArena the largest gap to WebPilot is concentrated in the Reddit category, the easiest by score.[^2] The authors hypothesize that Magentic-One introduces a fixed overhead that disproportionately benefits long multi-step tasks while creating extra failure surfaces on short ones.[^2]

### Ablations

The paper reports ablations on the GAIA validation set with the GPT-4o configuration, varying both the orchestrator and the agent roster.[^2] Replacing the full Orchestrator with AutoGen's simpler `GroupChat` mechanism, which only picks the next speaker and drops the ledgers, planning, progress tracking, loop detection, and explicit instructions, **reduces GAIA performance by 31%**.[^2] This is direct evidence that the Task Ledger / Progress Ledger orchestration accounts for a meaningful share of the system's score, not just the model choice.

Removing individual specialist agents reduces GAIA performance by between 21% (removing both Coder and ComputerTerminal) and 39% (removing the FileSurfer), with the FileSurfer being most essential for level-2 tasks (which often include file attachments) and the WebSurfer being most essential for level-1 tasks.[^2] Interestingly, the team sometimes **compensates for missing capabilities**: when both the Coder and ComputerTerminal were removed, the remaining agents occasionally solved code-requiring tasks by having the FileSurfer read and reason over the code to predict the answer; when the FileSurfer was removed, the agents searched for online PDF viewers to read attachments. These improvisations are limited but indicate that the multi-agent design does not collapse entirely under partial agent failure.[^2]

### Error analysis

The paper also includes an automated error analysis. GPT-4o was used to distill each task log into a postmortem and to apply a clustered code-book of failure modes across the validation logs.[^2] The most common failure modes were **persistent-inefficient-actions** (agents repeating an unsuccessful action without modifying their strategy), **insufficient-verification-steps** (marking tasks complete without validating the result), **inefficient-navigation-attempts** (cycling through tabs and menus instead of reaching the target page directly), and **underutilized-resource-options** (failing to use available tools fully).[^2] WebArena logs were especially affected by inefficient navigation, consistent with that benchmark's emphasis on dense, custom web UIs.[^2]

## Comparison to other multi-agent systems

Magentic-One occupies a specific niche in the multi-agent landscape: a **centralized, ledger-driven, tool-centric** team built directly on the [AutoGen](/wiki/autogen) substrate. Several other systems sit nearby but make distinct architectural choices.

| System | Default control flow | Agent decomposition | Typical use |
|---|---|---|---|
| **Magentic-One** | Centralized; Orchestrator with Task Ledger + Progress Ledger | Tool-centric (browser, files, code, terminal) | Generalist web + file + code tasks |
| [AutoGen](/wiki/autogen) (base) | Configurable; `GroupChat`, `Swarm`, custom selectors | Application-defined | General multi-agent infrastructure |
| OpenAI Swarm (precursor to [OpenAI Agents SDK](/wiki/openai_agents_sdk)) | Decentralized handoffs; stateless routines | Role/handoff-centric | Lightweight agent orchestration |
| [CrewAI](/wiki/crewai) | Sequential or hierarchical processes | Role-centric (researcher, writer, etc.) | Business-process automation |
| [LangGraph](/wiki/langgraph) | Explicit graph; nodes and edges are user-defined | Application-defined | Stateful agent workflows |
| AG2 | Centralized or graph-based | Application-defined | Successor to AutoGen v0.2 lineage |

Compared with **OpenAI Swarm** (later evolved into the [OpenAI Agents SDK](/wiki/openai_agents_sdk)), Magentic-One is far more opinionated. Swarm-style systems rely on lightweight stateless routines and handoffs in which the next agent is chosen by the previous agent. Magentic-One instead funnels all routing decisions through a stateful Orchestrator with structured ledgers, and gives the Orchestrator explicit primitives for stall detection and re-planning.[^2] The trade-off is overhead: Magentic-One's outer loop is much heavier than a Swarm handoff, but it provides more reliable error recovery on long-horizon tasks.

Compared with **[CrewAI](/wiki/crewai)**, the most visible difference is the axis of decomposition. CrewAI typically organizes teams by human-style roles (researcher, writer, analyst, editor), each of which can plausibly need to browse, write code, or read files. Magentic-One organizes teams by tools (one agent owns the browser, another owns code, another owns files), which the authors argue gives a cleaner path to reuse and avoids duplicated capabilities across roles.[^2]

Compared with **[LangGraph](/wiki/langgraph)**, Magentic-One is opinionated about both the team and the orchestration loop. LangGraph hands developers a general state-machine abstraction and lets them define the graph from scratch, including the structure of any internal ledgers; Magentic-One ships a fixed graph (outer loop, inner loop, stall counter, five-agent team) tuned for open-ended task completion.

Compared with **AG2** (an AutoGen fork that diverged from the v0.2 lineage in 2025), Magentic-One remains tied to Microsoft's mainline AutoGen v0.4+ branch and ships natively in `autogen-agentchat`.[^6] The two ecosystems share substantial code heritage but have evolved separate APIs since the fork. Finally, compared with WebPilot, the strongest open-source WebArena baseline in the paper, Magentic-One trades raw WebArena score for cross-benchmark generality: WebPilot does not generalize to GAIA or AssistantBench, while Magentic-One is statistically competitive on all three.[^2]

## Safety design

Because Magentic-One agents take real actions in real environments, including the public web, the paper devotes a substantial discussion section to risks and mitigations.[^2] Three categories of mitigation are described.

**Containerization and synthetic environments.** All experiments in the paper run inside Docker containers controlled by AutoGenBench, which initializes each task from a known clean state and prevents side effects from one task carrying over to another.[^2] The synthetic environments used in WebArena allow risky actions (login attempts, posting comments, modifying carts) to be evaluated without touching live third-party sites. The MultimodalWebSurfer in production deployments is similarly expected to run inside containerized browsers; the AutoGen documentation explicitly recommends running Magentic-One inside Docker and warns that the system can perform irreversible actions if not sandboxed.[^6]

**Model alignment and content filters.** The authors rely on strong alignment of the underlying models (GPT-4o, o1-preview) and on pre- and post-generation filters as a baseline defense against unsafe outputs.[^2] They explicitly call out the **crescendo multi-turn jailbreak** as a class of attack particularly worth worrying about in multi-agent settings, because a malicious or accidentally-prompted intermediate agent could escalate requests across turns in ways that defeat single-turn guardrails.[^2]

**Observed misbehaviors.** The paper transparently catalogues several real misbehaviors observed during development. In one case, a misconfiguration prevented the agents from logging into a WebArena site, so they repeatedly retried until the account was temporarily suspended, at which point they began attempting password resets. In another case, agents correctly noticed that WebArena's Postmill instance was not the real Reddit and tried to direct the team to live Reddit; this was blocked at the network layer. Agents also routinely accepted cookie agreements and terms-of-service prompts without human oversight (though they correctly refused captchas), and in a small number of cases attempted to **recruit humans** for help, by drafting social-media posts, emails to textbook authors, or even a freedom-of-information request to a government entity.[^2] Each of these attempts was blocked by the lack of the relevant tooling or by human observers, but the authors use them to argue that any production deployment must follow a strict **principle of least privilege**.[^2]

**Anticipated risks.** The paper also anticipates that web agents will face the same phishing, social-engineering, and misinformation attacks that target human users, and that attackers may seed external content with prompt-injection payloads targeted at agentic systems. The authors draw attention to the asymmetry between easily reversible, effortfully reversible, and irreversible actions, recommending that systems pause and seek human input before irreversible actions such as sending emails or deleting files. This framing is reflected in the action-guard design of the later Magentic-UI.[^2][^4]

## Follow-on work

Although Magentic-One was released as a fixed five-agent team, both Microsoft Research and the wider AutoGen community continued to build on it through 2025.

**Magentic-UI (May 2025).** Microsoft Research released Magentic-UI as an experimental human-centered web agent on May 19, 2025.[^3][^4] Magentic-UI is described as building on Magentic-One and is powered by AutoGen.[^3][^4] The architecture inherits the same core agent roles (Orchestrator, WebSurfer, Coder, FileSurfer) but adds a UserProxy agent that represents the human in the loop, and is fronted by a browser-based interface. The accompanying technical report, "Magentic-UI: Towards Human-in-the-loop Agentic Systems," was published in July 2025 (arXiv:2507.22358).[^7] Magentic-UI introduces six interaction mechanisms designed to give humans low-cost levers over an otherwise autonomous system:[^3][^4]

- **Co-planning**, in which the user can edit the Orchestrator's draft plan before execution begins.
- **Co-tasking**, in which the user can pause execution and take manual control of the browser mid-task.
- **Multi-tasking**, supporting parallel sessions.
- **Action guards**, which require explicit user confirmation for actions deemed irreversible (closing a tab, clicking a button with side effects, submitting a form).
- **Plan learning**, allowing users to save successful plans for reuse on similar tasks.
- **Long-term memory**, persisting learned facts and preferences across tasks.

Magentic-UI also exposes its tool surface through the Model Context Protocol (MCP), so external tool servers can plug into the agent team without changes to the orchestrator.[^4] Magentic-UI is open source and shipped at `github.com/microsoft/magentic-ui`.[^3]

**`MagenticOneGroupChat` in `autogen-agentchat`.** As discussed above, the original Magentic-One reference code was rebuilt as a first-class AgentChat team during the AutoGen v0.4 stabilization process, making it interoperable with arbitrary AssistantAgent participants and with the rest of the AutoGen ecosystem.[^5][^6]

**Adoption in third-party frameworks.** The Task Ledger / Progress Ledger pattern was widely adopted in 2025 as a reference design for orchestration in long-horizon agentic systems, often under slightly different names. Both `MagenticOneGroupChat` itself and the underlying pattern can be wired into a wide variety of frontend frameworks, including AutoGen Studio.[^5]

**Continued research.** The paper enumerates several limitations that became active areas of follow-up: high cost and latency from many LLM calls, the Coder's lack of stateful execution (a Jupyter-style notebook model would help), the fixed team roster (dynamic team composition), and the absence of cross-task learning (long-term memory). Magentic-UI directly addresses the last with its plan-learning and long-term memory features.[^2][^4]

## License

The Magentic-One source code is released under the **MIT license**, in line with the rest of the AutoGen project.[^5] This applies both to the original `autogen-magentic-one` package and to the modern `MagenticOneGroupChat` implementation in `autogen-agentchat`.[^5][^6] Magentic-UI is also released as open source under a permissive license.[^3] Model weights are not included; deployments must supply API access to a suitable foundation model such as GPT-4o or o1.

## Reception

Within the AI-research community, Magentic-One was received primarily as a reference design rather than a leaderboard champion. Its raw scores were statistically tied with several leaderboard systems but did not unambiguously top any benchmark; its real contribution was packaging a coherent, reproducible orchestration pattern with open-source code.[^2] The Task Ledger / Progress Ledger pattern, the tool-centric agent decomposition, and the stall-counter recovery mechanism each diffused into subsequent multi-agent designs.

In the engineering community, Magentic-One's most immediate impact was on AutoGen itself. The release made AutoGen v0.4 the de facto Microsoft-supported framework for shipping autonomous multi-agent systems and gave AutoGen a flagship application against which to evaluate API decisions. By 2025, `MagenticOneGroupChat` had become the canonical way to demonstrate AutoGen's team abstractions.[^6]

## See also

- [AutoGen](/wiki/autogen) - the Microsoft multi-agent framework that hosts Magentic-One
- [CrewAI](/wiki/crewai) - a role-centric multi-agent framework often compared with Magentic-One
- [LangGraph](/wiki/langgraph) - a state-machine framework for agentic workflows
- [Agno](/wiki/agno) - a competing open-source multi-agent framework
- [Semantic Kernel](/wiki/semantic_kernel) - Microsoft's other agent orchestration framework
- [OpenAI Agents SDK](/wiki/openai_agents_sdk) - OpenAI's successor to Swarm
- [GAIA benchmark](/wiki/gaia_benchmark) - the generalist agent benchmark on which Magentic-One was evaluated
- [WebArena](/wiki/webarena) - the synthetic web-task benchmark on which Magentic-One was evaluated
- [Microsoft](/wiki/microsoft) - the organization behind Magentic-One

## References

[^1]: Microsoft Research, "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks." Microsoft Research articles, November 4, 2024. https://www.microsoft.com/en-us/research/articles/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/ (Accessed 2026-05-20).

[^2]: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi. "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks." arXiv:2411.04468, November 7, 2024. https://arxiv.org/abs/2411.04468 (Accessed 2026-05-20).

[^3]: Microsoft Research, "Magentic-UI, an experimental human-centered web agent." Microsoft Research blog, May 19, 2025. https://www.microsoft.com/en-us/research/blog/magentic-ui-an-experimental-human-centered-web-agent/ (Accessed 2026-05-20).

[^4]: Microsoft, "magentic-ui: A research prototype of a human-centered web agent." GitHub repository. https://github.com/microsoft/magentic-ui (Accessed 2026-05-20).

[^5]: Microsoft, "autogen-magentic-one: README." AutoGen repository, GitHub. https://github.com/microsoft/autogen/blob/main/python/packages/autogen-magentic-one/README.md (Accessed 2026-05-20).

[^6]: Microsoft, "Magentic-One." AutoGen AgentChat user guide. https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/magentic-one.html (Accessed 2026-05-20).

[^7]: Hussein Mozannar et al., "Magentic-UI: Towards Human-in-the-loop Agentic Systems." arXiv:2507.22358, July 2025. https://arxiv.org/abs/2507.22358 (Accessed 2026-05-20).

[^8]: Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant. "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?" arXiv:2407.15711, 2024. https://arxiv.org/abs/2407.15711 (Accessed 2026-05-20).