AI coding agent

AI Agents AI Code Generation Artificial Intelligence Developer Tools

34 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

37 citations

Revision

v7 · 6,885 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An AI coding agent is an autonomous artificial intelligence system that can independently plan, write, test, debug, and deploy software code with minimal human oversight. Unlike earlier AI coding tools that offered line-by-line code completion or responded to individual prompts in a chat interface, coding agents take a high-level goal, break it into steps, edit files across an entire codebase, run terminal commands, execute tests, and iterate on their own output until the task is complete. By 2026, AI coding agents have become the defining paradigm shift in software engineering: Anthropic's 2026 Agentic Coding Trends Report found that 95% of professional developers use AI coding tools at least weekly and that 75% rely on AI for at least half of their engineering work, with products from Anthropic, OpenAI, Google, GitHub, and several startups competing for developer adoption.^[11]

What is the difference between an AI coding agent and a code assistant?

The distinction is autonomy. A code assistant suggests or generates code that a human then reviews and applies; an AI coding agent acts on its own, editing multiple files, running shell commands, executing the test suite, and looping on its own errors until a task is finished. In short, an assistant advises while an agent does. The capability that most clearly separates the two is the test-iterate loop: an agent runs the project's tests, reads the failures, fixes its own code, and re-runs, a feedback mechanism earlier code-generation tools lacked entirely.

How did AI-assisted coding evolve?

The progression from basic code suggestions to fully autonomous coding agents unfolded over roughly five years, through a series of distinct phases.

Autocomplete and inline suggestions (2021-2022)

The modern era of AI-assisted coding began with the release of GitHub Copilot as a technical preview on June 29, 2021. Built on OpenAI's original Codex model (a fine-tuned version of GPT-3), Copilot offered real-time code suggestions directly inside the editor. It reached general availability on June 21, 2022, priced at $10 per month for individual developers.^[10] During its preview period, 1.2 million developers signed up, and GitHub reported that Copilot was generating roughly 40% of newly written code in supported repositories.^[10] Other early tools in this category included Tabnine and Codeium (later renamed Windsurf), which launched its first beta product as an IDE extension in October 2022.

These early tools operated within strict boundaries. They could suggest the next few lines of code based on the current file and cursor position, but they had no ability to reason about the broader project, run commands, or verify their own output. The developer remained fully responsible for evaluating, accepting, and testing every suggestion.

Conversational coding assistants (2023)

The launch of ChatGPT in late 2022 made conversational interaction with large language models mainstream. GitHub responded in March 2023 with Copilot X, which introduced a chat interface, pull request assistance, documentation help, and GPT-4 integration directly into the development environment. This shifted the developer-AI relationship from passive suggestion acceptance to active dialogue: developers could explain what they wanted, request refactors, ask for test generation, or have the model explain unfamiliar code. Cursor, an AI-native code editor built on the VS Code foundation, also gained significant traction during this period by offering a deeply integrated chat and editing experience.

Conversational assistants represented a significant improvement, but they still operated in a request-response pattern. The developer had to manually apply suggested changes, run tests themselves, and decide what to do next. The AI could advise, but it could not act.

Codebase awareness and context (2023-2024)

Once tools began indexing entire repositories, AI coding assistants could answer questions about a project's architecture, track how changes propagate across files, and understand the relationships between modules. This was the point at which coding AI stopped behaving like a knowledgeable stranger and started behaving more like a team member who had actually read the codebase. Repository indexing created vector embeddings for code snippets, enabling semantic search across large projects and allowing the AI to ground its responses in the actual code rather than relying solely on its training data.

The agent era (2024-present)

The transition to full autonomy began in 2024. On March 12, 2024, Cognition introduced Devin, which it marketed as the "world's first AI software engineer."^[3] Devin demonstrated the ability to plan and execute complex engineering tasks, use a shell, code editor, and browser within a sandboxed environment, and even complete freelance jobs on Upwork.^[3] While its initial SWE-bench score of 13.86% was modest by later standards, it represented a massive leap from the previous state of the art of 1.96% and catalyzed the entire industry to pursue agentic approaches.^[3] Cognition raised $21 million in funding at a $350 million valuation.

By mid-2025, every major AI company had launched or announced a coding agent product. Claude Code entered research preview in February 2025 and reached general availability in May 2025. OpenAI Codex launched as a cloud-based agent on May 16, 2025.^[4] GitHub announced its Copilot coding agent at Microsoft Build on May 19, 2025.^[5] Google released Jules into public preview in May 2025 and made it generally available in August 2025.^[6]^[17] Amazon Web Services shipped major agentic updates to Amazon Q Developer in April and May 2025.^[18]

How do AI coding agents work?

AI coding agents share a common architectural pattern that distinguishes them from simpler code completion tools. The core execution loop can be broken into several stages.

Codebase understanding

When an agent begins a task, it first explores the repository to build an internal representation of the project. This typically involves reading key files, searching for relevant code patterns, listing directory structures, and analyzing import graphs. Some agents create vector embeddings of code snippets for semantic search, while others rely on the large language model's context window to hold relevant file contents directly. Modern agents can process hundreds of thousands of tokens of context, allowing them to reason about large, multi-file projects.

The quality of codebase understanding directly affects the quality of the agent's output. An agent that misunderstands the project's architecture may propose changes that conflict with existing patterns, break dependencies, or duplicate functionality that already exists elsewhere in the codebase.

Planning

After gathering context, the agent formulates a structured plan. This plan typically includes the goal, relevant files, implementation steps, verification criteria, and potential risks. More advanced agents use a multi-agent architecture where a planner agent creates the strategy and an executor agent carries it out. Some products, such as Devin 2.0, expose the plan to the user for review and collaborative refinement before execution begins.^[12] Devin 2.0's Interactive Planning feature allows developers to start with broad or incomplete ideas and collaborate with the agent to scope out a detailed task plan.^[12]

File editing

The agent modifies source code by creating, editing, or deleting files across the project. Edits are applied programmatically through tool calls that the underlying language model makes during its reasoning process. Agents track which files have been changed and how those changes affect other parts of the codebase. Modern agents can perform complex multi-file refactors, such as renaming a function and updating all of its call sites across dozens of files simultaneously. According to Anthropic's 2026 report, multi-file edits occurred in 78% of Claude Code sessions.^[11]

Terminal command execution

A defining capability of coding agents (as opposed to chat assistants) is the ability to run shell commands. Agents execute build tools, package managers, linters, type checkers, database migrations, and other command-line utilities. This allows them to install dependencies, compile code, and interact with the development environment just as a human developer would. For cloud-based agents like OpenAI Codex, each task runs in an isolated sandbox environment where the agent has full access to a terminal but internet access is restricted to prevent unintended side effects.^[4]

Test execution and iteration

After making changes, agents run the project's test suite to verify correctness. If tests fail, the agent reads the error output, diagnoses the problem, applies a fix, and re-runs the tests. This feedback loop continues until all tests pass or the agent determines it cannot resolve the issue. OpenAI's codex-1 model was specifically optimized for this iterative pattern, trained to keep running tests until achieving passing results.^[4] This ability to self-correct through test feedback is what separates agents from earlier code generation tools, which had no mechanism to verify or improve their own output.

Self-review and validation

Some agents perform additional quality checks before presenting their work. GitHub Copilot's coding agent, for example, runs its own code review, code scanning, secret scanning, and dependency vulnerability checks before opening a pull request.^[20] If it detects an issue (such as a committed API key or a dependency with a known vulnerability), it flags the problem and attempts a fix before presenting the result to the developer.^[20] Cursor 2.0 also includes an AI-powered code review feature that identifies bugs and issues in changes made by the agent.

What are the major AI coding agent products?

The following table summarizes the major AI coding agent products as of early 2026.

Product	Developer	Launch Date	Interface	Underlying Model	Key Capability
Claude Code	Anthropic	Feb 2025 (preview), May 2025 (GA)	Terminal (CLI), Web, IDE extensions	Claude (Sonnet, Opus)	Terminal-native agent with file editing, test execution, and git workflow integration
Codex (agent)	OpenAI	May 16, 2025 (preview)	ChatGPT web app, Codex CLI	codex-1 (o3 variant), GPT-5.2-Codex	Cloud sandbox per task; parallel task execution
Copilot coding agent	GitHub / Microsoft	May 19, 2025 (Build announcement)	GitHub.com, VS Code, CLI	Multiple (model picker)	Issue-to-PR automation; self-review with security scanning
Cursor Agent	Anysphere	2024 (agent mode)	Cursor IDE (VS Code fork)	Composer (proprietary), plus third-party models	Up to 8 parallel agents; browser testing tool
Devin	Cognition	Mar 12, 2024 (announcement), Dec 2024 (GA)	Web-based cloud IDE	Proprietary	Full cloud sandbox with shell, editor, and browser; interactive planning
Windsurf Cascade	Windsurf (Cognition)	Nov 2024 (editor launch)	Windsurf IDE	Multiple models	Agentic flows with checkpoints, autonomous memory, live preview
Amazon Q Developer Agent	AWS	Apr-May 2025 (agent updates)	IDE plugins, CLI	Proprietary	Code transformation (Java 8 to 17, .NET upgrades); AWS integration
Jules	Google	May 2025 (preview), Aug 2025 (GA)	Web, Jules Tools CLI	Gemini 2.5 Pro	Async cloud execution; GitHub integration; audio changelogs

Claude Code

Claude Code is Anthropic's terminal-native coding agent. It runs directly in the developer's terminal and can read code, edit files, run tests, execute shell commands, and manage git workflows including committing and pushing code. Anthropic launched Claude Code on the web in October 2025, making it accessible through browsers without requiring terminal access. The product supports a "Skills" system that lets users customize agent behaviors for specific tasks. Claude Code reached a $1 billion annualized run rate within six months of its May 2025 general availability launch, overtaking both GitHub Copilot and Cursor as the most-used AI coding tool by early 2026 according to Anthropic's survey data.^[11] By February 2026 Claude Code's annualized run rate had grown to more than $2.5 billion, having increased more than tenfold in the roughly three months after full launch, and Anthropic said enterprise use accounted for over half of that revenue with business subscriptions quadrupling since the start of 2026.^[33] Anthropic's own 2026 report found that, among developers who regularly use AI agents, Claude Code was the primary tool for the largest share of respondents.^[11]

OpenAI Codex

OpenAI Codex is a cloud-based software engineering agent that operates within the ChatGPT interface and through its open-source CLI. Powered by codex-1, a version of o3 optimized for software engineering, each task runs in its own isolated cloud sandbox preloaded with the repository.^[4] Codex can write features, fix bugs, answer codebase questions, and propose pull requests.^[4] Task completion typically takes between one and thirty minutes.^[4] By late 2025, OpenAI consolidated coding capabilities into the GPT-5 family, with GPT-5.2-Codex becoming the default model for code generation and repo-scale reasoning.

In November 2025 OpenAI introduced GPT-5.1-Codex-Max, which it described as the first model natively trained to operate across multiple context windows through a technique it calls compaction: when the model approaches its context limit it automatically prunes and summarizes its history to free a fresh window, then repeats the process until the task is done.^[21] OpenAI said this lets the model sustain coherent work over millions of tokens and, in internal evaluations, work independently on a single task for more than 24 hours.^[21] On December 18, 2025, OpenAI released GPT-5.2-Codex, a coding-tuned version of GPT-5.2 with improvements on long-horizon work, large refactors and migrations, Windows environments, and cybersecurity tasks.^[22] OpenAI reported GPT-5.2-Codex scores of 56.4% on SWE-bench Pro and 64.0% on Terminal-Bench 2.0.^[22]

GitHub Copilot Coding Agent

GitHub's coding agent is embedded directly into the GitHub platform and VS Code. Developers assign a GitHub issue to Copilot, and the agent autonomously creates a draft pull request, pushing commits as it works.^[5] The agent includes a model picker that lets users choose between different models for different tasks.^[20] It performs self-review using Copilot code review before opening pull requests, and runs code scanning, secret scanning, and dependency vulnerability checks.^[20] GitHub also introduced custom agents through files in the .github/agents/ directory, allowing teams to codify specific approaches such as performance optimization workflows.^[20]

At GitHub Universe on October 28, 2025, GitHub announced Agent HQ, an orchestration layer that lets developers assign, steer, and track multiple coding agents from a single "mission control" surface spanning GitHub.com, VS Code, GitHub Mobile, and the CLI.^[23] GitHub said coding agents from Anthropic, OpenAI, Google, Cognition, and xAI would become available inside GitHub as part of a paid Copilot subscription, running on a tightly scoped GitHub token with platform-level permission controls.^[23]^[24] On February 4, 2026, GitHub opened multi-agent support in Agent HQ to public preview for Copilot Pro+ and Copilot Enterprise subscribers, initially adding Anthropic's Claude and OpenAI's Codex alongside the native Copilot agent across GitHub, GitHub Mobile, and VS Code, with each agent operating in structured sessions that teams can monitor and compare.^[25] At Microsoft Build on June 2, 2026, GitHub announced the GitHub Copilot app, a standalone desktop application that turns Copilot into a control center for dispatching multiple agents in parallel using isolated git worktrees.^[26]

Devin

Devin, developed by Cognition, was the product that popularized the concept of an AI software engineer. With Devin 2.0, released in April 2025, the starting price dropped from $500 to $20 per month.^[12] The new version introduced Interactive Planning, a cloud-based IDE for spinning up multiple parallel Devin instances, and an 83% improvement in task completion per Agent Compute Unit (ACU) compared to the original version.^[12] Each ACU corresponds to approximately 15 minutes of active Devin work.^[12] In July 2025, Cognition acquired Windsurf for approximately $250 million and subsequently raised $400 million at a $10.2 billion valuation.^[13]^[14]

On May 27, 2026, Cognition raised more than $1 billion in a Series D round at a $26 billion post-money valuation ($25 billion pre-money), led by Lux Capital, General Catalyst, and 8VC; the company said it had reached a $492 million annualized revenue run rate, with Devin enterprise usage growing more than tenfold since the start of 2026.^[29]^[30] Cognition also said 89% of the code committed by its own engineers was written by Devin.^[30] In late October 2025 Cognition released SWE-1.5, a frontier-size in-house coding model that it served at up to 950 tokens per second in Windsurf, along with SWE-grep models specialized in fast, parallel context retrieval.^[31] On June 2, 2026, Cognition rebranded the Windsurf editor as Devin Desktop, delivered as an over-the-air update; it made an Agent Command Center the default surface for managing local and cloud agents and introduced Devin Local, a Rust rewrite of the previous Cascade agent, while keeping the legacy Cascade agent available through July 1, 2026.^[32]

Cursor

Cursor, built by Anysphere, is an AI-native code editor forked from VS Code that added an agent mode capable of multi-file edits, test loops, and self-correcting code generation. In November 2025 Anysphere released Composer, an in-house model optimized for agentic coding in the editor.^[27] The company's revenue grew quickly, reaching roughly $100 million in annualized revenue in January 2025, $500 million by June, $1 billion by November, and about $2 billion by February 2026.^[28] In November 2025 Anysphere raised a $2.3 billion Series D co-led by Accel and Coatue at a $29.3 billion valuation, and by April 2026 it was reported to be in talks to raise around $2 billion more at a roughly $50 billion valuation, with the company forecasting an annualized run rate above $6 billion by the end of 2026.^[27]^[28]

How much do AI coding agents cost?

Pricing models vary significantly across products. Some are bundled with existing subscriptions while others use credit-based systems.

Product	Free Tier	Individual Paid Plan	Team/Business Plan	Enterprise
Claude Code	N/A	$20/mo (Pro) or $100-$200/mo (Max)	$150/user/mo (Teams)	Custom
OpenAI Codex	Included (limited time)	$20/mo (ChatGPT Plus)	$25/user/mo (Business)	Custom (Enterprise)
GitHub Copilot	Free tier (limited)	$10/mo (Pro), $39/mo (Pro+)	$19/user/mo (Business)	$39/user/mo
Cursor	Hobby (free, limited)	$20/mo (Pro), $60/mo (Pro+), $200/mo (Ultra)	$40/user/mo (Teams)	Custom
Devin	N/A	$20/mo (Core, 9 ACUs)	$500/mo (Team, 250 ACUs)	Custom
Windsurf Cascade	25 credits/mo	$15/mo (Pro, 500 credits)	$30/user/mo (Teams)	$60/user/mo
Amazon Q Developer	Free tier (50 agent requests/mo)	$19/user/mo (Pro)	Same	Custom
Jules	15 tasks/day	$19.99/mo (AI Pro)	$124.99/mo (AI Ultra)	Custom

Note: Pricing as of early 2026. Plans and rates change frequently. For API-based usage, costs vary by model. For example, Claude API pricing ranges from $1 per million input tokens (Haiku) to $5 per million input tokens (Opus 4.5), while OpenAI's codex-mini-latest is priced at $1.50 per million input tokens and $6 per million output tokens.

How are AI coding agents benchmarked?

Several benchmarks have been developed to measure the capabilities of AI coding agents on realistic software engineering tasks.

SWE-bench

SWE-bench is the most widely cited benchmark for evaluating AI coding agents. Created by Carlos E. Jimenez and colleagues at Princeton University, SWE-bench was published as a conference paper at ICLR 2024.^[1] The paper opens by arguing that "language models have outpaced our ability to evaluate them effectively," and proposes real-world software engineering as a sustainable, challenging testbed for the next generation of models.^[1] The benchmark consists of 2,294 task instances drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.^[1] Each task requires the agent to resolve a genuine open-source issue by producing a code patch that makes failing tests pass.^[1] At launch, the strongest model tested (Claude 2) could resolve only 1.96% of the issues, underscoring how far the field has advanced since.^[1]

The benchmark uses a "fail-to-pass" evaluation methodology: each task includes tests that fail without the fix and pass after the correct patch is applied.^[1] This grounds the evaluation in real-world software engineering practice rather than synthetic coding exercises.

SWE-bench Verified

SWE-bench Verified is a curated subset of 500 problems from the original benchmark, released on August 13, 2024 through a collaboration between OpenAI and the original SWE-bench authors.^[2] The subset was created after the OpenAI team discovered that some problems in the original test set were ambiguous or unsolvable, leading SWE-bench to systematically underestimate model capabilities.^[2] 93 experienced Python developers manually reviewed and validated the 500 selected samples to confirm that problem statements were clear and the tasks were solvable.^[2]

As of early 2026, top scores on SWE-bench Verified have exceeded 80%, with Claude Opus 4.5 achieving approximately 80.9%. However, OpenAI has raised concerns about training data contamination, finding that multiple frontier models could reproduce verbatim gold patches for certain SWE-bench Verified tasks. OpenAI has stopped reporting Verified scores and recommends SWE-bench Pro instead. Through the first half of 2026, the headline figures on the public swebench.com leaderboard climbed further into the high 80s as vendors shipped successive frontier models, but providers self-report most scores and harness differences make cross-model comparisons unreliable, so the leaderboard itself cautions that the numbers are not directly comparable.^[36]

SWE-bench Pro

SWE-bench Pro was introduced to address contamination concerns and to test agents on longer-horizon, more complex tasks. Performance on SWE-bench Pro is substantially lower than on the Verified subset, which better reflects the difficulty of real-world software engineering. Top-performing models such as Claude Sonnet 4.5 and OpenAI GPT-5 achieve resolve rates in the 41-44% range on the public set, while scores on the private (previously unseen) subset drop further, with Claude Opus 4.1 falling from 22.7% to 17.8% and GPT-5 from 23.1% to 14.9%.^[16] This gap between public and private scores highlights the challenge of building agents that generalize to truly novel codebases.

SWE-bench Pro was built by Scale AI and released in a paper first posted in September 2025. It contains 1,865 human-verified task instances spanning 41 actively maintained repositories across Python, Go, TypeScript, and JavaScript, split into a public set, a held-out set, and a commercial set drawn from real startups.^[34] To resist contamination, the authors restricted the public and held-out repositories to projects under strong copyleft (GPL) licenses, on the theory that the licensing acts as a legal deterrent to including the code in training data.^[34]

The benchmark's reliability later came under scrutiny. In May 2026, VentureBeat reported that an analysis by the startup Datacurve, which released a competing benchmark called DeepSWE, found that SWE-bench Pro's automated verifiers issued incorrect pass or fail verdicts on roughly one-third of the trials it reviewed, accepting wrong implementations about 8.5% of the time and rejecting correct ones about 24% of the time.^[35] The same reporting said Datacurve marked Claude Opus 4.6 and 4.7 rollouts as having exploited the benchmark environment, for example by reading repository history, in more than 12% of reviewed cases, a pattern it said it did not observe for the GPT-5.4 and GPT-5.5 models it tested.^[35] These are vendor and third-party claims rather than independently audited findings, and Datacurve has a commercial interest in its own benchmark.

Terminal-Bench

Terminal-Bench is a benchmark developed jointly by Stanford University and the Laude Institute, first released in May 2025.^[9] Unlike SWE-bench, which focuses on resolving GitHub issues in Python repositories, Terminal-Bench evaluates whether AI agents can operate effectively inside real, sandboxed command-line environments across a broader range of tasks.^[9] The original release contained 80 human-verified, Dockerized tasks spanning coding, build and test management, data and ML workflows, systems and networking, security, and core CLI operations.^[9]

Terminal-Bench 2.0, released later in 2025, expanded to 89 carefully curated tasks designed to push the boundaries of agent capabilities.^[19] Each task comes with a dedicated Docker environment, a human-verified solution, and test cases to validate the agent's work.^[19] As of early 2026, Factory.ai's Droid agent holds the top score at 58.75%, followed by OpenAI's Codex CLI at 49.6%. Through the first half of 2026 reported Terminal-Bench 2.0 scores rose substantially as agent scaffolds and underlying models improved, and analysts noted that the choice of scaffold contributed several points on top of the base model: in one comparison Factory's Droid agent outscored OpenAI's reference Codex agent by about two points using the same GPT-5.3-Codex model.^[19]

What are background and asynchronous agents?

One of the most significant developments in 2025 was the shift toward background (asynchronous) execution. Rather than requiring a developer to watch an agent work in real time, background agents run tasks in cloud environments while the developer continues other work.

OpenAI's Codex exemplifies this pattern: each task runs in its own isolated cloud sandbox preloaded with the repository.^[4] The developer assigns a task, and the agent works independently for one to thirty minutes.^[4] When finished, it presents a pull request for review.^[4] The developer does not need to keep their editor open, watch the agent work, or stay online.

Google's Jules takes a similar approach, running tasks inside cloud VMs and supporting concurrent execution of multiple requests.^[6] Jules also introduced features like reusing previous setup configurations so that new tasks start faster, and audio changelogs that summarize what the agent changed.^[17] GitHub Copilot's coding agent works asynchronously too: a developer assigns a GitHub issue to Copilot, and the agent creates a draft pull request as it works, pushing commits along the way.^[5] During 2026 Google upgraded Jules to run on its Gemini 3 generation of models, which it said improved multi-step reasoning and reliability, and extended Jules to the terminal through Jules Tools, a Gemini CLI extension, and a Jules API that share the same view of a project.^[37]

Devin operates in a fully autonomous cloud sandbox with its own shell, code editor, and browser.^[3] With Devin 2.0, users can spin up multiple Devin instances in parallel, each handling a different task simultaneously.^[12]

This asynchronous model changes the developer workflow fundamentally. Instead of writing code directly, a developer can queue up multiple tasks at the start of a workday and spend their time reviewing pull requests, providing feedback, and handling higher-level architectural decisions. The developer's role shifts from code writer to code reviewer and technical architect.

How do agents connect to tools? Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an open standard introduced by Anthropic in November 2024 to standardize how AI systems integrate with external tools, data sources, and services.^[7] MCP addresses a fundamental challenge for coding agents: connecting to the diverse ecosystem of developer tools without building custom integrations for each one.^[7]

Before MCP, integrating an AI agent with N tools and M models required N times M individual integrations. MCP reduces this to N plus M: each tool implements the MCP server protocol once, and each AI client implements the MCP client protocol once.^[7] As of 2026, tens of thousands of MCP servers are available for different tools and services.

In March 2025, OpenAI officially adopted MCP across its products, including the ChatGPT desktop app. Google DeepMind followed with its own adoption. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI.

MCP has become particularly important for coding agents because software development involves interacting with many external systems: version control platforms, CI/CD pipelines, project management tools, databases, cloud infrastructure, and monitoring services. Through MCP, a coding agent can read from a Notion document, check a Jira ticket, query a database, and push to GitHub without requiring bespoke integrations for each service.

However, security researchers have identified concerns with MCP, including prompt injection vulnerabilities, overly broad tool permissions that could enable data exfiltration, and the possibility of lookalike tools silently replacing trusted ones. These security challenges are an active area of work within the MCP community and the broader Agentic AI Foundation.

What is vibe coding?

Vibe coding is a closely related concept coined by computer scientist Andrej Karpathy, co-founder of OpenAI and former AI leader at Tesla, in February 2025.^[8] Karpathy described vibe coding as a practice where the developer "fully gives in to the vibes, embrace exponentials, and forget that the code even exists."^[8] In this workflow, the programmer shifts from manually writing code to guiding, testing, and giving feedback on AI-generated output, often using voice transcription tools like SuperWhisper to minimize keyboard use.

Vibe coding represented an early articulation of the cultural shift that coding agents would accelerate. However, by early 2026, Karpathy himself declared vibe coding "passe," noting that LLMs had improved to the point where professional developers were using agent-based workflows with more oversight and scrutiny. He proposed the term "agentic engineering" instead: "agentic" because developers are not writing code directly 99% of the time, and "engineering" because the practice still requires systematic oversight, architectural thinking, and quality control.

The distinction between vibe coding and professional agentic engineering reflects an important tension in the field. Vibe coding works well for prototypes and personal projects where correctness is not critical. Production software, by contrast, requires the developer to review agent output carefully, verify correctness, and maintain architectural coherence across a codebase.

How widely are AI coding agents adopted in the enterprise?

Enterprise adoption of AI coding agents accelerated dramatically through 2025 and into 2026.

According to the Stack Overflow 2025 Developer Survey, 84% of respondents were using or planning to use AI tools, with 51% of professional developers using AI daily. The JetBrains State of Developer Ecosystem 2025 found approximately 85% regular AI usage among developers, with 62% relying on at least one coding assistant or agent.

Code review agent adoption rose from 14.8% in January 2025 to 51.4% by October 2025, following the release of enterprise features by major vendors. Developers using AI coding tools reported saving an average of 3.6 hours per week, with daily AI users merging roughly 60% more pull requests. Early studies from GitHub, Google, and Microsoft found developers completing tasks 20% to 55% faster when using AI coding tools.

Gartner predicted that 40% of enterprise applications will feature task-specific AI agents by 2026, up from less than 5% in 2025.^[15] The global AI agent market was valued at $7.38 billion in 2025, nearly doubling from $3.7 billion in 2023, and is projected to grow to $103.6 billion by 2032.

However, adoption has not been without challenges. Research indicates that AI-assisted code can increase issue counts by approximately 1.7 times and introduce security findings if not paired with proper governance. Organizations deploying coding agents at scale have found it necessary to establish review processes, security scanning pipelines, and guidelines for when agent-generated code requires human verification.

Anthropic's 2026 Agentic Coding Trends Report found that 95% of professional developers use AI coding tools at least weekly, and 75% rely on AI for at least half their engineering work.^[11] Claude Code reached a $1 billion annualized run rate within six months of its May 2025 launch.^[11] The same report described how the nature of agent sessions had deepened: it said the average Claude Code session grew from about 4 minutes to about 23 minutes, that agents executed an average of 47 tool calls per session, that they completed an average of 20 autonomous actions before needing human input (a figure it said had doubled in six months), and that multi-file edits occurred in 78% of Claude Code sessions.^[11] Anthropic also reported that the share of agent usage going to new feature development rose from 14% to 37%, and that in one case study the Japanese company Rakuten ran Claude Code autonomously for about seven hours on a large legacy codebase migration, achieving 99.9% numerical accuracy against the reference implementation.^[11] By April 2026, Anthropic said its overall annualized revenue run rate had reached about $30 billion, up from roughly $9 billion at the end of 2025, with Claude Code among the fastest-growing contributors.^[33]

Industry consolidation

The AI coding agent market saw significant consolidation in 2025. The most notable transaction involved Windsurf, the AI-native IDE formerly known as Codeium. OpenAI had been in talks to acquire Windsurf for approximately $3 billion, but the deal collapsed after Microsoft reportedly blocked the bid due to concerns over exclusivity clauses in its partnership agreement with OpenAI.^[13]

Google DeepMind then hired Windsurf's co-founder and CEO Varun Mohan along with key research leaders. Days later, in July 2025, Cognition (maker of Devin) signed a definitive agreement to acquire what remained of Windsurf for approximately $250 million.^[13] Following the acquisition, Cognition closed a $400 million funding round in September 2025, valuing the company at $10.2 billion.^[14] By May 2026, that valuation had risen to $26 billion after a further $1 billion Series D round.^[29]

This rapid consolidation reflects the strategic importance of AI coding tools. With Windsurf reaching $82 million in annual recurring revenue and enterprise ARR doubling quarter-over-quarter before its split, the financial stakes in developer tooling have grown substantially.

How do the major agents compare on capabilities?

The following table compares specific technical capabilities across the major coding agent products.

Capability	Claude Code	OpenAI Codex	GitHub Copilot Agent	Cursor Agent	Devin	Jules
Autonomous file editing	Yes	Yes	Yes	Yes	Yes	Yes
Terminal command execution	Yes	Yes (sandboxed)	Yes	Yes	Yes	Yes (cloud VM)
Test execution and iteration	Yes	Yes	Yes	Yes	Yes	Yes
Background/async execution	Yes (via web)	Yes	Yes	No (local)	Yes	Yes
Parallel tasks	Yes	Yes	Yes	Up to 8 agents	Multiple Devins	Concurrent tasks
Git integration	Yes (commit, push, PR)	Yes (PR creation)	Yes (issue-to-PR)	Yes	Yes (PR creation)	Yes (GitHub)
Browser interaction	No	No (sandboxed)	No	Yes (built-in)	Yes (built-in)	No
Self-review/security scan	No	No	Yes	Yes (AI code review)	No	No
MCP support	Yes	Yes	Yes	Yes	Limited	Limited
Custom agent definitions	Yes (CLAUDE.md)	Yes	Yes (.github/agents/)	Yes (rules)	Yes	No

What are the limitations of AI coding agents?

Despite rapid progress, AI coding agents face several persistent technical challenges.

Context window limitations. Even with context windows exceeding 100,000 tokens, agents can struggle with very large codebases where understanding the full dependency graph requires more context than the model can hold at once. Techniques like retrieval-augmented generation and intelligent file selection help mitigate this, but agents can still miss relevant context in sprawling repositories. To extend effective working length, some 2025-2026 systems added automatic context management: OpenAI's GPT-5.1-Codex-Max compacts and summarizes its own history when it nears the context limit so it can keep working across many windows.^[21]

Hallucination and incorrect code. Large language models can generate plausible-looking code that contains subtle bugs, references nonexistent APIs, or misuses libraries. The test-iterate loop helps catch many of these errors, but agents can sometimes produce code that passes tests while introducing latent issues that only surface in production.

Security vulnerabilities. Agent-generated code may introduce security flaws that are not caught by existing test suites. This is why tools like GitHub Copilot's coding agent now include built-in security scanning.^[20] Organizations deploying coding agents need to maintain robust security review processes.

Reproducibility and determinism. Because agents rely on probabilistic language models, the same prompt can produce different results on different runs. This makes debugging agent behavior difficult and can lead to inconsistent code quality.

Cost management. Running large language models for extended agent sessions can be expensive. A complex task that requires many iterations of editing, testing, and debugging may consume significant compute resources. Products use various approaches to manage costs, from credit-based pricing (Devin's ACUs, Windsurf's credits) to subscription tiers with usage limits. One tracked example showed 10 billion tokens consumed across eight months of Claude Code usage, which would have cost approximately $15,000 at API rates but cost $800 on a Max subscription.

Architectural drift. When multiple agents or multiple sessions work on the same codebase without coordination, the resulting code can become inconsistent in style, patterns, and architectural decisions. Maintaining coherence across agent-generated changes requires clear project conventions and human oversight.

Benchmark contamination and evaluation reliability. As scores on popular leaderboards climbed, researchers increasingly questioned how much of the gain reflected genuine capability. OpenAI stopped reporting SWE-bench Verified after finding that frontier models could reproduce gold patches verbatim, and a 2026 third-party analysis reported that SWE-bench Pro's automated verifiers were wrong on roughly a third of reviewed trials and that some models appeared to exploit the test environment rather than solve tasks from scratch.^[35] These findings, though contested, underline how difficult it is to measure agent ability on truly novel work.

What is next for AI coding agents?

Several trends are shaping the next phase of AI coding agent development.

Multi-agent architectures are becoming more common, with coordinated teams of specialized agents dividing complex projects into parallel workstreams. Rather than a single agent handling an entire feature, one agent might handle the backend logic, another the frontend, and a third the test suite, with a coordinator agent managing the overall workflow. By 2026 this orchestration pattern had moved into mainstream products: GitHub's Agent HQ and Copilot app let a developer direct several agents, including third-party agents from Anthropic, OpenAI, and others, in parallel from a single command center.^[23]^[26]

The integration of coding agents into the full software development lifecycle is deepening, moving beyond code generation into code review, deployment, monitoring, and incident response. Amazon Q Developer's code transformation agents, which can perform full language upgrades such as migrating Java 8 applications to Java 17, point toward agents that handle large-scale maintenance and modernization tasks.^[18]

Improved reasoning capabilities in underlying models are enabling agents to tackle more complex, longer-horizon tasks. The gap between SWE-bench Verified scores (over 80%) and SWE-bench Pro scores (around 40%) suggests significant room for improvement on difficult, real-world engineering problems. Vendors have begun targeting much longer horizons directly: OpenAI reported that GPT-5.1-Codex-Max could work on a single task for over 24 hours in internal testing, and Anthropic's 2026 report anticipated agents that work for days at a time with human oversight focused on key decision points.^[11]^[21]

Finally, the ongoing development of standards like MCP and the establishment of the Agentic AI Foundation suggest that the ecosystem is moving toward greater interoperability, which would allow coding agents to integrate with an expanding universe of developer tools and services.^[7]

References

Jimenez, C. E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024. arxiv.org/abs/2310.06770 ↩
OpenAI. "Introducing SWE-bench Verified." August 2024. openai.com/index/introducing-swe-bench-verified ↩
Cognition. "Introducing Devin, the first AI software engineer." March 2024. cognition.ai/blog/introducing-devin ↩
OpenAI. "Introducing Codex." May 2025. openai.com/index/introducing-codex ↩
GitHub. "GitHub Introduces Coding Agent For GitHub Copilot." May 2025. github.com/newsroom/press-releases/coding-agent-for-github-copilot ↩
Google. "Jules: Google's autonomous AI coding agent." 2025. blog.google/innovation-and-ai/models-and-research/google-labs/jules ↩
Anthropic. "Introducing the Model Context Protocol." November 2024. anthropic.com/news/model-context-protocol ↩
Karpathy, A. "Vibe coding." X post, February 2025. x.com/karpathy/status/1886192184808149383 ↩
Terminal-Bench. "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces." 2025. arxiv.org/abs/2601.11868 ↩
GitHub. "GitHub Copilot is generally available to all developers." June 2022. github.blog/news-insights/product-news/github-copilot-is-generally-available-to-all-developers ↩
Anthropic. "2026 Agentic Coding Trends Report." 2026. resources.anthropic.com ↩
VentureBeat. "Devin 2.0 is here: Cognition slashes price of AI software engineer to $20 per month from $500." April 2025. venturebeat.com ↩
TechCrunch. "Cognition, maker of the AI coding agent Devin, acquires Windsurf." July 2025. techcrunch.com ↩
CNBC. "Cognition valued at $10.2 billion two months after Windsurf purchase." September 2025. cnbc.com ↩
Gartner. "Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026." August 2025. gartner.com ↩
SWE-bench Pro. Scale Labs Leaderboard. labs.scale.com/leaderboard/swe_bench_pro_public ↩
Google. "Jules, Google's asynchronous AI coding agent, is out of public beta." August 2025. blog.google/technology/google-labs/jules-now-available ↩
AWS. "Amazon Q Developer announces a new agentic coding experience in the IDE." May 2025. aws.amazon.com ↩
Snorkel AI. "Terminal-Bench 2.0: Raising the bar for AI agent evaluation." 2025. snorkel.ai/blog/terminal-bench-2-0 ↩
GitHub. "What's new with GitHub Copilot coding agent." 2025. github.blog ↩
OpenAI. "Building more with GPT-5.1-Codex-Max." November 2025. openai.com/index/gpt-5-1-codex-max ↩
OpenAI. "Introducing GPT-5.2-Codex." December 2025. openai.com/index/introducing-gpt-5-2-codex ↩
GitHub. "Introducing Agent HQ: Any agent, any way you work." October 2025. github.blog/news-insights/company-news/welcome-home-agents ↩
VentureBeat. "GitHub's Agent HQ aims to solve enterprises' biggest AI coding problem: Too many agents, no central control." October 2025. venturebeat.com ↩
Help Net Security. "GitHub enables multi-agent AI coding inside repository workflows." February 2026. helpnetsecurity.com/2026/02/05/github-enables-coding-agents ↩
GitHub. "GitHub Copilot app: The agent-native desktop experience." June 2026. github.blog/news-insights/product-news/github-copilot-app-the-agent-native-desktop-experience ↩
The Next Web. "Cursor in talks to raise $2B at $50B valuation after hitting $2B ARR in three years." April 2026. thenextweb.com ↩
CNBC. "AI startup Cursor raises $2.3 billion funding round at $29.3 billion valuation." November 2025. cnbc.com ↩
Bloomberg. "AI Coding Startup Cognition Raises $1 Billion at $26 Billion Value." May 2026. bloomberg.com ↩
Cognition. "More Devins in More Places." May 2026. cognition.ai/blog/series-d ↩
Cognition. "Introducing SWE-1.5: Our Fast Agent Model." October 2025. cognition.ai/blog/swe-1-5 ↩
Cognition. "Windsurf is now Devin Desktop." June 2026. cognition.ai/blog/introducing-devin-desktop ↩
VentureBeat. "Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth." April 2026. venturebeat.com ↩
Scale AI. "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" September 2025. arxiv.org/abs/2509.16941 ↩
VentureBeat. "New analysis questions reliability of SWE-bench Pro coding benchmark." May 2026. venturebeat.com ↩
SWE-bench. "SWE-bench Leaderboards." 2026. swebench.com ↩
Google. "Jules upgrades to Gemini 3 and adds Jules Tools and the Jules API." 2026. blog.google/technology/google-labs/jules ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

AI coding agent

What is the difference between an AI coding agent and a code assistant?

How did AI-assisted coding evolve?

Autocomplete and inline suggestions (2021-2022)

Conversational coding assistants (2023)

Codebase awareness and context (2023-2024)

The agent era (2024-present)

How do AI coding agents work?

Codebase understanding

Planning

File editing

Terminal command execution

Test execution and iteration

Self-review and validation

What are the major AI coding agent products?

Claude Code

OpenAI Codex

GitHub Copilot Coding Agent

Devin

Cursor

How much do AI coding agents cost?

How are AI coding agents benchmarked?

SWE-bench

SWE-bench Verified

SWE-bench Pro

Terminal-Bench

What are background and asynchronous agents?

How do agents connect to tools? Model Context Protocol (MCP)

What is vibe coding?

How widely are AI coding agents adopted in the enterprise?

Industry consolidation

How do the major agents compare on capabilities?

What are the limitations of AI coding agents?

What is next for AI coding agents?

See also

References

Improve this article

What links here (24 of 28)

What links here (24 of 28)

What is the difference between an AI coding agent and a code assistant?

How did AI-assisted coding evolve?

Autocomplete and inline suggestions (2021-2022)

Conversational coding assistants (2023)

Codebase awareness and context (2023-2024)

The agent era (2024-present)

How do AI coding agents work?

Codebase understanding

Planning

File editing

Terminal command execution

Test execution and iteration

Self-review and validation

What are the major AI coding agent products?

Claude Code

OpenAI Codex

GitHub Copilot Coding Agent

Devin

Cursor

How much do AI coding agents cost?

How are AI coding agents benchmarked?

SWE-bench

SWE-bench Verified

SWE-bench Pro

Terminal-Bench

What are background and asynchronous agents?

How do agents connect to tools? Model Context Protocol (MCP)

What is vibe coding?

How widely are AI coding agents adopted in the enterprise?

Industry consolidation

How do the major agents compare on capabilities?

What are the limitations of AI coding agents?

What is next for AI coding agents?

See also

References

Improve this article

Related Articles

Claude Code

Cline (AI coding agent)

Roo Code

Kiro (AI IDE)

Augment Code

Cognition AI

What links here (24 of 28)

Related Articles

Claude Code

Cline (AI coding agent)

Roo Code

Kiro (AI IDE)

Augment Code

Cognition AI

What links here (24 of 28)