Programming
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 3,936 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 3,936 words
Add missing citations, update stale details, or suggest a clearer explanation.
AI for programming refers to the use of artificial intelligence, and especially large language models, to write, edit, refactor, test, review, and explain source code. Tools in this category range from inline autocomplete plugins that suggest the next few tokens, to fully autonomous agents that read an entire repository, plan a change, run tests, and open a pull request without further human input. Since the first public release of GitHub Copilot in 2021, AI assisted coding has gone from a niche experiment to a default part of many professional software workflows.
This article is a gateway to the major categories of AI coding tools, the underlying LLMs trained on code, the benchmarks used to compare them, and the empirical research on whether they actually make developers faster.
The modern wave of AI coding tools is rooted in two developments. First, the 2017 transformer architecture made it practical to train very large language models on the kind of long sequences that source code requires. Second, public code hosts such as GitHub provided a corpus of billions of lines of code in many languages that could be used as training data, alongside natural language documentation.
The first widely used product of this combination was Tabnine, which launched the first deep learning code completion tool for Java in June 2018 and now reports about one million monthly active users. It was followed in 2021 by GitHub Copilot, built on a Codex model derived from GPT-3 and trained on public GitHub code. Copilot moved beyond single line completion and could generate multi line functions from a comment or a function signature.
From 2022 onward the field expanded along several fronts at once. General chat assistants such as ChatGPT and Claude became popular for code questions even though they were not built specifically for the editor. New AI native editors like Cursor and Windsurf reimagined the IDE around an LLM as the central abstraction. Open source models trained heavily on code, including Code Llama, DeepSeek-Coder, Qwen Coder, and StarCoder, gave developers and vendors local alternatives to the closed APIs from OpenAI, Anthropic, and Google. By 2024 and 2025 the dominant frontier was agentic coding: tools like Claude Code, Devin, and Aider that take a task description and run a multi step loop of read, plan, edit, test, and retry.
Inline assistants live inside an existing IDE such as VS Code, JetBrains, Vim, Emacs, or Xcode. They watch the file you are editing, plus some surrounding context from the project, and offer suggestions that you accept with a key press. Most of them also expose a chat panel for longer questions and refactoring requests.
| Tool | Vendor | Released | Underlying models | Notes |
|---|---|---|---|---|
| GitHub Copilot | GitHub / Microsoft | 2021 | OpenAI GPT, Anthropic Claude, Google Gemini | First mainstream AI coding tool. Multi model since 2024. |
| Tabnine | Tabnine | 2018 | In house and partner models | Originated AI code completion. Offers fully on premises and air gapped deployments. |
| Codeium | Exafunction (now Cognition) | 2022 | In house models, third party APIs | Free tier popular with individual developers. The company later launched Windsurf. |
| Sourcegraph Cody | Sourcegraph | 2023 | OpenAI, Anthropic, others | Tightly coupled to Sourcegraph code search and graph indexing for large codebases. |
| Supermaven | Supermaven | 2024 | In house Babble model | Focused on sub 200 ms latency and a 1 million token context window before being acquired by Cursor. |
| Continue | Continue Dev | 2023 | Any provider via config | Open source extension for VS Code and JetBrains, BYOK across many providers. |
These tools share a common pattern. They send a prompt that contains the current file, recently edited files, and sometimes a code map of the wider repository. The model returns a completion, which is rendered as ghost text. Suggestion acceptance rates are commonly cited around 30 percent for Copilot, though acceptance is a weak proxy for value because it does not measure whether the accepted code was correct or kept after review.
A newer category rebuilt the editor itself with an AI agent at the center, rather than bolting one onto VS Code as an extension. These products typically fork VS Code or build a new shell entirely.
Cursor is the most prominent. It was built by Anysphere, a startup founded in 2022 by four MIT graduates, and launched in March 2023. It is a fork of VS Code with native multi file editing, an agent mode that can run terminal commands, indexed search across the repository, and integration with frontier models from OpenAI, Anthropic, Google, and others. By 2026 Anysphere had reached more than two billion dollars in annual recurring revenue.
Windsurf launched in November 2024 from the team behind Codeium, originally branded as the first agentic IDE. Its main differentiator was Cascade, an agent that could plan, edit across multiple files, and execute terminal commands inside the editor. Windsurf had an unusual corporate history. In mid 2025 OpenAI agreed to acquire it for around three billion dollars, but the deal collapsed in July 2025. Google then licensed the technology and hired the leadership team for two point four billion dollars. Cognition acquired the remainder of the company at the end of 2025 and now owns the Windsurf brand.
Zed is a newer editor written in Rust that emphasizes performance and collaborative editing, with built in AI features that talk to multiple LLM providers. Replit's web based environment offers Replit Agent, launched in early 2025, which lets users describe an app in natural language and have it scaffolded, deployed, and iterated on inside the browser. Replit Agent 4, released in 2026, runs independent subtasks in parallel and uses an internal browser based testing system to check its own work.
Coding agents are different from inline assistants in scope. Instead of suggesting the next few lines, they take an issue or a goal and execute a long running loop: read the codebase, plan, propose edits across many files, run tests, fix failures, and report back. Some run in the terminal, some run in the cloud, and some run as GitHub bots.
| Agent | Vendor | First release | Interface |
|---|---|---|---|
| Claude Code | Anthropic | February 2025 (research preview) | Terminal, IDE extensions, GitHub @claude |
| Codex CLI | OpenAI | April 16, 2025 | Terminal, ChatGPT cloud agent |
| Devin | Cognition AI | March 12, 2024 | Web app, Slack |
| Aider | Aider AI | 2023 | Terminal, any LLM |
| OpenHands | All Hands AI | 2024 (as OpenDevin) | Local, cloud |
| Cline | Cline | 2024 | VS Code extension |
| Roo Code | Roo Veterinary | 2024 (forked from Cline) | VS Code extension |
| Augment | Augment Code | 2024 | IDE, CLI, GitHub, Slack |
| Jules | 2025 | Web, GitHub | |
| Copilot coding agent | GitHub | May 17, 2025 | GitHub Issues |
Claude Code is Anthropic's terminal first agent. It reads the whole repository, edits across files, runs tests, commits, and uses MCP servers to talk to external systems. The harness was originally called the Claude Code SDK and was later renamed to the Claude Agent SDK because the same loop powers many other kinds of agents.
Devin, from Cognition, was the first coding agent marketed as a fully autonomous AI software engineer. It launched in March 2024 with a viral demo and a SWE Bench score of 13.86 percent, well above the previous state of the art of 1.96 percent at that time. Cognition went on to absorb Windsurf at the end of 2025 and was reportedly raising at a 25 billion dollar valuation in 2026.
Aider is a popular open source pair programming tool that runs in the terminal, builds a code map of the repo, edits files in place, and commits each change with a generated message. It works with most LLMs through standard APIs, including local models. As of 2026 it has more than 41 thousand GitHub stars and has been installed about 5.3 million times from PyPI.
OpenHands, formerly OpenDevin, is an open source platform for building general purpose software agents. Cline is a VS Code extension that runs an agent loop with explicit human approval for each step, which makes it easier to debug and supervise. Roo Code forked Cline in late 2024 to add a multi mode workflow with separate Code, Architect, Ask, and Debug modes. Augment Code targets large enterprise codebases with a Context Engine that maintains a live semantic index of the repo, dependencies, and architecture.
Underneath every tool is a model. Most leading coding tools rely on general purpose frontier LLMs, but a healthy ecosystem of code specialized open weight models exists as well.
| Model | Developer | Notes |
|---|---|---|
| GPT-4, GPT-4o, GPT-5 series | OpenAI | General models with strong code performance. Powers ChatGPT, Codex, Copilot. |
| Claude Sonnet, Opus, Haiku | Anthropic | Strong on long context refactors and agentic coding. Powers Claude Code and many Cursor subscriptions. |
| Gemini Pro, Flash | Google DeepMind | Long context window. Powers Gemini Code Assist and Jules. |
| Code Llama | Meta | Released August 2023 in 7B, 13B, 34B, and 70B sizes. Built on Llama 2. |
| DeepSeek-Coder, DeepSeek-Coder-V2 | DeepSeek | Open weight Chinese model. V2 was the first open model to match GPT-4o on many code benchmarks. |
| Qwen Coder (Qwen2.5-Coder, Qwen3-Coder) | Alibaba | Open weight, sizes from 0.5B to 32B and beyond. Qwen2.5-Coder-32B-Instruct matched GPT-4o on coding benchmarks. |
| StarCoder, StarCoder2 | BigCode (Hugging Face, ServiceNow) | Trained on The Stack, an open source code dataset. StarCoder2 covers 619 programming languages. |
| Codestral | Mistral | French open weight model focused on code. |
These models are usually evaluated by training a base model on a mixture of code and natural language, then post training with instruction following and reinforcement learning from human feedback so that the model behaves well in a chat or agent loop. DeepSeek-Coder, for example, was pre trained on a 2 trillion token corpus that was 87 percent code and 13 percent natural language across 87 programming languages.
The gap between open weight code models and closed frontier models has narrowed sharply since 2023. Qwen2.5-Coder-32B-Instruct, released in 2024, matched GPT-4o on several benchmarks. DeepSeek-Coder-V2 was the first open weight model to claim parity with GPT-4 Turbo on a broad set of code tasks.
A second wave of products focuses on the pull request rather than the editor. They run in CI or as a GitHub or GitLab app, read the diff and surrounding code, and post review comments.
| Tool | Vendor | Notes |
|---|---|---|
| CodeRabbit | CodeRabbit AI | The most installed AI code review app on GitHub and GitLab as of 2026, with more than two million repositories connected and over 13 million pull requests processed. Integrates 40 plus linters and SAST scanners. |
| Greptile | Greptile | Indexes the entire repository and builds a code graph that lets it follow dependencies and git history when reviewing a PR. |
| Augment Reviews | Augment Code | Uses the same Context Engine that powers the Augment IDE assistant. |
| Copilot code review | GitHub | Built into the GitHub Copilot suite. |
| Qodo (formerly CodiumAI) PR Agent | Qodo | Open source GitHub Action that posts PR descriptions, suggestions, and tests. |
Review tools tend to be judged less on raw code generation skill and more on signal to noise: whether their comments are correct, useful, and not redundant with what a human reviewer would already say. In one comparative test cited by Panto AI, Greptile flagged 12 critical bugs to CodeRabbit's 10, while CodeRabbit produced more refactoring and validation suggestions.
Not every AI for programming project is a product. Several research efforts are aimed at competitive programming and formal mathematics rather than day to day software work.
AlphaCode, from Google DeepMind, was unveiled in February 2022. It generated millions of candidate programs per problem and filtered them down to a small set using execution against the public test cases. On a sample of recent Codeforces contests it placed in roughly the top 54 percent of human participants. AlphaCode 2, announced in late 2023, reached the top 15 percent on the same kind of contests.
AlphaProof, also from DeepMind, pairs a fine tuned Gemini language model with a reinforcement learning loop in the Lean theorem proving language. At the 2024 International Mathematical Olympiad, AlphaProof together with AlphaGeometry 2 solved problems at the level of a silver medalist, the first time an AI system reached that bar at the IMO.
For working with real codebases, several supporting techniques matter:
Benchmarks are how the field measures progress, and they are also the focus of intense debate about contamination and gaming.
| Benchmark | Origin | What it measures |
|---|---|---|
| HumanEval | OpenAI, 2021 (Codex paper) | 164 hand written Python programming problems with unit tests. Pass at one is the standard metric. |
| MBPP | Google, 2021 | Mostly Basic Python Problems. About 1000 short Python tasks with three test cases each. |
| SWE Bench, SWE Bench Verified, SWE Bench Pro | Princeton, 2023 onward | Real GitHub issues from large open source Python projects. The model must produce a patch that passes the original tests. SWE Bench Verified is a 500 task subset hand validated by humans. |
| LiveCodeBench | UC Berkeley and others, 2024 | Continuously refreshed competitive programming problems from LeetCode, AtCoder, and Codeforces, designed to limit training set contamination. |
| BigCodeBench | BigCode collaboration, 2024 | Practical Python tasks that require calling many real libraries with complex argument structures. |
| Aider Polyglot | Aider, 2024 | 225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust. Models get two attempts and may use test feedback. |
| HumanEval Pro, MBPP Pro | Yu, Zhao et al., 2024 | Self invoking variants where models must use their own solution to a base problem to solve a harder one. |
The leaderboards have changed quickly. Top frontier models routinely score above 90 percent on HumanEval, but performance falls sharply on harder benchmarks. On SWE Bench Verified the leaders in 2026 are Anthropic's Claude family, with scores in the high 80s to low 90s, well above the single digit numbers that were typical in early 2024. On the Aider Polyglot leaderboard GPT-5 leads with a score around 0.88. On HumanEval Pro, o1 mini drops from 96.2 percent on plain HumanEval to 76.2 percent on the self invoking version, a useful illustration of how brittle saturated benchmarks can be.
Using an LLM to help write code is its own emerging discipline, with vocabulary and patterns that did not exist a few years ago.
Retrieval over codebases. Most serious tools combine retrieval augmented generation with a code aware index. Embedding models are trained on code so that semantically similar functions cluster together, and a search step pulls the most relevant files and symbols into the prompt. Sourcegraph Cody, Augment, Cursor, and Aider all build some version of this index.
Agentic coding. Modern coding agents do not generate one big patch. They iterate. A typical loop reads files, writes a plan, applies a small edit, runs tests or a linter, observes the output, and decides whether to continue, revise, or ask the human. The Claude Code CLI, Codex CLI, Aider, OpenHands, Cline, and Devin all implement variants of this loop, with different tradeoffs around how much approval they require from the user at each step.
Model Context Protocol. MCP, introduced by Anthropic in November 2024, is an open standard for connecting LLMs to external tools and data sources. It uses JSON RPC 2.0 over a small set of message types adapted from the Language Server Protocol. Coding agents use MCP servers to talk to GitHub, Postgres, Slack, browsers, the file system, and many other systems. OpenAI announced full MCP support across its products in March 2025, and in December 2025 Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation.
Vibe coding. Vibe coding is a casual term, popularized in 2025, for writing software primarily by describing what you want and accepting model generated patches without reading them line by line. It works for prototypes and small projects but has obvious risks for production code, and is one of the practices that most divides experienced engineers.
Whether AI coding tools actually make developers faster, and at what cost to code quality, is still being argued out in research papers. The picture is more mixed than the marketing suggests.
A Microsoft Research controlled experiment published in 2023, in which developers were asked to implement an HTTP server in JavaScript, found the GitHub Copilot group finished about 55.8 percent faster than the control group. Less experienced developers benefited more than senior ones. A 2024 GitHub study with Accenture reported that developers with Copilot access were 53.2 percent more likely to pass all unit tests in a controlled task and saw small improvements in readability, reliability, maintainability, and conciseness.
A 2024 study from Uplevel Data Labs, using telemetry from a sample of working engineers, told a different story. It found that Copilot access did not significantly change pull request cycle time, throughput, or complexity, and that developers with Copilot showed a higher bug rate. A separate report covered by Visual Studio Magazine in early 2024 described downward pressure on code quality across a large GitHub corpus, with more code churn and more reverted commits. Microsoft's own three week internal study, summarized by getDX, found that adoption mattered more than access: developers who used Copilot heavily reported larger gains than those who barely used it, even though both had a license.
A reasonable read of the literature is that AI coding tools clearly help with boilerplate, scaffolding, and simple functions, give larger gains to less experienced developers and less familiar codebases, and have weaker or no measurable effect on architectural decisions, debugging, and complex refactors. They also introduce new failure modes around plausible looking but wrong code, especially when developers stop reviewing suggestions carefully.
As of 2026 most professional developers either use a coding assistant daily or work alongside colleagues who do. The frontier has moved from completion to delegation: agents that take an issue and return a pull request, sometimes overnight. Anthropic, OpenAI, and Google all ship first party coding agents, the open source ecosystem around Aider, OpenHands, Cline, Roo Code, and Continue is growing, and the AI native editor market is consolidating around Cursor and Cognition.
The open questions are large. How do you keep a human in the loop when an agent edits hundreds of files? How do you measure whether AI generated code makes a system better or worse over years, not weeks? How do tools handle codebases the model has never seen, where retrieval and grounding matter more than parametric knowledge? And how do organizations think about training data, licensing, and the long tail of bugs that pass tests but break in production? None of these are settled.
See also: Programming Custom GPTs
| Custom GPT | Image | Description | Knowledge | Actions | Link |
|---|---|---|---|---|---|
| WebGPT | * | ChatGPT with unbiased access to the Web, can build products using No-Code playgrounds, and use APIs. Powered by Web Requests. | Yes | https://chat.openai.com/g/g-9MFRcOPwQ-webgpt | |
| Code Copilot (GPT) | ![]() | Code Smarter, Build Faster, With the Expertise of a 10x Programmer by Your Side. | Yes | Yes | https://chat.openai.com/g/g-2DQzU5UZl-code-copilot |
See also: Programming ChatGPT Plugins