Software Development

AI Tools & Products Software Development

32 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

67 citations

Revision

v3 · 6,300 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AI in software development is the use of large language models and other machine learning systems to write, complete, review, test, document, and refactor source code that programmers have traditionally produced by hand. The field spans single-line autocomplete inside an editor, conversational chat assistants that explain or refactor code, autonomous agents that plan and execute multi-step changes, and specialised tools for code review, testing, documentation, and security analysis. Between mid-2021, when GitHub Copilot entered private preview, and the end of 2025, AI-assisted coding moved from a curiosity to a default part of many professional engineering workflows: GitHub reported on 30 July 2025 that Copilot alone had crossed 20 million all-time users, and the 2025 Stack Overflow Developer Survey found 84% of respondents using or planning to use AI tools, up from 76% a year earlier ^[1]^[2]^[66]^[67].

The shift has been driven by transformer models trained on very large corpora of public source code and natural language, the wide deployment of inline completion in editors such as Visual Studio Code and the JetBrains family, and a wave of agentic systems including Devin, Claude Code, and OpenAI Codex. Productivity studies have produced mixed and sometimes contradictory findings, ranging from a 55.8% speed-up on a controlled HTTP server task to a randomised trial of senior open-source maintainers who were 19% slower when allowed to use AI tools ^[3]^[4]. Concerns about hallucinated APIs, supply-chain attacks via fabricated package names, licensing of training data, and the long-term effect on code quality have prompted academic study, lawsuits, and changes in enterprise policy.

What is AI in software development?

AI coding tools fall into a small number of categories that are useful to distinguish because they have different cost structures, failure modes, and review requirements.

Category	Typical interaction	Examples
Inline completion	Ghost text inside the editor, accepted by Tab	Early Copilot, Tabnine, Codeium
Chat assistants	Side-panel or pop-up chat with file context	Copilot Chat, Sourcegraph Cody, JetBrains AI Assistant
Edit-mode tools	Model rewrites a selection in place	Cursor Composer, Aider, Continue
Coding agents	Multi-step planning, shell access, file I/O	Devin, Claude Code, OpenHands, Replit Agent
Domain specialists	Reviews, tests, docs, security, observability	CodeRabbit, Qodo, Diffblue, Snyk DeepCode

The boundary between these categories is fuzzy because most flagship products now bundle several modes. GitHub Copilot, for example, started as inline completion, added chat in 2023, added the Copilot Workspace agent in 2024, and shipped a fully agentic mode in 2025 ^[5]^[6]^[7]. Cursor began as a fork of Visual Studio Code with chat and edit features and now ships a background agent that runs on remote machines ^[8].

How do AI coding assistants work?

Underneath the products, the technology stack is fairly consistent. The base model is almost always a transformer-based LLM, either a general-purpose frontier model such as GPT-4o, Claude 3.5 Sonnet and its successors, or Gemini, or a code-specialised model such as DeepSeek-Coder, Qwen 2.5 Coder, StarCoder, or Code Llama ^[9]^[10]^[11]. Retrieval over the local repository is added on top, using embeddings, language servers, or filesystem traversal. For agents, a planning loop, a tool harness, and a sandboxed shell are layered over the model. The Model Context Protocol released by Anthropic in November 2024 has become a common way to plug external tools into these agents ^[12].

History

From statistical autocomplete to neural code models

Early editor assistance was rule-based or grammar-driven. Microsoft's IntelliSense, which shipped in Visual Basic 5.0 in 1996 and then in Visual Studio, used static type information and parse trees to suggest method names. Eclipse's Java Development Tools and the C/C++ Development Tooling shipped similar features for open source. These tools knew nothing about semantics beyond what compilers exposed, and they did not generate new code.

Statistical approaches arrived in the 2010s. Kite, founded by Adam Smith in 2014, used cloud-trained models to suggest Python completions inside editors. It briefly ran a free copilot product before shutting down in 2022, with Smith publishing a post-mortem that described the difficulty of building a viable business around inline suggestions before transformer models existed ^[13]. Tabnine, founded as Codota in 2013 by Dror Weiss and Eran Yahav, used statistical and neural models to provide multi-language autocomplete and is one of the few pre-Copilot vendors that survived the transition.

Neural language models for code became prominent with OpenAI's Codex paper in July 2021, which described a fine-tuning of GPT-3 on public GitHub code and introduced the HumanEval benchmark of 164 hand-written Python problems ^[14]. The model behind Codex powered the first version of Copilot, which Microsoft and GitHub launched as a private preview on 29 June 2021 ^[15]. General availability followed on 21 June 2022 at $10 per month for individuals, with free access for verified students and maintainers of popular open-source projects ^[16].

What was the Copilot era?

GitHub Copilot is the product that turned AI autocomplete into a mainstream developer tool. Microsoft and GitHub leadership cited it heavily on earnings calls, and CEO Satya Nadella reported that paid Copilot users for individuals and the GitHub Copilot business plan had exceeded 1.3 million on the company's fiscal year 2024 fourth quarter earnings call in July 2024, growing more than 180% year over year ^[17]. A year later, on the 30 July 2025 earnings call, Nadella said Copilot had crossed 20 million all-time users and was deployed across 90% of the Fortune 100; GitHub does not publish active-user counts, and the figure represents cumulative sign-ups rather than monthly actives ^[66]. Copilot Chat reached general availability for business users in December 2023 after a public beta announced at GitHub Universe in November of that year ^[5]. Copilot Workspace, a task-oriented interface that lets the model plan, propose code edits, and run tests against a repository, was announced at GitHub Universe in April 2024 ^[6]. Copilot agent mode, which lets the model run shell commands and iterate on a task without per-step approval, shipped in 2025 ^[7].

The success of Copilot triggered a wave of competitors. Amazon previewed CodeWhisperer in June 2022 and reached general availability in April 2023, then rebranded the product as Amazon Q Developer at AWS re:Invent in November 2023 with a broader rollout in April 2024 ^[18]. Google launched Duet AI for Developers at Google Cloud Next in August 2023 and renamed it Gemini Code Assist in February 2024 ^[19]. JetBrains, which makes the IntelliJ family of IDEs, shipped JetBrains AI Assistant in December 2023 and added the agentic Junie in 2025 ^[20]. Sourcegraph released Cody, which uses repository graph indexes to ground completions in large codebases, in 2023.

Why did coding go agentic in 2024?

The step from "autocomplete" to "agent" came in 2024. Cognition AI announced Devin, billed as the "first AI software engineer," on 12 March 2024, with a demo video showing the system completing freelance jobs end-to-end inside a sandboxed Linux environment ^[21]. The Devin demo, and the company's claim of a 13.86% unassisted score on the SWE-bench benchmark, drew both attention and scepticism. Several engineers, including Carl Brown of Internet of Bugs, published video critiques arguing that the demo had been edited and that the actual capability was lower than advertised ^[22]. The episode marked the start of a more public debate about how to measure agentic coders.

Academic work moved quickly. The SWE-Agent system, developed at Princeton NLP by John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press, was released in April 2024 ^[23]. It introduced the agent-computer interface, a deliberate set of file-editing and shell commands designed for an LLM, and reached 12.5% on SWE-bench Lite using GPT-4. The team at the University of Illinois Urbana-Champaign and the All Hands AI startup released OpenDevin, later renamed OpenHands, as an open-source alternative in March 2024 ^[24].

Anthropic announced Claude Code on 24 February 2025 as a terminal-based agent with file editing, shell execution, and git tools ^[25]. OpenAI re-released the Codex brand on 16 May 2025, this time as a cloud-hosted coding agent that runs in parallel containers, accepts tasks from chat or pull requests, and ships a CLI for local use ^[26]. Replit's Agent shipped in September 2024 with a focus on building and deploying full applications from natural language, and Replit Agent 2 followed in February 2025 ^[27].

Code completion assistants

The table below lists widely used tools, with first-release dates referring to the public availability of the AI-coding product, not to the parent company.

Tool	First release	Organisation	Key features
GitHub Copilot	June 2021 preview, June 2022 GA	GitHub (Microsoft)	Inline completion, Copilot Chat, Copilot Workspace, agent mode
Cursor	March 2023	Anysphere	Forked VS Code editor, Composer multi-file edits, background agent
Windsurf	November 2024	Codeium then OpenAI then Cognition	Cascade agent, Codeium roots, acquired by Cognition AI in July 2025
Codeium	2022	Codeium (later Windsurf)	Free tier completion across many editors
Tabnine	2018 (neural), 2023 (chat)	Tabnine	Self-hosted option, enterprise compliance focus
Aider	2023	Paul Gauthier (open source)	Terminal-based pair programmer that writes commits
Cline	2024 (as Claude Dev)	Open source	VS Code extension, BYO key, plan and act modes
Continue	2023	Continue.dev	Open-source autopilot for VS Code and JetBrains
Sourcegraph Cody	2023	Sourcegraph	Repo-graph search context, enterprise focus
Replit Agent	September 2024	Replit	Build and deploy apps from prompts, Replit Agent 2 in 2025
JetBrains AI Assistant	December 2023	JetBrains	Multi-language chat and completion across IntelliJ family
JetBrains Junie	2025	JetBrains	Agentic coder for the JetBrains IDEs
Amazon Q Developer	April 2024 (rebrand)	Amazon	Successor to CodeWhisperer, AWS service integration
Gemini Code Assist	February 2024 (rebrand)	Google	Successor to Duet AI for Developers, GCP integration
Gemini CLI	2025	Google	Open-source terminal agent powered by Gemini
Phind	2023	Phind	Search-based answer engine for developers
Mintlify	2022	Mintlify	Documentation generator and authoring platform
Stack Overflow OverflowAI	Announced July 2023	Stack Overflow	AI search, enterprise knowledge ingestion

Several of these products have stories that are worth flagging.

Cursor is built and operated by Anysphere, a company founded in 2022 in San Francisco by Michael Truell, Sualeh Asif, Arvid Lunnemark, and Aman Sanger, all then in their early twenties. Cursor's editor is a fork of VS Code with deeper LLM integration, including the Composer feature for multi-file edits and an inline edit mode triggered by Cmd-K. The product pulled ahead of standalone Copilot use among early adopters in 2024 and crossed $100 million in annualised revenue within twelve months. By June 2025, Anysphere had reached a $9.9 billion valuation on more than $500 million in ARR ^[28].

Codeium and Windsurf had a turbulent 2025. Codeium, the consumer brand, evolved into Windsurf with the launch of the Cascade agent in November 2024. In July 2025, Google reportedly paid about $2.4 billion in a licensing and hiring deal that pulled CEO Varun Mohan and other engineers to DeepMind, leaving the Windsurf product behind. Cognition AI then acquired what remained of Windsurf later in July 2025 ^[29]^[30].

Aider, written and maintained by Paul Gauthier, is an open-source command-line tool that pairs an LLM with a Git workflow. It applies edits as commits, with diffs the user can review, and has been used as a reference implementation for many later agentic coders ^[31].

Cline, originally released as Claude Dev, is an open-source VS Code extension that uses your own API key, runs in a plan-then-act loop, and supports multiple model providers ^[32].

Agentic coders

A "coding agent" in this article means a system that, given a high-level task, can plan steps, edit files, run commands such as tests, observe results, and iterate without per-step human approval. The line between an agent and a chat tool is blurry, since most chat tools now ship some form of tool use.

Devin

Devin is the agent that arguably defined the category. Cognition AI, founded in 2023 by Scott Wu, Steven Hao, and Walden Yan, announced Devin on 12 March 2024 with a launch video, a benchmark claim, and a private waitlist ^[21]. The system runs in a sandbox with a Linux shell, code editor, and browser, and presents a chat interface where users can hand off tasks. Devin reached general availability in December 2024 at $500 per month, then dropped to a $20 starting price with Devin 2.0 in April 2025 ^[33]. Cognition was valued at over $10 billion by September 2025 ^[34].

Claude Code

Claude Code is Anthropic's terminal-based coding agent. It launched in research preview with Claude 3.7 Sonnet on 24 February 2025 ^[25]. Claude Code runs as a CLI, with permission to read and edit files in a working directory, execute shell commands, and call MCP servers for external tools. Anthropic has rolled out follow-on features including subagents, hooks, a VS Code extension, GitHub integration, and a one-million-token context window for Sonnet variants. Anthropic reported on its fiscal results that Claude Code reached significant adoption among engineering teams during 2025, contributing to a sharp rise in API revenue ^[35].

OpenAI Codex (2025)

OpenAI Codex is a reuse of the Codex brand. The first Codex, retired in March 2023, was the model behind early Copilot. The new Codex, released on 16 May 2025, is a cloud-hosted coding agent that runs each task in its own container, with support for parallel runs and pull-request style review of changes. It is powered by codex-1, a fine-tuned variant of the o3 reasoning model, and ships a CLI for local use ^[26].

OpenHands and SWE-Agent

OpenHands is the leading open-source agent. It supports many model providers, runs in a Docker sandbox, and includes a browser and shell tool. The project was created at All Hands AI by the team that built OpenDevin ^[24]. SWE-Agent, also open source and developed by the Princeton NLP group, is more focused on the SWE-bench task format and contributed the influential agent-computer interface design ^[23].

Replit Agent

Replit's Agent, launched in September 2024 and followed by Agent 2 in February 2025, is positioned for non-professional developers who want to build and deploy an entire web application from a natural-language description, with hosting and a database provisioned automatically inside Replit's platform ^[27].

AutoGen and Magentic-One

Microsoft Research published AutoGen in late 2023 as a Python framework for multi-agent LLM systems, with built-in patterns for code-generation agents that call a Python interpreter. Magentic-One, released by the same team in November 2024, builds a generalist multi-agent orchestrator on top of AutoGen ^[36]. Both are research artefacts rather than products, but they have been used as starting points for downstream agents.

How is AI coding ability measured?

Progress in AI for code is tracked through a small set of benchmarks. The two that drive most headline claims are HumanEval, for function-level synthesis, and SWE-bench, for repository-level bug fixing.

HumanEval and MBPP

HumanEval was released by Chen, Tworek, Jun et al. of OpenAI in July 2021 alongside the Codex paper. It contains 164 hand-written Python programming problems, each with a function signature, docstring, body, and unit tests. Models are scored with pass@k, the probability that at least one of k sampled completions passes the tests ^[14]. By 2024, top models were essentially saturating HumanEval, with reported pass@1 above 90% for many frontier and code-specialised models, which made the benchmark a much weaker signal for new systems.

MBPP, the Mostly Basic Python Programming benchmark, was released by Austin et al. of Google in August 2021 with 974 short Python tasks crowdsourced from entry-level programmers ^[37]. MBPP plays a similar role to HumanEval and has been similarly saturated.

APPS and CRUXEval

APPS was released by Hendrycks, Basart, Kadavath et al. in May 2021 with 10,000 Python coding problems collected from competitive-programming sites and split by difficulty into introductory, interview, and competition tiers ^[38].

CRUXEval, released by Gu, Dao, Ermon, Wang, Yamada, and Sahoo in January 2024, is a benchmark for code reasoning, understanding, and execution. Models are asked to predict program inputs given outputs or vice versa, which separates execution understanding from synthesis ^[39].

SWE-bench

SWE-bench, released by Carlos Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan of Princeton in October 2023, was the first benchmark to evaluate models on real-world software engineering. It consists of 2,294 GitHub issues from 12 popular Python repositories, each paired with the merged fix as a hidden test case. A model is given the repo and the issue text, must produce a patch, and is scored on whether the patch passes the project's own tests ^[40]. SWE-bench Lite is a 300-task subset designed to be cheaper to run.

SWE-bench Verified, released by OpenAI in August 2024 in collaboration with the SWE-bench authors, is a 500-task human-validated subset of SWE-bench. OpenAI engineers manually checked each task to remove ambiguous problem statements and broken tests, since the original benchmark had a known false-negative rate ^[41]. SWE-bench Verified has become the default leaderboard for coding agents, with scores from the major model and tool vendors widely cited in launch materials. By late 2025, top agents combined with strong models were reaching the 70% to 80% range on SWE-bench Verified, up from under 5% in early 2024.

SWE-bench Multimodal was released in October 2024 and includes JavaScript repositories where the GitHub issue references images, such as broken UI screenshots. SWE-bench Live extends the benchmark with newer issues to mitigate training-data contamination.

LiveCodeBench and RepoBench

LiveCodeBench, introduced by Jain, Han, Gu, Li, Yan, Zhang, Wang, Stoica, Sen, and Song in April 2024, scrapes problems from competitive-programming sites such as LeetCode, AtCoder, and Codeforces, with explicit dating so evaluators can filter to problems released after a model's training cutoff and reduce contamination ^[42]. The benchmark also adds tasks beyond synthesis such as self-repair, test output prediction, and execution prediction.

RepoBench, released by Liu, Xu, and McAuley in June 2023 and updated in 2024, evaluates repository-level autocomplete by feeding models a long context drawn from across the repo and scoring the next-line prediction ^[43].

Does AI actually make developers more productive?

The answer depends heavily on the task, the developer's experience, and the maturity of the codebase. The most cited productivity studies are the GitHub field experiment and the 2025 METR randomised trial, and they reach almost opposite headline conclusions, which is part of why the topic remains contested. As a result, AI Wiki treats no single productivity figure as settled fact.

GitHub Copilot field experiment (2023)

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer published a controlled experiment in 2023 in which 95 freelance programmers were randomly assigned to write an HTTP server in JavaScript with or without GitHub Copilot. Developers using Copilot completed the task 55.8% faster on average, with stronger speed-ups for less-experienced developers ^[3]. The study is widely cited but is limited by the simplicity of the task and the freelancer population. Critics have noted that a tutorial-style coding task does not generalise to maintenance work on a complex codebase, and that the population was not representative of long-tenured engineers.

Microsoft 365 Copilot studies

Microsoft Research has published several internal studies of Microsoft 365 Copilot and GitHub Copilot in production use at Microsoft and Accenture. The 2024 paper by Cui, Demirer et al., "The Effects of Generative AI on High-Skilled Work," reported productivity gains in pull-request count and time-to-merge but cautioned that the effect varied across teams and tasks ^[44].

Dell'Acqua "Centaur and Cyborg" study (2023)

Fabrizio Dell'Acqua of Harvard Business School, together with co-authors from BCG and other institutions, ran an experiment with 758 BCG consultants on knowledge-work tasks. Consultants using GPT-4 outperformed controls on tasks inside the "jagged frontier" of GPT-4's capabilities but did worse on tasks outside it, a result the authors framed in terms of Centaur and Cyborg work styles ^[45]. The study was not strictly software engineering, but it shaped the way researchers discuss developer productivity with AI.

METR randomised trial (2025)

The METR (Model Evaluation and Threat Research) study, published in July 2025 by Becker, Rush, Barnes, and Christiansen, ran a randomised controlled trial with 16 experienced open-source developers contributing to large, mature repositories of their own. Developers were randomly assigned tasks to do with or without permission to use AI tools, primarily Cursor Pro paired with Claude 3.5 Sonnet and Claude 3.7 Sonnet. Developers expected the AI tools to make them 24% faster. As METR summarised the result, "When developers are allowed to use AI tools, they take 19% longer to complete issues, a significant slowdown that goes against developer beliefs and expert forecasts" ^[4]. Strikingly, even after experiencing the slowdown, the participants still believed AI had sped them up by 20%. The authors offer several hypotheses for the gap between expectation and result, including the cost of prompting and review and the maturity of the developers' own context for their codebases. The study became one of the most discussed pieces of AI productivity research in 2025 and has been used to argue that productivity effects depend heavily on the developer's prior expertise and the repository's complexity.

Why do developers distrust AI even as adoption rises?

Adoption and confidence have moved in opposite directions. The 2025 Stack Overflow Developer Survey found that 84% of respondents were using or planning to use AI tools, up from 76% in 2024, yet only 32.7% said they trust the accuracy of AI tools while 45.7% actively distrust it, and just 3.1% said they highly trust the output ^[67]. The most common frustration, cited by a plurality of respondents, was AI solutions that are "almost right, but not quite," which shifts effort from writing code to reviewing and debugging it. The survey's authors noted that 75% of developers would still ask another person for help when they did not trust an AI answer ^[67].

Industry claims

Outside the academic literature, large enterprises have made strong claims about AI's effect on engineering output. Salesforce CEO Marc Benioff said in 2025 that the company had stopped hiring new engineers because Agentforce and other internal AI systems had increased per-engineer productivity by around 30% ^[46]. Klarna told investors in 2024 that its OpenAI-powered customer service assistant was doing the work of 700 contractors, and the company paused engineering hiring in 2023 with AI cited as a factor ^[47]. Klarna later reversed course, hiring engineers and humans for customer service in 2025 after reporting service-quality issues. These announcements have been controversial in the trade press.

Code review AI

Automated code review combines LLM analysis with static checks and Git platform integration. The market grew quickly in 2024 and 2025 alongside agentic coders, because reviewing AI-written code at scale is itself a workload AI vendors want to handle.

CodeRabbit, launched in 2023 by Harjot Gill and Gur Singh, posts inline comments on pull requests across GitHub, GitLab, and Azure DevOps using an LLM grounded in repo-wide indices ^[48].

Greptile, founded in 2024, focuses on grounding review on codebase-wide context with a code graph and embeddings store, with a free tier for open-source repositories ^[49].

Qodo, formerly Codium AI, ships PR-Agent, an open-source pull-request reviewer, alongside a paid platform that generates tests and identifies regressions. Qodo rebranded from Codium AI in May 2024, partly to avoid confusion with the unrelated Codeium product ^[50].

Diamond by Sourcery is an LLM-powered review tool layered on top of Sourcery's earlier rule-based Python refactoring engine.

GitHub Copilot Code Review, which began rolling out in late 2024 and reached general availability in 2025, lets reviewers request a Copilot review on a pull request, with comments posted as if from a bot reviewer ^[51].

Testing and documentation AI

Automated testing has had AI-augmented vendors for longer than chat assistants. Diffblue, founded in 2017 as a spinout from the University of Oxford by Daniel Kroening and others, ships Diffblue Cover, which generates JUnit tests for Java codebases using reinforcement learning and symbolic search ^[52]. Newer entrants such as Functionize, Mabl, and KaneAI focus on end-to-end and UI test generation. Microsoft Playwright added LLM features for selector healing and test generation in 2024 and 2025 ^[53].

In documentation, Mintlify Writer generates inline docstrings from existing code; the broader Mintlify platform produces hosted developer documentation. ReadMe added AI-powered authoring features for API documentation in 2024. Swimm combines documentation authoring with code-aware diff tracking so that snippets stay in sync with source code as it changes.

Open-source models for code

Closed-source frontier models such as those from OpenAI, Anthropic, and Google dominate leaderboards, but open-weights code models are widely deployed for self-hosted use, on-device assistants, and fine-tuned domain-specific tools.

Model family	Releaser	First release	Notes
StarCoder, StarCoder 2	BigCode (Hugging Face, ServiceNow)	May 2023, February 2024	Open weights, trained on permissively licensed code from The Stack
Code Llama	Meta	August 2023	Fine-tunes of LLaMA 2, deprecated in favour of LLaMA 3 instruct variants
DeepSeek-Coder	DeepSeek	November 2023	Released as 1.3B, 6.7B, and 33B; trained on a code-heavy corpus
DeepSeek-Coder-V2	DeepSeek	June 2024	Mixture-of-experts model with 236B total parameters, strong on HumanEval and MBPP
DeepSeek-V3	DeepSeek	December 2024	General-purpose MoE that posted leading open-weights coding scores
Qwen 2.5 Coder	Alibaba Cloud (Qwen team)	November 2024	Family from 0.5B to 32B with strong SWE-bench-style results
Qwen3	Alibaba (Qwen team)	2025	Reasoning and coding focused models with public weights

The BigCode Project, a collaboration between Hugging Face and ServiceNow Research, also published The Stack, a permissively licensed code dataset, and the SantaCoder series. The Stack v2, released alongside StarCoder 2, includes over four trillion tokens of code from 600 programming languages ^[54].

DeepSeek's models have been particularly important because they have repeatedly closed the gap with closed-weight frontier coders at much lower training costs, and because they ship with permissive licences that allow commercial use ^[9].

What are the risks of AI-generated code?

Hallucinated libraries and slop squatting

LLMs frequently invent package names that look plausible but do not exist. Joseph Spracklen, Raveen Wijewickrama, Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala studied this in 2024 and found that across 16 popular LLMs, 19.7% of code samples that imported a Python or JavaScript package referenced a non-existent name ^[55]. The risk, sometimes called "slop squatting" after typo-squatting, is that an attacker registers the fabricated name on PyPI or npm and ships malicious code that AI-coded projects then download. Security researcher Bar Lanyado of Lasso Security demonstrated the attack practically in March 2024 by registering a fake package that ChatGPT had repeatedly hallucinated and reporting that thousands of downloads followed within weeks ^[56].

Security of AI-written code

NYU researchers Pearce, Ahmad, Tan, Dolan-Gavitt, and Karri published a 2021 study of Copilot output across 89 security-relevant scenarios in C and Python, finding that 40% of completions contained known vulnerabilities mapped to MITRE CWEs ^[57]. Follow-up work has shown mixed results, with some studies arguing that AI-written code has roughly the same vulnerability rate as human-written code when controlled for the population of developers using each.

Vendors have built specialised security review tools. Snyk DeepCode AI uses LLM analysis combined with symbolic reasoning to flag vulnerabilities and propose fixes. GitHub Advanced Security added AI-based autofix for Copilot users in 2024. Veracode AI Fix, released in 2024, generates patches for vulnerabilities found by Veracode's static analysis. Microsoft Security Copilot ships extensions for code reviewers in Azure DevOps. Each of these tools requires careful evaluation, since LLM-generated patches can introduce regressions.

Code quality and reuse

GitClear, a Git analytics company, published a 2024 report that compared four years of commit data and reported that the share of code that was copy-pasted within a project rose alongside Copilot adoption, while the share of refactored or moved code fell. The report concluded that AI assistance correlated with what the authors called "code churn" and reduced reuse ^[58]. The methodology has been disputed by GitHub and by other researchers, and the result has not been independently replicated, but the report has been widely cited in the debate about whether AI tools improve maintainability.

Long-context limits

Long-context performance is a recurring weakness. The Needle in a Haystack and RULER evaluations have shown that retrieval accuracy in very long contexts degrades sharply for many models, which limits how well a coding agent can reason over a large monorepo without retrieval ^[59]. The release of one-million-token windows for Gemini 1.5 Pro, Claude Sonnet, and GPT-4.1 has changed what is feasible, but agents still typically rely on chunking, embeddings, and language-server queries rather than naive context stuffing.

Licensing of training data

Most large code LLMs are trained on public GitHub repositories, including code with copyleft licences such as GPL. Critics including Matthew Butterick, Joseph Saveri, and Tim Davidson filed a class-action suit in November 2022 in the US District Court for the Northern District of California alleging that GitHub Copilot violated the licences of the open-source code it was trained on by emitting verbatim or near-verbatim snippets without the required attribution ^[60]. The case (Doe v. GitHub, Inc., No. 4:22-cv-06823) has been narrowed through several rulings since filing. A separate suit, Doe v. OpenAI, makes similar arguments against the underlying Codex model. As of late 2025, the litigation is ongoing and has not produced a definitive ruling that resolves the question.

Job and skill effects

The broader debate about whether AI coding tools will displace or augment software engineers has surfaced in public statements from CEOs (see Industry claims above) and in surveys. The Stack Overflow Developer Survey for 2024 reported that 72% of professional respondents used AI tools in their workflow, up from 44% in 2023, but only 43% trusted the accuracy of AI tools, down from 49% ^[2]. The 2025 edition recorded usage climbing again while trust fell further, to under a third of respondents ^[67]. Junior-developer hiring trends and changes in computer-science enrolment have been linked to AI by industry commentators including Gergely Orosz of The Pragmatic Engineer, though the causality is contested.

Notable lawsuits and policy

Doe v. GitHub, Inc. (2022 to present): Class-action complaint filed 3 November 2022 in California against GitHub, Microsoft, and OpenAI, alleging that Copilot reproduces copyrighted code without attribution. Several claims have been dismissed but the core DMCA and breach-of-contract claims have survived motions to dismiss ^[60].
Galactica v. The Stack contributors: Hugging Face introduced a tool that lets developers check whether their code was included in The Stack training set and opt out, in response to criticism that opt-out was not adequately offered.
EU AI Act provisions on transparency: The European Union's AI Act, which entered into force in August 2024, requires general-purpose AI providers to publish summaries of training data, with implications for how code models disclose their training corpora ^[61].
US Executive Order 14110: Issued in October 2023, this executive order required reporting on AI used in critical software and prompted the National Institute of Standards and Technology to publish guidance on AI-assisted software development. The order was revoked by President Donald Trump in January 2025, though some agency-level guidance remains ^[62].

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

App Development Software Software Development ChatGPT Plugins Vibe coding

What is AI in software development?

How do AI coding assistants work?

History

From statistical autocomplete to neural code models

What was the Copilot era?

Why did coding go agentic in 2024?

Code completion assistants

Agentic coders

Devin

Claude Code

OpenAI Codex (2025)

OpenHands and SWE-Agent

Replit Agent

AutoGen and Magentic-One

How is AI coding ability measured?

HumanEval and MBPP

APPS and CRUXEval

SWE-bench

LiveCodeBench and RepoBench

Does AI actually make developers more productive?

GitHub Copilot field experiment (2023)

Microsoft 365 Copilot studies

Dell'Acqua "Centaur and Cyborg" study (2023)

METR randomised trial (2025)

Why do developers distrust AI even as adoption rises?

Industry claims

Code review AI

Testing and documentation AI

Open-source models for code

What are the risks of AI-generated code?

Hallucinated libraries and slop squatting

Security of AI-written code

Code quality and reuse

Long-context limits

Licensing of training data

Job and skill effects

Notable lawsuits and policy

See also

References

Improve this article

Related Articles

App Development

Dev tools

Plugins

Programming

App Development ChatGPT Plugins

Eager Execution

What links here

Related Articles

App Development

Dev tools

Plugins

Programming

App Development ChatGPT Plugins

Eager Execution

What links here