arXiv
Last reviewed
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 1,967 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 1,967 words
Add missing citations, update stale details, or suggest a clearer explanation.
arXiv is the free, open-access repository where nearly every consequential artificial intelligence paper of the deep learning era first appeared in public, often months before, or instead of, formal peer review. Founded in August 1991 by physicist Paul Ginsparg at Los Alamos National Laboratory and operated by Cornell University since 2001, it grew from an email list for theoretical high-energy physics into the central distribution channel for machine learning and AI research. By mid-June 2026 the service had received 3,077,861 submissions in total, and computer science had become its single largest subject area [1][6]. arXiv performs no peer review of its own: its founder describes it as "a way of communicating science" rather than a journal [3]. On July 1, 2026, after 25 years at Cornell, arXiv became an independent nonprofit organization [2].
(arXiv is pronounced "archive": the name spells the word with the Greek letter chi in place of "ch.")
Ginsparg started the service on August 14, 1991 as an automated email server that distributed preprints in theoretical high-energy physics, reachable at the address xxx.lanl.gov [3][4]. Physicists had long circulated paper preprints by mail months ahead of journal publication; Ginsparg's server made that informal system instant, complete, and free to anyone with an internet connection. A web interface followed in 1993, and coverage expanded through the decade into other areas of physics and into mathematics [4].
Computer science arrived in September 1998, when the Computing Research Repository (CoRR) launched as a cooperation between the ACM, the Los Alamos e-print archive, and the NCSTRL digital library network, giving CS researchers a dedicated section with its own classification scheme [5]. The service was renamed arXiv.org in late 1998 [4]. In 2001 Ginsparg left Los Alamos for a faculty position at Cornell University and the repository moved with him; Cornell ran it first through the university library and in recent years through Cornell Tech [2][3]. In 2021 Ginsparg received the Einstein Foundation's inaugural Individual Award for his role in transforming scientific communication [3].
On April 2, 2026, arXiv announced that it would separate from Cornell and become a standalone nonprofit on July 1, 2026, with Cornell and the Simons Foundation, its largest philanthropic backer, jointly supporting the transition. A search for arXiv's first chief executive began at the same time [2].
| Year | Milestone |
|---|---|
| 1991 | Launched at Los Alamos as an email preprint server (xxx.lanl.gov) [3] |
| 1993 | Web interface added [4] |
| 1998 | Computer science section launched via CoRR; service renamed arXiv.org [4][5] |
| 2001 | Moved to Cornell University with Ginsparg [3] |
| 2008 | 500,000th article posted (October) [4] |
| 2014 | Cumulative articles pass 1 million [4] |
| 2021 | Cumulative articles pass 2 million [4] |
| 2024 | Monthly record of 24,226 submissions (October); computer science is the largest subject area [6] |
| 2026 | Cumulative submissions reach 3,077,861 (mid-June); independence from Cornell takes effect July 1 [1][2] |
arXiv accepts papers in eight subject areas: physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics [7]. Reading and submitting are both free. Two gatekeeping layers stand between an upload and public announcement, and neither is peer review.
The first is endorsement, introduced in 2004: a first-time submitter to a category must be endorsed by an established arXiv author in that area, though researchers with recognized institutional email addresses or prior co-authored arXiv papers are often endorsed automatically [4][8]. The second is moderation. Volunteer subject-matter experts holding terminal degrees, approved by arXiv's advisory committees and staff, screen submissions for topical relevance and basic scholarly standards. Moderators can reclassify a paper into a different category, place it on hold, decline it, or withdraw it after announcement, but arXiv is explicit that "the arXiv moderation process is not a peer-review process" and that moderators neither give feedback nor certify correctness [9].
Accepted submissions, typically uploaded as LaTeX source, are announced on a rolling daily cycle and assigned a permanent identifier such as arXiv:1706.03762. Authors can post revised versions, but every announced version remains publicly accessible: papers cannot be deleted, only marked as withdrawn [10]. Since January 2023, arXiv has barred generative AI tools from being listed as authors and requires disclosure of any significant use of large language models in preparing a paper, with human authors taking full responsibility for the contents [11].
Growth has been relentless and is accelerating. arXiv first exceeded 20,000 submissions in a month in May 2023; October 2024 set a record of 24,226 new papers, and cumulative submissions reached 2,597,322 by the end of that month [6]. arXiv's public statistics show monthly totals approaching 28,000 by late 2025, with the all-time counter at 3,077,861 in mid-June 2026 [1].
Computer science, a relative latecomer, is now the largest of the eight subject areas. In October 2024 the three most active categories on the entire site were cs.LG (machine learning), cs.CV (computer vision), and cs.CL (natural language processing); together they accounted for more than 6,000 new papers in that month alone [6]. The artificial intelligence category cs.AI alone roughly doubled year over year, from 1,742 papers in one November 2023 sample to 3,242 a year later [6]. The shift tracks the deep learning boom: a field that once relied on journals and gated proceedings moved its primary record onto a preprint server.
For AI researchers, arXiv is not a supplement to publication; for much of the field it is the publication venue of record. The norm since the mid-2010s has been to post work the moment it is ready and, sometimes, submit it to NeurIPS, ICML, or ICLR afterward. "Attention Is All You Need," the paper that introduced the Transformer, appeared on arXiv in June 2017, half a year before its NeurIPS publication, and was already shaping follow-up work in the interim [12]. OpenAI's GPT-4 technical report went straight to arXiv in March 2023 and never passed through a peer-reviewed venue at all [13]. Citing papers by arXiv identifier is routine, and scanning the day's new cs.LG and cs.CL listings is a professional habit.
A tooling ecosystem grew on top of the firehose: Andrej Karpathy's arXiv Sanity Preserver for sorting machine learning preprints, Hugging Face's Daily Papers feed, Papers with Code's linking of arXiv IDs to open-source implementations, and alphaXiv's discussion layer built directly on arXiv pages.
The preprint-first culture has repeatedly collided with double-blind conference review, because a posted preprint can reveal author identities to reviewers. The Association for Computational Linguistics long enforced an "anonymity period" that barred posting to arXiv in the month before submission deadlines and during review; it abandoned the rule on January 12, 2024, permitting non-anonymous preprints at any time while keeping the submissions themselves anonymized [14].
On October 31, 2025, arXiv announced an updated practice for the computer science section: review articles and position papers would no longer be accepted unless already accepted by a peer-reviewed journal or conference, with workshop acceptance explicitly insufficient, and authors must supply the journal reference and DOI metadata with their submission [15]. The trigger was volume. arXiv said it was receiving "hundreds of review articles every month," most of them "little more than annotated bibliographies, with no substantial discussion of open research issues," a surge it attributed to large language models making such papers fast and cheap to generate [15]. Press coverage framed the change as arXiv being "spammed with AI-generated 'research' papers" [16]; arXiv itself described it as a stricter application of long-standing editorial standards rather than a new policy, and noted that other sections could adopt the same practice if they experience similar surges [15].
The underlying problem has empirical support. A January 2026 analysis estimated that 21.4 percent of the content of recent computer science review papers on arXiv was LLM-generated, against 14.0 percent for non-review papers [17]. The new practice also drew criticism: some researchers argued it blocks legitimate survey and position work by independent or junior authors who lack conference access, and simply shifts gatekeeping onto already overloaded conference review systems [18]. The episode crystallized a broader tension, as the flood of low-effort machine-written text sometimes called AI slop reached the very platform on which AI research itself is published.
arXiv's annual budget is roughly $6 million [19]. Cornell has provided a cash subsidy plus in-kind coverage of indirect costs, with the remainder coming from the Simons Foundation, grants, individual donors, and a membership program [19][20]. Under the current model, member universities, libraries, and research institutes contribute from $1,000 per year; affiliate professional societies and government agencies are asked for $5,000 to $100,000; and corporate sponsors for $10,000 to $200,000 [20]. The Simons Foundation and Schmidt Sciences are separately funding a multiyear modernization of arXiv's aging codebase and its migration to cloud infrastructure [19]. The independent nonprofit taking over on July 1, 2026 retains the same mission, "to advance scientific discovery by supporting researchers with a free, fast, and reliable open service," with Cornell and the Simons Foundation backing the transition [2].
arXiv demonstrated, years before "open access" became a movement, that an entire discipline could move its communication system onto a free public server, and it became the template for later preprint services such as bioRxiv, medRxiv, and chemRxiv [4]. For AI specifically it serves as the field's timestamped public record: priority claims, model announcements, and benchmark results are dated by arXiv identifiers and version histories rather than by journal issues.
The same openness draws persistent criticism. Because arXiv performs no peer review, errors and unsupported claims circulate with the same ease as solid results, and readers must judge quality themselves. The endorsement and moderation systems have at times been criticized as opaque or as restricting legitimate inquiry [4]. And the volunteer moderation model is under visible strain from AI-era volume, the very pressure that produced the October 2025 computer science practice change [15][16]. The qualities that made arXiv indispensable to artificial intelligence, speed and openness, are now the ones that generative AI tests most severely.