Semantic Scholar
Last reviewed
Jun 4, 2026
Sources
25 citations
Review status
Source-backed
Revision
v1 ยท 2,788 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 4, 2026
Sources
25 citations
Review status
Source-backed
Revision
v1 ยท 2,788 words
Add missing citations, update stale details, or suggest a clearer explanation.
Semantic Scholar is a free, AI-powered academic search engine and open data platform built by the Allen Institute for AI (AI2, also styled Ai2), the nonprofit research institute founded by Microsoft co-founder Paul Allen. It was publicly released on November 2, 2015, initially as a tool for searching computer science literature, and has since grown into one of the largest free indexes of scholarly work, covering more than 200 million papers across every field of science. The service applies natural language processing, machine learning, and computer vision to extract meaning from papers, identify the most influential citations and elements within a document, and surface relevant research that keyword search alone would miss. Beyond the consumer-facing website, Semantic Scholar publishes a large family of open datasets, embeddings, and APIs (collectively the Semantic Scholar Open Data Platform) that have become core infrastructure for AI and NLP research.
Semantic Scholar was conceived at the Allen Institute for AI, the Seattle-based 501(c)(3) nonprofit that Paul Allen established in 2014 and that was directed from its earliest days by computer scientist Oren Etzioni, whom Allen appointed in September 2013. The motivation, in Etzioni's framing, was information overload: scientists could no longer keep up with the explosive growth of published literature, and conventional search engines did a poor job of surfacing the papers that actually mattered. AI2 set out to build a search engine that read papers the way a knowledgeable colleague would, drawing on the institute's expertise in data mining, natural language understanding, and computer vision.
The service launched publicly on November 2, 2015. At launch it indexed roughly three million computer science papers and offered features that were unusual for academic search at the time: it identified which references a paper relied on most heavily ("influential" citations), extracted figures and tables, pulled out key phrases and topics, and let users filter and jump between cited works. Etzioni said at launch that medicine was a particular priority for expansion within the following year.
Semantic Scholar broadened its coverage rapidly. In November 2016 it added neuroscience, and a high-profile demonstration ranked the most influential brain scientists of the modern era using its citation analysis. In 2017 the team moved into biomedical literature, aiming eventually to cover the more than 20 million papers in that domain; by January 2018 the corpus exceeded 40 million papers spanning computer science and biomedicine. In March 2018, Doug Raymond, who had previously worked on machine learning for Amazon's Alexa, was hired to lead the project.
In October 2019, Semantic Scholar expanded to cover essentially every branch of science, jumping to roughly 175 million papers. Much of this scale came from integrating large metadata sources, and the engine increasingly relied on SciBERT, a science-tuned variant of the BERT language model, to index articles and find patterns across and within disciplines. Raymond described the hardest engineering problem as moving from batch processing to a real-time, continuously updated data pipeline. By 2020 the platform reported roughly seven million users per month and had indexed about 190 million papers; by September 2022 it covered more than 200 million publications across all scientific fields.
The discontinuation of the Microsoft Academic Graph at the end of 2021 made Semantic Scholar's open data offerings more important to the wider research community, as developers who had depended on that resource looked for free, comprehensive alternatives. The Semantic Scholar Academic Graph (described below) was positioned as one such replacement.
Semantic Scholar is a project of AI2 rather than a separate company, so its trajectory tracks the institute's leadership. Etzioni led AI2 for roughly nine years and stepped down as CEO on September 30, 2022. Peter Clark served as interim CEO until Ali Farhadi was named CEO on June 20, 2023. Farhadi stepped down in early 2026, with Clark again taking interim leadership. Throughout these transitions Semantic Scholar has continued to operate and expand as a flagship AI2 product.
Semantic Scholar aggregates metadata and, where licensing allows, full text from many sources rather than crawling the open web alone. It integrates records from Crossref, PubMed, arXiv, Unpaywall, and numerous publishers and content providers, and it sources full text through publisher partnerships and crawling of open-access PDFs. The institute says it works with more than 50 publishers and scholarly societies; the University of Chicago Press is among the partners it has named publicly.
The underlying knowledge graph has grown steadily. As of May 2025, the Semantic Scholar Academic Graph contained roughly 225 million papers, 105 million authors, about 195,000 publication venues, and 2.8 billion citation edges, alongside hundreds of millions of paper-authorship and paper-venue links. (Earlier snapshots reported figures such as 205 million papers and about 2.5 billion citations in 2022, and around 214 million papers in subsequent updates, reflecting continuous growth and periodic deduplication.)
One of Semantic Scholar's signature contributions is the idea that not all citations carry equal weight. The feature grew out of the 2015 research paper "Identifying Meaningful Citations" by Marco Valenzuela, Vu Ha, and Oren Etzioni, presented at a workshop at the AAAI conference. That work framed the problem as supervised classification: deciding whether a citation indicates the cited work was genuinely used or extended versus merely mentioned in passing, using features such as how often and where a reference appears and the surrounding language.
In production, Semantic Scholar flags "Highly Influential Citations" by analyzing the full text of citing papers with a machine-learning model, looking for signals such as a reference appearing multiple times, language like "build upon" or "inspired by," and references tied to tables or figures. The platform also classifies citation intent, labeling whether a citation supports background, methods, or results, so that researchers can see not just who cites a paper but how. Because the feature depends on access to the citing paper's full text, coverage is uneven, and independent studies have cautioned that the influential-citation counts can be inconsistent and should be used carefully rather than treated as a definitive impact metric.
In November 2020, Semantic Scholar introduced TLDR (short for "too long; didn't read"), single-sentence, automatically generated summaries of a paper's main objective and results, shown directly on search results and author pages. TLDRs are produced by an abstractive summarization model and were trained using the SciTLDR dataset, a multi-target collection of roughly 5,400 summaries over about 3,200 papers. The dataset combined author-written summaries (drawn in part from OpenReview submissions) with expert-derived summaries condensed from peer reviews, and the training method (called CATTS) used paper titles as an auxiliary signal to compensate for the small amount of summary data. The underlying model was based on BART. Because scientific papers average around 5,000 words and the TLDRs average about 21, the feature compresses a paper roughly 238 times. AI2 researcher Daniel Weld described a longer-term goal of producing personalized research briefings that summarize several recent advances in a sub-area at once. TLDRs launched in beta for tens of millions of papers in computer science, biology, and medicine.
Semantic Reader is an augmented, AI-powered reading interface for research papers, developed by the Semantic Scholar team together with researchers at the University of California, Berkeley and the University of Washington, with support from the Alfred P. Sloan Foundation. Rather than presenting a static PDF, it overlays interactive elements: inline citation cards that let a reader preview a referenced paper without leaving the page, "skimming" highlights that mark a paper's key sentences with category labels (Goal, Method, and Result), tooltips that surface position-sensitive definitions of terms and symbols, automatically generated glossaries, and personalized citation highlighting. The project, described in the paper "The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces," reported improved reading experiences in user studies and noted that the reader is used by tens of thousands of scholars each week. Coinciding with the UIST 2023 conference, the team released the Semantic Reader Open Research Platform, including open-source toolkits such as PaperMage (for processing and analyzing scholarly PDFs) and PaperCraft (a React component for building interactive reading interfaces).
Research Feeds is an adaptive recommender that learns which papers a user cares about and surfaces recent, relevant research to keep them up to date. Users save papers into Library folders, and the system treats saved papers as positive examples while "not relevant" ratings act as negative signals, refining recommendations over time. The recommendations are powered by a paper-embedding model trained with contrastive learning, which finds papers semantically similar to those a user has collected; feeds draw from work published in roughly the previous three months. The Library itself lets users organize papers into folders, bulk-export citations, and share collections publicly or privately.
Semantic Scholar has added generative AI features that let users query a paper in natural language. The "Ask This Paper" tool (in beta) lets a reader select a suggested question or type their own about a specific paper and receive an answer along with supporting statements pulled from the text, an application of the retrieval-augmented generation pattern to a single document.
Semantic Scholar is unusual among search engines in that its research outputs are themselves widely used building blocks for the machine learning community. Several of its datasets and models are landmarks in scientific NLP.
| Resource | What it is | Notes |
|---|---|---|
| SciBERT | A BERT language model pretrained on scientific text | Released 2019 by Iz Beltagy, Kyle Lo, and Arman Cohan; widely reused for scientific NLP tasks |
| S2ORC | Semantic Scholar Open Research Corpus | Introduced 2020 (ACL) by Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld; a large corpus of academic papers with structured full text |
| SPECTER | Citation-informed document embeddings | Introduced 2020; learns paper-level representations from the citation graph, built on SciBERT |
| SciDocs | Benchmark for scientific document embeddings | Released alongside SPECTER; seven evaluation tasks |
| SciTLDR | Dataset for extreme summarization of papers | Powers the TLDR feature; ~5.4K summaries over ~3.2K papers |
| SPECTER2 | Successor embedding model | Trained on 6M triplets (about 10x SPECTER) across 23 fields; described in the SciRepEval paper (EMNLP 2023) |
S2ORC (pronounced "stork") aggregates papers from hundreds of publishers and archives into a unified, machine-readable corpus. The 2020 release described roughly 81 million paper records with metadata and abstracts and about 8.1 million open-access papers with structured full text, in which inline mentions of citations, figures, and tables are automatically detected and linked. By the platform's May 2025 reporting, the open-access full-text portion had grown to more than 12 million papers, with Medicine, Biology, and Physics the most represented fields. S2ORC has been used to train and evaluate many scientific language models.
SPECTER (Scientific Paper Embeddings using Citation-informed TransformERs) generates a vector for each paper by training a Transformer on a signal of document relatedness: the citation graph, using citing and cited papers as positive pairs in a contrastive objective. These embeddings power internal Semantic Scholar features including Research Feeds, author name disambiguation, and paper clustering, and they are also exposed through the public API. SPECTER2, released in 2023, extended the approach to multiple fields and task formats (classification, regression, retrieval, and search) using task-specific adapters, and was trained on roughly six million triplets, about ten times the data used for the original.
The Semantic Scholar Open Data Platform exposes its corpus through several free APIs and downloadable datasets, which are documented in the paper "The Semantic Scholar Open Data Platform" by Rodney Kinney, Iz Beltagy, Doug Downey, Oren Etzioni, and many additional AI2 contributors.
| Service | Purpose |
|---|---|
| Academic Graph API (S2AG) | REST endpoints to look up papers, authors, citations, venues, and SPECTER2 embeddings by Semantic Scholar ID, DOI, arXiv ID, PubMed ID, and other identifiers; supports relevance, title, bulk, and snippet search |
| Datasets API | Downloadable monthly snapshots of papers, abstracts, authors, citations, embeddings, TLDRs, venues, and S2ORC |
| Recommendations API | Suggests recent papers and preprints from positive (and optional negative) example papers; released June 29, 2022 |
S2AG (pronounced "stag") is free and open, distributed both as a live REST API and as monthly downloadable snapshots in JSON. An API key is optional but recommended; the introductory authenticated rate limit is modest (on the order of one request per second per endpoint), with higher limits available on request. The platform reports very large usage: it served over 1.8 billion API requests in 2024 and works with thousands of authenticated partners. Real-world integrations include bioRxiv (operated by Cold Spring Harbor Laboratory), which uses the Recommendations API to help life-sciences researchers filter preprints.
Semantic Scholar is frequently compared with Google Scholar, the dominant free academic search engine. The two have different design philosophies. Google Scholar prioritizes exhaustive coverage and indexes the widest range of scholarly material; comparative studies consistently find it has the broadest reach across disciplines, and for some queries it returns dramatically more results than Semantic Scholar. Semantic Scholar instead emphasizes understanding and discovery: identifying the most important and influential elements of a paper, classifying citations by how they are used, generating summaries, and recommending semantically related work that a keyword search might never surface. Its coverage and AI features have historically been strongest in computer science and biomedicine, the fields it entered first. In short, Google Scholar tends to win on raw breadth, while Semantic Scholar differentiates on AI-assisted analysis, structured citation data, and an open data platform that Google Scholar does not offer.
Semantic Scholar occupies a distinctive position because it is both a product used by millions of researchers and a supplier of open infrastructure to the AI field that builds it. Its corpora (S2ORC), embeddings (SPECTER, SPECTER2), pretrained models (SciBERT), summaries (TLDR), and graph (S2AG) are reused far beyond the website itself: they train and benchmark scientific language models, power third-party literature tools, and feed large language model systems that reason over scientific text. By keeping these resources free and openly licensed, AI2 has made Semantic Scholar a foundational dataset and toolkit for scholarly NLP, even as the consumer search engine competes with Google Scholar and a wave of newer AI research assistants such as Perplexity-style tools.