Semantic Scholar

AI Tools & Products Information Retrieval Natural Language Processing Research Organizations

16 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v3 · 3,223 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Semantic Scholar is a free, AI-powered academic search engine and open research-data platform built by the Allen Institute for AI (AI2, also styled Ai2), the Seattle nonprofit founded by Microsoft co-founder Paul Allen. Publicly released on November 2, 2015, it now indexes more than 200 million scholarly papers across every field of science and uses natural language processing, machine learning, and computer vision to extract meaning from papers, rank the most influential citations, generate one-sentence TLDR summaries, and surface relevant work that keyword search alone would miss ^[3]^[5]^[6]. Beyond the consumer website, AI2 publishes a large family of open datasets, embeddings, and free APIs (collectively the Semantic Scholar Open Data Platform) that have become core infrastructure for AI and NLP research, serving over 1.8 billion API requests in 2024 alone ^[10].

What is Semantic Scholar?

Semantic Scholar is a scholarly search engine that reads papers the way a knowledgeable colleague would, rather than matching keywords like a conventional index. At launch it identified which references a paper relied on most heavily ("influential" citations), extracted figures and tables, pulled out key phrases and topics, and let users filter and jump between cited works. "No one can keep up with the explosive growth of scientific literature," said AI2 CEO Oren Etzioni at the 2015 launch, framing the project as "part of our mission of AI for the common good" ^[1]^[2]. The service is free, requires no login to search, and its underlying corpus and tools are openly licensed, which distinguishes it from most other large academic indexes.

Who built Semantic Scholar and when?

Semantic Scholar was conceived at the Allen Institute for AI, the 501(c)(3) nonprofit that Paul Allen established in 2014 and that was directed from its earliest days by computer scientist Oren Etzioni, whom Allen appointed in September 2013. The motivation, in Etzioni's framing, was information overload: scientists could no longer keep up with the explosive growth of published literature, and conventional search engines did a poor job of surfacing the papers that actually mattered. AI2 set out to build a search engine that drew on the institute's expertise in data mining, natural language understanding, and computer vision.

The service launched publicly on November 2, 2015. At launch it indexed roughly three million computer science papers and offered features that were unusual for academic search at the time: highly influential citation detection, figure and table extraction, and key-phrase and topic surfacing ^[1]^[2]. Etzioni said at launch that medicine was a particular priority for expansion within the following year.

Expansion across disciplines

Semantic Scholar broadened its coverage rapidly. In November 2016 it added neuroscience, and a high-profile demonstration ranked the most influential brain scientists of the modern era using its citation analysis. In 2017 the team moved into biomedical literature, aiming eventually to cover the more than 20 million papers in that domain; by January 2018 the corpus exceeded 40 million papers spanning computer science and biomedicine. In March 2018, Doug Raymond, who had previously worked on machine learning for Amazon's Alexa, was hired to lead the project.

In October 2019, Semantic Scholar expanded to cover essentially every branch of science, jumping to roughly 175 million papers ^[23]^[24]. Much of this scale came from integrating large metadata sources, and the engine increasingly relied on SciBERT, a science-tuned variant of the BERT language model, to index articles and find patterns across and within disciplines. Raymond described the hardest engineering problem as moving from batch processing to a real-time, continuously updated data pipeline. By 2020 the platform reported roughly seven million users per month and had indexed about 190 million papers; by September 2022 it covered more than 200 million publications across all scientific fields ^[5].

The discontinuation of the Microsoft Academic Graph at the end of 2021 made Semantic Scholar's open data offerings more important to the wider research community, as developers who had depended on that resource looked for free, comprehensive alternatives ^[25]. The Semantic Scholar Academic Graph (described below) was positioned as one such replacement.

Leadership context

Semantic Scholar is a project of AI2 rather than a separate company, so its trajectory tracks the institute's leadership. Etzioni led AI2 for roughly nine years and stepped down as CEO on September 30, 2022. Peter Clark served as interim CEO until Ali Farhadi was named CEO on June 20, 2023. Farhadi stepped down in early 2026, with Clark again taking interim leadership. Throughout these transitions Semantic Scholar has continued to operate and expand as a flagship AI2 product.

How big is Semantic Scholar, and where does its data come from?

Semantic Scholar aggregates metadata and, where licensing allows, full text from many sources rather than crawling the open web alone. It integrates records from Crossref, PubMed, arXiv, Unpaywall, and numerous publishers and content providers, and it sources full text through publisher partnerships and crawling of open-access PDFs. The institute says it works with more than 50 publishers and scholarly societies; the University of Chicago Press is among the partners it has named publicly ^[6].

The underlying knowledge graph has grown steadily. As of May 2025, the Semantic Scholar Academic Graph contained roughly 225 million papers, 105 million authors, about 195,000 publication venues, and 2.8 billion citation edges, alongside hundreds of millions of paper-authorship and paper-venue links ^[10]. (Earlier snapshots reported figures such as 205 million papers and about 2.5 billion citations in 2022, and around 214 million papers in subsequent updates, reflecting continuous growth and periodic deduplication.) The headline figure AI2 markets to researchers is "200M+" papers ^[6].

Metric	Figure	As of
Papers indexed (marketed)	200M+	2022 onward ^[5]^[6]
Papers in Academic Graph	~225 million	May 2025 ^[10]
Authors	~105 million	May 2025 ^[10]
Citation edges	~2.8 billion	May 2025 ^[10]
Publication venues	~195,000	May 2025 ^[10]
Open-access full text (S2ORC)	~12 million papers	May 2025 ^[10]
API requests served	over 1.8 billion	2024 ^[10]

What are Semantic Scholar's main features?

Influential citations and citation context

One of Semantic Scholar's signature contributions is the idea that not all citations carry equal weight. The feature grew out of the 2015 research paper "Identifying Meaningful Citations" by Marco Valenzuela, Vu Ha, and Oren Etzioni, presented at a workshop at the AAAI conference ^[21]. That work framed the problem as supervised classification: deciding whether a citation indicates the cited work was genuinely used or extended versus merely mentioned in passing, using features such as how often and where a reference appears and the surrounding language.

In production, Semantic Scholar flags "Highly Influential Citations" by analyzing the full text of citing papers with a machine-learning model, looking for signals such as a reference appearing multiple times, language like "build upon" or "inspired by," and references tied to tables or figures ^[22]. The platform also classifies citation intent, labeling whether a citation supports background, methods, or results, so that researchers can see not just who cites a paper but how. Because the feature depends on access to the citing paper's full text, coverage is uneven, and independent studies have cautioned that the influential-citation counts can be inconsistent and should be used carefully rather than treated as a definitive impact metric.

What is TLDR (the auto-generated summary)?

In November 2020, Semantic Scholar introduced TLDR (short for "too long; didn't read"), single-sentence, automatically generated summaries of a paper's main objective and results, shown directly on search results and author pages ^[15]^[16]. TLDRs are produced by an abstractive summarization model and were trained using the SciTLDR dataset, a multi-target collection of roughly 5,400 summaries over about 3,200 papers ^[14]. The dataset combined author-written summaries (drawn in part from OpenReview submissions) with expert-derived summaries condensed from peer reviews, and the training method (called CATTS) used paper titles as an auxiliary signal to compensate for the small amount of summary data. The underlying model was based on BART. Because scientific papers average around 5,000 words and the TLDRs average about 21, the feature compresses a paper roughly 238 times.

The feature was pitched as an antidote to information overload. "Information overload is a top problem facing scientists," said AI2 researcher Isabel Cachola, the SciTLDR lead author; her colleague Daniel Weld noted that "since TLDRs are 20 words instead of 200, they are much faster to skim" than a full abstract ^[15]. Weld also described a longer-term goal of producing personalized research briefings that summarize several recent advances in a sub-area at once. TLDRs launched in beta for nearly ten million computer science papers and later expanded to biology and medicine ^[15].

What is the Semantic Reader?

Semantic Reader is an augmented, AI-powered reading interface for research papers, developed by the Semantic Scholar team together with researchers at the University of California, Berkeley and the University of Washington, with support from the Alfred P. Sloan Foundation ^[17]^[18]. Rather than presenting a static PDF, it overlays interactive elements: inline citation cards that let a reader preview a referenced paper without leaving the page, "skimming" highlights that mark a paper's key sentences with category labels (Goal, Method, and Result), tooltips that surface position-sensitive definitions of terms and symbols, automatically generated glossaries, and personalized citation highlighting. The project, described in the paper "The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces," reported improved reading experiences in user studies and noted that the reader is used by tens of thousands of scholars each week ^[17]. Coinciding with the UIST 2023 conference, the team released the Semantic Reader Open Research Platform, including open-source toolkits such as PaperMage (for processing and analyzing scholarly PDFs) and PaperCraft (a React component for building interactive reading interfaces).

Research Feeds and Library

Research Feeds is an adaptive recommender that learns which papers a user cares about and surfaces recent, relevant research to keep them up to date ^[19]. Users save papers into Library folders, and the system treats saved papers as positive examples while "not relevant" ratings act as negative signals, refining recommendations over time. The recommendations are powered by a paper-embedding model trained with contrastive learning, which finds papers semantically similar to those a user has collected; feeds draw from work published in roughly the previous three months. The Library itself lets users organize papers into folders, bulk-export citations, and share collections publicly or privately.

Ask This Paper

Semantic Scholar has added generative AI features that let users query a paper in natural language. The "Ask This Paper" tool (in beta) lets a reader select a suggested question or type their own about a specific paper and receive an answer along with supporting statements pulled from the text, an application of the retrieval-augmented generation pattern to a single document.

Technology and open research

Semantic Scholar is unusual among search engines in that its research outputs are themselves widely used building blocks for the machine learning community. Several of its datasets and models are landmarks in scientific NLP.

Resource	What it is	Notes
SciBERT	A BERT language model pretrained on scientific text	Released 2019 by Iz Beltagy, Kyle Lo, and Arman Cohan; widely reused for scientific NLP tasks
S2ORC	Semantic Scholar Open Research Corpus	Introduced 2020 (ACL) by Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld; a large corpus of academic papers with structured full text ^[11]
SPECTER	Citation-informed document embeddings	Introduced 2020; learns paper-level representations from the citation graph, built on SciBERT ^[12]
SciDocs	Benchmark for scientific document embeddings	Released alongside SPECTER; seven evaluation tasks
SciTLDR	Dataset for extreme summarization of papers	Powers the TLDR feature; ~5.4K summaries over ~3.2K papers ^[14]
SPECTER2	Successor embedding model	Trained on 6M triplets (about 10x SPECTER) across 23 fields; described in the SciRepEval paper (EMNLP 2023) ^[13]

S2ORC (pronounced "stork") aggregates papers from hundreds of publishers and archives into a unified, machine-readable corpus ^[11]. The 2020 release described roughly 81 million paper records with metadata and abstracts and about 8.1 million open-access papers with structured full text, in which inline mentions of citations, figures, and tables are automatically detected and linked. By the platform's May 2025 reporting, the open-access full-text portion had grown to more than 12 million papers, with Medicine, Biology, and Physics the most represented fields ^[10]. S2ORC has been used to train and evaluate many scientific language models.

SPECTER (Scientific Paper Embeddings using Citation-informed TransformERs) generates a vector for each paper by training a Transformer on a signal of document relatedness: the citation graph, using citing and cited papers as positive pairs in a contrastive objective ^[12]. These embeddings power internal Semantic Scholar features including Research Feeds, author name disambiguation, and paper clustering, and they are also exposed through the public API. SPECTER2, released in 2023, extended the approach to multiple fields and task formats (classification, regression, retrieval, and search) using task-specific adapters, and was trained on roughly six million triplets, about ten times the data used for the original ^[13].

What are the Semantic Scholar API and S2ORC used for?

The Semantic Scholar Open Data Platform exposes its corpus through several free APIs and downloadable datasets, which are documented in the paper "The Semantic Scholar Open Data Platform" by Rodney Kinney, Iz Beltagy, Doug Downey, Oren Etzioni, and many additional AI2 contributors ^[10].

Service	Purpose
Academic Graph API (S2AG)	REST endpoints to look up papers, authors, citations, venues, and SPECTER2 embeddings by Semantic Scholar ID, DOI, arXiv ID, PubMed ID, and other identifiers; supports relevance, title, bulk, and snippet search
Datasets API	Downloadable monthly snapshots of papers, abstracts, authors, citations, embeddings, TLDRs, venues, and S2ORC
Recommendations API	Suggests recent papers and preprints from positive (and optional negative) example papers; released June 29, 2022 ^[20]

S2AG (pronounced "stag") is free and open, distributed both as a live REST API and as monthly downloadable snapshots in JSON ^[7]^[9]. An API key is optional but recommended; the introductory authenticated rate limit is modest (on the order of one request per second per endpoint), with higher limits available on request. The platform reports very large usage: it served over 1.8 billion API requests in 2024 and works with thousands of authenticated partners, with paper metadata lookup by ID (about 60 percent of traffic) and keyword search (about 16 percent) the most common request types ^[10]. Real-world integrations include bioRxiv (operated by Cold Spring Harbor Laboratory), which uses the Recommendations API to help life-sciences researchers filter preprints.

How does Semantic Scholar compare with Google Scholar?

Semantic Scholar is frequently compared with Google Scholar, the dominant free academic search engine. The two have different design philosophies. Google Scholar prioritizes exhaustive coverage and indexes the widest range of scholarly material; comparative studies consistently find it has the broadest reach across disciplines, and for some queries it returns dramatically more results than Semantic Scholar. Semantic Scholar instead emphasizes understanding and discovery: identifying the most important and influential elements of a paper, classifying citations by how they are used, generating summaries, and recommending semantically related work that a keyword search might never surface. Its coverage and AI features have historically been strongest in computer science and biomedicine, the fields it entered first. In short, Google Scholar tends to win on raw breadth, while Semantic Scholar differentiates on AI-assisted analysis, structured citation data, and an open data platform that Google Scholar does not offer.

Role in AI and NLP research infrastructure

Semantic Scholar occupies a distinctive position because it is both a product used by millions of researchers and a supplier of open infrastructure to the AI field that builds it. Its corpora (S2ORC), embeddings (SPECTER, SPECTER2), pretrained models (SciBERT), summaries (TLDR), and graph (S2AG) are reused far beyond the website itself: they train and benchmark scientific language models, power third-party literature tools, and feed large language model systems that reason over scientific text. By keeping these resources free and openly licensed, AI2 has made Semantic Scholar a foundational dataset and toolkit for scholarly NLP, even as the consumer search engine competes with Google Scholar and a wave of newer AI research assistants such as Perplexity-style tools.

ELI5: What is Semantic Scholar?

Imagine a giant free library that holds more than 200 million science papers, plus a very smart robot librarian. Instead of just looking for the words you typed, the robot actually reads the papers. It can tell you which papers a study really leaned on (not just name-dropped), write a one-sentence summary of any paper so you do not have to read the whole thing, and point you to other papers you would probably find interesting. A nonprofit called the Allen Institute for AI made it, and they give the whole collection away for free so other people can build new science tools with it.

References

"Paul Allen's AI2 launches search engine designed specifically for scientists." GeekWire, November 2, 2015. https://www.geekwire.com/2015/paul-allens-ai2-launches-search-engine-designed-specifically-for-scientists/ ↩
"Semantic Scholar engine for scientists gets (leaps) to the points." TechXplore, November 2015. https://techxplore.com/news/2015-11-semantic-scholar-scientists.html ↩
"Semantic Scholar." Wikipedia. https://en.wikipedia.org/wiki/Semantic_Scholar ↩
"Allen Institute for AI." Wikipedia. https://en.wikipedia.org/wiki/Allen_Institute_for_AI
"Frequently Asked Questions." Semantic Scholar. https://www.semanticscholar.org/faq ↩
"About Semantic Scholar." Semantic Scholar / Ai2. https://www.semanticscholar.org/about ↩
"Semantic Scholar Academic Graph API." Semantic Scholar. https://www.semanticscholar.org/product/api ↩
"Semantic Scholar Academic Graph for Developers." Ai2 Blog (Medium), January 19, 2022. https://medium.com/ai2-blog/semantic-scholar-academic-graph-for-developers-6188cfec84d4
Wade, A. D. "The Semantic Scholar Academic Graph (S2AG)." Companion Proceedings of the Web Conference 2022. https://dl.acm.org/doi/10.1145/3487553.3527147 ↩
Kinney, R., Beltagy, I., Downey, D., Etzioni, O., et al. "The Semantic Scholar Open Data Platform." arXiv:2301.10140, 2023 (v2 April 25, 2025). https://arxiv.org/html/2301.10140v2 ↩
Lo, K., Wang, L. L., Neumann, M., Kinney, R., Weld, D. "S2ORC: The Semantic Scholar Open Research Corpus." Proceedings of ACL 2020, pp. 4969-4983. https://aclanthology.org/2020.acl-main.447/ ↩
Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D. "SPECTER: Document-level Representation Learning using Citation-informed Transformers." 2020. https://www.semanticscholar.org/paper/a3e4ceb42cbcd2c807d53aff90a8cb1f5ee3f031 ↩
"SPECTER2: Adapting scientific document embeddings to multiple fields and task formats." Ai2 Blog. https://allenai.org/blog/specter2-adapting-scientific-document-embeddings-to-multiple-fields-and-task-formats-c95686c06567 ↩
Cachola, I., Lo, K., Cohan, A., Weld, D. "TLDR: Extreme Summarization of Scientific Documents." 2020. https://arxiv.org/pdf/2004.15011 ↩
"Introducing TLDRs on Semantic Scholar." Ai2 Blog (Medium). https://medium.com/ai2-blog/introducing-tldrs-on-semantic-scholar-f8310c51c1fb ↩
"An AI helps you summarize the latest in AI." MIT Technology Review, November 18, 2020. https://www.technologyreview.com/2020/11/18/1012259/ai-summarizes-science-papers-ai2-semantic-scholar/ ↩
"The Semantic Reader Project." Communications of the ACM, 2024. https://cacm.acm.org/research/the-semantic-reader-project/ ↩
"Semantic Reader." Semantic Scholar. https://www.semanticscholar.org/product/semantic-reader ↩
"What are Research Feeds?" Semantic Scholar FAQ. https://www.semanticscholar.org/faq/what-are-research-feeds ↩
"Semantic Scholar Releases New Recommendations API." Ai2 Blog (Medium), June 29, 2022. https://medium.com/ai2-blog/semantic-scholar-releases-new-recommendations-api-ca01ef2d80d4 ↩
Valenzuela, M., Ha, V., Etzioni, O. "Identifying Meaningful Citations." AAAI Workshop on Scholarly Big Data, 2015. https://www.semanticscholar.org/paper/1c7be3fc28296a97607d426f9168ad4836407e4b ↩
"What are Highly Influential Citations?" Semantic Scholar FAQ. https://www.semanticscholar.org/faq/influential-citations ↩
"AI2's Semantic Scholar expands to cover 175 million papers in all scientific disciplines." TechCrunch, October 23, 2019. https://techcrunch.com/2019/10/23/ai2s-semantic-scholar-expands-to-cover-178-million-papers-in-all-scientific-disciplines/ ↩
"Allen Institute's Semantic Scholar now searches across 175 million academic papers." VentureBeat, 2019. https://venturebeat.com/ai/allen-institutes-semantic-scholar-now-searches-across-175-million-academic-papers ↩
"Microsoft Academic Graph is being discontinued. What's next?" Nature Index, 2021. https://www.nature.com/nature-index/news/microsoft-academic-graph-discontinued-whats-next ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Academic Research Allen Institute for AI Consensus (academic AI search)EgoSchema Elicit (research tool)PIQA Reflexion ResearchRabbit Scite

What is Semantic Scholar?

Who built Semantic Scholar and when?

Expansion across disciplines

Leadership context

How big is Semantic Scholar, and where does its data come from?

What are Semantic Scholar's main features?

Influential citations and citation context

What is TLDR (the auto-generated summary)?

What is the Semantic Reader?

Research Feeds and Library

Ask This Paper

Technology and open research

What are the Semantic Scholar API and S2ORC used for?

How does Semantic Scholar compare with Google Scholar?

Role in AI and NLP research infrastructure

ELI5: What is Semantic Scholar?

Related

See also

References

Improve this article

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here