SciBERT

SciBERT is a BERT-based language model that was pretrained from scratch on a large corpus of scientific papers, built by the Allen Institute for Artificial Intelligence (AI2). It was introduced in the 2019 paper "SciBERT: A Pretrained Language Model for Scientific Text" by Iz Beltagy, Kyle Lo, and Arman Cohan, presented at EMNLP 2019. The model was designed to improve performance on natural language processing tasks that involve scientific and biomedical text, where the vocabulary, writing conventions, and terminology differ substantially from the general-domain web and book text on which the original BERT was trained. SciBERT keeps the same neural architecture as BERT but pairs it with a domain-specific subword vocabulary called scivocab and a training corpus drawn entirely from research literature. The most widely used checkpoint is distributed on Hugging Face as allenai/scibert_scivocab_uncased.

Overview

SciBERT belongs to a family of domain-adapted variants of BERT, the bidirectional transformer encoder released by Google in 2018. General-purpose language models trained on Wikipedia and book text often underperform on scientific documents because research writing has a distinct word distribution: it is dense with technical terms, chemical names, gene symbols, mathematical notation, and citation conventions that rarely appear in everyday prose. SciBERT addresses this mismatch in two ways. First, it is pretrained on full-text scientific articles rather than general web text. Second, it uses a WordPiece vocabulary constructed directly from that scientific corpus, so that domain-specific tokens are represented more efficiently than they would be under BERT's original vocabulary.

The practical payoff is that, after fine-tuning on a labeled downstream dataset, SciBERT consistently beats the original BERT on scientific NLP benchmarks and, on several biomedical tasks, matches or exceeds BioBERT, an earlier domain-specific model, despite being trained on a smaller volume of biomedical text. SciBERT was released as open source under the Apache 2.0 license, with checkpoints provided for TensorFlow, PyTorch (through the Hugging Face library), and AllenNLP. Because the weights are freely available and easy to load, SciBERT became a common starting point for academic systems that process scholarly text, including those for information extraction over papers, literature mining, and citation analysis.

SciBERT is an encoder-only model. Like BERT, it produces contextual representations of input tokens and is typically used by attaching a small task-specific head, for example a linear classifier or a sequence-labeling layer, and training the whole stack on a target dataset. It is not a generative model and does not produce free-form text the way decoder-based systems such as the GPT family do.

Training corpus and vocabulary

SciBERT was pretrained on a corpus of 1.14 million papers sampled from Semantic Scholar, the academic search engine operated by AI2. The corpus totals roughly 3.17 billion tokens, which is comparable in scale to the data used to train the original BERT. The authors used the full text of each paper, not just the abstract, so the model saw complete documents including methods, results, and discussion sections. The average paper in the corpus contains about 154 sentences, or roughly 2,769 tokens.

The corpus is multi-domain but skewed toward the life sciences. About 82 percent of the papers come from the broad biomedical domain and about 18 percent from computer science. The authors chose these two fields as representative samples of scientific literature, which means SciBERT is strongest on biomedical and computing text and less specialized for fields such as physics, materials science, or the social sciences.

A central contribution of the work is the scivocab vocabulary. Rather than reuse BERT's original WordPiece vocabulary, which the authors call basevocab and which was derived from general-domain text, they built a new 30,000-token subword vocabulary from the scientific corpus using the SentencePiece library. This matters because BERT tokenizes unfamiliar words by splitting them into smaller subword pieces; a vocabulary tuned to scientific text breaks technical terms into fewer, more meaningful pieces, which gives the model cleaner inputs to learn from. The overlap between basevocab and scivocab is only about 42 percent, which quantifies just how different scientific token statistics are from general English. The remaining 58 percent reflects tokens that are common in research writing, such as gene names, chemical fragments, and discipline-specific jargon, but uncommon on the open web.

Architecture

SciBERT uses the same architecture and model size as BERT-base. That means a transformer encoder with 12 layers, a hidden size of 768, and 12 self-attention heads, for roughly 110 million parameters. There is no architectural innovation in SciBERT relative to BERT; the contribution lies in the training data and vocabulary, not the network design. This deliberate choice makes SciBERT a controlled demonstration that domain-specific pretraining alone, holding the architecture fixed, yields meaningful gains.

Like BERT, SciBERT was pretrained with two self-supervised objectives: masked language modeling, in which a fraction of input tokens are hidden and the model learns to predict them from surrounding context, and next sentence prediction. Pretraining was done on a single Cloud TPU v3 with 8 cores and took about one week. Following the original BERT recipe, training proceeded in two phases: roughly five days with a maximum sequence length of 128 tokens, followed by about two days at the longer maximum length of 512 tokens. The scivocab models were trained from scratch on the scientific corpus, while the basevocab models were initialized from the original BERT weights and then continued pretraining on scientific text.

Variants

The authors released four pretrained checkpoints, formed by crossing two vocabulary choices with two casing options. The vocabulary axis is scivocab, the new scientific vocabulary, versus basevocab, BERT's original general-domain vocabulary. The casing axis is uncased, where text is lowercased before tokenization, versus cased, where the original capitalization is preserved. The authors report that the scivocab-uncased variant generally gives the best results, and it is the one most users adopt.

Variant	Vocabulary	Casing	Hugging Face identifier	Notes
scibert-scivocab-uncased	scivocab	uncased	`allenai/scibert_scivocab_uncased`	Recommended default; most widely used
scibert-scivocab-cased	scivocab	cased	`allenai/scibert_scivocab_cased`	Preserves capitalization, useful when case is informative
scibert-basevocab-uncased	basevocab	uncased	`allenai/scibert_basevocab_uncased`	Continues from original BERT vocabulary
scibert-basevocab-cased	basevocab	cased	`allenai/scibert_basevocab_cased`	Continues from original BERT, case preserved

The difference between scivocab and basevocab variants isolates the effect of the domain vocabulary. The authors found that using the scientific vocabulary contributes a measurable share of SciBERT's overall improvement, confirming that both the corpus and the vocabulary matter. All four checkpoints were distributed in multiple formats so they could be loaded by the TensorFlow code from Google Research, by the Hugging Face Transformers library in PyTorch, or within the AllenNLP framework.

Performance on scientific NLP tasks

The SciBERT paper evaluated the model across five task types spanning the biomedical and computer science domains: named entity recognition (NER), PICO extraction (identifying Population, Intervention, Comparison, and Outcome spans in clinical trial reports), text classification (CLS), relation classification (REL), and dependency parsing (DEP). The benchmark datasets included BC5CDR, NCBI-disease, JNLPBA, and EBM-NLP for biomedical entity and PICO tasks, GENIA for dependency parsing, ChemProt for chemical-protein relation extraction, SciERC for entity and relation extraction in computer science abstracts, and SciCite, ACL-ARC, and the Paper Field dataset for citation-intent and paper-topic classification.

The authors compared SciBERT against the original BERT-base under two regimes: fine-tuning the entire model on each task, and using frozen BERT embeddings as features fed into a task-specific BiLSTM with a sequence-labeling or classification layer on top. Fine-tuning generally produced the stronger results. Averaged across all datasets, the fine-tuned SciBERT improved over fine-tuned BERT-base by about 2.11 F1 points overall, with a larger gain of roughly 3.55 F1 in computer science and about 1.92 F1 in the biomedical domain. The paper reported that SciBERT achieved new state-of-the-art results on several datasets at the time, including BC5CDR and ChemProt in biomedicine, the NER portion of SciERC and ACL-ARC in computer science, and SciCite and EBM-NLP.

The table below shows representative test-set scores reported in the paper for fine-tuned SciBERT (using scivocab) against fine-tuned BERT-base. Metrics are F1 unless otherwise noted. These are illustrative figures from the original evaluation, and exact numbers depend on the task setup.

Dataset	Domain	Task	BERT-base	SciBERT
BC5CDR	Biomedical	NER	86.72	90.01
NCBI-disease	Biomedical	NER	86.88	88.57
JNLPBA	Biomedical	NER	76.09	77.28
EBM-NLP	Biomedical	PICO	71.53	72.28
ChemProt	Biomedical	Relation	79.14	83.64
SciERC	Computer science	NER	65.24	67.57
SciERC	Computer science	Relation	78.71	79.97
ACL-ARC	Computer science	Classification	63.91	70.98
SciCite	Multi-domain	Classification	84.85	85.49

Compared with BioBERT, which was the leading biomedical model at the time, SciBERT was competitive and sometimes better even though its biomedical corpus was substantially smaller. For example, on BC5CDR and ChemProt the SciBERT scores reported in the paper exceeded the corresponding BioBERT results, while on tasks such as NCBI-disease and JNLPBA the two models were close. The authors attributed SciBERT's strength partly to the scivocab vocabulary, which BioBERT did not have because BioBERT reused BERT's original general-domain vocabulary.

Use cases

SciBERT is most useful as a pretrained backbone for downstream models that operate on scientific documents. Common applications include:

Scientific named entity recognition: extracting mentions of genes, proteins, chemicals, diseases, methods, datasets, and tasks from research text, which underpins literature curation and knowledge-base construction.
Relation and information extraction: identifying relationships such as chemical-protein interactions (as in ChemProt) or the links between scientific entities (as in SciERC), used to populate structured databases from unstructured papers.
Citation analysis: classifying the intent or function of a citation, for example whether a reference is background, a method, or a comparison, which is the task behind SciCite and ACL-ARC.
Paper and abstract classification: assigning topics, fields of study, or relevance labels to articles, useful for search, recommendation, and systematic literature review.
Clinical and evidence-based medicine NLP: PICO extraction and related tasks that help summarize and structure the content of clinical trial reports.
Embeddings and retrieval: producing dense vector representations of scientific sentences or documents for similarity search and clustering, although models trained specifically for retrieval often perform better at that narrow task.

Because SciBERT integrates directly with the Hugging Face Transformers library, practitioners can load it with a single call and fine-tune it using the same workflow as any other BERT model. Its popularity in the research community is reflected in the large number of fine-tuned derivatives and downstream projects built on top of it.

Limitations

SciBERT has several limitations that follow from its design and the era in which it was built. Its training corpus is heavily weighted toward biomedicine and computer science, so it is less effective on disciplines that are underrepresented in the data, such as physics, chemistry beyond the biomedical interface, economics, or the humanities. The vocabulary and learned representations reflect the conventions of those two fields.

As an encoder-only BERT-base model, SciBERT inherits the structural constraints of that architecture. It has a maximum input length of 512 tokens, which is short relative to a full research paper, so long documents must be truncated or split into chunks. It is not a text generator and cannot summarize, answer open-ended questions, or write prose on its own; it produces representations that a task head consumes. With about 110 million parameters, it is small by the standards of later large language models, which limits the depth of reasoning and world knowledge it can encode.

SciBERT was trained on literature available through 2019, so it has no knowledge of more recent research, terminology, or findings, and it does not update on its own. Like all models trained on real-world text, it can reproduce biases present in its source corpus. Finally, although it is a strong general backbone for scientific text, it has been surpassed on specific biomedical benchmarks by later models such as PubMedBERT, and on long-document and broad-coverage tasks by newer architectures and much larger models.

SciBERT arrived during a wave of domain-specific adaptations of BERT in 2019, shortly after BERT itself was released in late 2018. The central question these efforts explored was how best to specialize a general-purpose language model for a technical domain, and SciBERT, BioBERT, and PubMedBERT represent three points on that spectrum.

BioBERT was introduced by Jinhyuk Lee and colleagues in the paper "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," published in the journal Bioinformatics in 2019 (with an earlier arXiv preprint, 1901.08746). BioBERT used continued pretraining: it started from the original BERT weights and kept training on biomedical text, specifically PubMed abstracts and PubMed Central full-text articles, with checkpoints offered for PubMed only, PMC only, and the combination. A key design difference from SciBERT is that BioBERT retained BERT's original general-domain vocabulary rather than building a new one. The authors reported gains over BERT on biomedical NER, relation extraction, and question answering. BioBERT and SciBERT appeared at roughly the same time and are frequently compared; SciBERT's distinct contribution was the from-scratch scientific vocabulary and a corpus that spanned computer science as well as biomedicine.

PubMedBERT followed in 2020, introduced by Yu Gu and colleagues at Microsoft Research in "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing" (arXiv 2007.15779; later published in ACM Transactions on Computing for Healthcare, and now also distributed under the name BiomedBERT). PubMedBERT pushed the domain-specialization idea further by pretraining entirely from scratch on biomedical text, roughly 14 million PubMed abstracts, with a vocabulary also generated from that biomedical corpus. The paper argued that for domains with abundant unlabeled text, such as biomedicine, training from scratch yields larger gains than continued pretraining of a general-domain model. The same work introduced the BLURB benchmark (Biomedical Language Understanding and Reasoning Benchmark), a suite of biomedical NLP datasets used to compare these models on common footing, on which PubMedBERT set state-of-the-art results across multiple tasks.

Together, these three models clarified a now-standard lesson: when a domain has enough text, pretraining a model and its vocabulary from scratch on in-domain data tends to outperform simply continuing to train a general model. SciBERT was an early and influential demonstration of this principle for scientific text, and it remains a widely cited baseline and a practical tool for processing scholarly literature.

References

Beltagy, Iz; Lo, Kyle; Cohan, Arman. "SciBERT: A Pretrained Language Model for Scientific Text." Proceedings of EMNLP 2019, Association for Computational Linguistics, 2019. https://arxiv.org/abs/1903.10676
Beltagy, Iz; Lo, Kyle; Cohan, Arman. "SciBERT: A Pretrained Language Model for Scientific Text." ACL Anthology D19-1371, 2019. https://aclanthology.org/D19-1371/
Allen Institute for AI. "allenai/scibert_scivocab_uncased." Hugging Face model card, 2019. https://huggingface.co/allenai/scibert_scivocab_uncased
Allen Institute for AI. "scibert: A BERT model for scientific text." GitHub repository, 2019. https://github.com/allenai/scibert
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019, 2018. https://arxiv.org/abs/1810.04805
Lee, Jinhyuk; Yoon, Wonjin; Kim, Sungdong; Kim, Donghyeon; Kim, Sunkyu; So, Chan Ho; Kang, Jaewoo. "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics, vol. 36, no. 4, 2020. https://arxiv.org/abs/1901.08746
Gu, Yu; Tinn, Robert; Cheng, Hao; Lucas, Michael; Usuyama, Naoto; Liu, Xiaodong; Naumann, Tristan; Gao, Jianfeng; Poon, Hoifung. "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing." ACM Transactions on Computing for Healthcare, 2021. https://arxiv.org/abs/2007.15779
Lo, Kyle; Wang, Lucy Lu; Neumann, Mark; Kinney, Rodney; Weld, Daniel S. "S2ORC: The Semantic Scholar Open Research Corpus." Proceedings of ACL 2020, 2020. https://arxiv.org/abs/1911.02782

SciBERT

Overview

Training corpus and vocabulary

Architecture

Variants

Performance on scientific NLP tasks

Use cases

Limitations

References

Improve this article

What links here

Overview

Training corpus and vocabulary

Architecture

Variants

Performance on scientific NLP tasks

Use cases

Limitations

References

What links here

Overview

Training corpus and vocabulary

Architecture

Variants

Performance on scientific NLP tasks

Use cases

Limitations

History and related models

References

Improve this article

Related Articles

Bert-base-uncased model

Llama 3

Translation Models

Conversational Models

Fill-Mask Models

Natural Language Processing Models

What links here

Overview

Training corpus and vocabulary

Architecture

Variants

Performance on scientific NLP tasks

Use cases

Limitations

History and related models

References

Related Articles

Bert-base-uncased model

Llama 3

Translation Models

Conversational Models

Fill-Mask Models

Natural Language Processing Models

What links here