BioBERT
Last reviewed
May 1, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 · 3,554 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 · 3,554 words
Add missing citations, update stale details, or suggest a clearer explanation.
BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is a domain-specific language representation model based on BERT, pre-trained on a large corpus of biomedical literature drawn from PubMed abstracts and PubMed Central (PMC) full-text articles in addition to BERT's original general-domain pre-training corpus. Introduced in January 2019 and published in Bioinformatics in February 2020, BioBERT was the first widely adopted biomedical-domain transformer language model and rapidly became the de facto baseline for biomedical natural language processing (BioNLP). It was developed by the Data Mining and Information Systems (DMIS) Lab at Korea University, led by Professor Jaewoo Kang, and released as open source via the dmis-lab/biobert repository. The model is distributed under several configurations (BioBERT-Base v1.0, v1.1, v1.2 and BioBERT-Large v1.1), with the most popular checkpoint being dmis-lab/biobert-v1.1 on Hugging Face.
BioBERT demonstrated that taking a strong general-purpose encoder and continuing pre-training on biomedical text could substantially improve performance on biomedical named entity recognition, relation extraction, and question answering, often without changing the underlying architecture or vocabulary. The result was a clear empirical case for domain-specific pre-training in technical fields where vocabulary, syntax, and semantics differ sharply from general English. The paper has been cited more than 8,000 times and spawned an entire family of biomedical and clinical BERT derivatives, including BlueBERT, SciBERT, ClinicalBERT, PubMedBERT, SapBERT, BioMegatron, GatorTron, and the generative BioGPT and BioMedLM.
By late 2018, BERT had set new state-of-the-art results on the GLUE benchmark and on a wide range of general-domain natural language understanding tasks. The original BERT, released by Google AI in October 2018, was pre-trained on the BooksCorpus (about 0.8 billion words) and English Wikipedia (about 2.5 billion words) using two self-supervised objectives: masked language modeling (MLM) and next sentence prediction (NSP). These corpora are dominated by everyday vocabulary and narrative prose.
Biomedical text is different. Words such as transcriptional, idiopathic, adenocarcinoma, BRCA1, acetylcholinesterase, and NF-κB appear constantly in biomedical literature but rarely in Wikipedia or novels. Entity names are dense and ambiguous (gene names overlap with everyday words, drug names are highly variable), sentences are long and syntactically complex, and the underlying semantic relationships often involve specialized scientific concepts. Out-of-the-box BERT performed poorly on biomedical tasks compared with task-specific BiLSTM-CRF systems trained on labeled biomedical corpora, despite BERT's general advantage on most other NLP tasks.
The DMIS Lab team hypothesized that the BERT framework could be adapted by simply continuing pre-training on biomedical text rather than re-training from scratch. This approach is far cheaper than from-scratch pre-training and inherits BERT's general linguistic knowledge as a starting point. The result was BioBERT.
BioBERT was developed at the DMIS (Data Mining and Information Systems) Lab in the Department of Computer Science and Engineering at Korea University in Seoul, South Korea, under the supervision of Professor Jaewoo Kang. The paper authors are Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.
Key dates and venues:
| Event | Date |
|---|---|
| First arXiv preprint (v1) | 25 January 2019 (arXiv:1901.08746) |
| Final arXiv revision (v4) | 10 September 2019 |
| Published in Bioinformatics | 10 September 2019 (online), February 2020 (print, 36(4):1234–1240) |
| Open source release on GitHub | 2019 (dmis-lab/biobert) |
| Pre-trained weights mirrored | naver/biobert-pretrained |
The code base is built on top of Google's reference TensorFlow implementation of BERT, with PyTorch ports added later via Hugging Face Transformers. Pre-training compute was contributed by Naver, the Korean search and AI company, which also hosts the pre-trained weights mirror.
BioBERT extends BERT by performing additional pre-training (also called continual or domain-incremental pre-training) on biomedical text. The corpora used are:
| Corpus | Size (words) | Domain |
|---|---|---|
| English Wikipedia | 2.5 billion | General (inherited from BERT) |
| BooksCorpus | 0.8 billion | General (inherited from BERT) |
| PubMed abstracts | 4.5 billion | Biomedical literature abstracts |
| PMC full-text articles | 13.5 billion | Biomedical literature full text |
PubMed is the United States National Library of Medicine's bibliographic database of biomedical citations, and PubMed Central (PMC) is the corresponding open-access repository of full-text journal articles. The combined biomedical corpus is roughly 18 billion words, an order of magnitude larger than the original BERT corpus, and is exclusively scientific.
A distinctive design choice of BioBERT is that pre-training is initialized from the released BERT weights, not from scratch. The model then continues training on the biomedical corpora using the same MLM and NSP objectives that BERT used. Critically, BioBERT keeps the original BERT WordPiece vocabulary (cased, 28,996 tokens) rather than building a new biomedical vocabulary. This was a pragmatic decision: starting from BERT weights requires using BERT's tokenizer, and changing the vocabulary mid-training would invalidate the embedding table. The downside, addressed later by PubMedBERT, is that many biomedical terms get split into long sequences of subword pieces because they are absent from BERT's general-domain vocabulary.
Training used a maximum sequence length of 512 tokens and a batch size of 192 sequences. The hardware was eight NVIDIA V100 GPUs, and the longest run (BioBERT v1.1) took roughly 23 days. Several versioned checkpoints were released, differing in which combination of corpora was used and how many pre-training steps were taken.
| Version | Initialized from | Additional corpus | Steps | Vocabulary |
|---|---|---|---|---|
| BioBERT-Base v1.0 (+PubMed) | BERT-Base Cased | PubMed abstracts | 200K | BERT-Base Cased |
| BioBERT-Base v1.0 (+PMC) | BERT-Base Cased | PMC full text | 270K | BERT-Base Cased |
| BioBERT-Base v1.0 (+PubMed +PMC) | BERT-Base Cased | PubMed + PMC | 470K | BERT-Base Cased |
| BioBERT-Base v1.1 (+PubMed) | BERT-Base Cased | PubMed abstracts | 1M | BERT-Base Cased |
| BioBERT-Base v1.2 (+PubMed) | BERT-Base Cased | PubMed abstracts | 1M | BERT-Base Cased (with LM head) |
| BioBERT-Large v1.1 (+PubMed) | BERT-Large Cased | PubMed abstracts | 1M | Custom 30K vocab |
Version v1.1 (Base, PubMed only, 1M steps) became the most widely cited and used checkpoint, and is the recommended default in most subsequent papers. Version v1.2 is functionally similar but ships with the language modeling head intact, which is convenient for users who want to do further pre-training or use the model for fill-mask inference.
BioBERT inherits BERT's transformer encoder architecture without modification. The base configuration has 12 transformer encoder layers, hidden size 768, 12 self-attention heads, and roughly 110 million parameters. The large configuration has 24 layers, hidden size 1024, 16 attention heads, and around 340 million parameters. Each input is the WordPiece-tokenized text with a [CLS] classification token prepended and a [SEP] separator. The positional embeddings cap input length at 512 tokens, which is a real constraint for biomedical NLP because full PubMed abstracts often exceed this length and full-text articles certainly do.
Nothing in BioBERT's encoder distinguishes it from BERT in terms of structure. The differences are entirely in the data and the resulting weights, which is why BioBERT can be loaded into any standard BERT-compatible code with a different checkpoint path. This made adoption trivial in practice.
The original BioBERT paper evaluated three categories of biomedical tasks: named entity recognition (NER), relation extraction (RE), and question answering (QA). The fine-tuning recipe is standard BERT-style: a linear classification head on top of the final layer, trained with cross-entropy loss on a small task-specific dataset.
BioBERT was evaluated on nine biomedical NER datasets covering four entity types (disease, drug/chemical, gene/protein, species). Reported entity-level F1 scores for BioBERT-Base v1.1 (+PubMed) are:
| Dataset | Entity type | BioBERT v1.1 F1 |
|---|---|---|
| NCBI Disease | Disease | 89.71 |
| BC5CDR-Disease | Disease | 87.15 |
| BC5CDR-Chemical | Chemical | 93.47 |
| BC4CHEMD | Chemical | 92.36 |
| BC2GM | Gene/Protein | 84.72 |
| JNLPBA | Gene/Protein | 77.49 |
| LINNAEUS | Species | 88.24 |
| Species-800 | Species | 74.06 |
The paper reports an average F1 improvement of 0.62 over the previous state of the art across these NER datasets. While the absolute gain is modest, the improvement is consistent across all datasets, and BioBERT achieved this with a single uniform model and minimal task-specific engineering.
Relation extraction was evaluated on three datasets (ChemProt for chemical-protein interactions, GAD for gene-disease associations, and EU-ADR for adverse drug reactions). BioBERT v1.1 (+PubMed) reports:
| Dataset | Task | BioBERT v1.1 F1 |
|---|---|---|
| ChemProt | Chemical-protein | 76.46 |
| GAD | Gene-disease | 79.83 |
| EU-ADR | Adverse drug reactions | 79.74 |
The paper reports an average F1 improvement of 2.80 over the previous state of the art across the three RE datasets.
QA was evaluated on the BioASQ factoid task across three challenge years (4b, 5b, 6b). BioBERT v1.1 (+PubMed) reports:
| Dataset | Strict Accuracy | Lenient Accuracy | MRR |
|---|---|---|---|
| BioASQ 4b | 27.95 | 44.10 | 34.72 |
| BioASQ 5b | 46.00 | 60.00 | 51.64 |
| BioASQ 6b | 42.86 | 57.77 | 48.43 |
The paper reports an average MRR improvement of 12.24 over the previous state of the art across BioASQ. QA showed the largest absolute gain because earlier biomedical QA systems relied on hand-engineered features and were a poor fit for free-text answer extraction.
BioBERT triggered a wave of follow-up work that explored alternative pre-training corpora, vocabularies, architectures, and scales. The table below summarizes the most influential biomedical and clinical language models that came after BioBERT.
| Model | Year | Authors / org | Pre-training data | Vocabulary | Notes |
|---|---|---|---|---|---|
| BERT | 2018 | Devlin et al., Google AI | Wikipedia + BooksCorpus | BERT-Base Cased/Uncased | General domain baseline |
| BioBERT | 2019/2020 | Lee et al., Korea University DMIS | BERT corpus + PubMed + PMC | BERT-Base Cased | Continued pre-training from BERT |
| BlueBERT | 2019 | Peng et al., NIH | PubMed + MIMIC-III | BERT-Base | Mixed biomedical + clinical |
| SciBERT | 2019 | Beltagy et al., AI2 (Allen Institute) | 1.14M Semantic Scholar papers (82% biomed, 18% CS) | Custom scivocab | From-scratch with new vocab |
| ClinicalBERT (Bio_ClinicalBERT) | 2019 | Alsentzer et al., MIT/Harvard | MIMIC-III clinical notes (~880M words) | BERT-Base Cased | Initialized from BioBERT, fine-tuned on clinical text |
| ClinicalBERT (Huang) | 2019 | Huang et al., NYU | MIMIC-III discharge summaries | BERT-Base | Hospital readmission prediction |
| BioMed-RoBERTa | 2020 | Gururangan et al., AI2 | 2.68M biomed/CS papers | RoBERTa | Domain-adaptive pre-training (DAPT) on top of RoBERTa |
| BioMegatron | 2020 | Shin et al., NVIDIA | PubMed | Custom | 345M–1.2B parameter biomedical model based on Megatron-LM |
| PubMedBERT | 2021 | Gu et al., Microsoft Research | PubMed abstracts (and full text variant) | Custom biomedical vocab | From-scratch, outperforms BioBERT on most tasks |
| SapBERT | 2021 | Liu et al., Cambridge LTL | UMLS synonyms | PubMedBERT vocab | Synonym-aware, used for entity linking |
| GatorTron | 2022 | Yang et al., University of Florida | 90B words (82B clinical + PubMed + Wikipedia) | Custom | 345M to 8.9B parameter clinical model |
| BioGPT | 2023 | Luo et al., Microsoft Research | PubMed (15M abstracts) | Custom | Generative GPT-2-style biomedical model |
| BioMedLM (PubMedGPT) | 2022/2023 | Stanford CRFM + MosaicML | PubMed via The Pile | Custom | 2.7B parameter GPT model; 50.3% on MedQA |
| PMC-LLaMA | 2023 | Wu et al. | PubMed Central + medical books | LLaMA tokenizer | Continued pre-training of LLaMA on biomedical text |
| Med-PaLM / Med-PaLM 2 | 2023 | Singhal et al., Google | PaLM with medical instruction tuning | PaLM | Generative; passed USMLE-style questions |
| Clinical ModernBERT | 2025 | Recent | Biomedical + clinical text | ModernBERT | Long-context biomedical encoder |
A few comparisons are worth flagging. PubMedBERT showed that for a domain like biomedicine, where billions of words of in-domain text are freely available, training a vocabulary and weights from scratch on biomedical text alone outperforms continual pre-training from a general-domain checkpoint. This was a direct critique of BioBERT's design and is now the conventional wisdom for high-resource technical domains. SciBERT made a similar argument with a smaller scientific corpus and a custom vocabulary, and reported better results than BioBERT on BC5CDR and ChemProt despite a smaller training corpus. ClinicalBERT by Alsentzer et al. starts from BioBERT's weights and continues pre-training on MIMIC-III clinical notes, producing a model that outperforms BioBERT on clinical (as opposed to literature) tasks. Later models such as GatorTron and BioMedLM scaled up to billions of parameters, and the very latest generation of biomedical foundation models is generative (BioGPT, Med-PaLM, PMC-LLaMA), aligning with the broader shift in NLP from encoder-only models to decoder-only LLMs.
BioBERT and its derivatives have been applied across the entire biomedical text-mining stack. Common applications include:
Fine-tuning BioBERT follows the standard BERT recipe: load the pre-trained checkpoint, attach a task-specific head, train for a few epochs with a small learning rate (commonly 2e-5 to 5e-5) on labeled biomedical data. The original repository provides reference fine-tuning scripts for NER, RE, and QA. The Hugging Face Transformers ecosystem hosts the canonical checkpoints (dmis-lab/biobert-v1.1, dmis-lab/biobert-base-cased-v1.2, and the large variant), and they slot into existing pipelines with no modification.
For parameter-efficient fine-tuning, BioBERT works with adapters and LoRA, which is helpful when GPU memory is tight or when many task-specific heads need to be maintained. BioBERT is also frequently used as a frozen feature extractor: the contextual embeddings from the final or penultimate layer are fed into downstream classifiers, retrieval systems, or clustering algorithms. This is especially common in production biomedical NLP systems where robustness and predictability matter more than the marginal gain from full fine-tuning.
The original tokenizer is BERT's WordPiece, so domain-specific terms get fragmented into multiple subwords. Users should be aware that token counts on biomedical text are higher than on general text for the same character length, which interacts with the 512-token sequence limit. For long documents, common workarounds include sliding-window inference, hierarchical models, or switching to long-context biomedical encoders such as Clinical-Longformer or Clinical ModernBERT.
BioBERT has well-known limitations that have shaped subsequent research:
BioBERT has been one of the most cited papers in biomedical NLP, with citation counts on Google Scholar exceeding 8,000 by 2024. Its influence shows up in three forms.
First, it established the domain-specific BERT paradigm. After BioBERT, every well-resourced technical domain saw an analogous model (LegalBERT, FinBERT, ScholarBERT, MathBERT, AraBERT, JuriBERT, and more). The recipe of "take BERT, continue pre-training on your domain's text" became standard practice, even after PubMedBERT showed the from-scratch alternative often does better.
Second, it became the default baseline. Almost every biomedical NLP paper published since 2019 reports BioBERT numbers as a comparison point, including the papers that propose alternative architectures. The Microsoft BLURB benchmark, introduced with PubMedBERT in 2021, formalized this by including BioBERT alongside several other models in a standardized leaderboard for biomedical NLP.
Third, it remains in active production use. Despite the rise of larger and generative models, BioBERT (and its close derivatives) is widely deployed in biomedical search, semantic indexing, clinical NLP pipelines, and systematic-review tooling. Its small size, predictable behavior, and strong performance on classification and tagging make it a reliable workhorse.
The biomedical NLP landscape has shifted considerably since BioBERT's release. Generative LLMs, including GPT-4, GPT-4o, Claude 3.5 and 3.7 Sonnet, Gemini 1.5 and 2.0, and the dedicated medical models Med-PaLM 2 and Med-Gemini, have become competitive or superior on many biomedical tasks, especially question answering, summarization, and reasoning. Large biomedical encoder-decoder and decoder-only models such as BioGPT, BioMedLM, PMC-LLaMA, and Meditron offer in-context learning capabilities that BioBERT lacks.
At the same time, BioBERT is still widely used in 2024 and 2025 for several reasons. Supervised fine-tuning of a small encoder remains cheaper and more reliable than zero-shot prompting of a frontier LLM for high-volume tasks like NER over millions of PubMed abstracts. Encoder embeddings remain better suited to dense retrieval than autoregressive models. Regulatory and reproducibility constraints in clinical settings often favor smaller open models with stable behavior over closed frontier APIs. And many existing biomedical NLP systems, especially in pharmaceutical companies and clinical informatics groups, are built on BioBERT or close relatives and would not be replaced lightly.
The broader trajectory is clearly toward larger generative biomedical models trained on combinations of PubMed, PMC, clinical notes, medical textbooks, and curated reasoning data. BioBERT's specific architecture is increasingly historical, but the principle it established (that domain-specific pre-training matters in biomedicine, and that PubMed and PMC are the right places to start) continues to shape every new biomedical foundation model.