# BioBERT

> Source: https://aiwiki.ai/wiki/biobert
> Updated: 2026-06-28
> Categories: Healthcare AI, Large Language Models, Transformer Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**BioBERT** (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is a domain-specific [language model](/wiki/large_language_model) that adapts [BERT](/wiki/bert) to biomedicine by continuing its pre-training on large biomedical corpora, namely PubMed abstracts (about 4.5 billion words) and PubMed Central (PMC) full-text articles (about 13.5 billion words), on top of BERT's original general-domain corpus.[1] Introduced in January 2019 and published in *Bioinformatics* in 2020, it was the first widely adopted biomedical-domain [transformer](/wiki/transformer) language model and rapidly became the de facto baseline for biomedical [natural language processing](/wiki/natural_language_processing) (BioNLP). The paper reports that BioBERT improved on the previous state of the art by 0.62% F1 on biomedical [named entity recognition](/wiki/named_entity_recognition), 2.80% F1 on relation extraction, and 12.24% MRR on biomedical [question answering](/wiki/question_answering), and concludes that "BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora."[1]

BioBERT was developed by the Data Mining and Information Systems (DMIS) Lab at Korea University, led by Professor Jaewoo Kang, and released as open source via the [`dmis-lab/biobert`](https://github.com/dmis-lab/biobert) repository.[17] The model is distributed under several configurations (BioBERT-Base v1.0, v1.1, v1.2 and BioBERT-Large v1.1), with the most popular checkpoint being `dmis-lab/biobert-v1.1` on Hugging Face.[19]

BioBERT demonstrated that taking a strong general-purpose encoder and continuing pre-training on biomedical text could substantially improve performance on biomedical named entity recognition, relation extraction, and question answering, often without changing the underlying architecture or vocabulary.[1] The result was a clear empirical case for domain-specific pre-training in technical fields where vocabulary, syntax, and semantics differ sharply from general English. The paper has been cited more than 8,000 times and spawned an entire family of biomedical and clinical BERT derivatives, including BlueBERT, [SciBERT](/wiki/scibert), ClinicalBERT, PubMedBERT, SapBERT, BioMegatron, GatorTron, and the generative BioGPT and BioMedLM.

## What is BioBERT?

BioBERT is a [BERT](/wiki/bert)-based encoder whose weights have been specialized for biomedical text. Its authors describe it as "a domain-specific language representation model pre-trained on large-scale biomedical corpora," and note that "directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora."[1] In other words, BioBERT exists because the everyday English that BERT learned from Wikipedia and books does not match the dense, specialized vocabulary of biomedical literature, so the model is re-exposed to millions of biomedical documents to close that gap.

Architecturally BioBERT is identical to BERT (see [How was BioBERT trained?](#how-was-biobert-trained)); the differences are entirely in the training data and the resulting weights. That design makes BioBERT a drop-in replacement for BERT in any standard [fine-tuning](/wiki/fine_tuning) pipeline, which is a large part of why it was adopted so quickly.

## Background and motivation

By late 2018, [BERT](/wiki/bert) had set new state-of-the-art results on the GLUE benchmark and on a wide range of general-domain natural language understanding tasks. The original BERT, released by Google AI in October 2018, was pre-trained on the BooksCorpus (about 0.8 billion words) and English Wikipedia (about 2.5 billion words) using two self-supervised objectives: masked language modeling (MLM) and next sentence prediction (NSP).[3] These corpora are dominated by everyday vocabulary and narrative prose.

Biomedical text is different. Words such as *transcriptional*, *idiopathic*, *adenocarcinoma*, *BRCA1*, *acetylcholinesterase*, and *NF-κB* appear constantly in biomedical literature but rarely in Wikipedia or novels. Entity names are dense and ambiguous (gene names overlap with everyday words, drug names are highly variable), sentences are long and syntactically complex, and the underlying semantic relationships often involve specialized scientific concepts. Out-of-the-box BERT performed poorly on biomedical tasks compared with task-specific BiLSTM-CRF systems trained on labeled biomedical corpora, despite BERT's general advantage on most other NLP tasks.[1]

The DMIS Lab team hypothesized that the BERT framework could be adapted by simply continuing pre-training on biomedical text rather than re-training from scratch. This approach is far cheaper than from-scratch pre-training and inherits BERT's general linguistic knowledge as a starting point. The result was BioBERT.

## Who created BioBERT and when was it released?

BioBERT was developed at the **DMIS (Data Mining and Information Systems) Lab** in the Department of Computer Science and Engineering at **Korea University** in Seoul, South Korea, under the supervision of Professor Jaewoo Kang. The paper authors are Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.[1]

Key dates and venues:

| Event | Date |
|---|---|
| First arXiv preprint (v1) | 25 January 2019 (arXiv:1901.08746) |
| Final arXiv revision (v4) | 10 September 2019 |
| Published in *Bioinformatics* | 10 September 2019 (online), 2020 (print, 36(4):1234-1240) |
| Open source release on GitHub | 2019 (`dmis-lab/biobert`) |
| Pre-trained weights mirrored | `naver/biobert-pretrained` |

The code base is built on top of Google's reference TensorFlow implementation of BERT, with PyTorch ports added later via Hugging Face Transformers.[17] Pre-training compute was contributed by Naver, the Korean search and AI company, which also hosts the pre-trained weights mirror.[18]

## What data was BioBERT pre-trained on?

BioBERT extends BERT by performing additional pre-training (also called *continual* or *domain-incremental* pre-training) on biomedical text. The corpora used are:[1]

| Corpus | Size (words) | Domain |
|---|---|---|
| English Wikipedia | 2.5 billion | General (inherited from BERT) |
| BooksCorpus | 0.8 billion | General (inherited from BERT) |
| PubMed abstracts | 4.5 billion | Biomedical literature abstracts |
| PMC full-text articles | 13.5 billion | Biomedical literature full text |

PubMed is the United States National Library of Medicine's bibliographic database of biomedical citations, and PubMed Central (PMC) is the corresponding open-access repository of full-text journal articles. The combined biomedical corpus is roughly 18 billion words, an order of magnitude larger than the original BERT corpus, and is exclusively scientific.[1]

## How was BioBERT trained?

A distinctive design choice of BioBERT is that pre-training is **initialized from the released BERT weights**, not from scratch. The model then continues training on the biomedical corpora using the same MLM and NSP objectives that BERT used. Critically, BioBERT keeps the original BERT WordPiece vocabulary (cased, 28,996 tokens) rather than building a new biomedical vocabulary.[1] This was a pragmatic decision: starting from BERT weights requires using BERT's tokenizer, and changing the vocabulary mid-training would invalidate the embedding table. The downside, addressed later by PubMedBERT, is that many biomedical terms get split into long sequences of subword pieces because they are absent from BERT's general-domain vocabulary.[7]

Training used a maximum sequence length of 512 tokens and a batch size of 192 sequences. The hardware was eight NVIDIA V100 GPUs, and the longest run (BioBERT v1.1) took roughly 23 days.[1] Several versioned checkpoints were released, differing in which combination of corpora was used and how many pre-training steps were taken.

### Released versions

| Version | Initialized from | Additional corpus | Steps | Vocabulary |
|---|---|---|---|---|
| BioBERT-Base v1.0 (+PubMed) | BERT-Base Cased | PubMed abstracts | 200K | BERT-Base Cased |
| BioBERT-Base v1.0 (+PMC) | BERT-Base Cased | PMC full text | 270K | BERT-Base Cased |
| BioBERT-Base v1.0 (+PubMed +PMC) | BERT-Base Cased | PubMed + PMC | 470K | BERT-Base Cased |
| BioBERT-Base v1.1 (+PubMed) | BERT-Base Cased | PubMed abstracts | 1M | BERT-Base Cased |
| BioBERT-Base v1.2 (+PubMed) | BERT-Base Cased | PubMed abstracts | 1M | BERT-Base Cased (with LM head) |
| BioBERT-Large v1.1 (+PubMed) | BERT-Large Cased | PubMed abstracts | 1M | Custom 30K vocab |

Version v1.1 (Base, PubMed only, 1M steps) became the most widely cited and used checkpoint, and is the recommended default in most subsequent papers. Version v1.2 is functionally similar but ships with the language modeling head intact, which is convenient for users who want to do further pre-training or use the model for fill-mask inference.

## Architecture

BioBERT inherits BERT's transformer encoder architecture without modification. The base configuration has 12 transformer encoder layers, hidden size 768, 12 self-attention heads, and roughly 110 million parameters. The large configuration has 24 layers, hidden size 1024, 16 attention heads, and around 340 million parameters. Each input is the WordPiece-tokenized text with a `[CLS]` classification token prepended and a `[SEP]` separator. The positional embeddings cap input length at 512 tokens, which is a real constraint for biomedical NLP because full PubMed abstracts often exceed this length and full-text articles certainly do.

Nothing in BioBERT's encoder distinguishes it from BERT in terms of structure. The differences are entirely in the data and the resulting weights, which is why BioBERT can be loaded into any standard BERT-compatible code with a different checkpoint path. This made adoption trivial in practice.

## How well does BioBERT perform on biomedical NLP tasks?

The original BioBERT paper evaluated three categories of biomedical tasks: named entity recognition (NER), relation extraction (RE), and question answering (QA). The fine-tuning recipe is standard BERT-style: a linear classification head on top of the final layer, trained with cross-entropy loss on a small task-specific dataset. Across these three task families the paper reports gains over the prior state of the art of 0.62% F1 (NER), 2.80% F1 (RE), and 12.24% MRR (QA).[1]

### Named entity recognition results

BioBERT was evaluated on nine biomedical NER datasets covering four entity types (disease, drug/chemical, gene/protein, species). Reported entity-level F1 scores for BioBERT-Base v1.1 (+PubMed) are:[1]

| Dataset | Entity type | BioBERT v1.1 F1 |
|---|---|---|
| NCBI Disease | Disease | 89.71 |
| BC5CDR-Disease | Disease | 87.15 |
| BC5CDR-Chemical | Chemical | 93.47 |
| BC4CHEMD | Chemical | 92.36 |
| BC2GM | Gene/Protein | 84.72 |
| JNLPBA | Gene/Protein | 77.49 |
| LINNAEUS | Species | 88.24 |
| Species-800 | Species | 74.06 |

The paper reports an average F1 improvement of 0.62 over the previous state of the art across these NER datasets.[1] While the absolute gain is modest, the improvement is consistent across all datasets, and BioBERT achieved this with a single uniform model and minimal task-specific engineering.

### Relation extraction results

Relation extraction was evaluated on three datasets (ChemProt for chemical-protein interactions, GAD for gene-disease associations, and EU-ADR for adverse drug reactions). BioBERT v1.1 (+PubMed) reports:[1]

| Dataset | Task | BioBERT v1.1 F1 |
|---|---|---|
| ChemProt | Chemical-protein | 76.46 |
| GAD | Gene-disease | 79.83 |
| EU-ADR | Adverse drug reactions | 79.74 |

The paper reports an average F1 improvement of 2.80 over the previous state of the art across the three RE datasets.[1]

### Question answering results

QA was evaluated on the BioASQ factoid task across three challenge years (4b, 5b, 6b). BioBERT v1.1 (+PubMed) reports:[1]

| Dataset | Strict Accuracy | Lenient Accuracy | MRR |
|---|---|---|---|
| BioASQ 4b | 27.95 | 44.10 | 34.72 |
| BioASQ 5b | 46.00 | 60.00 | 51.64 |
| BioASQ 6b | 42.86 | 57.77 | 48.43 |

The paper reports an average MRR improvement of 12.24 over the previous state of the art across BioASQ.[1] QA showed the largest absolute gain because earlier biomedical QA systems relied on hand-engineered features and were a poor fit for free-text answer extraction.

## How does BioBERT compare to BERT and SciBERT?

BioBERT, [BERT](/wiki/bert), and [SciBERT](/wiki/scibert) are all built on the same transformer encoder, but they differ in what they were trained on and in whose vocabulary they use. BERT is the general-domain baseline. BioBERT and SciBERT are the two best-known early attempts to specialize that baseline for science, and they took opposite approaches.

BioBERT was *initialized from BERT's weights* and then continued pre-training on PubMed and PMC, reusing BERT's original WordPiece vocabulary unchanged.[1] [SciBERT](/wiki/scibert) (Beltagy, Lo, and Cohan, 2019) instead trained *from scratch* on a broader scientific corpus and built its own in-domain vocabulary. The SciBERT authors state that their model is "trained from scratch" and that they "construct SciVocab, a new WordPiece vocabulary on our scientific corpus using the SentencePiece library."[4] SciBERT's corpus is 1.14 million papers from Semantic Scholar, 82% biomedical and 18% computer science, totaling about 3.17 billion tokens.[4] The new vocabulary matters: the SciBERT paper reports that "the resulting token overlap between BaseVocab and SciVocab is 42%, illustrating a substantial difference in frequently used words between scientific and general domain texts."[4]

| Model | Year | Initialization | Pre-training corpus | Vocabulary |
|---|---|---|---|---|
| [BERT](/wiki/bert) | 2018 | From scratch | Wikipedia (2.5B) + BooksCorpus (0.8B) | BERT WordPiece (general) |
| BioBERT | 2019 | Continued from BERT | PubMed (4.5B) + PMC (13.5B) | BERT WordPiece (reused) |
| [SciBERT](/wiki/scibert) | 2019 | From scratch | 1.14M Semantic Scholar papers, ~3.17B tokens (82% biomed, 18% CS) | SciVocab (custom, 42% overlap with BERT) |

The practical upshot: BioBERT is cheaper to produce and stays close to BERT's general knowledge, but it inherits a general-domain vocabulary that over-fragments biomedical terms. SciBERT spends more to train a custom vocabulary and from-scratch weights, and reported stronger results than BioBERT on some shared benchmarks such as BC5CDR and ChemProt. The later PubMedBERT model pushed the from-scratch, in-domain-vocabulary idea further and showed it generally beats continual pre-training when billions of words of in-domain text are available, which is now the conventional wisdom for high-resource technical domains.[7]

## Variants and successors

BioBERT triggered a wave of follow-up work that explored alternative pre-training corpora, vocabularies, architectures, and scales. The table below summarizes the most influential biomedical and clinical language models that came after BioBERT.

| Model | Year | Authors / org | Pre-training data | Vocabulary | Notes |
|---|---|---|---|---|---|
| [BERT](/wiki/bert) | 2018 | Devlin et al., Google AI | Wikipedia + BooksCorpus | BERT-Base Cased/Uncased | General domain baseline |
| BioBERT | 2019/2020 | Lee et al., Korea University DMIS | BERT corpus + PubMed + PMC | BERT-Base Cased | Continued pre-training from BERT |
| BlueBERT | 2019 | Peng et al., NIH | PubMed + MIMIC-III | BERT-Base | Mixed biomedical + clinical |
| [SciBERT](/wiki/scibert) | 2019 | Beltagy et al., AI2 (Allen Institute) | 1.14M Semantic Scholar papers (82% biomed, 18% CS) | Custom *scivocab* | From-scratch with new vocab |
| ClinicalBERT (Bio_ClinicalBERT) | 2019 | Alsentzer et al., MIT/Harvard | MIMIC-III clinical notes (~880M words) | BERT-Base Cased | Initialized from BioBERT, fine-tuned on clinical text |
| ClinicalBERT (Huang) | 2019 | Huang et al., NYU | MIMIC-III discharge summaries | BERT-Base | Hospital readmission prediction |
| BioMed-RoBERTa | 2020 | Gururangan et al., AI2 | 2.68M biomed/CS papers | RoBERTa | Domain-adaptive pre-training (DAPT) on top of [RoBERTa](/wiki/roberta) |
| BioMegatron | 2020 | Shin et al., NVIDIA | PubMed | Custom | 345M-1.2B parameter biomedical model based on Megatron-LM |
| PubMedBERT | 2021 | Gu et al., Microsoft Research | PubMed abstracts (and full text variant) | Custom biomedical vocab | From-scratch, outperforms BioBERT on most tasks |
| SapBERT | 2021 | Liu et al., Cambridge LTL | UMLS synonyms | PubMedBERT vocab | Synonym-aware, used for entity linking |
| GatorTron | 2022 | Yang et al., University of Florida | 90B words (82B clinical + PubMed + Wikipedia) | Custom | 345M to 8.9B parameter clinical model |
| BioGPT | 2023 | Luo et al., Microsoft Research | PubMed (15M abstracts) | Custom | Generative GPT-2-style biomedical model |
| BioMedLM (PubMedGPT) | 2022/2023 | Stanford CRFM + MosaicML | PubMed via The Pile | Custom | 2.7B parameter GPT model; 50.3% on MedQA |
| PMC-LLaMA | 2023 | Wu et al. | PubMed Central + medical books | LLaMA tokenizer | Continued pre-training of LLaMA on biomedical text |
| Med-PaLM / Med-PaLM 2 | 2023 | Singhal et al., Google | PaLM with medical instruction tuning | PaLM | Generative; passed USMLE-style questions |
| Clinical ModernBERT | 2025 | Recent | Biomedical + clinical text | ModernBERT | Long-context biomedical encoder |

A few comparisons are worth flagging. **PubMedBERT** showed that for a domain like biomedicine, where billions of words of in-domain text are freely available, training a vocabulary and weights from scratch on biomedical text alone outperforms continual pre-training from a general-domain checkpoint.[7] This was a direct critique of BioBERT's design and is now the conventional wisdom for high-resource technical domains. **SciBERT** made a similar argument with a smaller scientific corpus and a custom vocabulary, and reported better results than BioBERT on BC5CDR and ChemProt despite a smaller training corpus.[4] **ClinicalBERT** by Alsentzer et al. starts from BioBERT's weights and continues pre-training on MIMIC-III clinical notes, producing a model that outperforms BioBERT on clinical (as opposed to literature) tasks.[5] Later models such as **GatorTron** and **BioMedLM** scaled up to billions of parameters, and the very latest generation of biomedical foundation models is generative (BioGPT, Med-PaLM, PMC-LLaMA), aligning with the broader shift in NLP from encoder-only models to decoder-only LLMs.

## What is BioBERT used for?

BioBERT and its derivatives have been applied across the entire biomedical text-mining stack. Common applications include:

- Biomedical literature mining and information extraction over PubMed.
- Biomedical named entity recognition for diseases, chemicals, drugs, genes, proteins, and species.
- Drug-drug interaction (DDI) extraction from clinical and pharmacology literature.
- Adverse drug event (ADE) detection from medical text and social media for pharmacovigilance.
- Protein-protein interaction (PPI) extraction from biomedical abstracts.
- Gene-disease association mining for biomarker discovery.
- Biomedical question answering, including BioASQ challenges.
- Clinical decision support, with appropriate caution regarding patient safety.
- Medical entity linking (mapping mentions to UMLS, MeSH, or SNOMED-CT concept identifiers), often combined with SapBERT.
- Drug repurposing pipelines that mine literature for unexpected drug-target relationships.
- Semantic search over PubMed using BioBERT embeddings as a feature extractor for retrieval.
- Document classification for systematic review screening, where BioBERT can triage candidate abstracts.

## Practical considerations

Fine-tuning BioBERT follows the standard BERT recipe: load the pre-trained checkpoint, attach a task-specific head, train for a few epochs with a small learning rate (commonly 2e-5 to 5e-5) on labeled biomedical data. The original repository provides reference fine-tuning scripts for NER, RE, and QA.[17] The [Hugging Face](/wiki/hugging_face) Transformers ecosystem hosts the canonical checkpoints (`dmis-lab/biobert-v1.1`, `dmis-lab/biobert-base-cased-v1.2`, and the large variant), and they slot into existing pipelines with no modification.[19]

For parameter-efficient fine-tuning, BioBERT works with adapters and LoRA, which is helpful when GPU memory is tight or when many task-specific heads need to be maintained. BioBERT is also frequently used as a frozen feature extractor: the contextual embeddings from the final or penultimate layer are fed into downstream classifiers, retrieval systems, or clustering algorithms. This is especially common in production biomedical NLP systems where robustness and predictability matter more than the marginal gain from full fine-tuning.

The original tokenizer is BERT's WordPiece, so domain-specific terms get fragmented into multiple subwords. Users should be aware that token counts on biomedical text are higher than on general text for the same character length, which interacts with the 512-token sequence limit. For long documents, common workarounds include sliding-window inference, hierarchical models, or switching to long-context biomedical encoders such as Clinical-Longformer or Clinical ModernBERT.

## Limitations

BioBERT has well-known limitations that have shaped subsequent research:

- *Vocabulary mismatch*: because the tokenizer is BERT's general vocabulary, biomedical terms are over-fragmented. PubMedBERT showed that a from-scratch biomedical vocabulary materially helps performance on many tasks.[7]
- *Domain-incremental rather than from-scratch*: the choice to initialize from BERT carries some general-domain bias forward and was later shown to be suboptimal for high-resource domains.[7]
- *Sequence length cap*: 512 tokens is short for many biomedical applications, especially full-text articles and clinical narratives.
- *Encoder-only*: BioBERT cannot generate text. Tasks framed as generation (summarization, question answering with free-form answers, dialog) require a different architecture.
- *Older transformer scale*: 110M to 340M parameters is small by modern standards. GatorTron, BioMedLM, and Med-PaLM operate at billions to hundreds of billions of parameters and substantially outperform BioBERT on harder tasks.
- *English-only*: PubMed and PMC are predominantly English, so BioBERT does not transfer to multilingual biomedical settings.
- *Static knowledge*: like all pre-trained encoders, BioBERT's knowledge is frozen at training time. New literature requires re-pre-training to be incorporated.
- *Not a clinical-text expert*: BioBERT is trained on biomedical literature, not on clinical notes (which are stylistically very different, with abbreviations and incomplete sentences). Models such as ClinicalBERT, BlueBERT, and GatorTron explicitly target the clinical setting and tend to perform better there.[5]
- *No safety guarantees for clinical use*: BioBERT outputs should not be used for direct patient care without expert validation, regulatory clearance, and appropriate monitoring.

## Influence and adoption

BioBERT has been one of the most cited papers in biomedical NLP, with citation counts on Google Scholar exceeding 8,000 by 2024. Its influence shows up in three forms.

First, it established the *domain-specific BERT* paradigm. After BioBERT, every well-resourced technical domain saw an analogous model (LegalBERT, FinBERT, ScholarBERT, MathBERT, AraBERT, JuriBERT, and more). The recipe of "take BERT, continue pre-training on your domain's text" became standard practice, even after PubMedBERT showed the from-scratch alternative often does better.[7]

Second, it became the default baseline. Almost every biomedical NLP paper published since 2019 reports BioBERT numbers as a comparison point, including the papers that propose alternative architectures. The Microsoft BLURB benchmark, introduced with PubMedBERT in 2021, formalized this by including BioBERT alongside several other models in a standardized leaderboard for biomedical NLP.[7]

Third, it remains in active production use. Despite the rise of larger and generative models, BioBERT (and its close derivatives) is widely deployed in biomedical search, semantic indexing, clinical NLP pipelines, and systematic-review tooling. Its small size, predictable behavior, and strong performance on classification and tagging make it a reliable workhorse.

## Recent context (2024-2026)

The biomedical NLP landscape has shifted considerably since BioBERT's release. Generative LLMs, including GPT-4, GPT-4o, Claude 3.5 and 3.7 Sonnet, Gemini 1.5 and 2.0, and the dedicated medical models Med-PaLM 2 and Med-Gemini, have become competitive or superior on many biomedical tasks, especially question answering, summarization, and reasoning. Large biomedical encoder-decoder and decoder-only models such as BioGPT, BioMedLM, PMC-LLaMA, and Meditron offer in-context learning capabilities that BioBERT lacks.

At the same time, BioBERT is still widely used in 2024 and 2025 for several reasons. Supervised fine-tuning of a small encoder remains cheaper and more reliable than zero-shot prompting of a frontier LLM for high-volume tasks like NER over millions of PubMed abstracts. Encoder embeddings remain better suited to dense retrieval than autoregressive models. Regulatory and reproducibility constraints in clinical settings often favor smaller open models with stable behavior over closed frontier APIs. And many existing biomedical NLP systems, especially in pharmaceutical companies and clinical informatics groups, are built on BioBERT or close relatives and would not be replaced lightly.

The broader trajectory is clearly toward larger generative biomedical models trained on combinations of PubMed, PMC, clinical notes, medical textbooks, and curated reasoning data. BioBERT's specific architecture is increasingly historical, but the principle it established (that domain-specific pre-training matters in biomedicine, and that PubMed and PMC are the right places to start) continues to shape every new biomedical foundation model.

## See also

- [BERT](/wiki/bert)
- [DistilBERT](/wiki/distilbert)
- [RoBERTa](/wiki/roberta)
- [SciBERT](/wiki/scibert)
- [PubMedQA](/wiki/pubmedqa)
- [Question answering](/wiki/question_answering)
- [Named entity recognition](/wiki/named_entity_recognition)

## References

1. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506). *Bioinformatics*, 36(4), 1234-1240.
2. Lee, J., et al. (2019). [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746). arXiv:1901.08746.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). NAACL-HLT.
4. Beltagy, I., Lo, K., & Cohan, A. (2019). [SciBERT: A Pretrained Language Model for Scientific Text](https://aclanthology.org/D19-1371/). EMNLP.
5. Alsentzer, E., et al. (2019). [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323). Clinical NLP Workshop.
6. Peng, Y., Yan, S., & Lu, Z. (2019). [Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets (BlueBERT)](https://arxiv.org/abs/1906.05474). BioNLP.
7. Gu, Y., et al. (2021). [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (PubMedBERT)](https://dl.acm.org/doi/10.1145/3458754). *ACM Transactions on Computing for Healthcare*.
8. Liu, Y., et al. (2019). [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692). arXiv:1907.11692.
9. Liu, F., et al. (2021). [Self-Alignment Pretraining for Biomedical Entity Representations (SapBERT)](https://aclanthology.org/2021.naacl-main.334/). NAACL.
10. Gururangan, S., et al. (2020). [Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://aclanthology.org/2020.acl-main.740/). ACL.
11. Shin, H.-C., et al. (2020). [BioMegatron: Larger Biomedical Domain Language Model](https://aclanthology.org/2020.emnlp-main.379/). EMNLP.
12. Yang, X., et al. (2022). [GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records](https://arxiv.org/abs/2203.03540). arXiv:2203.03540 / *npj Digital Medicine*.
13. Luo, R., et al. (2023). [BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining](https://arxiv.org/abs/2210.10341). *Briefings in Bioinformatics*.
14. Bolton, E., et al. (2024). [BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text](https://arxiv.org/abs/2403.18421). arXiv:2403.18421.
15. Wu, C., et al. (2023). [PMC-LLaMA: Towards Building Open-source Language Models for Medicine](https://arxiv.org/abs/2304.14454). arXiv:2304.14454.
16. Singhal, K., et al. (2023). [Large Language Models Encode Clinical Knowledge (Med-PaLM)](https://www.nature.com/articles/s41586-023-06291-2). *Nature*.
17. DMIS Lab (Korea University). [`dmis-lab/biobert` GitHub repository](https://github.com/dmis-lab/biobert).
18. Naver. [`naver/biobert-pretrained` GitHub repository](https://github.com/naver/biobert-pretrained).
19. Hugging Face. [`dmis-lab/biobert-v1.1` model card](https://huggingface.co/dmis-lab/biobert-v1.1).

