BioGPT
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 3,811 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 3,811 words
Add missing citations, update stale details, or suggest a clearer explanation.
BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining, developed by Microsoft Research. Introduced by Renqian Luo and colleagues in October 2022 and published in Briefings in Bioinformatics in November 2022, BioGPT was the first widely adopted decoder-only generative large language model trained specifically on biomedical literature. Earlier biomedical NLP models such as BioBERT and PubMedBERT were encoder-only architectures suited to discriminative tasks like classification and named entity recognition. BioGPT extended the domain by demonstrating that a GPT-2 style decoder, trained from scratch on PubMed abstracts, could fluently generate biomedical text and outperform prior systems on relation extraction, biomedical question answering, and document classification benchmarks.
The model is released under the MIT license, with weights distributed through the Hugging Face Hub at microsoft/biogpt and microsoft/BioGPT-Large, and source code at microsoft/BioGPT on GitHub. Two pre-trained checkpoints are available: a 347M-parameter base model and a 1.5B-parameter BioGPT-Large variant. A task-specific checkpoint, BioGPT-Large-PubMedQA, was fine-tuned on the PubMedQA benchmark and reported 81.0% accuracy, surpassing the prior state of the art at the time of release. Despite being smaller than frontier general-purpose models, BioGPT remains a widely cited baseline in biomedical NLP and inspired a generation of follow-up work on generative medical language models including BioMedLM, Meditron, BioMistral, and PMC-LLaMA.
By 2022, the dominant approach to biomedical natural language processing relied on bidirectional encoder models derived from BERT. BioBERT, introduced by Lee et al. in 2020, continued the pre-training of general-domain BERT on PubMed abstracts and PMC full-text articles, improving named entity recognition by 0.62 F1 points, relation extraction by 2.80 F1 points, and question-answering MRR by 12.24 points over the base BERT. PubMedBERT, introduced by Yu Gu and colleagues in 2021, took a different route by pre-training BERT from scratch with a vocabulary built directly from PubMed, demonstrating that domain-specific tokenization gives consistent gains over continued pre-training. Other contemporaneous encoder-only models included SciBERT (Beltagy et al. 2019), trained on 1.14 million scientific papers from Semantic Scholar; BlueBERT (Peng et al. 2019), trained on PubMed plus MIMIC-III clinical notes; and ClinicalBERT (Alsentzer et al. 2019), focused on de-identified clinical text from MIMIC-III.
These encoder-only systems excelled at discriminative classification tasks but could not generate text. Researchers wanting free-form output, such as a fluent description of a gene, a drug-target relation expressed as a natural-language sentence, or an answer to an open-ended biomedical question, had to fall back on general-domain generative systems like the original GPT-2. General-domain generators frequently failed on biomedical entities entirely, producing irrelevant or hallucinated continuations because their tokenizers fragmented technical vocabulary and their training data underrepresented biomedical literature. BioGPT was designed to close this gap by porting the generative pre-training paradigm of the GPT family into the biomedical domain.
BioGPT was developed by a team primarily based at Microsoft Research Asia in Beijing, with collaborators from Microsoft Health Futures and Microsoft Research AI for Science. The author list, in the order it appears on the Briefings in Bioinformatics paper, is:
| Author | Affiliation | Role |
|---|---|---|
| Renqian Luo | Microsoft Research AI4Science | First and corresponding author |
| Liai Sun | Microsoft Research | Co-author |
| Yingce Xia | Microsoft Research AI4Science | Corresponding author |
| Tao Qin | Microsoft Research AI4Science | Corresponding author |
| Sheng Zhang | Microsoft Research, Health Futures | Co-author |
| Hoifung Poon | Microsoft Research, Health Futures | Senior author |
| Tie-Yan Liu | Microsoft Research | Senior author |
The initial preprint went up on arXiv on 19 October 2022 under identifier 2210.10341. The peer-reviewed version was published in volume 23, issue 6 of Briefings in Bioinformatics in November 2022, with DOI 10.1093/bib/bbac409. The accompanying source code was released to the public on GitHub on 9 August 2022 under the microsoft/BioGPT repository, ahead of the formal arXiv posting. By 2025 the paper had accumulated several thousand citations on Google Scholar, making it one of the most cited biomedical language model papers of the 2020s.
BioGPT uses the standard GPT-2 decoder-only Transformer as its backbone. The base model has 24 Transformer layers, a hidden dimension of 1024, 16 attention heads per layer, and a feed-forward inner dimension of 4096. With the in-domain vocabulary it has roughly 347 million trainable parameters, slightly fewer than the 355M of GPT-2 medium because the embedding matrix is smaller. Positional information uses learned absolute embeddings up to a context length of 1024 tokens. Activations use GELU non-linearities and the model is trained with the standard causal language modelling objective: at each step the network predicts the next subword token given all preceding tokens in the sequence, with a softmax over the vocabulary.
A critical design choice was to train BioGPT from scratch rather than continuing the pre-training of an existing GPT-2 checkpoint. This is in contrast to BioBERT, which continued the pre-training of general-domain BERT, and aligns with the from-scratch approach taken by PubMedBERT. The team argued that initialising from a general-domain checkpoint would lock in a tokenizer and parameter prior that was poorly suited to biomedical text. By starting from random weights with a custom vocabulary, BioGPT could specialise more aggressively on biomedical surface forms.
BioGPT learns a byte-pair encoding (BPE) vocabulary directly from the biomedical pre-training corpus using the fastBPE library. The resulting vocabulary contains 42,384 subword tokens, with biomedical terms like naloxone, acetyltransferase, glioblastoma, and carcinoma often represented as single tokens rather than fragmented sequences of two or three subwords. Input text is first segmented into words using the Moses tokenizer, then split into subwords by fastBPE, then mapped to integer ids. This pipeline matches the practice in fairseq, the framework used to train the model.
BioGPT-Large is a scaled-up version with the GPT-2 XL architecture: 48 Transformer layers, hidden dimension 1600, 25 attention heads, and roughly 1.5 billion parameters. Its vocabulary is the same 42,384-token biomedical BPE used by the base model, and it is trained on the same PubMed abstract corpus. BioGPT-Large is the most widely used variant for downstream applications because the additional capacity translates into noticeably better performance on PubMedQA and on free-form text generation. The Hugging Face checkpoint microsoft/BioGPT-Large exposes the same BioGptForCausalLM interface as the base model.
BioGPT is pre-trained on roughly 15 million PubMed records collected before 2021. Each record consists of a paper title and an abstract, concatenated into a single document for the language modelling objective. Records lacking abstracts are filtered out. The corpus uses no full-text articles, no clinical notes, and no general-domain web data; this is a strict design choice intended to produce a model that mirrors the linguistic style and vocabulary of the published biomedical literature. Compared to BioBERT, which used a similar PubMed-based corpus but in BERT form, BioGPT inherits the recency advantage of training on documents up to 2020 and the depth advantage of joint title plus abstract context.
The base model was trained on eight NVIDIA V100 GPUs for 200,000 optimisation steps. The effective batch size is 524,288 tokens, achieved through 1024 tokens per GPU multiplied by 8 GPUs and 64 gradient accumulation steps. The optimiser is Adam with a peak learning rate of 2 × 10⁻⁴, an inverse square-root decay schedule, and 20,000 warmup steps. Training uses fairseq 0.12.0 on PyTorch 1.12.0 in mixed precision. The total compute budget for the base model is modest by 2022 standards, reflecting the academic-style scope of the project, and the public release on Hugging Face means anyone with a single high-memory consumer GPU can run inference.
| Variant | Parameters | Hidden dim | Layers | Heads | Notes |
|---|---|---|---|---|---|
| BioGPT (base) | 347M | 1024 | 24 | 16 | GPT-2 medium scale, in-domain BPE |
| BioGPT-Large | 1.5B | 1600 | 48 | 25 | GPT-2 XL scale |
| BioGPT-Large-PubMedQA | 1.5B | 1600 | 48 | 25 | Fine-tuned on PubMedQA, 81.0% accuracy |
Fine-tuned task-specific checkpoints for BC5CDR, KD-DTI, DDI, and HoC were also released alongside the original repository, packaged as fairseq checkpoints rather than Hugging Face shards.
The BioGPT paper evaluates the model on six standard biomedical NLP benchmarks spanning end-to-end relation extraction, question answering, and document classification. A consistent pattern in the paper is that BioGPT casts every downstream task as a text generation problem, predicting a target string conditioned on a prompt rather than emitting class probabilities. This re-formulation is what makes a generative architecture competitive on tasks that historically belonged to BERT-style classifiers.
| Task | Type | Metric | BioGPT score | Prior best |
|---|---|---|---|---|
| BC5CDR | End-to-end chemical-disease relation extraction | F1 | 44.98% | REBEL: 36.70% |
| KD-DTI | End-to-end drug-target interaction extraction | F1 | 38.42% | REBEL_pt: 33.32% |
| DDI | End-to-end drug-drug interaction extraction | F1 | 40.76% | REBEL_pt: 40.56% |
| PubMedQA | Yes / no / maybe biomedical QA | Accuracy | 78.2% (BioGPT) / 81.0% (BioGPT-Large) | BioLinkBERT-Large: 72.2% |
| HoC | Hallmarks of Cancer document classification | F1 | 85.12% | PubMedBERT: 82.32% |
On the relation extraction benchmarks the model uses prefix-tuning style soft prompts, with continuous embeddings inserted between source and target sequences. The tuning approach converts categorical labels into natural-language fragments such as "the relation between X and Y is" before generation, which lets the model leverage its language modelling ability rather than learning a separate classification head from scratch. PubMedQA is treated as a yes / no / maybe answer-token generation problem given the abstract and the long-form question, and HoC document classification is reformulated as multi-label generation of cancer hallmark phrases.
The 78.2% PubMedQA accuracy reported in the original paper, and the 81.0% accuracy of the BioGPT-Large-PubMedQA fine-tune released later, were both state-of-the-art at the time. Subsequent larger generative models such as BioMedLM and Meditron-70B have since pushed the PubMedQA frontier higher, but BioGPT remains a common baseline in benchmark comparisons.
A qualitative section of the paper compares BioGPT generation to general-domain GPT-2 generation given identical biomedical prompts. For prompts naming specific genes, drugs, or diseases, GPT-2 frequently produces irrelevant continuations, repeats the prompt, or invents non-medical context, while BioGPT produces fluent paragraphs that resemble the abstract sections of PubMed papers. The paper demonstrates this with prompts such as "COVID-19 is" and produces specific, professionally toned descriptions referring to SARS-CoV-2, transmission routes, and clinical symptoms. This generation quality is the main behavioural difference between BioGPT and earlier encoder-only biomedical models.
BioGPT sits within a family of language models tailored for biomedicine that grew rapidly between 2019 and 2024. The table below summarises the most notable members, organised by release date.
| Model | Year | Architecture | Size | Training data | Key paper |
|---|---|---|---|---|---|
| BioBERT | 2020 | Encoder (BERT) | 110M | PubMed abstracts + PMC full text, continued from BERT | Lee et al., Bioinformatics |
| SciBERT | 2019 | Encoder (BERT) | 110M | 1.14M Semantic Scholar papers, from scratch with SciVocab | Beltagy et al., EMNLP |
| BlueBERT | 2019 | Encoder (BERT) | 110M / 340M | PubMed + MIMIC-III, continued from BERT | Peng et al., BioNLP |
| ClinicalBERT | 2019 | Encoder (BERT) | 110M | MIMIC-III clinical notes, continued from BioBERT | Alsentzer et al., Clinical NLP |
| PubMedBERT | 2021 | Encoder (BERT) | 110M | PubMed abstracts, from scratch with biomedical vocab | Gu et al., ACM Trans. Computing for Healthcare |
| BioGPT | 2022 | Decoder (GPT-2) | 347M / 1.5B | 15M PubMed abstracts, from scratch | Luo et al., Briefings in Bioinformatics |
| BioMedLM (PubMedGPT) | 2022 | Decoder (GPT-2) | 2.7B | PubMed abstracts + papers from The Pile | Stanford CRFM and MosaicML |
| GatorTron | 2022 | Encoder (BERT-style) | 345M / 3.9B / 8.9B | >90B words clinical notes from UF Health, PubMed, Wikipedia | Yang et al., npj Digital Medicine |
| Med-PaLM | 2022 | Decoder (PaLM) | 540B | PaLM with instruction tuning on medical data | Singhal et al., Nature |
| Med-PaLM 2 | 2023 | Decoder (PaLM 2) | undisclosed | PaLM 2 with medical fine-tuning and ensemble refinement | Singhal et al., Nature Medicine |
| PMC-LLaMA | 2023 | Decoder (LLaMA) | 7B / 13B | LLaMA fine-tuned on 4.8M PubMed Central papers + 30K textbooks | Wu et al., JAMIA |
| Meditron | 2023 | Decoder (LLaMA-2) | 7B / 70B | LLaMA-2 with 48.1B-token GAP-Replay medical corpus | Chen et al., EPFL |
| BioMistral | 2024 | Decoder (Mistral) | 7B | Mistral fine-tuned on PubMed Central | Labrak et al., ACL Findings |
A few patterns are visible in this lineage. The encoder-only family from 2019 to 2021 was dominated by BERT-derived models that matured rapidly with each generation refining domain-specific tokenization and pre-training data. The shift to decoder-only generative models in 2022 was driven simultaneously by BioGPT in the abstract-only setting and by BioMedLM in the abstract-plus-full-text setting, with a similar GPT-2 backbone but different parameter counts. By 2023 the field moved towards adapting open general-purpose models such as LLaMA, LLaMA-2, and Mistral via continued pre-training on biomedical corpora, which proved a more compute-efficient route to large medical models than training from scratch. Frontier closed models like Med-PaLM 2 followed a separate trajectory inside Google Research, leveraging much larger base models and instruction tuning to reach human-expert-level USMLE performance.
BioGPT has been applied to a range of biomedical informatics problems where a fluent generative model with PubMed-style language is useful.
Because BioGPT performs strongly on the KD-DTI and DDI benchmarks, it has been used to scan large literature corpora for candidate drug-target interactions and drug-drug interactions, presenting hits as natural-language relation triples that can be reviewed by domain experts. A 2023 paper in PMC used a fine-tuned BioGPT variant for age-related disease target discovery, demonstrating the practical viability of generative biomedical models for hypothesis generation in pharmaceutical R&D pipelines.
BioGPT-Large can produce coherent multi-sentence summaries of PubMed abstracts when prompted appropriately, and several open-source projects have wrapped it as a literature-review assistant. Because the pre-training data ends in 2020, the model is most useful as a baseline in summarisation pipelines where retrieval over up-to-date corpora supplies the actual content and the language model only handles synthesis.
The BioGPT-Large-PubMedQA checkpoint is the canonical option for closed-book biomedical question answering. It accepts a yes / no / maybe question and an associated abstract, and returns a categorical answer with a long-form rationale. Open-source biomedical QA stacks such as BioASQ baselines often include BioGPT for benchmarking.
Researchers use BioGPT for entity-relation extraction, named entity recognition reformulated as generation, and structured triple extraction from unstructured biomedical text. The prefix-tuning approach used in the original paper transfers cleanly to other relation schemas, and the small parameter count compared to frontier models makes large-scale extraction over millions of documents economically feasible on commodity GPUs.
A niche but interesting use is prompted hypothesis generation, where a researcher provides a biomedical premise such as "Genes associated with Alzheimer's disease that interact with APOE include" and asks the model to continue. The output is not authoritative, but the suggestions can serve as a starting point for literature review or experimental planning, particularly when grounded by retrieval against PubMed.
BioGPT's strengths come with several genuine limitations that users should weigh before deploying it in production.
Its scale is small by 2024 standards. At 1.5 billion parameters in the largest variant, BioGPT is roughly two to three orders of magnitude smaller than frontier models like GPT-4 or Med-PaLM 2, and it lacks the broad world knowledge, multilingual coverage, and instruction-following ability that those models gained from much larger and more diverse training sets.
Its knowledge cutoff is 2020. The pre-training corpus excludes papers published after 2020, so the model has no knowledge of COVID-19 vaccine development beyond very early 2021, of mRNA platform refinements, or of newer drug approvals. Retrieval-augmented generation can mitigate this, but the model itself is frozen at a 2020 view of biomedicine.
It is English-only. The PubMed corpus used for pre-training is overwhelmingly English-language, and the model cannot read or generate clinical text in other languages without significant additional training.
It is trained on abstracts, not full text. Compared to PMC-LLaMA, BioMedLM, or Meditron, BioGPT has not seen the methods, results, or discussion sections of biomedical papers, only their titles and abstracts. This limits the depth of its knowledge of specific experimental protocols and statistical results.
Hallucination remains a serious concern in the biomedical domain. BioGPT can produce plausible-sounding but incorrect medical claims, including invented drug-target interactions and fabricated citations. The model card and the Microsoft Research repository explicitly warn against using BioGPT in clinical decision making without human review.
It has been surpassed by larger generative biomedical models on most benchmarks, including Meditron-70B, Med-PaLM 2, and BioMistral, and by general-purpose frontier LLMs like GPT-4 and Claude on broad medical question-answering tasks. BioGPT remains relevant primarily as a small, accessible, open-weight baseline rather than as a state-of-the-art model.
BioGPT is released under the MIT license. The weights for microsoft/biogpt, microsoft/BioGPT-Large, and microsoft/BioGPT-Large-PubMedQA are hosted on the Hugging Face Hub and integrated into the transformers library through the BioGptForCausalLM and BioGptTokenizer classes, alongside the standard pipeline interfaces for text generation and feature extraction. Source code, fairseq configurations, and the original task-specific fine-tuned checkpoints are available at github.com/microsoft/BioGPT. The repository pins to PyTorch 1.12.0, fairseq 0.12.0, and Python 3.10, although the Hugging Face port works with current versions of transformers and any modern PyTorch release.
Monthly download statistics on Hugging Face show sustained usage in the hundreds of thousands of downloads per month for the base model and several thousand per month for the BioGPT-Large-PubMedQA fine-tune, with dozens of community fine-tunes and adapter modules listed as derived from the official checkpoints.
BioGPT has been cited several thousand times since its 2022 publication, making it among the most influential biomedical language model papers of its era. Its main intellectual contribution was the demonstration that decoder-only generative pre-training, the architecture popularised by the GPT family, was a viable and competitive substitute for encoder-only pre-training in the biomedical domain. Before BioGPT, the conventional wisdom favoured BERT-style models for biomedical NLP because they were already known to handle question answering, classification, and named entity recognition strongly. BioGPT showed that the same tasks could be reformulated as text generation and solved competitively or better with a decoder, while opening up entirely new capabilities such as fluent free-form biomedical text production.
The paper directly inspired follow-up generative biomedical work including Stanford CRFM and MosaicML's BioMedLM, EPFL's Meditron suite, the BioMistral collection from Avignon Université and Nantes Université, and PMC-LLaMA from Shanghai Jiao Tong University. It also helped legitimise the use of generative models in clinical informatics conferences and journals where the prior dominance of BERT had made decoder-only architectures uncommon.
By 2024 and into 2025, biomedical LLM research had largely shifted to fine-tuning open general-purpose foundation models like LLaMA, Mistral, and Phi on biomedical corpora rather than training from scratch. BioGPT remains a frequent baseline in benchmark comparisons because it is small, fast, well documented, MIT-licensed, and runs comfortably on a single consumer GPU. For specific generation tasks where a domain-specific generative model is appropriate and where 1.5B parameters are sufficient, BioGPT is still a reasonable default. For broad medical reasoning, general-purpose frontier LLMs and dedicated medical fine-tunes of larger open models tend to outperform it.