Bert-base-uncased model

Hugging Face

google-bert/bert-base-uncased

Name

bert-base-uncased

Type

Natural Language Processing

Task

Fill-Mask

Library

PyTorch, TensorFlow, JAX, Rust, Core ML, Safetensors, Transformers

Dataset

BookCorpus, Wikipedia

Language

English

Paper

arxiv:1810.04805

License

apache-2.0

Related to

bert, exbert, AutoTrain Compatible

Bert-base-uncased is the original lowercased English checkpoint of BERT, released in October 2018 by Google AI Language. It is one of four model sizes published with the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, alongside bert-base-cased, bert-large-uncased, and bert-large-cased. The checkpoint contains roughly 110 million parameters and was pretrained on the combination of BookCorpus and English Wikipedia using two self-supervised objectives: a masked language model and next sentence prediction. On the Hugging Face Hub it is hosted under the canonical name google-bert/bert-base-uncased, with the older alias bert-base-uncased still resolving to the same files. It is licensed under Apache 2.0 and, as of 2026, is downloaded around 60 million times per month, which makes it one of the most used models on the platform.

The "uncased" qualifier signals that the input text is lowercased and accent markers are stripped during tokenization. This is the default variant recommended for general English work where capitalization carries little signal, for example sentiment analysis, topic classification, or many question answering setups. The cased sibling, bert-base-cased, preserves capitalization and is preferred for tasks where casing matters, such as named entity recognition. Both variants share the same architecture and training procedure; only the tokenizer and pretraining corpus preprocessing differ.

Background and release

Before BERT, the dominant approach to building NLP models combined static word vectors such as word2vec or GloVe with task-specific recurrent or convolutional networks. Contextual embeddings from ELMo and unidirectional language models like OpenAI GPT had begun to push the field toward transfer learning, but a deeply bidirectional pretrained encoder did not yet exist. The BERT paper changed that by adapting the transformer encoder of Vaswani et al. (2017) and proposing a masked-token objective that lets the model attend to both left and right context in every layer.

The paper was first posted to arXiv on October 11, 2018, with a revised version on May 24, 2019. It was presented at NAACL-HLT 2019 in Minneapolis, where it won the Best Long Paper award. The pretrained weights were released a few weeks after the initial preprint on the original google-research/bert repository on GitHub, including bert-base-uncased and a multilingual variant. Soon after, the same checkpoints were ported to PyTorch by Hugging Face, which is the form most practitioners use today.

Architecture

Bert-base-uncased is a stack of transformer encoder blocks with no decoder. Each block contains a multi-head self-attention layer followed by a position-wise feed-forward network, with residual connections and layer normalization around each sub-layer. The exact configuration matches the "BERT-base" specification from the paper.

Property	Value
Encoder layers	12
Hidden size	768
Self-attention heads	12
Feed-forward inner size	3,072 (4 x hidden size)
Total parameters	~110 million
Vocabulary size	30,522 WordPiece tokens
Maximum sequence length	512 tokens
Position encodings	Learned absolute, length 512
Activation	GELU
Casing	Lowercased, accents stripped
Tokenizer	BasicTokenizer + WordPiece

The BERT-base size was chosen to match the parameter count of OpenAI GPT so the two could be compared fairly. Inputs are constructed as [CLS] sentence A [SEP] sentence B [SEP], where the special [CLS] token's final hidden state is used as the aggregate representation for classification. Each token's input embedding is the sum of a WordPiece embedding, a learned segment embedding (sentence A or sentence B), and a learned position embedding.

Tokenization

The uncased tokenizer first runs a BasicTokenizer that lowercases the text, normalizes Unicode to NFD form, strips combining diacritical marks, and splits on whitespace and punctuation. The resulting words are then broken into subword units using the WordPiece algorithm with a fixed vocabulary of 30,522 tokens. Subword pieces inside a word are prefixed with ##, so the word "playing" becomes the two tokens play and ##ing. Out-of-vocabulary characters fall back to [UNK]. The vocabulary also reserves five special tokens: [PAD], [UNK], [CLS], [SEP], and [MASK].

Because case information is discarded, the tokenizer maps "Apple" and "apple" to the same token. This reduces vocabulary fragmentation and helps on small fine-tuning datasets, but it is the wrong choice when capitalization is an informative feature, for example distinguishing the company "Apple" from the fruit. In those situations the cased variant or a later model is a better fit.

Pretraining data

Bert-base-uncased was pretrained on a combined corpus of unlabeled English text totaling roughly 3.3 billion words.

Corpus	Approx. words	Notes
BookCorpus	800 million	11,038 unpublished books across many genres
English Wikipedia	2,500 million	Text passages only, excluding lists, tables, and headers

The authors used Wikipedia text passages rather than the full dump because tabular and list-formatted content is poorly suited to a sentence-pair objective. The corpus was sentence-segmented and shuffled so that long contiguous spans of text from a single source did not dominate any one batch.

Pretraining objectives

Bert-base-uncased is trained on two unsupervised tasks at the same time, with the loss equal to the sum of the two task losses.

The first is masked language modeling. For each input sequence, 15% of the WordPiece tokens are selected at random. Of those selected tokens, 80% are replaced with the special [MASK] token, 10% are replaced with a random token from the vocabulary, and 10% are kept unchanged. The model then has to predict the original token at every selected position from a softmax over the full vocabulary. The 80/10/10 split exists because [MASK] never appears at fine-tuning time; mixing in random and unchanged tokens forces the model to keep a usable distributional representation for every position rather than only for masked positions.

The second is next sentence prediction. For each example, two sentences A and B are sampled. Half the time, B is the actual next sentence in the source document; the other half it is a random sentence from the corpus. The model uses the final hidden state of the [CLS] token to make a binary IsNext or NotNext prediction. The goal was to teach the model to model relations between sentence pairs, which matters for tasks like natural language inference and question answering. Later work, particularly RoBERTa, showed that NSP contributes little once the model is trained for long enough on long sequences, but it remained part of the original recipe.

Training procedure

Google trained bert-base-uncased on 4 Cloud TPUs in a Pod configuration, totaling 16 TPU chips, for approximately 4 days at an estimated public-pricing cost of around 500 US dollars in 2018. The schedule used 1 million optimizer steps with a batch size of 256 sequences. The first 90% of steps used a maximum sequence length of 128 tokens to speed up training, and the remaining 10% used the full 512-token length to learn long-range position embeddings.

Hyperparameter	Value
Optimizer	Adam
Learning rate	1e-4
Beta1, Beta2	0.9, 0.999
Weight decay	0.01
Warmup steps	10,000
LR schedule	Linear warmup, then linear decay
Dropout	0.1 on all layers
Batch size	256 sequences
Total steps	1,000,000
Sequence length	128 for 90% of steps, 512 for 10%

Gradient checkpointing and mixed precision were not used in the original run; both are common today when reproducing the training on smaller hardware budgets.

Evaluation

The original paper reports that bert-base-uncased reached 79.6 on the dev average of the GLUE benchmark, well above the previous state of the art at the time. The Hugging Face model card documents the same numbers per task.

GLUE task	Metric	bert-base-uncased
MNLI-m	accuracy	84.6
MNLI-mm	accuracy	83.4
QQP	F1	71.2
QNLI	accuracy	90.5
SST-2	accuracy	93.5
CoLA	Matthews	52.1
STS-B	Spearman	85.8
MRPC	F1	88.9
RTE	accuracy	66.4
Average		79.6

On extractive question answering, the BERT-base configuration reached 88.5 F1 on the SQuAD v1.1 dev set with a single model and 80.8 F1 on SQuAD v2.0. The larger BERT-large configurations push these numbers higher (90.9 / 83.1 on SQuAD v1.1 dev for the large ensemble in the paper), but the base model remains a strong baseline for a wide range of sentence-level and span-level tasks.

Comparison with sibling checkpoints

The original BERT release shipped four English checkpoints. Bert-base-uncased sits at the smaller, lowercased corner of the matrix.

Model	Layers	Hidden	Heads	Parameters	Casing
bert-base-uncased	12	768	12	~110M	lowercased
bert-base-cased	12	768	12	~110M	preserved
bert-large-uncased	24	1024	16	~340M	lowercased
bert-large-cased	24	1024	16	~340M	preserved

Bert-large variants raise the GLUE average by several points but are roughly three times larger and noticeably slower at fine-tuning and inference time. Cased variants help when capitalization carries information, particularly for named entity recognition. Multilingual variants such as bert-base-multilingual-cased extend the same architecture to 104 languages with a shared 110k WordPiece vocabulary.

Usage with the Transformers library

The Hugging Face Transformers library exposes the model through both AutoTokenizer and AutoModel interfaces. The recommended quick test on the model card uses the fill-mask pipeline, which exercises the masked language modeling head that the model was actually pretrained for.

from transformers import pipeline

unmasker = pipeline("fill-mask", model="google-bert/bert-base-uncased")
unmasker("Hello I'm a [MASK] model.")

For downstream use, the library provides task-specific heads such as BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering, and BertForMultipleChoice. A typical fine-tuning loop loads the pretrained encoder, attaches a small classification head over the [CLS] representation, and trains for 2 to 4 epochs with a learning rate near 2e-5 to 5e-5. The model card explicitly notes that the encoder is intended for fine-tuning on tasks that use the whole sentence (or sentence pair) and that text generation should use a model like GPT-2 instead.

Limitations and biases

The Hugging Face model card includes a candid note on bias. Even though the underlying corpora (BookCorpus and Wikipedia) are described as "fairly neutral," the trained model produces clearly stereotyped predictions for occupational fill-mask prompts. Asked to fill in "The man worked as a [MASK].", the top predictions include carpenter, waiter, barber, mechanic, and salesman. For "The woman worked as a [MASK].", the top predictions include nurse, waitress, maid, prostitute, and cook. These biases carry over to any downstream task fine-tuned from the same checkpoint, which is why deployments should evaluate fairness explicitly rather than assume the encoder is neutral.

Other practical limits include the hard 512-token cap, which makes the model awkward for long documents without chunking, and the English-only vocabulary, which performs poorly on text with significant non-English content.

Successors and influence

Bert-base-uncased anchored a wave of encoder models that improved on its training recipe rather than its architecture. RoBERTa (2019) dropped next sentence prediction, trained for much longer with bigger batches on more data, and showed substantial gains. ALBERT shared parameters across layers to shrink the model. DistilBERT distilled bert-base-uncased into a smaller model with about 40% fewer parameters and 60% faster inference while retaining roughly 97% of the GLUE performance. ELECTRA replaced the masked language modeling objective with a more sample-efficient replaced-token detection task. DeBERTa added disentangled attention and a different position encoding scheme.

The model also seeded a long line of domain-specific releases including BioBERT for biomedical text, SciBERT for scientific papers, FinBERT for financial filings, and ClinicalBERT for clinical notes, all of which started from one of the original BERT checkpoints and continued pretraining on a target domain corpus. As of early 2026, the Hugging Face Hub lists more than 6,000 fine-tuned descendants of bert-base-uncased, which is a reasonable proxy for how widely the checkpoint has been used as a starting point.

Why it became the workhorse

Several practical factors made bert-base-uncased the default English encoder for years. The 110M parameter count fit comfortably on a single consumer GPU for fine-tuning, which most contemporary alternatives did not. The Apache 2.0 license allowed commercial use without friction. The Hugging Face port arrived quickly and exposed a clean API. The architecture was simple enough that researchers could write task-specific heads in a few dozen lines. And the GLUE numbers, while no longer state of the art, were and remain a strong baseline for a model of its size. Even after the rise of much larger generative models, bert-base-uncased is still a common pick for text classification, retrieval reranking, and feature extraction when latency and cost matter more than the last few points of accuracy.

Clone model repository

# Be sure to have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/google-bert/bert-base-uncased

# To clone the repo without large files, just their pointers,
# prepend git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

Hugging Face transformers library

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")

Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/google-bert/bert-base-uncased"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "The answer to the universe is [MASK].",
})

References

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Hugging Face. "google-bert/bert-base-uncased model card." https://huggingface.co/google-bert/bert-base-uncased
Wikipedia. "BERT (language model)." https://en.wikipedia.org/wiki/BERT_(language_model)
Vaswani, A., et al. (2017). "Attention Is All You Need." arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692. https://arxiv.org/abs/1907.11692
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv:1910.01108. https://arxiv.org/abs/1910.01108
Lan, Z., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv:1909.11942. https://arxiv.org/abs/1909.11942
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." arXiv:2003.10555. https://arxiv.org/abs/2003.10555
Google Research. "google-research/bert." GitHub repository. https://github.com/google-research/bert

Background and release

Architecture

Tokenization

Pretraining data

Pretraining objectives

Training procedure

Evaluation

Comparison with sibling checkpoints

Usage with the Transformers library

Limitations and biases

Successors and influence

Why it became the workhorse

Clone model repository

Hugging Face transformers library

Inference API

References

Improve this article

Related Articles

BERT

LLaMA 3

ELMo (Embeddings from Language Models)

BART (language model)

Sentence-BERT (SBERT)

DistilBERT

Background and release

Architecture

Tokenization

Pretraining data

Pretraining objectives

Training procedure

Evaluation

Comparison with sibling checkpoints

Usage with the Transformers library

Limitations and biases

Successors and influence

Why it became the workhorse

Clone model repository

Hugging Face transformers library

Inference API

References

Related Articles

BERT

LLaMA 3

ELMo (Embeddings from Language Models)

BART (language model)

Sentence-BERT (SBERT)

DistilBERT