Bert-base-uncased model
Hugging Face
google-bert/bert-base-uncased
Name
bert-base-uncased
Type
Task
Fill-Mask
Library
PyTorch, TensorFlow, JAX, Rust, Core ML, Safetensors, Transformers
Dataset
BookCorpus, Wikipedia
Language
English
Paper
License
apache-2.0
Related to
bert, exbert, AutoTrain Compatible
Bert-base-uncased is the original lowercased English checkpoint of BERT, released in October 2018 by Google AI Language. It is one of four model sizes published with the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, alongside bert-base-cased, bert-large-uncased, and bert-large-cased. The checkpoint contains roughly 110 million parameters and was pretrained on the combination of BookCorpus and English Wikipedia using two self-supervised objectives: a masked language model and next sentence prediction. On the Hugging Face Hub it is hosted under the canonical name google-bert/bert-base-uncased, with the older alias bert-base-uncased still resolving to the same files. It is licensed under Apache 2.0 and, as of 2026, is downloaded around 60 million times per month, which makes it one of the most used models on the platform.
The "uncased" qualifier signals that the input text is lowercased and accent markers are stripped during tokenization. This is the default variant recommended for general English work where capitalization carries little signal, for example sentiment analysis, topic classification, or many question answering setups. The cased sibling, bert-base-cased, preserves capitalization and is preferred for tasks where casing matters, such as named entity recognition. Both variants share the same architecture and training procedure; only the tokenizer and pretraining corpus preprocessing differ.
Before BERT, the dominant approach to building NLP models combined static word vectors such as word2vec or GloVe with task-specific recurrent or convolutional networks. Contextual embeddings from ELMo and unidirectional language models like OpenAI GPT had begun to push the field toward transfer learning, but a deeply bidirectional pretrained encoder did not yet exist. The BERT paper changed that by adapting the transformer encoder of Vaswani et al. (2017) and proposing a masked-token objective that lets the model attend to both left and right context in every layer.
The paper was first posted to arXiv on October 11, 2018, with a revised version on May 24, 2019. It was presented at NAACL-HLT 2019 in Minneapolis, where it won the Best Long Paper award. The pretrained weights were released a few weeks after the initial preprint on the original google-research/bert repository on GitHub, including bert-base-uncased and a multilingual variant. Soon after, the same checkpoints were ported to PyTorch by Hugging Face, which is the form most practitioners use today.
Bert-base-uncased is a stack of transformer encoder blocks with no decoder. Each block contains a multi-head self-attention layer followed by a position-wise feed-forward network, with residual connections and layer normalization around each sub-layer. The exact configuration matches the "BERT-base" specification from the paper.
| Property | Value |
|---|---|
| Encoder layers | 12 |
| Hidden size | 768 |
| Self-attention heads | 12 |
| Feed-forward inner size | 3,072 (4 x hidden size) |
| Total parameters | ~110 million |
| Vocabulary size | 30,522 WordPiece tokens |
| Maximum sequence length | 512 tokens |
| Position encodings | Learned absolute, length 512 |
| Activation | GELU |
| Casing | Lowercased, accents stripped |
| Tokenizer | BasicTokenizer + WordPiece |
The BERT-base size was chosen to match the parameter count of OpenAI GPT so the two could be compared fairly. Inputs are constructed as [CLS] sentence A [SEP] sentence B [SEP], where the special [CLS] token's final hidden state is used as the aggregate representation for classification. Each token's input embedding is the sum of a WordPiece embedding, a learned segment embedding (sentence A or sentence B), and a learned position embedding.
The uncased tokenizer first runs a BasicTokenizer that lowercases the text, normalizes Unicode to NFD form, strips combining diacritical marks, and splits on whitespace and punctuation. The resulting words are then broken into subword units using the WordPiece algorithm with a fixed vocabulary of 30,522 tokens. Subword pieces inside a word are prefixed with ##, so the word "playing" becomes the two tokens play and ##ing. Out-of-vocabulary characters fall back to [UNK]. The vocabulary also reserves five special tokens: [PAD], [UNK], [CLS], [SEP], and [MASK].
Because case information is discarded, the tokenizer maps "Apple" and "apple" to the same token. This reduces vocabulary fragmentation and helps on small fine-tuning datasets, but it is the wrong choice when capitalization is an informative feature, for example distinguishing the company "Apple" from the fruit. In those situations the cased variant or a later model is a better fit.
Bert-base-uncased was pretrained on a combined corpus of unlabeled English text totaling roughly 3.3 billion words.
| Corpus | Approx. words | Notes |
|---|---|---|
| BookCorpus | 800 million | 11,038 unpublished books across many genres |
| English Wikipedia | 2,500 million | Text passages only, excluding lists, tables, and headers |
The authors used Wikipedia text passages rather than the full dump because tabular and list-formatted content is poorly suited to a sentence-pair objective. The corpus was sentence-segmented and shuffled so that long contiguous spans of text from a single source did not dominate any one batch.
Bert-base-uncased is trained on two unsupervised tasks at the same time, with the loss equal to the sum of the two task losses.
The first is masked language modeling. For each input sequence, 15% of the WordPiece tokens are selected at random. Of those selected tokens, 80% are replaced with the special [MASK] token, 10% are replaced with a random token from the vocabulary, and 10% are kept unchanged. The model then has to predict the original token at every selected position from a softmax over the full vocabulary. The 80/10/10 split exists because [MASK] never appears at fine-tuning time; mixing in random and unchanged tokens forces the model to keep a usable distributional representation for every position rather than only for masked positions.
The second is next sentence prediction. For each example, two sentences A and B are sampled. Half the time, B is the actual next sentence in the source document; the other half it is a random sentence from the corpus. The model uses the final hidden state of the [CLS] token to make a binary IsNext or NotNext prediction. The goal was to teach the model to model relations between sentence pairs, which matters for tasks like natural language inference and question answering. Later work, particularly RoBERTa, showed that NSP contributes little once the model is trained for long enough on long sequences, but it remained part of the original recipe.
Google trained bert-base-uncased on 4 Cloud TPUs in a Pod configuration, totaling 16 TPU chips, for approximately 4 days at an estimated public-pricing cost of around 500 US dollars in 2018. The schedule used 1 million optimizer steps with a batch size of 256 sequences. The first 90% of steps used a maximum sequence length of 128 tokens to speed up training, and the remaining 10% used the full 512-token length to learn long-range position embeddings.
| Hyperparameter | Value |
|---|---|
| Optimizer | Adam |
| Learning rate | 1e-4 |
| Beta1, Beta2 | 0.9, 0.999 |
| Weight decay | 0.01 |
| Warmup steps | 10,000 |
| LR schedule | Linear warmup, then linear decay |
| Dropout | 0.1 on all layers |
| Batch size | 256 sequences |
| Total steps | 1,000,000 |
| Sequence length | 128 for 90% of steps, 512 for 10% |
Gradient checkpointing and mixed precision were not used in the original run; both are common today when reproducing the training on smaller hardware budgets.
The original paper reports that bert-base-uncased reached 79.6 on the dev average of the GLUE benchmark, well above the previous state of the art at the time. The Hugging Face model card documents the same numbers per task.
| GLUE task | Metric | bert-base-uncased |
|---|---|---|
| MNLI-m | accuracy | 84.6 |
| MNLI-mm | accuracy | 83.4 |
| QQP | F1 | 71.2 |
| QNLI | accuracy | 90.5 |
| SST-2 | accuracy | 93.5 |
| CoLA | Matthews | 52.1 |
| STS-B | Spearman | 85.8 |
| MRPC | F1 | 88.9 |
| RTE | accuracy | 66.4 |
| Average | 79.6 |
On extractive question answering, the BERT-base configuration reached 88.5 F1 on the SQuAD v1.1 dev set with a single model and 80.8 F1 on SQuAD v2.0. The larger BERT-large configurations push these numbers higher (90.9 / 83.1 on SQuAD v1.1 dev for the large ensemble in the paper), but the base model remains a strong baseline for a wide range of sentence-level and span-level tasks.
The original BERT release shipped four English checkpoints. Bert-base-uncased sits at the smaller, lowercased corner of the matrix.
| Model | Layers | Hidden | Heads | Parameters | Casing |
|---|---|---|---|---|---|
| bert-base-uncased | 12 | 768 | 12 | ~110M | lowercased |
| bert-base-cased | 12 | 768 | 12 | ~110M | preserved |
| bert-large-uncased | 24 | 1024 | 16 | ~340M | lowercased |
| bert-large-cased | 24 | 1024 | 16 | ~340M | preserved |
Bert-large variants raise the GLUE average by several points but are roughly three times larger and noticeably slower at fine-tuning and inference time. Cased variants help when capitalization carries information, particularly for named entity recognition. Multilingual variants such as bert-base-multilingual-cased extend the same architecture to 104 languages with a shared 110k WordPiece vocabulary.
The Hugging Face Transformers library exposes the model through both AutoTokenizer and AutoModel interfaces. The recommended quick test on the model card uses the fill-mask pipeline, which exercises the masked language modeling head that the model was actually pretrained for.
from transformers import pipeline
unmasker = pipeline("fill-mask", model="google-bert/bert-base-uncased")
unmasker("Hello I'm a [MASK] model.")
For downstream use, the library provides task-specific heads such as BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering, and BertForMultipleChoice. A typical fine-tuning loop loads the pretrained encoder, attaches a small classification head over the [CLS] representation, and trains for 2 to 4 epochs with a learning rate near 2e-5 to 5e-5. The model card explicitly notes that the encoder is intended for fine-tuning on tasks that use the whole sentence (or sentence pair) and that text generation should use a model like GPT-2 instead.
The Hugging Face model card includes a candid note on bias. Even though the underlying corpora (BookCorpus and Wikipedia) are described as "fairly neutral," the trained model produces clearly stereotyped predictions for occupational fill-mask prompts. Asked to fill in "The man worked as a [MASK].", the top predictions include carpenter, waiter, barber, mechanic, and salesman. For "The woman worked as a [MASK].", the top predictions include nurse, waitress, maid, prostitute, and cook. These biases carry over to any downstream task fine-tuned from the same checkpoint, which is why deployments should evaluate fairness explicitly rather than assume the encoder is neutral.
Other practical limits include the hard 512-token cap, which makes the model awkward for long documents without chunking, and the English-only vocabulary, which performs poorly on text with significant non-English content.
Bert-base-uncased anchored a wave of encoder models that improved on its training recipe rather than its architecture. RoBERTa (2019) dropped next sentence prediction, trained for much longer with bigger batches on more data, and showed substantial gains. ALBERT shared parameters across layers to shrink the model. DistilBERT distilled bert-base-uncased into a smaller model with about 40% fewer parameters and 60% faster inference while retaining roughly 97% of the GLUE performance. ELECTRA replaced the masked language modeling objective with a more sample-efficient replaced-token detection task. DeBERTa added disentangled attention and a different position encoding scheme.
The model also seeded a long line of domain-specific releases including BioBERT for biomedical text, SciBERT for scientific papers, FinBERT for financial filings, and ClinicalBERT for clinical notes, all of which started from one of the original BERT checkpoints and continued pretraining on a target domain corpus. As of early 2026, the Hugging Face Hub lists more than 6,000 fine-tuned descendants of bert-base-uncased, which is a reasonable proxy for how widely the checkpoint has been used as a starting point.
Several practical factors made bert-base-uncased the default English encoder for years. The 110M parameter count fit comfortably on a single consumer GPU for fine-tuning, which most contemporary alternatives did not. The Apache 2.0 license allowed commercial use without friction. The Hugging Face port arrived quickly and exposed a clean API. The architecture was simple enough that researchers could write task-specific heads in a few dozen lines. And the GLUE numbers, while no longer state of the art, were and remain a strong baseline for a model of its size. Even after the rise of much larger generative models, bert-base-uncased is still a common pick for text classification, retrieval reranking, and feature extraction when latency and cost matter more than the last few points of accuracy.
# Be sure to have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/google-bert/bert-base-uncased
# To clone the repo without large files, just their pointers,
# prepend git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")
import requests
API_URL = "https://api-inference.huggingface.co/models/google-bert/bert-base-uncased"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "The answer to the universe is [MASK].",
})