Token Classification Models
Last reviewed
May 11, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Token classification models are natural language processing systems that assign a discrete label to every token in an input sequence, where a token is typically a word, subword piece, or character. The family covers named entity recognition (NER), part-of-speech tagging (POS), syntactic chunking, slot filling for dialogue systems, morphological tagging, and (loosely) semantic role labeling. Although the task is simple to state (input n tokens, output n labels), token classification underlies most information extraction pipelines: search indexing, biomedical literature mining, clinical decision support, voice assistants, and content moderation.
The family is distinct from text classification, which labels whole documents, and from sequence-to-sequence generation. Token classification preserves a one-to-one alignment between input positions and output labels, which suits encoder architectures and structured prediction. Modern systems combine a pretrained Transformer encoder with a lightweight classification head and use tag schemes (BIO, BIOES, BILOU) to encode multi-token spans.
Given an input sentence X = (x_1, ..., x_n), a token classification model predicts a sequence of labels Y = (y_1, ..., y_n) drawn from a fixed label set L. For tasks that mark spans rather than individual words, the label set is extended with prefixes that encode segment boundaries. The entity "New York City" of type LOC tags in the BIO scheme as (B-LOC, I-LOC, I-LOC), where B marks the beginning of a span, I marks an inside position, and O marks tokens outside any span. The BIOES (or BILOU) scheme adds explicit End and Single tags so the model can distinguish single-token from multi-token spans more cleanly.
Subword tokenizers used by Transformers complicate alignment between input pieces and word-level labels. A common convention assigns the gold label to the first subword of each word and an ignore index to the remaining subwords during training; at inference, the first subword prediction propagates to the whole word.
The earliest taggers were hand-crafted rule systems. Eric Brill introduced transformation-based learning for POS tagging in 1992, a method that learns an ordered list of correction rules from a labeled corpus and applies them iteratively. The Brill tagger reached accuracies comparable to stochastic taggers with only a few hundred rules and remained an interpretable baseline.
Hidden Markov models dominated the second half of the 1990s for POS tagging. John Lafferty, Andrew McCallum, and Fernando Pereira proposed conditional random fields (CRFs) in 2001, a family of probabilistic models that overcame the label bias problem of maximum entropy Markov models and allowed richer feature templates without the strong independence assumptions of HMMs. CRFs became the standard backbone for sequence labeling and powered systems such as Stanford NER (Finkel, Grenager, and Manning 2005), which augmented a linear-chain CRF with non-local constraints through Gibbs sampling.
Deep learning displaced feature-engineered CRFs in the mid-2010s. Zhiheng Huang, Wei Xu, and Kai Yu published the bidirectional LSTM-CRF in 2015, combining a BiLSTM sentence encoder with a CRF output layer; the model achieved near state-of-the-art accuracy on POS, chunking, and NER on CoNLL-2003 data without hand-engineered features. Character-level convolutions captured morphology and out-of-vocabulary words. ELMo (Peters et al. 2018) raised CoNLL-2003 English NER F1 above 92 when fed into a BiLSTM-CRF.
The modern era began with BERT (Devlin et al. 2018), which pushed token classification onto bidirectional Transformer encoders with a linear classification head. Domain-specific encoders followed: BioBERT (Lee et al. 2019) for biomedical text, SciBERT (Beltagy et al. 2019) for scientific papers, and Clinical BERT (Alsentzer et al. 2019) for hospital notes. XLM-RoBERTa (Conneau et al. 2020) extended the recipe to over 100 languages. The 2023 wave of instruction-tuned large language models added a second paradigm where entity extraction is framed as generation guided by a schema, with systems such as GoLLIE and UniNER closing the gap to fine-tuned encoders in zero-shot settings.
| Task | Output | Typical label set |
|---|---|---|
| Named entity recognition | Spans of mentions | PERSON, ORG, LOC, MISC, DATE, MONEY |
| Part-of-speech tagging | One tag per token | NOUN, VERB, ADJ, ADV, Penn Treebank tags |
| Chunking | Shallow phrase spans | NP, VP, PP, ADJP, ADVP |
| Slot filling | Spans aligned with intent | departure_city, time, product |
| Morphological tagging | Morphological features | Case, Tense, Number, Gender |
| Semantic role labeling | Spans tied to predicates | ARG0, ARG1, ARGM-TMP |
NER typically uses a few coarse types (four in CoNLL-2003: PER, ORG, LOC, MISC; eighteen in OntoNotes 5.0). POS tagging draws from the Penn Treebank tagset (about 45 tags for English) or the Universal Dependencies tagset (17 coarse tags shared across more than 150 languages). Chunking divides a sentence into non-overlapping syntactic phrases; slot filling is the analogous task inside dialogue systems and is normally trained jointly with intent classification.
Four span-encoding schemes are common.
| Scheme | Tags used | Notes |
|---|---|---|
| IO | I, O | Cannot mark consecutive spans of the same type |
| BIO (IOB2) | B, I, O | Standard CoNLL format; B marks every span start |
| BIOES | B, I, O, E, S | Adds End and Single; reduces decoding ambiguity |
| BILOU | B, I, L, O, U | Renaming of BIOES; Last and Unit replace End and Single |
BIO is the default for most modern Transformer pipelines. BIOES and BILOU make certain illegal sequences (such as I-LOC after O) explicit and tend to help CRF and span-based decoders. Research on chemical and biomedical NER has reported small but consistent F1 gains from BIOES at the cost of a larger label space.
The dominant 2020s architecture is an encoder Transformer with a linear softmax head over each token's contextual representation. Fine-tuning BERT-base or RoBERTa on BIO tags reaches CoNLL-2003 test F1 in the low 90s. A linear-chain CRF layer atop the encoder still provides a small boost when the label set is dense or training data is limited.
The pre-Transformer BiLSTM-CRF remains a strong baseline. It concatenates word embeddings (often GloVe) with character embeddings from a smaller BiLSTM or CNN, feeds the sequence through a deeper BiLSTM, and decodes under a CRF transition matrix. Parts of spaCy and Flair still use BiLSTM components because they run faster than large Transformers on CPU.
Span-based models predict labels for candidate spans directly instead of tag by tag. This formulation, popularized by Lee et al. (2017) for coreference, was adapted for NER soon after. Span models handle nested entities because overlapping spans are scored independently, which is impossible under BIO. Retrieval-augmented NER conditions the encoder on retrieved examples or a knowledge base, helping with rare entities and domain shift.
LLM-based extraction frames the task as generation. The model receives a schema and returns structured output, typically JSON. GoLLIE (Sainz et al. 2023) is fine-tuned on guidelines expressed as Python class docstrings; UniNER (Zhou et al. 2023) distills GPT-4 outputs into a smaller targeted model. Both perform well in zero-shot regimes, though fine-tuning a domain-specific encoder still dominates when ample labeled data is available.
| Model | Year | Approach | Notes |
|---|---|---|---|
| Brill tagger | 1992 | Transformation rules | Interpretable POS baseline |
| Stanford NER | 2005 | Linear-chain CRF | Gibbs sampling for non-local features |
| BiLSTM-CRF | 2015 | BiLSTM encoder + CRF | First strong neural sequence tagger |
| Flair | 2018 | Character LM + BiLSTM-CRF | Akbik et al., Zalando Research |
| spaCy | 2015+ | CNN tok2vec + transition NER | Production library, en_core_web_lg |
| BERT | 2018 | Transformer encoder + linear head | Devlin et al., near 92 F1 on CoNLL-2003 |
| BioBERT | 2019 | BERT pretrained on PubMed | Lee et al., gain on BC5CDR, NCBI Disease |
| SciBERT | 2019 | BERT pretrained on scientific papers | Beltagy et al., 1.14M Semantic Scholar papers |
| Clinical BERT | 2019 | BERT pretrained on MIMIC III | Alsentzer et al., clinical NER |
| XLM-RoBERTa | 2020 | Multilingual Transformer | Conneau et al., 100 languages, +2.4 F1 on NER over mBERT |
| GoLLIE | 2023 | Code-LLaMA fine-tuned on guidelines | Strong zero-shot IE |
| UniNER | 2023 | Instruction tuning + GPT-4 distillation | Targeted NER LLM |
Stanford NER, spaCy, and Flair remain in active production use. Hugging Face hosts thousands of fine-tuned BERT, RoBERTa, and XLM-R checkpoints for NER across languages and domains.
NER and POS benchmarks fall into general-domain, social media, biomedical, and multilingual categories.
| Benchmark | Year | Task | Notes |
|---|---|---|---|
| CoNLL-2003 | 2003 | NER (English, German) | Sang and De Meulder; PER, ORG, LOC, MISC |
| OntoNotes 5.0 | 2013 | NER, coref, SRL | 18 entity types; English, Chinese, Arabic |
| WNUT-2017 | 2017 | Emerging entities | Twitter, Reddit, Stack Overflow |
| Few-NERD | 2021 | Few-shot NER | 8 coarse, 66 fine types, 188K sentences |
| ACE 2005 | 2005 | NER, relation, event | Multi-genre English |
| MultiCoNER | 2022 | Multilingual complex NER | 11 languages, 2.3M instances |
| BC5CDR | 2016 | Biomedical NER | Chemicals and diseases |
| NCBI Disease | 2014 | Biomedical NER | Disease mentions in PubMed abstracts |
| JNLPBA | 2004 | Biomedical NER | Genes and proteins |
| GENIA | 2003 | Biomedical NER | Molecular biology corpus |
| GermEval 2014 | 2014 | NER (German) | Nested entities |
| Penn Treebank | 1993 | POS tagging | English newswire, 45 tags |
| Universal Dependencies | 2014+ | POS, morphology, parsing | 200+ treebanks, 150+ languages |
The canonical NER metric is span-level F1, computed on exact match of entity boundaries and types (the conlleval script). Biomedical work often reports both exact and relaxed matching. POS tagging is reported as token accuracy, with strong models exceeding 97 percent on English Penn Treebank.
From around 2022, large language models such as GPT-4 and Claude shifted part of the field toward prompt-based extraction. With structured output features such as JSON mode and tool use, an LLM can read a passage and return a list of entities, even with custom types described only in the prompt. Performance in zero-shot settings on novel domains can match or exceed fine-tuned encoders when the target schema is unusual or labeled data is scarce. Open-weight instruction-tuned NER systems include GoLLIE and UniNER (2023) and GLiNER (Zaratiana et al. 2024).
For well-resourced domains with stable schemas, fine-tuned encoder models retain advantages in throughput, latency, and cost. A BERT-base tagger processes thousands of sentences per second on a single GPU, while an LLM extractor producing JSON runs orders of magnitude slower. Production stacks often mix the two: an LLM bootstraps labels on a new domain, and a distilled encoder serves traffic.
Domain shift is the largest practical challenge. A CoNLL-2003 newswire model often drops more than 10 F1 points on tweets or clinical notes; in-domain pretraining recovers most of the loss. Nested entities (one entity inside another, common in biomedical text) cannot be expressed in BIO and require span-based decoders. Very long entities (multi-clause titles) are hard for left-to-right decoders. Low-resource languages have few labeled examples; cross-lingual transfer through XLM-R helps but does not close the gap.
Label noise is common in crowdsourced corpora, and audits of CoNLL-2003 have identified annotation errors that limit further F1 gains. Corrected test sets such as CoNLL++ have been proposed. Generalization across schemas remains weak: a model trained on OntoNotes does not produce CoNLL or biomedical types without retraining, although schema-aware LLMs reduce that friction.