Token Classification Models
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 ยท 4,469 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 ยท 4,469 words
Add missing citations, update stale details, or suggest a clearer explanation.
Token classification models are natural language processing systems that assign a discrete label to every token in an input sequence, where a token is typically a word, subword piece, or character. The family covers named entity recognition (NER), part-of-speech tagging (POS), syntactic chunking, slot filling for dialogue systems, morphological tagging, and (loosely) semantic role labeling. Although the task is simple to state (input n tokens, output n labels), token classification underlies most information extraction pipelines: search indexing, biomedical literature mining, clinical decision support, voice assistants, and content moderation.
The family is distinct from text classification, which labels whole documents, and from sequence-to-sequence generation. Token classification preserves a one-to-one alignment between input positions and output labels, which suits encoder architectures and structured prediction. Modern systems combine a pretrained Transformer encoder with a lightweight classification head and use tag schemes (BIO, BIOES, BILOU) to encode multi-token spans.
Given an input sentence X = (x_1, ..., x_n), a token classification model predicts a sequence of labels Y = (y_1, ..., y_n) drawn from a fixed label set L. For tasks that mark spans rather than individual words, the label set is extended with prefixes that encode segment boundaries. The entity "New York City" of type LOC tags in the BIO scheme as (B-LOC, I-LOC, I-LOC), where B marks the beginning of a span, I marks an inside position, and O marks tokens outside any span. The BIOES (or BILOU) scheme adds explicit End and Single tags so the model can distinguish single-token from multi-token spans more cleanly.
Subword tokenizers used by Transformers complicate alignment between input pieces and word-level labels. A common convention assigns the gold label to the first subword of each word and an ignore index to the remaining subwords during training; at inference, the first subword prediction propagates to the whole word.
The earliest taggers were hand-crafted rule systems. Eric Brill introduced transformation-based learning for POS tagging in 1992, a method that learns an ordered list of correction rules from a labeled corpus and applies them iteratively. The Brill tagger reached accuracies comparable to stochastic taggers with only a few hundred rules and remained an interpretable baseline.
Hidden Markov models dominated the second half of the 1990s for POS tagging. John Lafferty, Andrew McCallum, and Fernando Pereira proposed conditional random fields (CRFs) in 2001, a family of probabilistic models that overcame the label bias problem of maximum entropy Markov models and allowed richer feature templates without the strong independence assumptions of HMMs. CRFs became the standard backbone for sequence labeling and powered systems such as Stanford NER (Finkel, Grenager, and Manning 2005), which augmented a linear-chain CRF with non-local constraints through Gibbs sampling.
Deep learning displaced feature-engineered CRFs in the mid-2010s. Zhiheng Huang, Wei Xu, and Kai Yu published the bidirectional LSTM-CRF in 2015, combining a BiLSTM sentence encoder with a CRF output layer; the model achieved near state-of-the-art accuracy on POS, chunking, and NER on CoNLL-2003 data without hand-engineered features. Character-level convolutions captured morphology and out-of-vocabulary words. ELMo (Peters et al. 2018) raised CoNLL-2003 English NER F1 above 92 when fed into a BiLSTM-CRF.
The modern era began with BERT (Devlin et al. 2018), which pushed token classification onto bidirectional Transformer encoders with a linear classification head. Domain-specific encoders followed: BioBERT (Lee et al. 2019) for biomedical text, SciBERT (Beltagy et al. 2019) for scientific papers, and Clinical BERT (Alsentzer et al. 2019) for hospital notes. XLM-RoBERTa (Conneau et al. 2020) extended the recipe to over 100 languages. The 2023 wave of instruction-tuned large language models added a second paradigm where entity extraction is framed as generation guided by a schema, with systems such as GoLLIE and UniNER closing the gap to fine-tuned encoders in zero-shot settings.
| Task | Output | Typical label set |
|---|---|---|
| Named entity recognition | Spans of mentions | PERSON, ORG, LOC, MISC, DATE, MONEY |
| Part-of-speech tagging | One tag per token | NOUN, VERB, ADJ, ADV, Penn Treebank tags |
| Chunking | Shallow phrase spans | NP, VP, PP, ADJP, ADVP |
| Slot filling | Spans aligned with intent | departure_city, time, product |
| Morphological tagging | Morphological features | Case, Tense, Number, Gender |
| Semantic role labeling | Spans tied to predicates | ARG0, ARG1, ARGM-TMP |
NER typically uses a few coarse types (four in CoNLL-2003: PER, ORG, LOC, MISC; eighteen in OntoNotes 5.0). POS tagging draws from the Penn Treebank tagset (about 45 tags for English) or the Universal Dependencies tagset (17 coarse tags shared across more than 150 languages). Chunking divides a sentence into non-overlapping syntactic phrases; slot filling is the analogous task inside dialogue systems and is normally trained jointly with intent classification.
Named entity recognition is the most studied token classification task. Its goal is to identify and categorize spans in free text that refer to real-world entities: people, organizations, locations, products, dates, monetary amounts, and similar types. Modern models are trained on annotated corpora where human experts have marked entity boundaries and types. The task is evaluated at the entity level rather than the token level: a prediction is only counted as correct if both the boundary and the type match the gold annotation exactly.
NER performance depends heavily on the domain. Newswire models trained on CoNLL-2003 or OntoNotes often drop 10 to 20 F1 points when applied to social media text, clinical notes, or scientific abstracts. Domain-specific pretraining and targeted annotation corpora address this gap.
Part-of-speech tagging assigns a syntactic category to each token. For English, the Penn Treebank tagset distinguishes 45 fine-grained classes (NN, NNS, VBD, JJ, etc.). The Universal Dependencies project defines 17 coarse categories (NOUN, VERB, ADJ, etc.) that apply across the 200-plus treebanks covering more than 160 languages.
POS tagging is relatively saturated for high-resource languages: transformer-based taggers exceed 98 percent token accuracy on standard English benchmarks. The remaining errors concentrate on rare words, punctuation ambiguities, and domain-shifted text such as spoken transcripts or social media.
Chunking (also called shallow parsing) groups consecutive tokens into non-overlapping syntactic phrases without requiring a full parse tree. Output spans are typed as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and similar constituents. Chunking is evaluated on CoNLL-2000 data using token-level and span-level F1. Because it is simpler than full dependency or constituency parsing, it remains useful as a fast preprocessing step in production pipelines.
Slot filling is the dialogue-oriented variant of NER. Given an utterance such as "Book a table for two at eight on Friday," the system must identify and tag slot values: party_size=two, time=eight, day=Friday. Slots are defined by an intent schema (a reservation intent has specific slot types that a weather intent does not). BERT-based joint models that solve intent classification and slot filling simultaneously became standard after Chen and Zhuo (2019) demonstrated that shared encoder representations improve both tasks. Slot filling benchmarks include ATIS (Airline Travel Information Systems) and SNIPS.
Four span-encoding schemes are common.
| Scheme | Tags used | Notes |
|---|---|---|
| IO | I, O | Cannot mark consecutive spans of the same type |
| BIO (IOB2) | B, I, O | Standard CoNLL format; B marks every span start |
| BIOES | B, I, O, E, S | Adds End and Single; reduces decoding ambiguity |
| BILOU | B, I, L, O, U | Renaming of BIOES; Last and Unit replace End and Single |
BIO is the default for most modern Transformer pipelines. BIOES and BILOU make certain illegal sequences (such as I-LOC after O) explicit and tend to help CRF and span-based decoders. Research on chemical and biomedical NER has reported small but consistent F1 gains from BIOES at the cost of a larger label space.
The IO scheme is the simplest but has an inherent limitation: two consecutive entities of the same type cannot be distinguished because there is no Beginning marker to separate them. BIO resolves this by requiring a B tag at the start of every entity span, regardless of whether the previous token had the same type. BIOES and BILOU additionally mark the final token of each span (E and L respectively) and treat single-token entities with a dedicated S or U tag, which gives the CRF transition model clearer structural cues.
The pre-Transformer BiLSTM-CRF remains a well-understood baseline. It concatenates word embeddings (often GloVe) with character embeddings from a smaller BiLSTM or CNN, feeds the sequence through a deeper BiLSTM, and decodes under a CRF transition matrix. The CRF layer learns a matrix of transition scores between label pairs (how likely is a B-LOC to follow an I-ORG, for example) and uses the Viterbi algorithm at inference to find the globally most probable tag sequence given both emission scores and transition constraints. This global normalization is the key advantage over greedy or softmax decoders: the CRF never produces illegal label sequences (such as I-LOC after O without a B-LOC) because such transitions receive a large negative transition score.
Parts of spaCy and Flair still use BiLSTM components because they run faster than large Transformers on CPU. Character-level embeddings from a CNN or a smaller BiLSTM give morphological signal for out-of-vocabulary words, which is especially helpful for languages with complex inflection.
The dominant 2020s architecture is an encoder Transformer with a linear softmax head over each token's contextual representation. The encoder (BERT-base, RoBERTa, DeBERTa-v3, or XLM-RoBERTa) converts the input tokens into dense contextual vectors; the classification head maps each vector to a logit for each label in the tag vocabulary. At fine-tuning, the cross-entropy loss over labeled tokens is backpropagated through both the head and the encoder. On CoNLL-2003 English, BERT-base reaches entity-level F1 in the low 90s; larger and more recent encoders such as DeBERTa-v3-large push past 93.
Adding a linear-chain CRF layer atop the Transformer still provides a small but consistent boost when the label set is dense, training data is limited, or the schema includes rare tag transitions. The additional parameters are few and the computational overhead is negligible, so many practitioners include the CRF by default.
Span-based models predict labels for candidate spans directly instead of tagging token by token. Every possible span (up to some maximum length) is enumerated, a span representation is constructed (typically by concatenating the contextual representations of the start and end tokens and sometimes adding a width embedding), and a classifier scores each span against the entity types plus a null class. This formulation, popularized by Lee et al. (2017) for coreference, was adapted for NER and handles nested entities naturally: two spans with overlapping boundaries simply receive independent scores. Span models are evaluated on GENIA, ACE 2004, and ACE 2005, achieving F1 in the upper 80s on those nested-entity corpora.
LLM-based extraction frames the task as generation. The model receives a schema and returns structured output, typically JSON. GoLLIE (Sainz et al. 2023) is fine-tuned on Code-LLaMA with annotation guidelines expressed as Python class docstrings, allowing it to follow novel extraction schemas described only at inference time. UniNER (Zhou et al. 2023) distills GPT-4 outputs into a smaller targeted model. Both perform well in zero-shot regimes, though fine-tuning a domain-specific encoder still dominates when ample labeled data is available.
GLiNER (Zaratiana et al. 2023) takes a middle path: it is an encoder-based generalist model that accepts entity type labels as input alongside the text, encoding them in the same latent space as the document tokens, then scores each token against each provided label in parallel. This architecture is much smaller and faster than an LLM while generalizing to entity types seen only at inference time. GLiNER outperforms both ChatGPT and larger instruction-tuned models in zero-shot NER evaluations on held-out benchmarks.
Retrieval-augmented NER conditions the encoder on retrieved examples or a knowledge base, helping with rare entities and domain shift. Gazetteer features (lookup tables of known entity names) have been used since the CRF era; modern versions embed gazetteer entries and inject them as soft features into Transformer attention. Retrieval-augmented methods that pull relevant training examples at inference time show gains on low-resource and emerging-entity settings.
| Model | Year | Approach | Notes |
|---|---|---|---|
| Brill tagger | 1992 | Transformation rules | Interpretable POS baseline |
| Stanford NER | 2005 | Linear-chain CRF | Gibbs sampling for non-local features |
| BiLSTM-CRF | 2015 | BiLSTM encoder + CRF | First strong neural sequence tagger |
| Flair | 2018 | Character LM + BiLSTM-CRF | Akbik et al., Zalando Research |
| spaCy | 2015+ | CNN tok2vec + transition NER | Production library, en_core_web_lg |
| BERT | 2018 | Transformer encoder + linear head | Devlin et al., near 92 F1 on CoNLL-2003 |
| BioBERT | 2019 | BERT pretrained on PubMed | Lee et al., gain on BC5CDR, NCBI Disease |
| SciBERT | 2019 | BERT pretrained on scientific papers | Beltagy et al., 1.14M Semantic Scholar papers |
| Clinical BERT | 2019 | BERT pretrained on MIMIC III | Alsentzer et al., clinical NER |
| XLM-RoBERTa | 2020 | Multilingual Transformer | Conneau et al., 100 languages, +2.4 F1 on NER over mBERT |
| DeBERTa-v3 | 2021 | ELECTRA-style pretraining | He et al., strong encoder for token classification |
| PubMedBERT | 2021 | BERT trained from scratch on PubMed | Gu et al., BLURB SOTA across 13 biomedical NLP tasks |
| GoLLIE | 2023 | Code-LLaMA fine-tuned on guidelines | Sainz et al., strong zero-shot IE |
| UniNER | 2023 | Instruction tuning + GPT-4 distillation | Zhou et al., targeted NER LLM |
| GLiNER | 2023 | Bidirectional encoder + label encoding | Zaratiana et al., generalist zero-shot NER |
| NuNER | 2024 | RoBERTa fine-tuned on LLM-annotated data | Ashok and Lipton, LLM-bootstrapped pretraining |
| ModernBERT | 2024 | Modernized BERT with RoPE and long context | Warner et al., 8192-token context, 2-4x faster |
Stanford NER, spaCy, and Flair remain in active production use. Hugging Face hosts thousands of fine-tuned BERT, RoBERTa, and XLM-R checkpoints for NER across languages and domains.
NER and POS benchmarks fall into general-domain, social media, biomedical, and multilingual categories.
| Benchmark | Year | Task | Notes |
|---|---|---|---|
| CoNLL-2003 | 2003 | NER (English, German) | Sang and De Meulder; PER, ORG, LOC, MISC; ~93 F1 near ceiling |
| OntoNotes 5.0 | 2013 | NER, coref, SRL | 18 entity types; English, Chinese, Arabic |
| WNUT-2017 | 2017 | Emerging entities | Twitter, Reddit, Stack Overflow; top F1 around 50 |
| Few-NERD | 2021 | Few-shot NER | 8 coarse, 66 fine types, 188K sentences |
| ACE 2005 | 2005 | NER, relation, event | Multi-genre English; nested entity evaluation |
| MultiCoNER | 2022 | Multilingual complex NER | 11 languages, 2.3M instances |
| BC5CDR | 2016 | Biomedical NER | Chemicals and diseases |
| NCBI Disease | 2014 | Biomedical NER | Disease mentions in PubMed abstracts |
| JNLPBA | 2004 | Biomedical NER | Genes and proteins |
| GENIA | 2003 | Biomedical NER | Molecular biology corpus; nested entities |
| GermEval 2014 | 2014 | NER (German) | Nested entities |
| Penn Treebank | 1993 | POS tagging | English newswire, 45 tags; >97% token accuracy |
| Universal Dependencies | 2014+ | POS, morphology, parsing | 200+ treebanks, 160+ languages |
| CoNLL-2000 | 2000 | Chunking | English Wall Street Journal, NP/VP/PP |
| ATIS | 1990s | Slot filling | Airline travel utterances |
| SNIPS | 2018 | Slot filling | 7 intent domains, diverse slot types |
| CleanCoNLL | 2023 | Corrected NER | Re-annotation of CoNLL-2003; SOTA reaches 97.1 F1 |
The canonical NER metric is span-level F1, computed on exact match of entity boundaries and types. A predicted span is a true positive only if both its start index, its end index, and its type label match the gold annotation. The conlleval script, released alongside the CoNLL-2003 shared task, implements this evaluation and remains the standard tool.
Partial matching schemes (relaxed boundaries, type-only, boundary-only) are also used in clinical and legal NER where exact boundary agreement is harder to achieve. Biomedical work often reports both exact and relaxed matching.
POS tagging is reported as token accuracy: the fraction of tokens whose predicted tag matches the gold tag. Strong models exceed 97 percent on English Penn Treebank and 98 percent on Universal Dependencies with transformer encoders.
Slot filling is measured by slot-level F1 and intent accuracy. Joint models are evaluated on both simultaneously.
Because CoNLL-2003 has annotation errors estimated at around 5 to 7 percent of labels, a performance ceiling was apparent by the early 2020s: the apparent test F1 ceiling near 93 reflects noise in the reference labels rather than a model limitation. CoNLL++ (Wang et al. 2019) corrected 309 labels in the test set; CleanCoNLL (Reiss et al. 2023) performed a comprehensive relabeling, allowing SOTA models to reach 97.1 F1.
From around 2022, large language models such as GPT-4 and Claude shifted part of the field toward prompt-based extraction. With structured output features such as JSON mode and tool use, an LLM can read a passage and return a list of entities, even with custom types described only in the prompt. Performance in zero-shot settings on novel domains can match or exceed fine-tuned encoders when the target schema is unusual or labeled data is scarce. Open-weight instruction-tuned NER systems include GoLLIE and UniNER (2023) and GLiNER (Zaratiana et al. 2023/2024 NAACL).
For well-resourced domains with stable schemas, fine-tuned encoder models retain advantages in throughput, latency, and cost. A BERT-base tagger processes thousands of sentences per second on a single GPU, while an LLM extractor producing JSON runs orders of magnitude slower. Production stacks often mix the two: an LLM bootstraps labels on a new domain, and a distilled encoder serves traffic.
Encoder models continued to advance in this period. DeBERTa-v3, released by Microsoft in 2021, uses ELECTRA-style replaced-token-detection pretraining with gradient-disentangled embedding sharing, achieving consistently top scores on NER benchmarks with only 60 to 70 percent of the pretraining data required by earlier models. ModernBERT (Warner et al. 2024) introduced a refresh of the BERT architecture with rotary positional embeddings, alternating local-global attention, and training on 2 trillion tokens, delivering 2 to 4 times the inference throughput of older encoders while extending the effective context window to 8192 tokens.
NuNER (Ashok and Lipton 2024) demonstrated that pretraining a RoBERTa model on data automatically annotated by GPT-3.5 on a large general corpus yields strong few-shot NER performance, providing a practical path to bootstrapping new domains without manual annotation.
| Domain | Use case | Representative tools |
|---|---|---|
| Search and information retrieval | NER-based query understanding and vertical routing | spaCy, custom BERT taggers |
| Biomedical text mining | Extraction of genes, proteins, chemicals, diseases from PubMed | BioBERT, PubMedBERT, BLURB models |
| Clinical informatics | Problem list population, medication extraction from discharge summaries | Clinical BERT, i2b2 models |
| Legal and financial analysis | Tagging parties, dates, monetary amounts in contracts and filings | Legal-BERT, FinBERT fine-tunes |
| Voice assistants | Slot filling for flight booking, restaurant reservation, calendar events | BERT joint intent and slot models |
| Content moderation | Detection of PII, addresses, sensitive terms | Presidio (Microsoft), custom NER |
| Machine translation and NMT | Named entity handling, transliteration decisions | XLM-R based NER in MT pipelines |
| Knowledge base construction | Relation extraction bootstrapped by entity detection | Stanford NER, spaCy pipelines |
Search engines and enterprise indexes use NER to identify queries about people, places, and products, then route to specialized verticals or populate knowledge panels. Biomedical tools extract genes, proteins, chemicals, diseases, and adverse events from PubMed and clinical trial registries at scale: PubMed alone has more than 36 million citations, making automated extraction essential. Hospital systems run NER over discharge summaries to populate problem lists, medication histories, and quality metrics, reducing manual coding burden. Voice assistants rely on slot filling as the backbone of their natural language understanding stack; a single slot filling model typically handles dozens of intent domains with hundreds of slot types.
Domain shift is the largest practical challenge. A CoNLL-2003 newswire model often drops more than 10 F1 points on tweets or clinical notes; in-domain pretraining recovers most of the loss. The WNUT-2017 benchmark, built from Twitter, Reddit, and Stack Overflow text, captures this: the best systems reach only around 50 F1 on emerging entities in that noisy, informal domain. Domain-adaptive pretraining and targeted annotation remain the primary mitigations.
Nested entities (one entity inside another, common in biomedical text) cannot be expressed in BIO and require span-based decoders. In GENIA and clinical corpora, a protein mention may be nested inside a gene family mention, or a disease name may be nested inside a treatment phrase. Span-based models handle this naturally; BIO-based taggers require post-processing heuristics or separate passes over the text. The prevalence of nesting in specialized domains means that span models are preferred for biomedical and legal NER despite their higher computational cost.
Label noise is common in crowdsourced corpora, and audits of CoNLL-2003 have identified annotation errors that limit further F1 gains. Corrected test sets such as CoNLL++ and CleanCoNLL have been proposed and show that a significant fraction of the apparent performance ceiling is due to label noise rather than model limitations. Generalization across schemas remains weak: a model trained on OntoNotes does not produce CoNLL or biomedical types without retraining, although schema-aware LLMs reduce that friction.
Low-resource languages have few labeled examples; cross-lingual transfer through XLM-R helps but does not close the gap to high-resource languages. Universal Dependencies treebanks provide POS data for more than 160 languages, enabling cross-lingual POS transfer; but for NER, most languages outside of English, Chinese, German, Spanish, and Dutch have sparse corpora. Zero-shot transfer from multilingual encoders, combined with a small number of in-language labeled examples, currently gives the best results for low-resource NER.
Standard token classifiers process sequences up to 512 tokens (BERT) or 8192 tokens (ModernBERT). Long legal or scientific documents require chunking strategies. Very long entities that span multiple clauses (such as long medication dosage descriptions or complex legal clauses) are hard for left-to-right or windowed decoders. Hierarchical encoders and sparse attention help but remain research-stage solutions.
LLM-based extraction, while flexible, introduces new failure modes: format errors (invalid JSON, missing fields), hallucinated entity boundaries, and sensitivity to prompt wording. Empirical studies show GPT-4 has a non-trivial invalid-response rate for complex extraction schemas, and performance degrades sharply on nested or overlapping entity types. Combining LLM flexibility with encoder-based reliability through distillation or label smoothing is an active research area.