Named entity recognition

Machine Learning Natural Language Processing

34 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v5 · 6,765 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Named entity recognition (NER) is the natural language processing task of locating spans of text that name real-world things, such as people, organizations, and locations, and classifying each span into a predefined category. It is a subtask of information extraction that converts unstructured text into structured records, and it underpins downstream systems including relation extraction, question answering, and knowledge graph construction. On the standard CoNLL-2003 English benchmark, the best system at the 2003 shared task reached an F1 score of 88.76%, while modern Transformer encoders such as BERT exceed 92% F1 and corrected versions of the benchmark push state-of-the-art results above 97%. ^[5]^[6]^[13]^[24]

Named entity recognition, also known as entity identification or entity extraction, locates and classifies named entities in unstructured text into categories such as person names, organizations, locations, dates, monetary values, and other entity types. NER serves as a foundational step in many NLP pipelines, providing structured information that downstream tasks depend on.

The task can be formally stated as follows: given a sequence of tokens (words or subwords), assign each token a label indicating whether it is part of a named entity and, if so, which category that entity belongs to. For example, in the sentence "Barack Obama was born in Honolulu on August 4, 1961," a NER system should identify "Barack Obama" as a person (PER), "Honolulu" as a location (LOC), and "August 4, 1961" as a date (DATE).

NER has evolved from rule-based systems in the early 1990s through statistical methods like hidden Markov models and conditional random fields, to modern deep learning approaches based on recurrent neural networks and Transformers. The field continues to advance with few-shot and zero-shot approaches powered by large language models.

What does named entity recognition do?

NER reads text and produces, for each named entity, its character or token span and its type label. A typical pipeline tokenizes the input, predicts a tag for each token (or scores candidate spans), and then merges adjacent tokens of the same type into entity mentions. The output is a set of typed mentions that later stages can link to a knowledge base, relate to one another, or aggregate. Because it is the first structuring step over raw text, NER quality propagates to every component built on top of it.

History and Development

When was named entity recognition first defined?

The origins of NER as a formal research task trace back to the Message Understanding Conferences (MUC), a series of evaluations funded by the U.S. Defense Advanced Research Projects Agency (DARPA) during the late 1980s and 1990s. These conferences focused on information extraction, aiming to pull structured data from unstructured text sources such as newspaper articles and military dispatches. ^[1]

The NER task was formally introduced at the Sixth Message Understanding Conference (MUC-6) in November 1995, organized by Beth Sundheim of the Naval Research and Development group. Participants were required to identify and classify entities in English newswire text into three SGML-tagged categories: ^[1]

MUC Tag	Entity Category	Examples
ENAMEX	Person, Organization, Location	"John Smith" (PER), "IBM" (ORG), "Paris" (LOC)
TIMEX	Date, Time	"January 5, 1999", "3:00 PM"
NUMEX	Money, Percent	"$500", "25%"

The term "named entity" itself was coined in the context of MUC-6, described by Ralph Grishman and Beth Sundheim in their paper on the MUC-6 evaluation. ^[1] MUC-7, held in 1998, continued and refined the task. These early evaluations established the person, organization, and location categories that remain central to NER today.

Early systems participating in MUC relied heavily on hand-crafted rules, gazetteers (lists of known entity names), and pattern-matching heuristics. While effective in narrow domains, these rule-based approaches were labor-intensive to build and did not generalize well to new text genres or languages.

Statistical and Machine Learning Era (Late 1990s to 2000s)

The limitations of rule-based systems motivated researchers to explore statistical and machine learning methods. Daniel Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel introduced "Nymble" in 1997, a hidden Markov model (HMM) system for NER that learned to identify entities from annotated training data rather than relying on manually written rules. ^[2] Their follow-up paper "An Algorithm that Learns What's in a Name" (1999) further refined this approach. ^[3]

A major theoretical advance came in 2001 when John Lafferty, Andrew McCallum, and Fernando Pereira published "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" at ICML. ^[4] Conditional random fields (CRFs) addressed key limitations of HMMs by modeling the conditional probability of label sequences given observation sequences, avoiding the independence assumptions required by generative models. CRFs also avoided the label bias problem that affected maximum entropy Markov models (MEMMs). ^[4] CRFs quickly became the dominant framework for NER and other sequence labeling tasks.

The CoNLL-2003 shared task on language-independent NER, organized by Erik Tjong Kim Sang and Fien De Meulder, became the most widely used benchmark for evaluating NER systems. The organizers describe it concisely: "We describe the CoNLL-2003 shared task: language-independent named entity recognition." ^[5] Sixteen systems participated, and the official surface-form baseline scored 59.61% F1 on the English test set, establishing how far learned systems advanced beyond simple heuristics. ^[5] The winning system by Florian et al. (2003) achieved an F1 score of 88.76% on English by combining multiple classifiers with extensive feature engineering, including word features, part-of-speech tags, chunk tags, prefix and suffix features, and large gazetteers. ^[6] Chieu and Ng (2003) ranked second at 88.31% F1, demonstrating the value of document-level global features in a maximum entropy framework. ^[7]

Throughout this period, most competitive NER systems relied on careful feature engineering combined with statistical classifiers such as maximum entropy models, support vector machines (SVMs), or CRFs. Features typically included word shapes, capitalization patterns, surrounding context words, gazetteers, part-of-speech tags, and character n-grams.

Neural and Deep Learning Era (2011 onward)

The first significant neural approach to NER came from Ronan Collobert, Jason Weston, and colleagues, who published "Natural Language Processing (Almost) from Scratch" in the Journal of Machine Learning Research in 2011. Their SENNA system used a convolutional neural network (CNN) trained with word embeddings to perform NER, POS tagging, chunking, and semantic role labeling in a unified architecture, achieving an F1 of 89.59 on CoNLL-2003 without task-specific feature engineering. ^[8]

The real breakthrough for neural NER came in 2016 with two influential papers:

Lample et al. (2016) proposed a BiLSTM-CRF model that used character-level word representations learned by a bidirectional LSTM combined with pre-trained word embeddings, feeding into a BiLSTM encoder with a CRF output layer. This achieved 90.94 F1 on CoNLL-2003. ^[9]
Ma and Hovy (2016) proposed a BiLSTM-CNN-CRF architecture that used CNNs (instead of LSTMs) for character-level representations, achieving 91.21 F1 on CoNLL-2003. ^[10]

Both papers demonstrated that neural models could surpass traditional feature-engineered systems without requiring gazetteers or hand-crafted features. The BiLSTM-CRF architecture became the standard neural approach for NER.

Alan Akbik, Duncan Blythe, and Roland Vollgraf introduced Flair embeddings in 2018 with their paper "Contextual String Embeddings for Sequence Labeling," which generated contextualized character-level embeddings using a character-level language model. This approach achieved 93.09 F1 on CoNLL-2003, setting a new state of the art at the time. ^[12] Their follow-up work on pooled contextualized embeddings (Akbik, Bergmann, and Vollgraf, 2019) further improved results. ^[15]

Transformer Era (2018 onward)

The introduction of BERT (Bidirectional Encoder Representations from Transformers) by Jacob Devlin and colleagues in 2018 transformed NER along with most other NLP tasks. BERT-Large achieved 92.8 F1 on the CoNLL-2003 test set (and 96.6 F1 on the development set) by fine-tuning a pre-trained Transformer encoder for token classification, using no task-specific architecture beyond a simple linear classification layer on top of BERT's token representations and feeding the representation of the first sub-token of each word into the classifier. ^[13] The reported scores are averaged over five random restarts. ^[13]

SpanBERT, introduced by Mandar Joshi and colleagues in 2019, improved on BERT by masking contiguous spans rather than random tokens during pre-training and training span boundary representations to predict masked content. This span-centric approach proved particularly effective for tasks involving multi-token entities. ^[14]

Subsequent pre-trained models like RoBERTa, ALBERT, XLNet, and DeBERTa continued to push NER performance higher. By the early 2020s, the best systems on CoNLL-2003 surpassed 94 F1, with some models approaching 95-96 F1. However, researchers observed a plateau in benchmark performance, partly attributed to annotation noise in the original CoNLL-2003 dataset; studies found significant annotation errors and inconsistencies that set an effective ceiling on measurable progress. ^[18]

Entity Types

Standard Entity Types

Different NER benchmarks and annotation schemes define different sets of entity categories. The most common schemes are:

Scheme	Entity Types	Number of Types	Source
CoNLL-2003	PER, ORG, LOC, MISC	4	Reuters newswire (Tjong Kim Sang and De Meulder, 2003)
MUC-6/7	PERSON, ORGANIZATION, LOCATION, DATE, TIME, MONEY, PERCENT	7	Newswire (Grishman and Sundheim, 1996)
OntoNotes 5.0	PERSON, ORG, GPE, LOC, FAC, EVENT, PRODUCT, WORK_OF_ART, LAW, LANGUAGE, NORP, DATE, TIME, MONEY, QUANTITY, ORDINAL, CARDINAL, PERCENT	18	Multiple genres (Weischedel et al., 2013)
ACE 2005	PERSON, ORG, GPE, LOC, FAC, VEHICLE, WEAPON	7	Newswire, broadcast (Walker et al., 2006)

The CoNLL-2003 scheme is the simplest and most widely benchmarked. The four types are:

PER (Person): Names of individuals, such as "Albert Einstein" or "Marie Curie."
ORG (Organization): Names of companies, institutions, agencies, and other organized groups, such as "Google" or "United Nations."
LOC (Location): Names of geographical locations that are not geopolitical entities, such as "Mount Everest" or "the Pacific Ocean."
MISC (Miscellaneous): Named entities that do not fit the other three categories, such as nationalities, events, and product names.

OntoNotes Entity Types

The OntoNotes 5.0 annotation scheme is considerably richer, with 18 entity types that include 11 named entity types and 7 value types (numerical and temporal expressions): ^[19]

Type	Description	Example
PERSON	People, including fictional	"Albert Einstein"
ORG	Companies, agencies, institutions	"Microsoft"
GPE	Countries, cities, states	"France", "Tokyo"
LOC	Non-GPE locations: mountains, bodies of water	"the Nile River"
FAC	Facilities: buildings, airports, highways	"the Golden Gate Bridge"
EVENT	Named events: wars, sports events	"World War II"
PRODUCT	Objects, vehicles, foods (not services)	"iPhone"
WORK_OF_ART	Titles of books, songs, etc.	"Hamlet"
LAW	Named documents made into laws	"the Constitution"
LANGUAGE	Any named language	"French"
NORP	Nationalities, religious or political groups	"Republican", "Buddhist"
DATE	Absolute or relative dates	"June 2024", "yesterday"
TIME	Times smaller than a day	"3:00 PM"
MONEY	Monetary values	"$500 million"
QUANTITY	Measurements	"100 kilometers"
ORDINAL	Ordinal numbers	"first", "third"
CARDINAL	Cardinal numbers not covered by other types	"three", "1,000"
PERCENT	Percentage values	"25%"

Domain-Specific Entity Types

Many specialized domains define their own entity categories beyond the standard types:

Biomedical NER: Gene names, protein names, chemical compounds, drug names, disease names, cell types, and DNA/RNA sequences.
Clinical NER: Symptoms, diagnoses, medications, dosages, procedures, anatomical sites, and adverse drug events.
Financial NER: Company names, stock tickers, stock exchanges, financial instruments, monetary amounts, and regulatory bodies.
Legal NER: Case names, statute references, court names, legal parties, judge names, and jurisdiction identifiers.

Tagging Schemes

NER is typically formulated as a sequence labeling task where each token in a sentence receives a tag indicating its role relative to entity boundaries. Several tagging schemes have been developed:

Scheme	Tags per Entity Type	Description
IO	I, O	Only marks tokens inside (I) or outside (O) entities; cannot distinguish adjacent entities of the same type
IOB1 (IOB)	I, O, B	B tag used only when two entities of the same type are adjacent
IOB2 (BIO)	B, I, O	Every entity begins with B, continues with I; the most widely used scheme
BIOES (BILOU)	B, I, O, E, S	Adds End (E) and Single (S) tags for richer boundary information

In the BIO scheme (the most common), "B-PER" indicates the beginning of a person entity, "I-PER" indicates a continuation token within that entity, and "O" indicates a token that is not part of any entity. For example:

Barack  B-PER
Obama   I-PER
was     O
born    O
in      O
Honolulu B-LOC

The BIOES/BILOU scheme provides additional supervision signals and has been shown to yield small F1 improvements (roughly 1-2 percentage points) on some benchmarks, though BIO remains the default in most implementations.

Traditional Approaches

Hidden Markov Models

Hidden Markov models (HMMs) were among the earliest statistical approaches to NER. An HMM models the joint probability of observation sequences (words) and label sequences (entity tags) by assuming that each label depends only on the previous label (the Markov property) and each observation depends only on its corresponding label. The Viterbi algorithm is used to find the most likely label sequence for a given input.

Bikel et al.'s Nymble system (1997, 1999) was one of the first HMM-based NER systems. ^[2] It demonstrated that statistical models trained on annotated data could achieve competitive performance without hand-crafted rules. However, HMMs have limitations for NER: their independence assumptions prevent them from using rich, overlapping features of the input, and they model the joint probability P(X, Y) rather than the conditional probability P(Y|X) that is more directly relevant to the labeling task.

Maximum Entropy Models

Maximum entropy (MaxEnt) classifiers, also known as multinomial logistic regression, address some limitations of HMMs by allowing arbitrary, overlapping features. Andrew McCallum, Dayne Freitag, and Fernando Pereira introduced maximum entropy Markov models (MEMMs) for sequence labeling in 2000, which combined the feature flexibility of MaxEnt with sequential modeling. ^[21] However, MEMMs suffer from the label bias problem, where states with fewer outgoing transitions effectively ignore their observations. ^[21]

Conditional Random Fields

Conditional random fields (CRFs), introduced by Lafferty, McCallum, and Pereira in 2001, became the dominant model for NER for over a decade. ^[4] A linear-chain CRF models the conditional probability of a label sequence Y given an observation sequence X:

P(Y|X) = (1/Z(X)) * exp(sum of weighted feature functions)

CRFs combine the advantages of MaxEnt models (arbitrary, overlapping features) with global normalization over the entire sequence, which avoids the label bias problem. ^[4] They also allow researchers to incorporate a wide range of features: word identities, prefixes, suffixes, capitalization patterns, part-of-speech tags, gazetteer membership, word shapes, and context window features.

For NER specifically, CRF-based systems achieved the best results throughout the 2000s and early 2010s. Key features used in CRF NER systems included:

Word-level features: Current word, surrounding words in a context window, word shape (e.g., "Xxxxx" for capitalized words), prefix and suffix character n-grams.
Lexical resources: Membership in gazetteers of known person names, organization names, location names, and other categories.
Syntactic features: Part-of-speech tags, chunk tags, dependency relations.
Document-level features: Whether the word appears capitalized elsewhere in the document, co-occurrence with known entities.

Support Vector Machines

Support vector machines (SVMs) were also applied to NER, typically in a token-by-token classification setup with features similar to those used in CRFs. While SVMs performed well on individual token classification, they lacked the ability to model label dependencies across the sequence natively, which CRFs handled through their graphical model structure.

Deep Learning Approaches

Word Embeddings and Early Neural Models

The application of word embeddings to NER began with Collobert and Weston's work (2008, 2011), which showed that pre-trained word vectors could replace hand-crafted features. Their SENNA system used a CNN with a CRF-like sentence-level objective, achieving 89.59 F1 on CoNLL-2003 without gazetteers or feature engineering. ^[8]

The development of Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) provided high-quality pre-trained word embeddings that significantly improved neural NER systems. ^[22]

BiLSTM-CRF Architecture

The BiLSTM-CRF architecture, introduced independently by Lample et al. (2016) and Ma and Hovy (2016), became the standard neural approach for NER. ^[9]^[10] The architecture consists of three main components:

Character-level encoder: A small neural network (BiLSTM or CNN) that processes each word character by character to produce a character-level representation. This captures morphological features like prefixes, suffixes, and capitalization patterns, and can handle out-of-vocabulary words.
Word-level BiLSTM encoder: A bidirectional LSTM that processes the sequence of word representations (concatenation of pre-trained word embeddings and character-level representations) and produces contextualized hidden states for each token.
CRF output layer: A linear-chain CRF that models dependencies between adjacent labels, ensuring valid label sequences (e.g., preventing an I-PER tag from following a B-ORG tag).

The key differences between the two 2016 papers:

Component	Lample et al. (2016)	Ma and Hovy (2016)
Character encoder	BiLSTM	CNN
Word embeddings	Skip-n-gram	GloVe
CoNLL-2003 F1	90.94	91.21
Paper title	"Neural Architectures for Named Entity Recognition"	"End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF"

Later work showed no significant difference in performance between the two character representation approaches.

Contextualized Embeddings

The introduction of contextualized word representations marked the next major advance. ELMo (Peters et al., 2018) used a bidirectional language model to produce word representations that varied based on context, improving NER performance when used as additional input features alongside traditional word embeddings. ^[11]

Akbik et al.'s Flair embeddings (2018) took a different approach by training character-level language models and extracting word-level representations from the hidden states at word boundaries. This method achieved 93.09 F1 on CoNLL-2003, a substantial improvement over previous methods. ^[12]

Transformer-Based NER

Transformer-based pre-trained models fundamentally changed NER by providing rich, deeply contextualized token representations. The standard approach is to treat NER as a token classification task:

Pass the input text through a pre-trained Transformer encoder (e.g., BERT, RoBERTa).
Take the output representation for each token (or the first subword token for words split into multiple subwords).
Apply a linear classification layer to predict the entity tag for each token.
Optionally, add a CRF layer on top for structured prediction.

BERT (Devlin et al., 2019) achieved 92.8 F1 on CoNLL-2003 with this simple approach. ^[13] Subsequent models have continued to improve:

Model	Year	CoNLL-2003 F1	Key Innovation
Florian et al. (best at CoNLL shared task)	2003	88.76	Classifier combination with gazetteers
Collobert et al. (SENNA)	2011	89.59	First neural approach; CNN with word embeddings
Lample et al. (BiLSTM-CRF)	2016	90.94	BiLSTM-CRF with character-level BiLSTM
Ma and Hovy (BiLSTM-CNN-CRF)	2016	91.21	BiLSTM-CRF with character-level CNN
Peters et al. (ELMo)	2018	92.22	Contextualized word representations
Akbik et al. (Flair)	2018	93.09	Contextual string embeddings
Devlin et al. (BERT-Large)	2019	92.80	Pre-trained Transformer encoder
Baevski et al. (RoBERTa + CRF)	2019	93.50	Robust pre-training optimization
Yamada et al. (LUKE)	2020	94.30	Entity-aware pre-training
Wang et al. (ACE + BERT)	2021	94.60	Automated concatenation of embeddings

Note: F1 scores above 93-94 should be interpreted cautiously, as annotation noise in the original CoNLL-2003 test set introduces an effective ceiling. Studies by Reiss et al. (2020) identified numerous labeling errors in the dataset. ^[18]

Span-Based Approaches

An alternative to token-level sequence labeling is to directly classify text spans as entities. Instead of assigning a BIO tag to each token, span-based methods enumerate candidate spans and classify each one. This approach has several advantages:

It naturally handles nested entities (entities contained within other entities).
It avoids the need for BIO decoding.
It can leverage span-level features that are difficult to incorporate in token-level models.

SpanBERT (Joshi et al., 2019) improved pre-training for span-centric tasks by masking contiguous spans rather than random tokens and training boundary representations to predict masked span content. While originally evaluated on question answering and coreference resolution, its span-oriented design is well suited for NER. ^[14]

Few-Shot and Zero-Shot NER with LLMs

The rise of large language models has opened new possibilities for NER in low-resource settings where labeled training data is scarce or unavailable.

How well do large language models do NER?

A central finding of recent work is that general-purpose LLMs prompted for NER still trail dedicated supervised models on standard benchmarks, largely because NER is a span-labeling task while LLMs are text generators. As the GPT-NER authors put it, despite strong LLM results elsewhere, "the performance of LLMs on NER is still significantly below supervised baselines," a gap they attribute to the mismatch between sequence labeling and text generation. ^[23] Bridging that gap by reformulating NER as generation, and adding a self-verification step to curb hallucinated entities, is the core idea behind GPT-NER. ^[23]

Zero-Shot NER

In zero-shot NER, the model must identify entities without any task-specific training examples. This is typically accomplished by prompting an LLM with a natural language description of the entity types to extract. For instance, a prompt might instruct the model: "Identify all person names, organization names, and locations in the following text."

Studies evaluating ChatGPT and GPT-4 for zero-shot NER have found that these models can achieve reasonable performance but typically fall short of fine-tuned specialist models. The EMNLP 2023 "Empirical Study of Zero-Shot NER with ChatGPT" reported that ChatGPT yielded much weaker results than fine-tuning-based models on CoNLL-2003, and proposed a decomposed question-answering paradigm (extracting one label type at a time) to narrow the gap. ^[25] Reported zero-shot ChatGPT scores on CoNLL-2003 in this literature fall well short of the 90+ F1 reached by fine-tuned BERT models. ^[25]

Few-Shot NER

Few-shot NER provides the model with a small number of annotated examples (typically 1 to 50) to guide entity extraction. Techniques include:

In-context learning: Including annotated examples in the prompt alongside the target text.
GPT-NER: A method that reformulates NER as a text generation task, prompting LLMs with entity definitions and few-shot examples, and adding a self-verification strategy in which the model is asked whether each extracted span really belongs to the target label. ^[23]
Nearest-neighbor retrieval: Selecting the most relevant few-shot examples using k-nearest neighbors based on embedding similarity.

While LLM-based approaches do not yet match fine-tuned models on standard benchmarks, they offer significant practical advantages: they require no model training, can handle arbitrary entity types defined at inference time, and can be adapted to new domains with minimal effort.

Datasets and Benchmarks

CoNLL-2003

The CoNLL-2003 dataset, created for the Conference on Natural Language Learning shared task, is the most widely used NER benchmark. It consists of Reuters newswire articles annotated with four entity types (PER, ORG, LOC, MISC) using the IOB tagging scheme. The dataset includes training, development, and test splits. Annotation was performed by researchers at the University of Antwerp. ^[5]

Split	Sentences	Tokens
Train	14,041	203,621
Development	3,250	51,362
Test	3,453	46,435

Despite its ubiquity, CoNLL-2003 has known limitations. Studies have identified significant annotation errors, with estimates suggesting 5-6% of entity annotations contain mistakes. ^[18] The CleanCoNLL dataset (Rücker and Akbik, EMNLP 2023) provides corrected annotations, updating roughly 7% of the labels using a hybrid of automatic relabeling (leveraging Wikipedia links from the AIDA-CoNLL-YAGO dataset) and several rounds of manual cross-checking; on this cleaner test set, state-of-the-art models reach 97.1% F1. ^[24]

OntoNotes 5.0

OntoNotes 5.0, released by the Linguistic Data Consortium (LDC), is a larger and more diverse corpus spanning multiple genres (newswire, broadcast news, broadcast conversation, web text, telephone conversation, and magazine text) and languages (English, Chinese, Arabic). It uses 18 entity types and is commonly used as a benchmark for more fine-grained NER evaluation. The English portion contains approximately 1.7 million words. ^[19]

WikiNER

WikiNER is a multilingual NER dataset automatically created from Wikipedia using a combination of the internal structure of Wikipedia (links, categories) and a small amount of manual annotation. It covers over 40 languages and uses the standard PER, ORG, LOC, MISC categories, making it useful for cross-lingual NER research.

Few-NERD

Few-NERD, introduced by Ding et al. at ACL 2021, is the first large-scale dataset designed specifically for few-shot NER. ^[17] The authors describe it as "a large-scale human-annotated few-shot NER dataset with a hierarchy of 8 coarse-grained and 66 fine-grained entity types." ^[17] It contains 188,238 sentences from Wikipedia with 491,711 entities and 4,601,223 tokens annotated across the 8 coarse-grained and 66 fine-grained types. ^[17] Few-NERD includes three benchmark settings: supervised (SUP), few-shot with inter-class transfer (INTER), and few-shot with intra-class transfer (INTRA). ^[17]

Other Notable Datasets

| Dataset | Year | Language(s) | Entity Types | Domain | Size | |---|---|---|---| | MUC-6 | 1995 | English | 7 (ENAMEX, TIMEX, NUMEX) | Newswire | ~30,000 words | | CoNLL-2002 | 2002 | Spanish, Dutch | 4 (PER, ORG, LOC, MISC) | Newswire | ~300,000 tokens | | ACE 2005 | 2005 | English, Chinese, Arabic | 7 | Multiple genres | ~300,000 words | | WNUT-17 | 2017 | English | 6 | Social media | 5,690 sentences | | CrossNER | 2021 | English | Domain-specific | 5 specialized domains | 5 x ~1,000 sentences |

Evaluation Metrics

NER systems are evaluated using precision, recall, and F1 score, but the exact definition of a "correct" prediction depends on the evaluation granularity.

Entity-Level Evaluation

Entity-level evaluation (also called span-level or strict evaluation) is the standard for NER benchmarks. A predicted entity is counted as correct only if both its span boundaries and its type exactly match a gold-standard entity. Any mismatch in either the boundaries or the type counts as both a false positive (for the predicted entity) and a false negative (for the gold entity).

Precision = (Number of correctly predicted entities) / (Total number of predicted entities)
Recall = (Number of correctly predicted entities) / (Total number of gold entities)
F1 score = 2 * (Precision * Recall) / (Precision + Recall)

This strict evaluation can be harsh: a prediction that identifies most of a multi-token entity but misses one boundary token is penalized twice (once as a false positive, once as a false negative).

Token-Level Evaluation

Token-level evaluation assigns credit for each individually correct token label. This is more lenient and can overestimate system quality because a partially correct entity span still receives partial credit. Token-level evaluation is less commonly reported for NER but is sometimes used for debugging or analysis.

Evaluation Tools

The conlleval script, originally developed for the CoNLL shared tasks, computes entity-level precision, recall, and F1. ^[5] The seqeval Python library provides a modern implementation that supports multiple tagging schemes (IOB1, IOB2, BIOES) and computes micro-averaged, macro-averaged, and per-entity-type metrics. The Hugging Face evaluate library includes seqeval as a built-in metric for NER evaluation.

Micro vs. Macro Averaging

Micro-averaged F1 counts all entities equally regardless of type and is the standard metric reported on CoNLL-2003. Macro-averaged F1 computes F1 for each entity type separately and then averages across types, giving equal weight to each type regardless of its frequency. Macro averaging is more informative when entity types have very different frequencies, as in OntoNotes.

Advanced NER Variants

Nested NER

Standard NER assumes that entities do not overlap, but in many real-world texts, entities can be nested within each other. For example, in "the Bank of England," both "Bank of England" (ORG) and "England" (LOC) are valid entities. Nested NER is common in biomedical text, where a gene name may be part of a longer protein complex name.

Approaches to nested NER include:

Layered sequence labeling: Running multiple passes of sequence labeling, each identifying entities at a different nesting level.
Span-based methods: Enumerating and classifying all possible spans, naturally handling overlapping entities.
Hypergraph-based methods: Representing nested entity structures as hypergraphs and applying structured prediction algorithms.
Transition-based methods: Using shift-reduce-like operations to build nested entity structures incrementally.

The ACE 2004 and ACE 2005 datasets are commonly used benchmarks for nested NER.

Fine-Grained NER

Fine-grained NER extends the standard categories to include dozens or hundreds of more specific types, organized in a hierarchy. For example, instead of simply labeling an entity as LOC, a fine-grained system might classify it as LOC/body_of_water, LOC/mountain, or LOC/city. The FIGER type system (Ling and Weld, 2012) defines 112 types, while the TypeNet system defines over 1,000 types. ^[20]

Fine-grained NER faces several challenges: entity types become increasingly difficult to distinguish, training data becomes sparse for rare types, and annotation consistency is harder to maintain.

Cross-Lingual NER

Cross-lingual NER aims to transfer NER capabilities from resource-rich languages (typically English) to resource-poor languages with little or no annotated training data. Approaches include:

Multilingual pre-trained models: Models like mBERT (multilingual BERT) and XLM-RoBERTa are trained on text in over 100 languages, allowing zero-shot cross-lingual transfer where a model fine-tuned on English NER data is directly applied to other languages.
Translation-based methods: Translating training data from the source language to the target language using machine translation, then projecting entity annotations through word alignment.
Adversarial training: Using language-adversarial objectives during fine-tuning to encourage language-invariant representations.

Cross-lingual NER performance varies significantly depending on the linguistic similarity between source and target languages, with closely related languages showing stronger transfer.

Document-Level NER

Most NER systems operate at the sentence level, but entities often span multiple sentences or their interpretation depends on document-level context. Document-level NER addresses this by incorporating broader context, using techniques such as passing entity predictions from earlier sentences as features for later sentences, or applying Transformer models with longer context windows.

Applications

What is named entity recognition used for?

NER is a building block across information extraction, search, and analytics. Common uses include populating knowledge graphs, routing and filtering search queries, redacting personally identifiable information, mining structured facts from clinical notes and financial filings, and feeding entity signals into recommendation and compliance systems. The subsections below detail the main application areas.

Information Extraction

NER is the first step in most information extraction pipelines. Once entities are identified, subsequent systems can extract relations between them (relation extraction), resolve coreferences (determining which entity mentions refer to the same real-world entity), and identify events in which entities participate.

Knowledge Graph Construction

NER is essential for building knowledge graphs from text. The process typically involves identifying entities with NER, linking them to existing knowledge base entries (entity linking), extracting relationships between entities, and storing the resulting triples (subject, relation, object) in a graph database. Projects like Google's Knowledge Graph, Wikidata, and DBpedia rely on NER as a core component of their construction pipelines.

Search and Recommendation

Search engines use NER to understand queries and documents. Identifying that a query contains a person name, location, or organization helps the search engine route the query to appropriate results (e.g., showing a knowledge panel for a person). NER also powers entity-based search filters and faceted navigation.

Healthcare and Biomedical

Clinical NER extracts medical entities from electronic health records, clinical notes, and biomedical literature. Entity types include diseases, symptoms, medications, dosages, procedures, anatomical sites, and laboratory test results. Clinical NER supports tasks like adverse drug event detection, clinical trial matching, and automated medical coding. Specialized models from systems like John Snow Labs' Spark NLP for Healthcare can detect over 50 clinical entity types.

Finance

Financial NER extracts entities such as company names, ticker symbols, stock exchanges, monetary amounts, dates, and financial instrument names from news articles, regulatory filings, earnings reports, and analyst notes. These extracted entities feed into trading signals, risk assessment, compliance monitoring, and financial knowledge graphs.

Legal

Legal NER identifies entities specific to the legal domain: case names, statute references, court names, parties to legal proceedings, judge names, dates, and jurisdiction identifiers. Legal NER supports contract analysis, case law research, regulatory compliance, and automated legal document processing.

NER on social media text (tweets, posts, comments) presents unique challenges due to informal language, abbreviations, misspellings, hashtags, and rapidly emerging entities. The WNUT (Workshop on Noisy User-generated Text) shared tasks have specifically addressed NER in these challenging settings.

Tools and Libraries

Several open-source tools and libraries provide production-ready NER capabilities:

Tool	Developer	Architecture	Languages	Speed (tokens/sec, CPU)	Key Strength
spaCy	Explosion AI	Transformer or efficiency pipelines	25+	~10,000	Production speed, easy integration
Flair	Zalando Research / Humboldt University	BiLSTM with stacked embeddings	15+	~300	High accuracy, flexible embedding stacking
Stanza	Stanford NLP Group	BiLSTM with character and word features	70+	~900	Broad multilingual support, linguistic annotations
Hugging Face Transformers	Hugging Face	Any Transformer model (BERT, RoBERTa, etc.)	100+	Varies	Access to thousands of pre-trained NER models
NLTK	Steven Bird et al.	Rule-based and MaxEnt	English primarily	Fast	Educational use, simple API
Stanford NER	Stanford NLP Group	CRF	English, German, Chinese, Spanish	~1,500	Well-established, CRF-based

spaCy

spaCy is an industrial-strength NLP library that offers both efficient non-Transformer pipelines and Transformer-based models. Its NER component supports custom entity types and can be fine-tuned on domain-specific data. spaCy's non-Transformer models are among the fastest available, processing over 10,000 tokens per second on CPU, while its Transformer pipeline (en_core_web_trf, built on a RoBERTa-base backbone) reaches roughly 90 F1 on OntoNotes 5 NER at the cost of slower, GPU-friendly inference. ^[26]

Flair

Flair, developed initially at Zalando Research, is known for its contextual string embeddings and the ability to stack multiple embedding types (Flair embeddings, BERT, GloVe, etc.). ^[12] Flair models consistently achieve high accuracy on NER benchmarks, though at the cost of slower inference compared to spaCy.

Stanza

Stanza, developed by the Stanford NLP Group, provides pre-trained NER models for over 70 languages. It uses a BiLSTM architecture with character and word-level features and integrates tightly with other linguistic analysis tools (tokenization, POS tagging, dependency parsing, lemmatization).

Hugging Face Token Classification

The Hugging Face Transformers library provides a straightforward pipeline for NER using any pre-trained Transformer model. The TokenClassificationPipeline handles tokenization, subword alignment, and entity aggregation. The Hugging Face Model Hub hosts thousands of NER models fine-tuned on various datasets and languages, including popular models like dslim/bert-base-NER and Jean-Baptiste/camembert-ner (French).

Practical Challenges

Ambiguity

Many entity names are ambiguous. "Washington" could refer to George Washington (PER), Washington, D.C. (LOC), or the Washington Nationals (ORG). Disambiguating such cases requires understanding the surrounding context, which is one reason why contextualized models like BERT significantly outperform earlier approaches.

Domain Shift

NER models trained on one domain (e.g., newswire) often perform poorly on text from a different domain (e.g., biomedical literature, social media, legal documents). Domain adaptation techniques, including fine-tuning on small amounts of in-domain data or using domain-adaptive pre-training, can mitigate this issue.

Emerging and Rare Entities

New entities constantly appear (new companies, people, products), and NER models must generalize to entities not seen during training. Character-level features, subword tokenization, and pre-trained language models all help with unseen entities, but rare and novel entities remain a challenge.

Multilingual and Low-Resource Settings

While NER for English and a few other resource-rich languages has reached high accuracy, performance on low-resource languages remains significantly lower. Cross-lingual transfer, multilingual pre-training, and data augmentation techniques are active areas of research aimed at closing this gap.

Annotation Quality

NER benchmark performance is fundamentally limited by the quality of annotation. As models have improved, annotation noise in standard benchmarks has become a more significant factor. ^[18] Inter-annotator agreement for NER is typically around 95-97% F1, setting an approximate upper bound on benchmark scores.

Explain Like I'm 5 (ELI5)

Imagine you are reading a storybook and someone asks you to find all the names of people, places, and companies in the story, then circle each one with a different color. People get a red circle, places get a blue circle, and companies get a green circle. Named entity recognition is like a computer program that does exactly this: it reads through text and highlights all the important names, grouping them by type. This helps computers understand what a piece of text is actually talking about.

References

Grishman, R. and Sundheim, B. (1996). "Message Understanding Conference - 6: A Brief History." *Proceedings of COLING 1996*. ↩
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. (1997). "Nymble: A High-Performance Learning Name-finder." *Proceedings of the Fifth Conference on Applied Natural Language Processing*, 194-201. ↩
Bikel, D. M., Schwartz, R., and Weischedel, R. M. (1999). "An Algorithm that Learns What's in a Name." *Machine Learning*, 34(1-3), 211-231. ↩
Lafferty, J., McCallum, A., and Pereira, F. (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." *Proceedings of ICML*. ↩
Tjong Kim Sang, E. F. and De Meulder, F. (2003). "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition." *Proceedings of CoNLL*. https://aclanthology.org/W03-0419/ ↩
Florian, R., Ittycheriah, A., Jing, H., and Zhang, T. (2003). "Named Entity Recognition through Classifier Combination." *Proceedings of CoNLL*. ↩
Chieu, H. L. and Ng, H. T. (2003). "Named Entity Recognition with a Maximum Entropy Approach." *Proceedings of CoNLL*. ↩
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). "Natural Language Processing (Almost) from Scratch." *Journal of Machine Learning Research*, 12, 2493-2537. ↩
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). "Neural Architectures for Named Entity Recognition." *Proceedings of NAACL-HLT*. ↩
Ma, X. and Hovy, E. (2016). "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF." *Proceedings of ACL*. ↩
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL*. ↩
Akbik, A., Blythe, D., and Vollgraf, R. (2018). "Contextual String Embeddings for Sequence Labeling." *Proceedings of COLING*. ↩
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*. https://aclanthology.org/N19-1423/ ↩
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. (2019). "SpanBERT: Improving Pre-training by Representing and Predicting Spans." *Transactions of the ACL*. ↩
Akbik, A., Bergmann, T., and Vollgraf, R. (2019). "Pooled Contextualized Embeddings for Named Entity Recognition." *Proceedings of NAACL*. ↩
Yamada, I., Asai, A., Shindo, H., Takeda, H., and Matsumoto, Y. (2020). "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention." *Proceedings of EMNLP*.
Ding, N., Xu, G., Chen, Y., Wang, X., Han, X., Xie, P., Zheng, H., and Liu, Z. (2021). "Few-NERD: A Few-shot Named Entity Recognition Dataset." *Proceedings of ACL*. https://aclanthology.org/2021.acl-long.248/ ↩
Reiss, F., Xu, H., Cutler, B., Muthuraman, M., and Eichenberger, Z. (2020). "Identifying Incorrect Labels in the CoNLL-2003 Corpus." *Proceedings of CoNLL*. ↩
Weischedel, R. et al. (2013). "OntoNotes Release 5.0." Linguistic Data Consortium, LDC2013T19. ↩
Ling, X. and Weld, D. S. (2012). "Fine-Grained Entity Recognition." *Proceedings of AAAI*. ↩
McCallum, A., Freitag, D., and Pereira, F. (2000). "Maximum Entropy Markov Models for Information Extraction and Segmentation." *Proceedings of ICML*. ↩
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Proceedings of NeurIPS*. ↩
Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., and Wang, G. (2023). "GPT-NER: Named Entity Recognition via Large Language Models." *arXiv:2304.10428*. https://arxiv.org/abs/2304.10428 ↩
Rücker, S. and Akbik, A. (2023). "CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset." *Proceedings of EMNLP*. https://arxiv.org/abs/2310.16225 ↩
Xie, T., Li, Q., Zhang, J., Zhang, Y., Liu, Z., and Wang, H. (2023). "Empirical Study of Zero-Shot NER with ChatGPT." *Proceedings of EMNLP*. https://aclanthology.org/2023.emnlp-main.493/ ↩
Explosion AI. "en_core_web_trf: English transformer pipeline (RoBERTa-base)." spaCy Models documentation. https://huggingface.co/spacy/en_core_web_trf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit