Token Classification Models

AI Models Natural Language Processing

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 4,469 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Token classification models are natural language processing systems that assign a discrete label to every token in an input sequence, where a token is typically a word, subword piece, or character. The family covers named entity recognition (NER), part-of-speech tagging (POS), syntactic chunking, slot filling for dialogue systems, morphological tagging, and (loosely) semantic role labeling. Although the task is simple to state (input n tokens, output n labels), token classification underlies most information extraction pipelines: search indexing, biomedical literature mining, clinical decision support, voice assistants, and content moderation.

The family is distinct from text classification, which labels whole documents, and from sequence-to-sequence generation. Token classification preserves a one-to-one alignment between input positions and output labels, which suits encoder architectures and structured prediction. Modern systems combine a pretrained Transformer encoder with a lightweight classification head and use tag schemes (BIO, BIOES, BILOU) to encode multi-token spans.

Definition and output structure

Given an input sentence X = (x_1, ..., x_n), a token classification model predicts a sequence of labels Y = (y_1, ..., y_n) drawn from a fixed label set L. For tasks that mark spans rather than individual words, the label set is extended with prefixes that encode segment boundaries. The entity "New York City" of type LOC tags in the BIO scheme as (B-LOC, I-LOC, I-LOC), where B marks the beginning of a span, I marks an inside position, and O marks tokens outside any span.^[3] The BIOES (or BILOU) scheme adds explicit End and Single tags so the model can distinguish single-token from multi-token spans more cleanly.

Subword tokenizers used by Transformers complicate alignment between input pieces and word-level labels. A common convention assigns the gold label to the first subword of each word and an ignore index to the remaining subwords during training; at inference, the first subword prediction propagates to the whole word.^[8]

History

The earliest taggers were hand-crafted rule systems. Eric Brill introduced transformation-based learning for POS tagging in 1992, a method that learns an ordered list of correction rules from a labeled corpus and applies them iteratively.^[1] The Brill tagger reached accuracies comparable to stochastic taggers with only a few hundred rules and remained an interpretable baseline.^[1]

Hidden Markov models dominated the second half of the 1990s for POS tagging. John Lafferty, Andrew McCallum, and Fernando Pereira proposed conditional random fields (CRFs) in 2001, a family of probabilistic models that overcame the label bias problem of maximum entropy Markov models and allowed richer feature templates without the strong independence assumptions of HMMs.^[2] CRFs became the standard backbone for sequence labeling and powered systems such as Stanford NER (Finkel, Grenager, and Manning 2005), which augmented a linear-chain CRF with non-local constraints through Gibbs sampling.^[4]

Deep learning displaced feature-engineered CRFs in the mid-2010s. Zhiheng Huang, Wei Xu, and Kai Yu published the bidirectional LSTM-CRF in 2015, combining a BiLSTM sentence encoder with a CRF output layer; the model achieved near state-of-the-art accuracy on POS, chunking, and NER on CoNLL-2003 data without hand-engineered features.^[5] Character-level convolutions captured morphology and out-of-vocabulary words. ELMo (Peters et al. 2018) raised CoNLL-2003 English NER F1 above 92 when fed into a BiLSTM-CRF.^[6]

The modern era began with BERT (Devlin et al. 2018), which pushed token classification onto bidirectional Transformer encoders with a linear classification head.^[8] Domain-specific encoders followed: BioBERT (Lee et al. 2019) for biomedical text,^[9] SciBERT (Beltagy et al. 2019) for scientific papers,^[10] and Clinical BERT (Alsentzer et al. 2019) for hospital notes.^[11] XLM-RoBERTa (Conneau et al. 2020) extended the recipe to over 100 languages.^[12] The 2023 wave of instruction-tuned large language models added a second paradigm where entity extraction is framed as generation guided by a schema, with systems such as GoLLIE and UniNER closing the gap to fine-tuned encoders in zero-shot settings.^[19]

Key tasks

Task	Output	Typical label set
Named entity recognition	Spans of mentions	PERSON, ORG, LOC, MISC, DATE, MONEY
Part-of-speech tagging	One tag per token	NOUN, VERB, ADJ, ADV, Penn Treebank tags
Chunking	Shallow phrase spans	NP, VP, PP, ADJP, ADVP
Slot filling	Spans aligned with intent	departure_city, time, product
Morphological tagging	Morphological features	Case, Tense, Number, Gender
Semantic role labeling	Spans tied to predicates	ARG0, ARG1, ARGM-TMP

NER typically uses a few coarse types (four in CoNLL-2003: PER, ORG, LOC, MISC; eighteen in OntoNotes 5.0). POS tagging draws from the Penn Treebank tagset (about 45 tags for English) or the Universal Dependencies tagset (17 coarse tags shared across more than 150 languages). Chunking divides a sentence into non-overlapping syntactic phrases; slot filling is the analogous task inside dialogue systems and is normally trained jointly with intent classification.

Named entity recognition

Named entity recognition is the most studied token classification task.^[21] Its goal is to identify and categorize spans in free text that refer to real-world entities: people, organizations, locations, products, dates, monetary amounts, and similar types.^[21] Modern models are trained on annotated corpora where human experts have marked entity boundaries and types. The task is evaluated at the entity level rather than the token level: a prediction is only counted as correct if both the boundary and the type match the gold annotation exactly.^[3]

NER performance depends heavily on the domain. Newswire models trained on CoNLL-2003 or OntoNotes often drop 10 to 20 F1 points when applied to social media text, clinical notes, or scientific abstracts. Domain-specific pretraining and targeted annotation corpora address this gap.^[9]

Part-of-speech tagging

Part-of-speech tagging assigns a syntactic category to each token. For English, the Penn Treebank tagset distinguishes 45 fine-grained classes (NN, NNS, VBD, JJ, etc.). The Universal Dependencies project defines 17 coarse categories (NOUN, VERB, ADJ, etc.) that apply across the 200-plus treebanks covering more than 160 languages.

POS tagging is relatively saturated for high-resource languages: transformer-based taggers exceed 98 percent token accuracy on standard English benchmarks. The remaining errors concentrate on rare words, punctuation ambiguities, and domain-shifted text such as spoken transcripts or social media.

Chunking

Chunking (also called shallow parsing) groups consecutive tokens into non-overlapping syntactic phrases without requiring a full parse tree. Output spans are typed as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and similar constituents. Chunking is evaluated on CoNLL-2000 data using token-level and span-level F1. Because it is simpler than full dependency or constituency parsing, it remains useful as a fast preprocessing step in production pipelines.

Slot filling

Slot filling is the dialogue-oriented variant of NER. Given an utterance such as "Book a table for two at eight on Friday," the system must identify and tag slot values: party_size=two, time=eight, day=Friday. Slots are defined by an intent schema (a reservation intent has specific slot types that a weather intent does not). BERT-based joint models that solve intent classification and slot filling simultaneously became standard after Chen and Zhuo (2019) demonstrated that shared encoder representations improve both tasks. Slot filling benchmarks include ATIS (Airline Travel Information Systems) and SNIPS.

Tagging schemes

Four span-encoding schemes are common.

Scheme	Tags used	Notes
IO	I, O	Cannot mark consecutive spans of the same type
BIO (IOB2)	B, I, O	Standard CoNLL format; B marks every span start
BIOES	B, I, O, E, S	Adds End and Single; reduces decoding ambiguity
BILOU	B, I, L, O, U	Renaming of BIOES; Last and Unit replace End and Single

BIO is the default for most modern Transformer pipelines. BIOES and BILOU make certain illegal sequences (such as I-LOC after O) explicit and tend to help CRF and span-based decoders. Research on chemical and biomedical NER has reported small but consistent F1 gains from BIOES at the cost of a larger label space.

The IO scheme is the simplest but has an inherent limitation: two consecutive entities of the same type cannot be distinguished because there is no Beginning marker to separate them. BIO resolves this by requiring a B tag at the start of every entity span, regardless of whether the previous token had the same type.^[3] BIOES and BILOU additionally mark the final token of each span (E and L respectively) and treat single-token entities with a dedicated S or U tag, which gives the CRF transition model clearer structural cues.

Architectures

BiLSTM-CRF

The pre-Transformer BiLSTM-CRF remains a well-understood baseline. It concatenates word embeddings (often GloVe) with character embeddings from a smaller BiLSTM or CNN, feeds the sequence through a deeper BiLSTM, and decodes under a CRF transition matrix.^[5] The CRF layer learns a matrix of transition scores between label pairs (how likely is a B-LOC to follow an I-ORG, for example) and uses the Viterbi algorithm at inference to find the globally most probable tag sequence given both emission scores and transition constraints.^[2]^[5] This global normalization is the key advantage over greedy or softmax decoders: the CRF never produces illegal label sequences (such as I-LOC after O without a B-LOC) because such transitions receive a large negative transition score.^[2]

Parts of spaCy and Flair still use BiLSTM components because they run faster than large Transformers on CPU. Character-level embeddings from a CNN or a smaller BiLSTM give morphological signal for out-of-vocabulary words, which is especially helpful for languages with complex inflection.^[7]

Transformer encoder with classification head

The dominant 2020s architecture is an encoder Transformer with a linear softmax head over each token's contextual representation.^[8] The encoder (BERT-base, RoBERTa, DeBERTa-v3, or XLM-RoBERTa) converts the input tokens into dense contextual vectors; the classification head maps each vector to a logit for each label in the tag vocabulary. At fine-tuning, the cross-entropy loss over labeled tokens is backpropagated through both the head and the encoder.^[8] On CoNLL-2003 English, BERT-base reaches entity-level F1 in the low 90s;^[8] larger and more recent encoders such as DeBERTa-v3-large push past 93.

Adding a linear-chain CRF layer atop the Transformer still provides a small but consistent boost when the label set is dense, training data is limited, or the schema includes rare tag transitions. The additional parameters are few and the computational overhead is negligible, so many practitioners include the CRF by default.

Span-based models

Span-based models predict labels for candidate spans directly instead of tagging token by token. Every possible span (up to some maximum length) is enumerated, a span representation is constructed (typically by concatenating the contextual representations of the start and end tokens and sometimes adding a width embedding), and a classifier scores each span against the entity types plus a null class. This formulation, popularized by Lee et al. (2017) for coreference, was adapted for NER and handles nested entities naturally: two spans with overlapping boundaries simply receive independent scores. Span models are evaluated on GENIA, ACE 2004, and ACE 2005, achieving F1 in the upper 80s on those nested-entity corpora.^[21]

LLM-based extraction

LLM-based extraction frames the task as generation. The model receives a schema and returns structured output, typically JSON. GoLLIE (Sainz et al. 2023) is fine-tuned on Code-LLaMA with annotation guidelines expressed as Python class docstrings, allowing it to follow novel extraction schemas described only at inference time.^[19] UniNER (Zhou et al. 2023) distills GPT-4 outputs into a smaller targeted model. Both perform well in zero-shot regimes, though fine-tuning a domain-specific encoder still dominates when ample labeled data is available.

GLiNER (Zaratiana et al. 2023) takes a middle path: it is an encoder-based generalist model that accepts entity type labels as input alongside the text, encoding them in the same latent space as the document tokens, then scores each token against each provided label in parallel.^[20] This architecture is much smaller and faster than an LLM while generalizing to entity types seen only at inference time. GLiNER outperforms both ChatGPT and larger instruction-tuned models in zero-shot NER evaluations on held-out benchmarks.^[20]

Retrieval-augmented and knowledge-grounded NER

Retrieval-augmented NER conditions the encoder on retrieved examples or a knowledge base, helping with rare entities and domain shift. Gazetteer features (lookup tables of known entity names) have been used since the CRF era; modern versions embed gazetteer entries and inject them as soft features into Transformer attention. Retrieval-augmented methods that pull relevant training examples at inference time show gains on low-resource and emerging-entity settings.

Notable models

Model	Year	Approach	Notes
Brill tagger	1992	Transformation rules	Interpretable POS baseline^[1]
Stanford NER	2005	Linear-chain CRF	Gibbs sampling for non-local features^[4]
BiLSTM-CRF	2015	BiLSTM encoder + CRF	First strong neural sequence tagger^[5]
Flair	2018	Character LM + BiLSTM-CRF	Akbik et al., Zalando Research^[7]
spaCy	2015+	CNN tok2vec + transition NER	Production library, en_core_web_lg
BERT	2018	Transformer encoder + linear head	Devlin et al., near 92 F1 on CoNLL-2003^[8]
BioBERT	2019	BERT pretrained on PubMed	Lee et al., gain on BC5CDR, NCBI Disease^[9]
SciBERT	2019	BERT pretrained on scientific papers	Beltagy et al., 1.14M Semantic Scholar papers^[10]
Clinical BERT	2019	BERT pretrained on MIMIC III	Alsentzer et al., clinical NER^[11]
XLM-RoBERTa	2020	Multilingual Transformer	Conneau et al., 100 languages, +2.4 F1 on NER over mBERT^[12]
DeBERTa-v3	2021	ELECTRA-style pretraining	He et al., strong encoder for token classification^[13]
PubMedBERT	2021	BERT trained from scratch on PubMed	Gu et al., BLURB SOTA across 13 biomedical NLP tasks^[14]
GoLLIE	2023	Code-LLaMA fine-tuned on guidelines	Sainz et al., strong zero-shot IE^[19]
UniNER	2023	Instruction tuning + GPT-4 distillation	Zhou et al., targeted NER LLM
GLiNER	2023	Bidirectional encoder + label encoding	Zaratiana et al., generalist zero-shot NER^[20]
NuNER	2024	RoBERTa fine-tuned on LLM-annotated data	Ashok and Lipton, LLM-bootstrapped pretraining
ModernBERT	2024	Modernized BERT with RoPE and long context	Warner et al., 8192-token context, 2-4x faster^[22]

Stanford NER, spaCy, and Flair remain in active production use. Hugging Face hosts thousands of fine-tuned BERT, RoBERTa, and XLM-R checkpoints for NER across languages and domains.

Benchmarks and datasets

NER and POS benchmarks fall into general-domain, social media, biomedical, and multilingual categories.

Benchmark	Year	Task	Notes
CoNLL-2003	2003	NER (English, German)	Sang and De Meulder; PER, ORG, LOC, MISC; ~93 F1 near ceiling^[3]
OntoNotes 5.0	2013	NER, coref, SRL	18 entity types; English, Chinese, Arabic
WNUT-2017	2017	Emerging entities	Twitter, Reddit, Stack Overflow; top F1 around 50^[17]
Few-NERD	2021	Few-shot NER	8 coarse, 66 fine types, 188K sentences^[15]
ACE 2005	2005	NER, relation, event	Multi-genre English; nested entity evaluation
MultiCoNER	2022	Multilingual complex NER	11 languages, 2.3M instances^[16]
BC5CDR	2016	Biomedical NER	Chemicals and diseases
NCBI Disease	2014	Biomedical NER	Disease mentions in PubMed abstracts
JNLPBA	2004	Biomedical NER	Genes and proteins
GENIA	2003	Biomedical NER	Molecular biology corpus; nested entities
GermEval 2014	2014	NER (German)	Nested entities
Penn Treebank	1993	POS tagging	English newswire, 45 tags; >97% token accuracy
Universal Dependencies	2014+	POS, morphology, parsing	200+ treebanks, 160+ languages
CoNLL-2000	2000	Chunking	English Wall Street Journal, NP/VP/PP
ATIS	1990s	Slot filling	Airline travel utterances
SNIPS	2018	Slot filling	7 intent domains, diverse slot types
CleanCoNLL	2023	Corrected NER	Re-annotation of CoNLL-2003; SOTA reaches 97.1 F1^[18]

Evaluation metrics

The canonical NER metric is span-level F1, computed on exact match of entity boundaries and types. A predicted span is a true positive only if both its start index, its end index, and its type label match the gold annotation. The conlleval script, released alongside the CoNLL-2003 shared task, implements this evaluation and remains the standard tool.^[3]

Partial matching schemes (relaxed boundaries, type-only, boundary-only) are also used in clinical and legal NER where exact boundary agreement is harder to achieve. Biomedical work often reports both exact and relaxed matching.

POS tagging is reported as token accuracy: the fraction of tokens whose predicted tag matches the gold tag. Strong models exceed 97 percent on English Penn Treebank and 98 percent on Universal Dependencies with transformer encoders.

Slot filling is measured by slot-level F1 and intent accuracy. Joint models are evaluated on both simultaneously.

Because CoNLL-2003 has annotation errors estimated at around 5 to 7 percent of labels, a performance ceiling was apparent by the early 2020s: the apparent test F1 ceiling near 93 reflects noise in the reference labels rather than a model limitation.^[18] CoNLL++ (Wang et al. 2019) corrected 309 labels in the test set; CleanCoNLL (Reiss et al. 2023) performed a comprehensive relabeling, allowing SOTA models to reach 97.1 F1.^[18]

Modern era and LLM extraction

From around 2022, large language models such as GPT-4 and Claude shifted part of the field toward prompt-based extraction. With structured output features such as JSON mode and tool use, an LLM can read a passage and return a list of entities, even with custom types described only in the prompt. Performance in zero-shot settings on novel domains can match or exceed fine-tuned encoders when the target schema is unusual or labeled data is scarce. Open-weight instruction-tuned NER systems include GoLLIE and UniNER (2023) and GLiNER (Zaratiana et al. 2023/2024 NAACL).^[19]^[20]

For well-resourced domains with stable schemas, fine-tuned encoder models retain advantages in throughput, latency, and cost. A BERT-base tagger processes thousands of sentences per second on a single GPU, while an LLM extractor producing JSON runs orders of magnitude slower. Production stacks often mix the two: an LLM bootstraps labels on a new domain, and a distilled encoder serves traffic.

Encoder models continued to advance in this period. DeBERTa-v3, released by Microsoft in 2021, uses ELECTRA-style replaced-token-detection pretraining with gradient-disentangled embedding sharing, achieving consistently top scores on NER benchmarks with only 60 to 70 percent of the pretraining data required by earlier models.^[13] ModernBERT (Warner et al. 2024) introduced a refresh of the BERT architecture with rotary positional embeddings, alternating local-global attention, and training on 2 trillion tokens, delivering 2 to 4 times the inference throughput of older encoders while extending the effective context window to 8192 tokens.^[22]

NuNER (Ashok and Lipton 2024) demonstrated that pretraining a RoBERTa model on data automatically annotated by GPT-3.5 on a large general corpus yields strong few-shot NER performance, providing a practical path to bootstrapping new domains without manual annotation.

Applications

Domain	Use case	Representative tools
Search and information retrieval	NER-based query understanding and vertical routing	spaCy, custom BERT taggers
Biomedical text mining	Extraction of genes, proteins, chemicals, diseases from PubMed	BioBERT, PubMedBERT, BLURB models
Clinical informatics	Problem list population, medication extraction from discharge summaries	Clinical BERT, i2b2 models
Legal and financial analysis	Tagging parties, dates, monetary amounts in contracts and filings	Legal-BERT, FinBERT fine-tunes
Voice assistants	Slot filling for flight booking, restaurant reservation, calendar events	BERT joint intent and slot models
Content moderation	Detection of PII, addresses, sensitive terms	Presidio (Microsoft), custom NER
Machine translation and NMT	Named entity handling, transliteration decisions	XLM-R based NER in MT pipelines
Knowledge base construction	Relation extraction bootstrapped by entity detection	Stanford NER, spaCy pipelines

Search engines and enterprise indexes use NER to identify queries about people, places, and products, then route to specialized verticals or populate knowledge panels. Biomedical tools extract genes, proteins, chemicals, diseases, and adverse events from PubMed and clinical trial registries at scale: PubMed alone has more than 36 million citations, making automated extraction essential. Hospital systems run NER over discharge summaries to populate problem lists, medication histories, and quality metrics, reducing manual coding burden. Voice assistants rely on slot filling as the backbone of their natural language understanding stack; a single slot filling model typically handles dozens of intent domains with hundreds of slot types.

Limitations and open challenges

Domain shift

Domain shift is the largest practical challenge. A CoNLL-2003 newswire model often drops more than 10 F1 points on tweets or clinical notes; in-domain pretraining recovers most of the loss. The WNUT-2017 benchmark, built from Twitter, Reddit, and Stack Overflow text, captures this: the best systems reach only around 50 F1 on emerging entities in that noisy, informal domain.^[17] Domain-adaptive pretraining and targeted annotation remain the primary mitigations.^[14]

Nested entities

Nested entities (one entity inside another, common in biomedical text) cannot be expressed in BIO and require span-based decoders.^[21] In GENIA and clinical corpora, a protein mention may be nested inside a gene family mention, or a disease name may be nested inside a treatment phrase. Span-based models handle this naturally; BIO-based taggers require post-processing heuristics or separate passes over the text. The prevalence of nesting in specialized domains means that span models are preferred for biomedical and legal NER despite their higher computational cost.

Annotation quality

Label noise is common in crowdsourced corpora, and audits of CoNLL-2003 have identified annotation errors that limit further F1 gains.^[18] Corrected test sets such as CoNLL++ and CleanCoNLL have been proposed and show that a significant fraction of the apparent performance ceiling is due to label noise rather than model limitations.^[18] Generalization across schemas remains weak: a model trained on OntoNotes does not produce CoNLL or biomedical types without retraining, although schema-aware LLMs reduce that friction.

Low-resource languages and cross-lingual transfer

Low-resource languages have few labeled examples; cross-lingual transfer through XLM-R helps but does not close the gap to high-resource languages.^[12] Universal Dependencies treebanks provide POS data for more than 160 languages, enabling cross-lingual POS transfer; but for NER, most languages outside of English, Chinese, German, Spanish, and Dutch have sparse corpora. Zero-shot transfer from multilingual encoders, combined with a small number of in-language labeled examples, currently gives the best results for low-resource NER.

Long documents and long entities

Standard token classifiers process sequences up to 512 tokens (BERT) or 8192 tokens (ModernBERT).^[8]^[22] Long legal or scientific documents require chunking strategies. Very long entities that span multiple clauses (such as long medication dosage descriptions or complex legal clauses) are hard for left-to-right or windowed decoders. Hierarchical encoders and sparse attention help but remain research-stage solutions.

LLM reliability for structured extraction

LLM-based extraction, while flexible, introduces new failure modes: format errors (invalid JSON, missing fields), hallucinated entity boundaries, and sensitivity to prompt wording. Empirical studies show GPT-4 has a non-trivial invalid-response rate for complex extraction schemas, and performance degrades sharply on nested or overlapping entity types. Combining LLM flexibility with encoder-based reliability through distillation or label smoothing is an active research area.

References

Brill, E. (1992). A simple rule-based part-of-speech tagger. Proceedings of the third conference on Applied natural language processing. https://aclanthology.org/A92-1021/ ↩
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001. https://repository.upenn.edu/entities/publication/c9aea099-b5c8-4fdd-901c-15b6f889e4a7 ↩
Tjong Kim Sang, E. F., and De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL 2003. https://aclanthology.org/W03-0419/ ↩
Finkel, J. R., Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005. https://nlp.stanford.edu/pubs/finkel2005gibbs.pdf ↩
Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991. https://arxiv.org/abs/1508.01991 ↩
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep Contextualized Word Representations. NAACL 2018. https://aclanthology.org/N18-1202/ ↩
Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual String Embeddings for Sequence Labeling. COLING 2018. https://aclanthology.org/C18-1139/ ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv:1901.08746. https://arxiv.org/abs/1901.08746 ↩
Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A Pretrained Language Model for Scientific Text. EMNLP 2019. https://aclanthology.org/D19-1371/ ↩
Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H., Jin, D., Naumann, T., and McDermott, M. (2019). Publicly Available Clinical BERT Embeddings. NAACL Clinical NLP Workshop. https://aclanthology.org/W19-1909/ ↩
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020. https://aclanthology.org/2020.acl-main.747/ ↩
He, P., Gao, J., and Chen, W. (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv:2111.09543. https://arxiv.org/abs/2111.09543 ↩
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., and Poon, H. (2021). Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM CHIL 2021. https://arxiv.org/abs/2007.15779 ↩
Ding, N., Xu, G., Chen, Y., Wang, X., Han, X., Xie, P., Zheng, H.-T., and Liu, Z. (2021). Few-NERD: A Few-shot Named Entity Recognition Dataset. ACL 2021. https://aclanthology.org/2021.acl-long.248/ ↩
Malmasi, S., Fang, A., Fetahu, B., Kar, S., and Rokhlenko, O. (2022). SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER). SemEval 2022. https://aclanthology.org/2022.semeval-1.196/ ↩
Derczynski, L., Nichols, E., van Erp, M., and Limsopatham, N. (2017). Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. W-NUT 2017. https://aclanthology.org/W17-4418/ ↩
Reiss, F., Xu, H., Cutler, D. L., Murthy, K., Bhave, M., and Huber, J. (2023). CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset. EMNLP 2023. https://arxiv.org/abs/2310.16225 ↩
Sainz, O., Garcia-Ferrero, I., Agerri, R., Lopez de Lacalle, O., Rigau, G., and Agirre, E. (2023). GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction. arXiv:2310.03668. https://arxiv.org/abs/2310.03668 ↩
Zaratiana, U., Tomeh, N., Holat, P., and Charnois, T. (2023). GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer. NAACL 2024. https://arxiv.org/abs/2311.08526 ↩
Keraghel, I., Morbieu, S., and Nadif, M. (2024). A survey on recent advances in Named Entity Recognition. arXiv:2401.10825. https://arxiv.org/abs/2401.10825 ↩
Warner, B., Chaffin, A., Clavie, B., Werra, L. von, Subramani, N., Lhoest, Q., Dohan, D., Tillet, P., and Gu, Y. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (ModernBERT). arXiv:2412.13663. https://arxiv.org/abs/2412.13663 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Llama 3 Text Classification Models

Definition and output structure

History

Key tasks

Named entity recognition

Part-of-speech tagging

Chunking

Slot filling

Tagging schemes

Architectures

BiLSTM-CRF

Transformer encoder with classification head

Span-based models

LLM-based extraction

Retrieval-augmented and knowledge-grounded NER

Notable models

Benchmarks and datasets

Evaluation metrics

Modern era and LLM extraction

Applications

Limitations and open challenges

Domain shift

Nested entities

Annotation quality

Low-resource languages and cross-lingual transfer

Long documents and long entities

LLM reliability for structured extraction

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here