ELMo (Embeddings from Language Models)

Introduction

ELMo, short for Embeddings from Language Models, is a deep contextualized word embedding method introduced in early 2018 by researchers at the Allen Institute for AI (AI2) and the University of Washington. Presented in the paper "Deep contextualized word representations" by Matthew E. Peters and colleagues, ELMo won the Best Paper Award at the 2018 NAACL conference and rapidly became one of the most influential publications in modern natural language processing (NLP), accumulating well over 10,000 citations within a few years of release. ^[1] ^[2]

ELMo represents a turning point in the history of language modeling. Where earlier approaches such as word2vec and GloVe assigned a single static vector to each word in the vocabulary, ELMo produces representations that change based on the surrounding sentence. The word "bank" in "river bank" receives a different embedding than the same word in "bank account," because the contextual signal of nearby tokens flows through a deep bidirectional recurrent neural network before the final vector is produced. This shift, from static lookup tables to context-sensitive functions of the entire input sequence, opened the door for the transfer learning era that culminated in BERT, GPT-2, and modern transformer-based large language models. ^[3]

The ELMo system trains a large bidirectional language model (often abbreviated biLM) on a massive corpus of text, and then exposes the internal hidden states of that model as a feature extractor for downstream tasks. Practitioners simply concatenate ELMo vectors to the input of an existing task-specific architecture, which immediately delivered state-of-the-art results on six different NLP benchmarks at the time of release. The approach popularized the broader pattern of using a pre-trained language model as a feature extractor, a paradigm that within a year was extended and refined by BERT into the more aggressive fine-tuning regime that defines current practice. ^[1] ^[4]

Background and Motivation

For most of the deep learning era of NLP, word representations were learned in isolation from any particular task and then frozen for use as input features. The dominant techniques were Mikolov's word2vec (2013), Pennington's GloVe (2014), and FastText (2016). Each of these methods returns a single vector per surface form, regardless of the syntactic or semantic role the word is playing in a given sentence. A polysemous word like "play" receives the same vector whether it appears in "play the violin," "play tennis," or "the children's play." Static embeddings cannot represent this kind of context-dependent variation, and downstream models had to learn it themselves from limited labeled data. ^[3] ^[5]

Researchers explored several intermediate steps before ELMo. McCann and colleagues introduced CoVe (Contextual Word Vectors) in mid-2017. CoVe extracted hidden states from the encoder of a machine translation system and supplied them as additional features to downstream classifiers. The approach demonstrated that contextual signals could improve performance, but it relied on parallel translation data and was therefore limited by the size and language coverage of available bilingual corpora. ^[6]

A closer predecessor came from the same Allen Institute group that would later publish ELMo. In 2017, Peters, Ammar, and colleagues released TagLM, a model that combined fixed word embeddings with hidden states from a bidirectional language model trained on the One Billion Word Benchmark. TagLM achieved state-of-the-art results on named entity recognition (NER) and chunking, and it established the core idea that informed ELMo: a language model trained on a large unlabeled corpus carries general linguistic knowledge that can be transferred into supervised systems. ELMo expanded this idea by producing a richer, deeper, and more flexible representation. ^[7]

A parallel line of work that arrived around the same time was Howard and Ruder's ULMFiT (Universal Language Model Fine-tuning), released in early 2018. ULMFiT also pre-trained a recurrent language model on a large corpus, but it took a different transfer approach: instead of using the model as a feature extractor, ULMFiT fine-tuned the entire language model on the target task, with techniques such as discriminative learning rates and gradual unfreezing. The intellectual race between feature-based ELMo and fine-tuning ULMFiT helped motivate later transformer-based models, which adopted fine-tuning as the dominant paradigm. ^[8]

Architecture

ELMo's architecture has three main components stacked on top of each other: a character-level convolutional layer that produces context-independent token embeddings, a deep bidirectional LSTM language model that processes the sequence of token embeddings, and a learned linear combination of the LSTM's hidden states that yields the final task-specific contextual vector. Each piece is described below. ^[1]

Character-Level Input Representation

ELMo does not start from a fixed vocabulary of word indices. Instead, every input token is first decomposed into its sequence of UTF-8 characters, and a stack of one-dimensional convolutional filters is applied across those characters. The standard configuration uses kernel sizes of 1, 2, 3, 4, 5, 6, and 7 characters with corresponding numbers of filters (32, 32, 64, 128, 256, 512, and 1024 channels respectively). The outputs of these filters are max-pooled to produce a 2,048-dimensional vector that captures the morphological structure of the token. The vector is then passed through two highway layers and a linear projection to a final 512-dimensional context-insensitive representation. ^[9]

This design has two practical advantages. First, the model can produce representations for any token, including out-of-vocabulary words and misspellings, because it never depends on a closed word list. Second, the character-level encoder captures useful morphological cues such as prefixes, suffixes, and capitalization patterns, which are particularly helpful for tasks like named entity recognition. ^[9]

Bidirectional Language Model (biLM)

The context-insensitive token vectors are fed into a deep bidirectional language model. ELMo is, importantly, two independent language models trained jointly with shared input and softmax weights but separate LSTM stacks: a forward language model that reads left-to-right and predicts the next token given the previous tokens, and a backward language model that reads right-to-left and predicts the previous token given the following tokens. ^[1]

The forward and backward LSTMs each have L = 2 layers with 4,096 hidden units and a 512-dimensional projection per layer, with a residual connection from the first to the second layer. After training, every input token at position k is associated with 2L + 1 = 5 vectors:

The context-insensitive character-CNN embedding (one vector).
The forward LSTM hidden state at layer 1 and the backward LSTM hidden state at layer 1 (concatenated, one vector).
The forward LSTM hidden state at layer 2 and the backward LSTM hidden state at layer 2 (concatenated, one vector).

These five vectors form the raw representation set R_k that ELMo exposes to downstream tasks. The forward and backward LSTMs are trained as separate language models because the standard left-to-right next-token objective cannot use future tokens without leaking information. ELMo's bidirectionality is therefore a concatenation of two independent unidirectional models, a structural choice that BERT later replaced with masked language modeling to achieve genuine joint conditioning on both directions. ^[1] ^[10]

Task-Specific Linear Combination

The central innovation of ELMo over earlier biLM-feature approaches such as TagLM is that the downstream task does not just use the top LSTM layer. Instead, ELMo collapses the 2L + 1 vectors into a single representation by computing a task-specific linear combination:

ELMo_k^task = gamma^task * sum_{j=0}^{L} s_j^task * h_{k,j}^LM

Here s_j^task is a softmax-normalized weight assigned to layer j, and gamma^task is a single scaling factor. Both sets of parameters are learned per downstream task, and they are the only ELMo parameters that move during fine-tuning of the supervised system; the biLM weights themselves remain frozen. The number of trainable ELMo parameters per task is therefore tiny (L + 2 scalars), which keeps the computational cost of integrating ELMo into existing models very low. ^[1] ^[11]

This weighted combination matters because the different layers of the biLM encode different kinds of linguistic information. Probing studies showed that lower LSTM layers tend to capture syntactic information such as part-of-speech and local phrase structure, while upper layers carry more abstract semantic information such as word sense and discourse-level cues. By learning the mixing weights from data, the downstream model can emphasize the layer most relevant to its task: a syntactic chunker may upweight layer 1, while a sentiment classifier or coreference resolver may upweight layer 2. ^[12]

Training

The baseline English ELMo model is trained on the One Billion Word Benchmark introduced by Chelba and colleagues in 2014, a corpus of approximately 30 million sentences and 1 billion tokens drawn from news text. After ten epochs of training, the average forward and backward perplexities reach about 39.7. AI2 also released a larger ELMo variant trained on a 5.5-billion-token corpus combining English Wikipedia (1.9 billion tokens) and monolingual news crawl data from WMT 2008 to 2012 (3.6 billion tokens). The 5.5B model achieves modestly better downstream performance than the 1B model and was recommended as the default by the time the open-source release matured. ^[13] ^[14]

Training the biLM is computationally demanding for its era. The Allen Institute reported that training the original 1B-token model on a single GPU server took about two weeks. The model size is roughly 93.6 million parameters when the character CNN, both LSTM stacks, and the softmax projection are counted. Although small by 2026 standards, this scale was substantial for a 2018 NLP system and reflected a new willingness in the research community to invest serious compute in unsupervised pretraining. ^[9]

ELMo embeddings have also been trained for many languages outside English. The HIT-SCIR group released ELMoForManyLangs, which provides pretrained ELMo models for dozens of languages drawn from the CoNLL 2018 shared task data. Independent groups produced Portuguese, German, Slovenian, Croatian, Estonian, Finnish, Swedish, and Russian ELMo models, often using language-specific Wikipedia and CommonCrawl corpora. These multilingual variants extended ELMo's reach into NLP communities that lacked the labeled data required to train competitive models from scratch. ^[15] ^[16]

Use as a Feature in Downstream Models

The ELMo paper presents a remarkably simple integration recipe. To add ELMo to an existing supervised model, the practitioner concatenates the ELMo vector to the input embeddings of each token (and optionally to the output of the model's encoder as well). The supervised loss then propagates gradients back through the ELMo mixing weights gamma and s while leaving the biLM frozen. No architectural changes to the downstream model are required, and the additional training time is modest because the biLM forward pass is expensive but happens only once per minibatch. ^[1]

This plug-and-play property is one reason ELMo became so widely adopted so quickly. Almost any sequence labeling, classification, or pairwise comparison model could absorb ELMo with a few lines of code, immediately closing a substantial gap to the state of the art. AllenNLP shipped reference implementations and a Python interface that lowered the barrier even further. ^[4]

Benchmark Results

The ELMo paper reported state-of-the-art results on six diverse NLP tasks at the time of its release. Relative error reductions over the previous best baselines ranged from approximately 6% to 25%, demonstrating that the gains were not specific to a single benchmark but broadly applicable. ^[1]

Task	Dataset	Previous SOTA	With ELMo	Improvement
Question Answering	SQuAD 1.1 (F1)	81.1	85.8	+4.7
Textual Entailment	SNLI (accuracy)	88.0	88.7	+0.7
Semantic Role Labeling	OntoNotes (F1)	81.4	84.6	+3.2
Coreference Resolution	OntoNotes (F1)	67.2	70.4	+3.2
Named Entity Recognition	CoNLL 2003 (F1)	91.93	92.22	+0.29
Sentiment Analysis	SST-5 (accuracy)	51.4	54.7	+3.3

The gains on semantic role labeling and coreference resolution were especially striking, with relative error reductions above 17%. These tasks rely heavily on long-range syntactic and semantic relationships, which the deep biLM appears particularly good at capturing through its layered representations. ^[1] ^[17]

Architectural Comparisons

ELMo, BERT, and GPT

ELMo's contemporaries used different architectures and pretraining strategies. The table below summarizes the most important contrasts among ELMo, OpenAI's first GPT (June 2018), and Google's BERT (October 2018), the three systems that defined the early transfer learning wave. ^[10] ^[18]

Property	ELMo (Feb 2018)	GPT-1 (Jun 2018)	BERT (Oct 2018)
Backbone	Bidirectional LSTM	Transformer decoder	Transformer encoder
Direction	Two independent forward / backward LMs concatenated	Strict left-to-right	Joint bidirectional via masked LM
Pretraining objective	Standard next-token language modeling (forward and backward separately)	Standard left-to-right language modeling	Masked language modeling + Next-sentence prediction
Use in downstream models	Feature-based (frozen biLM, plug-in vectors)	Fine-tuning	Fine-tuning
Trainable downstream params (per task)	L + 2 mixing scalars	All transformer params	All transformer params
Tokenization	Character CNN over UTF-8	Byte-pair encoding (BPE)	WordPiece
Pretraining corpus	1B Word Benchmark (or 5.5B variant)	BookCorpus (~800M words)	BookCorpus + English Wikipedia (~3.3B words)
Best Paper Award	NAACL 2018	None	NAACL 2019

The contrasts highlight a few key engineering shifts. Replacing the LSTM with a transformer dramatically improved the model's ability to model long-range dependencies and made training more parallelizable. Switching from feature-based use to full fine-tuning gave downstream tasks more flexibility but required more memory and more careful regularization. The masked language modeling objective in BERT removed the need for two separate unidirectional language models and let every layer attend jointly to context on both sides of a token, which is what "truly bidirectional" means in the BERT literature. ^[10] ^[18]

Timeline of Contextual Embedding Models

ELMo sits inside a broader sequence of innovations that transformed NLP between 2017 and 2020. The table below traces the key contextual embedding and language model releases of that period. ^[3] ^[10] ^[18]

Date	Model	Affiliation	Key Innovation
2013	word2vec	Google (Mikolov et al.)	Static word embeddings via skip-gram and CBOW
2014	GloVe	Stanford (Pennington et al.)	Static embeddings from co-occurrence matrix factorization
2016	FastText	Facebook AI Research	Static embeddings with subword n-grams
2017 (Apr)	TagLM	AI2 (Peters et al.)	First demonstration that biLM hidden states improve sequence tagging
2017 (Aug)	CoVe	Salesforce (McCann et al.)	Contextual vectors from a translation encoder
2018 (Feb)	ELMo	AI2 + UW (Peters et al.)	Deep biLSTM language model with task-specific layer mixing
2018 (Jan / May)	ULMFiT	fast.ai (Howard & Ruder)	Fine-tuning of pretrained AWD-LSTM language model with discriminative learning rates
2018 (Jun)	GPT-1	OpenAI (Radford et al.)	Transformer decoder pretrained as left-to-right LM, fine-tuned on tasks
2018 (Oct)	BERT	Google (Devlin et al.)	Transformer encoder with masked language modeling and next-sentence prediction
2019 (Feb)	GPT-2	OpenAI (Radford et al.)	1.5B-parameter transformer LM with strong zero-shot abilities
2019 (Jun)	XLNet	Google + CMU	Permutation language modeling combining autoregressive and bidirectional benefits
2019 (Jul)	RoBERTa	Facebook AI Research	Re-trained BERT with more data, longer training, and removed NSP
2019 (Sep)	ALBERT	Google	Factorized embeddings and cross-layer parameter sharing
2019 (Oct)	T5	Google	Text-to-text framing of all NLP tasks; encoder-decoder transformer
2020 (May)	GPT-3	OpenAI	175B-parameter transformer with in-context learning

ELMo's position in this timeline is bridging. It came after the static embedding era and just before the transformer revolution. Its biLSTM architecture was already being phased out within months of its release, but the conceptual framework it introduced (pretrain a language model on a large unlabeled corpus, then transfer the resulting representations to supervised tasks) became the template for everything that followed. ^[3]

Layer-Wise Linguistic Analysis

One of the more interesting follow-up findings about ELMo concerns what the different layers of its biLM actually learn. Probing studies, in which a small classifier is trained on top of frozen ELMo vectors to predict a particular linguistic property, revealed a clean hierarchy. ^[12] ^[19]

The character CNN layer (layer 0) primarily encodes morphology and surface form. It performs well on tasks that depend on suffixes, prefixes, and orthography.
The first biLSTM layer captures local syntax. It scores highest on probing tasks such as part-of-speech tagging and constituency parsing, and it preserves more of the literal input than the upper layer.
The second biLSTM layer captures more abstract semantics. It performs best on word sense disambiguation, coreference, and other tasks that require integrating information across longer distances.

This hierarchy explains why the task-specific mixing weights matter so much. A POS tagger should learn to upweight layer 1, while a coreference system should upweight layer 2, and the gamma scaling factor allows the model to balance the resulting linear combination against other features. Studies of BERT and GPT-2 conducted in 2019 reported similar but more nuanced layer-wise hierarchies, suggesting that the property of "different layers encode different abstractions" is general to deep contextual models rather than specific to ELMo. ^[19]

AllenNLP and the Open-Source Release

ELMo's wide adoption was accelerated by its tight integration with AllenNLP, an open-source PyTorch-based NLP research framework released by AI2 in 2017. AllenNLP shipped reference implementations of common NLP architectures (semantic role labelers, coreference resolvers, NER taggers) along with a clean abstraction for plugging in token embedders. ELMo was supported as a first-class embedding option, which allowed researchers to switch from GloVe or word2vec inputs to ELMo with a single configuration change. ^[4]

AI2 also released the underlying TensorFlow implementation as bilm-tf and a separate command-line tool allennlp elmo that produces pre-computed ELMo vectors as HDF5 files for use in arbitrary frameworks. The trained 1B and 5.5B model checkpoints, plus a smaller 256-hidden-unit variant for users with constrained compute, were made publicly downloadable from the AllenNLP website. The ecosystem encouraged adoption across academia and industry well beyond the original AllenNLP user base. ^[4] ^[14]

AllenNLP itself was deprecated in late 2022 after its core team disbanded, but the ELMo model files remain available and the conceptual influence of the framework persisted in successor libraries such as Hugging Face's transformers. ^[20]

Influence on Subsequent NLP Research

ELMo's influence on the trajectory of NLP can be summarized in three threads. First, it normalized the practice of treating a pretrained language model as the central feature extractor for downstream tasks, a paradigm now called transfer learning for NLP. Within months, virtually every leaderboard for English NLP tasks was dominated by systems built on either ELMo or one of its rapid successors. ^[10]

Second, ELMo demonstrated empirically that scaling pretraining (more data, deeper models, longer training) produces broad downstream gains. The lesson was not lost on subsequent research. Transformer-based successors quickly adopted larger models, larger corpora, and longer training schedules, and the trend toward ever-larger language models that culminated in GPT-3 and modern frontier systems is in many ways an extrapolation of the curve that ELMo first traced. ^[21]

Third, ELMo set the standard for engineering deliverables expected of new methods papers. The release combined a clear paper, multiple model checkpoints, an open-source training and inference codebase, integration into a production-quality framework (AllenNLP), and detailed benchmark numbers across a diverse set of tasks. This combination became the implicit norm for follow-up releases, including BERT, GPT-2, RoBERTa, and XLNet. ^[4]

Limitations

Despite its impact, ELMo has several well-known limitations that became increasingly visible as transformer-based successors arrived. ^[10]

Concatenated bidirectionality. ELMo's two LSTMs are trained as independent forward and backward language models, then concatenated. The model never sees both directions of context jointly inside a single layer. BERT's masked language modeling objective fixes this by allowing every layer to attend to both left and right context simultaneously.
LSTM bottleneck. Recurrent computation processes tokens sequentially, which limits both the practical sequence length and the wall-clock training throughput. Transformers replaced this with self-attention, which scales much better on modern GPU hardware and captures long-range dependencies more effectively.
Feature-based use limits flexibility. ELMo treats the biLM as a frozen feature extractor. Fine-tuning the entire pretrained model on the target task, as introduced by ULMFiT and standardized by BERT and GPT, allows downstream supervision to reshape the internal representations and consistently produces stronger results.
Coarse contextualization. Stanford research published in 2019 showed that ELMo's contextualized representations are less context-specific than those produced by BERT or GPT-2. In layers above the first, the same word in different contexts ends up with vectors that are still relatively close in cosine similarity, which limits the model's ability to fully resolve polysemy.
Token-level rather than discourse-level. ELMo emits one vector per token. While this is fine for many sequence labeling and classification tasks, it does not natively model sentence-pair tasks the way BERT's [CLS] token and segment embeddings do.
Fixed pretraining vocabulary of internal states. Because the biLM was trained on relatively domain-specific corpora (news in the 1B benchmark), ELMo can underperform on highly specialized text such as biomedical or legal documents unless retrained on domain-matched data.

These limitations did not erase ELMo's contributions; they motivated the next generation of systems. By late 2018 and through 2019, BERT, GPT, XLNet, and RoBERTa progressively replaced ELMo on most leaderboards. By 2020, ELMo had been largely retired from frontline production use, although it persists as a robust baseline for low-resource languages and continues to be cited in pedagogy as the canonical introduction to contextual word representations. ^[10] ^[22]

Legacy

ELMo occupies a special place in NLP history as the model that crystallized the transfer learning paradigm and proved the value of deep, contextualized representations. Its measurable impact on benchmark scores, the elegance of its task-specific mixing scheme, and the open-source rigor of its release together made it a textbook example of how to do impactful applied research in the field. ^[3]

The paper's most direct intellectual descendant is arguably BERT, which can be read as ELMo with the LSTM replaced by a transformer encoder, the concatenated forward and backward LMs replaced by joint masked language modeling, and the feature-based use replaced by fine-tuning. The journalist analogy that became popular in 2018 captured this lineage in the title of Jay Alammar's widely read explainer: "The Illustrated BERT, ELMo, and co.: How NLP Cracked Transfer Learning." Both models are usually introduced together in coursework and survey papers because they tell two halves of the same story. ^[10]

Beyond its specific architectural choices, ELMo's broader legacy is a methodological one. It established that the right way to make progress in NLP is to pretrain a large language model on a vast unlabeled corpus, then transfer the resulting representations to downstream tasks. Every modern frontier system, from BERT through GPT-2 and the transformer-based large language models that followed, builds on that template. ELMo's particular combination of biLSTM and feature-based use has been superseded, but the strategic insight that drove it has only grown more central to the field. ^[3] ^[21]

References

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. "Deep contextualized word representations." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 2227 to 2237. New Orleans, June 2018. https://aclanthology.org/N18-1202/
"Best Paper Awards." NAACL-HLT 2018, https://naacl2018.wordpress.com/2018/04/11/2018-best-papers/
Alammar, Jay. "The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)." 2018. https://jalammar.github.io/illustrated-bert/
"AllenNLP: ELMo." Allen Institute for AI. https://allenai.org/allennlp/software/elmo
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. "Learned in Translation: Contextualized Word Vectors." Advances in Neural Information Processing Systems (NeurIPS) 30, 2017. https://arxiv.org/abs/1708.00107
Peters, Matthew E., Waleed Ammar, Chandra Bhagavatula, and Russell Power. "Semi-supervised sequence tagging with bidirectional language models." ACL 2017. https://arxiv.org/abs/1705.00108
Howard, Jeremy, and Sebastian Ruder. "Universal Language Model Fine-tuning for Text Classification." ACL 2018. https://arxiv.org/abs/1801.06146
Peters et al., "Deep contextualized word representations," arXiv preprint, March 2018. https://arxiv.org/pdf/1802.05365
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. https://arxiv.org/abs/1810.04805
"Linear Combination of Embeddings." allenai/bilm-tf issue 95. https://github.com/allenai/bilm-tf/issues/95
Peters, Matthew E., Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. "Dissecting Contextual Word Embeddings: Architecture and Representation." EMNLP 2018. https://arxiv.org/abs/1808.08949
Chelba, Ciprian, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling." arXiv:1312.3005, 2013. https://arxiv.org/abs/1312.3005
"Pre-trained ELMo Models." AllenNLP. https://allenai.org/allennlp/software/elmo
Che, Wanxiang, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. "Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation." CoNLL 2018 Shared Task. https://github.com/HIT-SCIR/ELMoForManyLangs
Ulcar, Matej, and Marko Robnik-Sikonja. "High Quality ELMo Embeddings for Seven Less-Resourced Languages." LREC 2020. https://aclanthology.org/2020.lrec-1.582/
Tsang, Sik-Ho. "Review: ELMo: Deep Contextualized Word Representations." Medium. https://sh-tsang.medium.com/review-elmo-deep-contextualized-word-representations-8eb1e58cd25c
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving Language Understanding by Generative Pre-Training." OpenAI Technical Report, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Ethayarajh, Kawin. "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings." EMNLP 2019. https://arxiv.org/abs/1909.00512
"AllenNLP is in maintenance mode." allenai/allennlp GitHub repository, 2022. https://github.com/allenai/allennlp
Brown, Tom B., et al. "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165
Liu, Yinhan, et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692

ELMo (Embeddings from Language Models)

Introduction

Background and Motivation

Architecture

Character-Level Input Representation

Bidirectional Language Model (biLM)

Task-Specific Linear Combination

Training

Use as a Feature in Downstream Models

Benchmark Results

Architectural Comparisons

ELMo, BERT, and GPT

Timeline of Contextual Embedding Models

Layer-Wise Linguistic Analysis

AllenNLP and the Open-Source Release

Influence on Subsequent NLP Research

Limitations

Legacy

See Also

References

Improve this article

Introduction

Background and Motivation

Architecture

Character-Level Input Representation

Bidirectional Language Model (biLM)

Task-Specific Linear Combination

Training

Use as a Feature in Downstream Models

Benchmark Results

Architectural Comparisons

ELMo, BERT, and GPT

Timeline of Contextual Embedding Models

Layer-Wise Linguistic Analysis

AllenNLP and the Open-Source Release

Influence on Subsequent NLP Research

Limitations

Legacy

See Also

References

Introduction

Background and Motivation

Architecture

Character-Level Input Representation

Bidirectional Language Model (biLM)

Task-Specific Linear Combination

Training

Use as a Feature in Downstream Models

Benchmark Results

Architectural Comparisons

ELMo, BERT, and GPT

Timeline of Contextual Embedding Models

Layer-Wise Linguistic Analysis

AllenNLP and the Open-Source Release

Influence on Subsequent NLP Research

Limitations

Legacy

See Also

References

Improve this article

Related Articles

GoogLeNet (Inception v1)

VGGNet

GloVe (Global Vectors for Word Representation)

Bert-base-uncased model

LLaMA 3

BART (language model)

Introduction

Background and Motivation

Architecture

Character-Level Input Representation

Bidirectional Language Model (biLM)

Task-Specific Linear Combination

Training

Use as a Feature in Downstream Models

Benchmark Results

Architectural Comparisons

ELMo, BERT, and GPT

Timeline of Contextual Embedding Models

Layer-Wise Linguistic Analysis

AllenNLP and the Open-Source Release

Influence on Subsequent NLP Research

Limitations

Legacy

See Also

References

Related Articles

GoogLeNet (Inception v1)

VGGNet

GloVe (Global Vectors for Word Representation)

Bert-base-uncased model

LLaMA 3

BART (language model)