# ELMo (Embeddings from Language Models)

> Source: https://aiwiki.ai/wiki/elmo
> Updated: 2026-06-21
> Categories: AI History, AI Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Introduction

**ELMo** (Embeddings from Language Models) is a deep contextualized [word embedding](/wiki/word_embedding) method, introduced in 2018 by the [Allen Institute for AI](/wiki/allen_institute_for_ai) (AI2) and the University of Washington, that produces a different vector for each word depending on the sentence it appears in. Unlike static embeddings such as [word2vec](/wiki/word2vec) and GloVe, which assign one fixed vector per word, ELMo computes representations from the internal states of a pretrained deep bidirectional [language model](/wiki/language_model) (biLM), and these vectors improved the state of the art on six different [natural language processing](/wiki/natural_language_processing) (NLP) tasks when added to existing models. [1]

Presented in the paper "Deep contextualized word representations" by Matthew E. Peters and colleagues, ELMo won the Best Paper Award at the 2018 NAACL conference and rapidly became one of the most influential publications in modern NLP, accumulating well over 10,000 citations within a few years of release. [1] [2] The paper's abstract states that ELMo word vectors "are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus," and that they "can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis." [1]

ELMo represents a turning point in the history of [language modeling](/wiki/language_model). Where earlier approaches such as [word2vec](/wiki/word2vec) and GloVe assigned a single static vector to each word in the vocabulary, ELMo produces representations that change based on the surrounding sentence. The word "bank" in "river bank" receives a different embedding than the same word in "bank account," because the contextual signal of nearby tokens flows through a deep bidirectional [recurrent neural network](/wiki/recurrent_neural_network) before the final vector is produced. This shift, from static lookup tables to context-sensitive functions of the entire input sequence, opened the door for the transfer learning era that culminated in [BERT](/wiki/bert), [GPT-2](/wiki/gpt-2), and modern transformer-based [large language models](/wiki/large_language_model). [3]

The ELMo system trains a large bidirectional language model (often abbreviated biLM) on a massive corpus of text, and then exposes the internal hidden states of that model as a feature extractor for downstream tasks. Practitioners simply concatenate ELMo vectors to the input of an existing task-specific architecture, which immediately delivered state-of-the-art results on six different NLP benchmarks at the time of release. The approach popularized the broader pattern of using a pre-trained language model as a feature extractor, a paradigm that within a year was extended and refined by BERT into the more aggressive fine-tuning regime that defines current practice. [1] [4]

## What does ELMo stand for and when was it released?

ELMo is an acronym for Embeddings from Language Models. It was first posted as an arXiv preprint in February 2018 and formally published in June 2018 at the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) in New Orleans, where it received the Best Paper Award. The work was a collaboration between the Allen Institute for AI and the Paul G. Allen School of Computer Science and Engineering at the University of Washington, with Matthew E. Peters as lead author. [1] [2]

## Background and Motivation

For most of the deep learning era of NLP, word representations were learned in isolation from any particular task and then frozen for use as input features. The dominant techniques were Mikolov's [word2vec](/wiki/word2vec) (2013), Pennington's GloVe (2014), and FastText (2016). Each of these methods returns a single vector per surface form, regardless of the syntactic or semantic role the word is playing in a given sentence. A polysemous word like "play" receives the same vector whether it appears in "play the violin," "play tennis," or "the children's play." Static embeddings cannot represent this kind of context-dependent variation, and downstream models had to learn it themselves from limited labeled data. [3] [5]

Researchers explored several intermediate steps before ELMo. McCann and colleagues introduced **CoVe** (Contextual Word Vectors) in mid-2017. CoVe extracted hidden states from the encoder of a machine translation system and supplied them as additional features to downstream classifiers. The approach demonstrated that contextual signals could improve performance, but it relied on parallel translation data and was therefore limited by the size and language coverage of available bilingual corpora. [6]

A closer predecessor came from the same Allen Institute group that would later publish ELMo. In 2017, Peters, Ammar, and colleagues released **TagLM**, a model that combined fixed word embeddings with hidden states from a bidirectional language model trained on the One Billion Word Benchmark. TagLM achieved state-of-the-art results on named entity recognition (NER) and chunking, and it established the core idea that informed ELMo: a [language model](/wiki/language_model) trained on a large unlabeled corpus carries general linguistic knowledge that can be transferred into supervised systems. ELMo expanded this idea by producing a richer, deeper, and more flexible representation. [7]

A parallel line of work that arrived around the same time was Howard and Ruder's **ULMFiT** (Universal Language Model Fine-tuning), released in early 2018. ULMFiT also pre-trained a recurrent language model on a large corpus, but it took a different transfer approach: instead of using the model as a feature extractor, ULMFiT fine-tuned the entire language model on the target task, with techniques such as discriminative learning rates and gradual unfreezing. The intellectual race between feature-based ELMo and fine-tuning ULMFiT helped motivate later transformer-based models, which adopted fine-tuning as the dominant paradigm. [8]

## How does ELMo work?

ELMo's architecture has three main components stacked on top of each other: a character-level convolutional layer that produces context-independent token embeddings, a deep bidirectional [LSTM](/wiki/long_short-term_memory_lstm) language model that processes the sequence of token embeddings, and a learned linear combination of the LSTM's hidden states that yields the final task-specific contextual vector. Each piece is described below. [1]

### Character-Level Input Representation

ELMo does not start from a fixed vocabulary of word indices. Instead, every input token is first decomposed into its sequence of UTF-8 characters, and a stack of one-dimensional convolutional filters is applied across those characters. The standard configuration uses kernel sizes of 1, 2, 3, 4, 5, 6, and 7 characters with corresponding numbers of filters (32, 32, 64, 128, 256, 512, and 1024 channels respectively). The outputs of these filters are max-pooled to produce a 2,048-dimensional vector that captures the morphological structure of the token. The vector is then passed through two highway layers and a linear projection to a final 512-dimensional context-insensitive representation. [9]

This design has two practical advantages. First, the model can produce representations for any token, including out-of-vocabulary words and misspellings, because it never depends on a closed word list. Second, the character-level encoder captures useful morphological cues such as prefixes, suffixes, and capitalization patterns, which are particularly helpful for tasks like named entity recognition. [9]

### Bidirectional Language Model (biLM)

The context-insensitive token vectors are fed into a deep bidirectional language model. ELMo is, importantly, two independent language models trained jointly with shared input and softmax weights but separate LSTM stacks: a forward language model that reads left-to-right and predicts the next token given the previous tokens, and a backward language model that reads right-to-left and predicts the previous token given the following tokens. [1]

The forward and backward LSTMs each have L = 2 layers with 4,096 hidden units and a 512-dimensional projection per layer, with a residual connection from the first to the second layer. After training, every input token at position k is associated with 2L + 1 = 5 vectors:

1. The context-insensitive character-CNN embedding (one vector).
2. The forward LSTM hidden state at layer 1 and the backward LSTM hidden state at layer 1 (concatenated, one vector).
3. The forward LSTM hidden state at layer 2 and the backward LSTM hidden state at layer 2 (concatenated, one vector).

These five vectors form the raw representation set R_k that ELMo exposes to downstream tasks. The forward and backward LSTMs are trained as separate language models because the standard left-to-right next-token objective cannot use future tokens without leaking information. ELMo's bidirectionality is therefore a concatenation of two independent unidirectional models, a structural choice that BERT later replaced with masked language modeling to achieve genuine joint conditioning on both directions. [1] [10]

### Task-Specific Linear Combination

The central innovation of ELMo over earlier biLM-feature approaches such as TagLM is that the downstream task does not just use the top LSTM layer. Instead, ELMo collapses the 2L + 1 vectors into a single representation by computing a task-specific linear combination:

```
ELMo_k^task = gamma^task * sum_{j=0}^{L} s_j^task * h_{k,j}^LM
```

Here s_j^task is a softmax-normalized weight assigned to layer j, and gamma^task is a single scaling factor. Both sets of parameters are learned per downstream task, and they are the only ELMo parameters that move during fine-tuning of the supervised system; the biLM weights themselves remain frozen. The number of trainable ELMo parameters per task is therefore tiny (L + 2 scalars), which keeps the computational cost of integrating ELMo into existing models very low. [1] [11]

This weighted combination matters because the different layers of the biLM encode different kinds of linguistic information. Probing studies showed that lower LSTM layers tend to capture syntactic information such as part-of-speech and local phrase structure, while upper layers carry more abstract semantic information such as word sense and discourse-level cues. By learning the mixing weights from data, the downstream model can emphasize the layer most relevant to its task: a syntactic chunker may upweight layer 1, while a sentiment classifier or coreference resolver may upweight layer 2. [12]

## How was ELMo trained?

The baseline English ELMo model is trained on the **One Billion Word Benchmark** introduced by Chelba and colleagues in 2014, a corpus of approximately 30 million sentences and 1 billion tokens drawn from news text. After ten epochs of training, the average forward and backward perplexities reach about 39.7. AI2 also released a larger ELMo variant trained on a 5.5-billion-token corpus combining English Wikipedia (1.9 billion tokens) and monolingual news crawl data from WMT 2008 to 2012 (3.6 billion tokens). The 5.5B model achieves modestly better downstream performance than the 1B model and was recommended as the default by the time the open-source release matured. [13] [14]

Training the biLM is computationally demanding for its era. The Allen Institute reported that training the original 1B-token model on a single GPU server took about two weeks. The model size is roughly 93.6 million parameters when the character CNN, both LSTM stacks, and the softmax projection are counted. Although small by 2026 standards, this scale was substantial for a 2018 NLP system and reflected a new willingness in the research community to invest serious compute in unsupervised pretraining. [9]

ELMo embeddings have also been trained for many languages outside English. The HIT-SCIR group released ELMoForManyLangs, which provides pretrained ELMo models for dozens of languages drawn from the CoNLL 2018 shared task data. Independent groups produced Portuguese, German, Slovenian, Croatian, Estonian, Finnish, Swedish, and Russian ELMo models, often using language-specific Wikipedia and CommonCrawl corpora. These multilingual variants extended ELMo's reach into NLP communities that lacked the labeled data required to train competitive models from scratch. [15] [16]

## How is ELMo used in downstream models?

The ELMo paper presents a remarkably simple integration recipe. To add ELMo to an existing supervised model, the practitioner concatenates the ELMo vector to the input embeddings of each token (and optionally to the output of the model's encoder as well). The supervised loss then propagates gradients back through the ELMo mixing weights gamma and s while leaving the biLM frozen. No architectural changes to the downstream model are required, and the additional training time is modest because the biLM forward pass is expensive but happens only once per minibatch. [1]

This plug-and-play property is one reason ELMo became so widely adopted so quickly. Almost any sequence labeling, classification, or pairwise comparison model could absorb ELMo with a few lines of code, immediately closing a substantial gap to the state of the art. AllenNLP shipped reference implementations and a Python interface that lowered the barrier even further. [4]

## How well did ELMo perform on benchmarks?

The ELMo paper reported state-of-the-art results on six diverse NLP tasks at the time of its release. Relative error reductions over the previous best baselines ranged from approximately 6% to 24.9%, demonstrating that the gains were not specific to a single benchmark but broadly applicable. [1]

| Task | Dataset | Previous SOTA | With ELMo | Improvement |
|------|---------|---------------|-----------|-------------|
| Question Answering | SQuAD 1.1 (F1) | 81.1 | 85.8 | +4.7 |
| Textual Entailment | SNLI (accuracy) | 88.0 | 88.7 | +0.7 |
| Semantic Role Labeling | OntoNotes (F1) | 81.4 | 84.6 | +3.2 |
| Coreference Resolution | OntoNotes (F1) | 67.2 | 70.4 | +3.2 |
| Named Entity Recognition | CoNLL 2003 (F1) | 91.93 | 92.22 | +0.29 |
| Sentiment Analysis | SST-5 (accuracy) | 51.4 | 54.7 | +3.3 |

The gains on semantic role labeling and coreference resolution were especially striking, with relative error reductions above 17%. These tasks rely heavily on long-range syntactic and semantic relationships, which the deep biLM appears particularly good at capturing through its layered representations. [1] [17]

## How does ELMo differ from BERT and GPT?

ELMo's contemporaries used different architectures and pretraining strategies. The table below summarizes the most important contrasts among ELMo, OpenAI's first GPT (June 2018), and Google's BERT (October 2018), the three systems that defined the early transfer learning wave. [10] [18]

| Property | ELMo (Feb 2018) | GPT-1 (Jun 2018) | BERT (Oct 2018) |
|----------|-----------------|-------------------|------------------|
| Backbone | Bidirectional [LSTM](/wiki/long_short-term_memory_lstm) | [Transformer](/wiki/transformer) decoder | Transformer encoder |
| Direction | Two independent forward / backward LMs concatenated | Strict left-to-right | Joint bidirectional via masked LM |
| Pretraining objective | Standard next-token language modeling (forward and backward separately) | Standard left-to-right language modeling | Masked language modeling + Next-sentence prediction |
| Use in downstream models | Feature-based (frozen biLM, plug-in vectors) | Fine-tuning | Fine-tuning |
| Trainable downstream params (per task) | L + 2 mixing scalars | All transformer params | All transformer params |
| Tokenization | Character CNN over UTF-8 | Byte-pair encoding (BPE) | WordPiece |
| Pretraining corpus | 1B Word Benchmark (or 5.5B variant) | BookCorpus (~800M words) | BookCorpus + English Wikipedia (~3.3B words) |
| Best Paper Award | NAACL 2018 | None | NAACL 2019 |

The contrasts highlight a few key engineering shifts. Replacing the LSTM with a transformer dramatically improved the model's ability to model long-range dependencies and made training more parallelizable. Switching from feature-based use to full fine-tuning gave downstream tasks more flexibility but required more memory and more careful regularization. The masked language modeling objective in BERT removed the need for two separate unidirectional language models and let every layer attend jointly to context on both sides of a token, which is what "truly bidirectional" means in the BERT literature. [10] [18]

### Timeline of Contextual Embedding Models

ELMo sits inside a broader sequence of innovations that transformed NLP between 2017 and 2020. The table below traces the key contextual embedding and language model releases of that period. [3] [10] [18]

| Date | Model | Affiliation | Key Innovation |
|------|-------|------------|-----------------|
| 2013 | [word2vec](/wiki/word2vec) | Google (Mikolov et al.) | Static word embeddings via skip-gram and CBOW |
| 2014 | GloVe | Stanford (Pennington et al.) | Static embeddings from co-occurrence matrix factorization |
| 2016 | FastText | Facebook AI Research | Static embeddings with subword n-grams |
| 2017 (Apr) | TagLM | AI2 (Peters et al.) | First demonstration that biLM hidden states improve sequence tagging |
| 2017 (Aug) | CoVe | Salesforce (McCann et al.) | Contextual vectors from a translation encoder |
| 2018 (Feb) | ELMo | AI2 + UW (Peters et al.) | Deep biLSTM language model with task-specific layer mixing |
| 2018 (Jan / May) | ULMFiT | fast.ai (Howard & Ruder) | Fine-tuning of pretrained AWD-LSTM language model with discriminative learning rates |
| 2018 (Jun) | GPT-1 | OpenAI (Radford et al.) | Transformer decoder pretrained as left-to-right LM, fine-tuned on tasks |
| 2018 (Oct) | BERT | Google (Devlin et al.) | Transformer encoder with masked language modeling and next-sentence prediction |
| 2019 (Feb) | [GPT-2](/wiki/gpt-2) | OpenAI (Radford et al.) | 1.5B-parameter transformer LM with strong zero-shot abilities |
| 2019 (Jun) | XLNet | Google + CMU | Permutation language modeling combining autoregressive and bidirectional benefits |
| 2019 (Jul) | RoBERTa | Facebook AI Research | Re-trained BERT with more data, longer training, and removed NSP |
| 2019 (Sep) | ALBERT | Google | Factorized embeddings and cross-layer parameter sharing |
| 2019 (Oct) | T5 | Google | Text-to-text framing of all NLP tasks; encoder-decoder transformer |
| 2020 (May) | GPT-3 | OpenAI | 175B-parameter transformer with in-context learning |

ELMo's position in this timeline is bridging. It came after the static embedding era and just before the transformer revolution. Its biLSTM architecture was already being phased out within months of its release, but the conceptual framework it introduced (pretrain a language model on a large unlabeled corpus, then transfer the resulting representations to supervised tasks) became the template for everything that followed. [3]

## Layer-Wise Linguistic Analysis

One of the more interesting follow-up findings about ELMo concerns what the different layers of its biLM actually learn. Probing studies, in which a small classifier is trained on top of frozen ELMo vectors to predict a particular linguistic property, revealed a clean hierarchy. [12] [19]

- The character CNN layer (layer 0) primarily encodes morphology and surface form. It performs well on tasks that depend on suffixes, prefixes, and orthography.
- The first biLSTM layer captures local syntax. It scores highest on probing tasks such as part-of-speech tagging and constituency parsing, and it preserves more of the literal input than the upper layer.
- The second biLSTM layer captures more abstract semantics. It performs best on word sense disambiguation, coreference, and other tasks that require integrating information across longer distances.

This hierarchy explains why the task-specific mixing weights matter so much. A POS tagger should learn to upweight layer 1, while a coreference system should upweight layer 2, and the gamma scaling factor allows the model to balance the resulting linear combination against other features. Studies of BERT and GPT-2 conducted in 2019 reported similar but more nuanced layer-wise hierarchies, suggesting that the property of "different layers encode different abstractions" is general to deep contextual models rather than specific to ELMo. [19]

## AllenNLP and the Open-Source Release

ELMo's wide adoption was accelerated by its tight integration with **AllenNLP**, an open-source PyTorch-based NLP research framework released by AI2 in 2017. AllenNLP shipped reference implementations of common NLP architectures (semantic role labelers, coreference resolvers, NER taggers) along with a clean abstraction for plugging in token embedders. ELMo was supported as a first-class embedding option, which allowed researchers to switch from GloVe or word2vec inputs to ELMo with a single configuration change. [4]

AI2 also released the underlying TensorFlow implementation as `bilm-tf` and a separate command-line tool `allennlp elmo` that produces pre-computed ELMo vectors as HDF5 files for use in arbitrary frameworks. The trained 1B and 5.5B model checkpoints, plus a smaller 256-hidden-unit variant for users with constrained compute, were made publicly downloadable from the AllenNLP website. The ecosystem encouraged adoption across academia and industry well beyond the original AllenNLP user base. [4] [14]

AllenNLP itself was deprecated in late 2022 after its core team disbanded, but the ELMo model files remain available and the conceptual influence of the framework persisted in successor libraries such as Hugging Face's `transformers`. [20]

## Influence on Subsequent NLP Research

ELMo's influence on the trajectory of NLP can be summarized in three threads. First, it normalized the practice of treating a pretrained language model as the central feature extractor for downstream tasks, a paradigm now called transfer learning for NLP. Within months, virtually every leaderboard for English NLP tasks was dominated by systems built on either ELMo or one of its rapid successors. [10]

Second, ELMo demonstrated empirically that scaling pretraining (more data, deeper models, longer training) produces broad downstream gains. The lesson was not lost on subsequent research. Transformer-based successors quickly adopted larger models, larger corpora, and longer training schedules, and the trend toward ever-larger language models that culminated in GPT-3 and modern frontier systems is in many ways an extrapolation of the curve that ELMo first traced. [21]

Third, ELMo set the standard for engineering deliverables expected of new methods papers. The release combined a clear paper, multiple model checkpoints, an open-source training and inference codebase, integration into a production-quality framework (AllenNLP), and detailed benchmark numbers across a diverse set of tasks. This combination became the implicit norm for follow-up releases, including BERT, GPT-2, RoBERTa, and XLNet. [4]

## What are the limitations of ELMo?

Despite its impact, ELMo has several well-known limitations that became increasingly visible as transformer-based successors arrived. [10]

- **Concatenated bidirectionality.** ELMo's two LSTMs are trained as independent forward and backward language models, then concatenated. The model never sees both directions of context jointly inside a single layer. BERT's masked language modeling objective fixes this by allowing every layer to attend to both left and right context simultaneously.
- **LSTM bottleneck.** Recurrent computation processes tokens sequentially, which limits both the practical sequence length and the wall-clock training throughput. Transformers replaced this with self-attention, which scales much better on modern GPU hardware and captures long-range dependencies more effectively.
- **Feature-based use limits flexibility.** ELMo treats the biLM as a frozen feature extractor. Fine-tuning the entire pretrained model on the target task, as introduced by ULMFiT and standardized by BERT and GPT, allows downstream supervision to reshape the internal representations and consistently produces stronger results.
- **Coarse contextualization.** Stanford research published in 2019 showed that ELMo's contextualized representations are less context-specific than those produced by BERT or GPT-2. In layers above the first, the same word in different contexts ends up with vectors that are still relatively close in cosine similarity, which limits the model's ability to fully resolve polysemy.
- **Token-level rather than discourse-level.** ELMo emits one vector per token. While this is fine for many sequence labeling and classification tasks, it does not natively model sentence-pair tasks the way BERT's [CLS] token and segment embeddings do.
- **Fixed pretraining vocabulary of internal states.** Because the biLM was trained on relatively domain-specific corpora (news in the 1B benchmark), ELMo can underperform on highly specialized text such as biomedical or legal documents unless retrained on domain-matched data.

These limitations did not erase ELMo's contributions; they motivated the next generation of systems. By late 2018 and through 2019, BERT, GPT, XLNet, and RoBERTa progressively replaced ELMo on most leaderboards. By 2020, ELMo had been largely retired from frontline production use, although it persists as a robust baseline for low-resource languages and continues to be cited in pedagogy as the canonical introduction to contextual word representations. [10] [22]

## Legacy

ELMo occupies a special place in NLP history as the model that crystallized the transfer learning paradigm and proved the value of deep, contextualized representations. Its measurable impact on benchmark scores, the elegance of its task-specific mixing scheme, and the open-source rigor of its release together made it a textbook example of how to do impactful applied research in the field. [3]

The paper's most direct intellectual descendant is arguably BERT, which can be read as ELMo with the LSTM replaced by a transformer encoder, the concatenated forward and backward LMs replaced by joint masked language modeling, and the feature-based use replaced by fine-tuning. The journalist analogy that became popular in 2018 captured this lineage in the title of Jay Alammar's widely read explainer: "The Illustrated BERT, ELMo, and co.: How NLP Cracked Transfer Learning." Both models are usually introduced together in coursework and survey papers because they tell two halves of the same story. [10]

Beyond its specific architectural choices, ELMo's broader legacy is a methodological one. It established that the right way to make progress in NLP is to pretrain a large language model on a vast unlabeled corpus, then transfer the resulting representations to downstream tasks. Every modern frontier system, from BERT through [GPT-2](/wiki/gpt-2) and the transformer-based [large language models](/wiki/large_language_model) that followed, builds on that template. ELMo's particular combination of biLSTM and feature-based use has been superseded, but the strategic insight that drove it has only grown more central to the field. [3] [21]

## See Also

- [Table Question Answering Models](/wiki/table_question_answering_models)
- [BERT](/wiki/bert)
- [GPT-2](/wiki/gpt-2)
- [Transformer](/wiki/transformer)
- [Word2vec](/wiki/word2vec)
- [Word embedding](/wiki/word_embedding)
- [Large language model](/wiki/large_language_model)
- [Long short-term memory (LSTM)](/wiki/long_short-term_memory_lstm)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [Natural language processing](/wiki/natural_language_processing)
- [Allen Institute for AI](/wiki/allen_institute_for_ai)
- [Language model](/wiki/language_model)

## References

[1] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. "Deep contextualized word representations." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 2227 to 2237. New Orleans, June 2018. https://aclanthology.org/N18-1202/

[2] "Best Paper Awards." NAACL-HLT 2018, https://naacl2018.wordpress.com/2018/04/11/2018-best-papers/

[3] Alammar, Jay. "The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)." 2018. https://jalammar.github.io/illustrated-bert/

[4] "AllenNLP: ELMo." Allen Institute for AI. https://allenai.org/allennlp/software/elmo

[5] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781

[6] McCann, Bryan, James Bradbury, Caiming Xiong, and Richard Socher. "Learned in Translation: Contextualized Word Vectors." Advances in Neural Information Processing Systems (NeurIPS) 30, 2017. https://arxiv.org/abs/1708.00107

[7] Peters, Matthew E., Waleed Ammar, Chandra Bhagavatula, and Russell Power. "Semi-supervised sequence tagging with bidirectional language models." ACL 2017. https://arxiv.org/abs/1705.00108

[8] Howard, Jeremy, and Sebastian Ruder. "Universal Language Model Fine-tuning for Text Classification." ACL 2018. https://arxiv.org/abs/1801.06146

[9] Peters et al., "Deep contextualized word representations," arXiv preprint, March 2018. https://arxiv.org/pdf/1802.05365

[10] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. https://arxiv.org/abs/1810.04805

[11] "Linear Combination of Embeddings." allenai/bilm-tf issue 95. https://github.com/allenai/bilm-tf/issues/95

[12] Peters, Matthew E., Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. "Dissecting Contextual Word Embeddings: Architecture and Representation." EMNLP 2018. https://arxiv.org/abs/1808.08949

[13] Chelba, Ciprian, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling." arXiv:1312.3005, 2013. https://arxiv.org/abs/1312.3005

[14] "Pre-trained ELMo Models." AllenNLP. https://allenai.org/allennlp/software/elmo

[15] Che, Wanxiang, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. "Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation." CoNLL 2018 Shared Task. https://github.com/HIT-SCIR/ELMoForManyLangs

[16] Ulcar, Matej, and Marko Robnik-Sikonja. "High Quality ELMo Embeddings for Seven Less-Resourced Languages." LREC 2020. https://aclanthology.org/2020.lrec-1.582/

[17] Tsang, Sik-Ho. "Review: ELMo: Deep Contextualized Word Representations." Medium. https://sh-tsang.medium.com/review-elmo-deep-contextualized-word-representations-8eb1e58cd25c

[18] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving Language Understanding by Generative Pre-Training." OpenAI Technical Report, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[19] Ethayarajh, Kawin. "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings." EMNLP 2019. https://arxiv.org/abs/1909.00512

[20] "AllenNLP is in maintenance mode." allenai/allennlp GitHub repository, 2022. https://github.com/allenai/allennlp

[21] Brown, Tom B., et al. "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165

[22] Liu, Yinhan, et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692

