Mike Lewis

Meta AI Natural Language Processing People

20 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v3 · 4,071 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mike Lewis is a British natural language processing researcher based in Seattle who serves as a research scientist at Meta AI (Facebook AI Research, FAIR) and as the pre-training research lead on the Llama team.^[1]^[2] He is the first author of the 2019 paper introducing BART, one of the foundational denoising sequence-to-sequence pre-trained models, and a co-author of RoBERTa, the Cicero Diplomacy agent, and the 2024 technical report describing Llama 3.^[1]^[3]^[4]^[5] He led pre-training for Llama 3 and has continued in that role on subsequent Meta foundation-model efforts.^[1]^[2] Lewis holds a PhD from the University of Edinburgh, where he was advised by Mark Steedman, and did a postdoc at the University of Washington with Luke Zettlemoyer; relative to public figures such as Yann LeCun he maintains a comparatively low public profile.^[1]^[6]

Education and early career

Lewis took a master's degree at the University of Oxford before moving to the University of Edinburgh's School of Informatics, where he joined the Institute for Language, Cognition and Computation as a PhD student under Mark Steedman.^[1] His doctoral work, completed in 2015, examined wide-coverage natural language processing semantics built on Combinatory Categorial Grammar (CCG), aiming to derive logical semantic representations that capture paraphrase and entailment from machine reading of large unlabelled corpora.^[7]^[8] An early flagship paper from this line of work, "Combined Distributional and Logical Semantics" with Steedman, appeared in the inaugural volume of the Transactions of the Association for Computational Linguistics in 2013, and introduced an approach that maps language to logical-form representations whose relational constants are induced offline by distributional clustering at the level of predicate-argument structure, with the aim of unifying the strengths of model-theoretic and distributional semantics.^[8]

After his PhD, Lewis moved to the University of Washington as a postdoctoral researcher in Luke Zettlemoyer's group, focusing on search-based structured prediction and neural CCG parsing.^[1] During this period he co-authored "Global Neural CCG Parsing with Optimality Guarantees," which received a Best Paper Award at the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) with Kenton Lee and Zettlemoyer.^[9] He also received a Best Resource Paper award at the Association for Computational Linguistics (ACL) 2017 and a Best Paper Honourable Mention at ACL 2018.^[6]

Lewis joined Facebook AI Research (now Meta AI) shortly thereafter, taking up a research scientist position at the Seattle FAIR site.^[1]

Selected publications and research

BART (2019, ACL 2020)

Lewis is the first author of "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension," posted to arXiv on 29 October 2019 (arXiv:1910.13461) and published in the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics in July 2020, pages 7871-7880.^[3]^[10] The co-authors are Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer.^[3]^[10]

BART is a denoising autoencoder built on a standard Transformer encoder-decoder. Training corrupts source text with a noising function and asks the decoder to reconstruct the clean original sequence.^[3]^[10] The model has a bidirectional encoder (in the style of BERT) and a left-to-right autoregressive decoder (in the style of GPT), with ReLU activations replaced by GeLUs and parameters initialised from N(0, 0.02); base BART uses six layers per side and BART-large uses twelve.^[3]^[19] The authors evaluated several corruption strategies, including token masking, token deletion, document rotation, sentence permutation, and text infilling with span lengths drawn from a Poisson(λ=3) distribution, and reported that combining sentence shuffling with span-based in-filling (replacing contiguous spans with a single mask token) worked best on downstream tasks.^[3]^[19] BART matched RoBERTa on the GLUE benchmark and SQuAD while delivering up to roughly six ROUGE-point gains on summarization and dialogue benchmarks and a 1.1 BLEU improvement over a back-translation baseline for machine translation.^[3]^[10] The paper has become a standard reference for encoder-decoder pretraining alongside T5 and is one of Lewis's most cited works, with more than sixteen thousand citations on Google Scholar.^[4] The model is distributed widely via Hugging Face in checkpoints such as facebook/bart-large, facebook/bart-large-cnn (fine-tuned for CNN/DailyMail summarization), and facebook/bart-large-mnli (fine-tuned for NLI and zero-shot classification).

RoBERTa (2019)

Lewis is one of ten co-authors of "RoBERTa: A Robustly Optimized BERT Pretraining Approach" (arXiv:1907.11692), posted on 26 July 2019.^[11] The paper is a replication study of BERT showing that the original model was undertrained, and that with longer training, larger batches, more data, and removal of the next-sentence prediction objective, the same architecture could substantially outperform published BERT results.^[11] RoBERTa is the single most cited paper on Lewis's Google Scholar profile, with over forty thousand citations.^[4]

Nearest-neighbour language models (ICLR 2020)

Also in 2020 Lewis was a co-author on "Generalization through Memorization: Nearest Neighbor Language Models" (Khandelwal, Levy, Jurafsky, Zettlemoyer, Lewis; ICLR 2020), introducing kNN-LM, which augments a pre-trained Transformer LM with a linear interpolation against a non-parametric k-nearest-neighbour distribution over training-data continuations. The method delivers state-of-the-art perplexity on WikiText-103 (15.79, a 2.9-point improvement) without additional training and is widely cited as an early example of explicit memory retrieval as a complement to parametric LMs.^[25]

QuIP (Findings of ACL 2022)

Lewis also co-authored "Question Answering Infused Pre-training of General-Purpose Contextualized Representations" (Jia, Lewis, Zettlemoyer; arXiv:2106.08190, Findings of ACL 2022), which uses 80 million synthesised QA pairs to train a bi-encoder QA model to match a more accurate cross-encoder, and shows the resulting representations transfer to QA, paraphrase detection, named entity recognition, and sentiment analysis.^[26]

Negotiation and story generation

Earlier first-author and co-author work at FAIR included "Deal or No Deal? End-to-End Learning of Negotiation Dialogues" at EMNLP 2017 (with Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra), which trained end-to-end dialogue agents on a multi-issue bargaining task and introduced a dialogue-rollouts planning technique in which the model simulates possible complete continuations of a conversation to choose actions that improve expected outcome.^[12] The accompanying dataset of human-human negotiation dialogues was publicly released, and the paper is often cited as one of the earliest end-to-end neural negotiation agents.^[12] In 2018 he was a co-author on "Hierarchical Neural Story Generation" (Fan, Lewis, Dauphin, ACL 2018), introducing a 300K-story WritingPrompts dataset and a fusion model with gated multi-scale self-attention; the paper's dataset and code are part of Meta's fairseq examples and remain a standard reference for long-form open-ended text generation.^[13]

Cicero

In 2022 Lewis was a co-author on the Meta Fundamental AI Research Diplomacy Team's Science paper "Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning," which introduced the Cicero agent.^[14] Cicero combined a controllable dialogue model with planning and reinforcement-learning components and, playing anonymously against humans in an online Diplomacy league, achieved more than double the average human score across 40 speed games and placed in the top 10% of participants who played more than one game.^[14] The dialogue model was trained to ground generated messages in an explicit plan of in-game actions, and the planning component combined a no-press equilibrium-search algorithm with a per-turn action policy conditioned on partner-modelling beliefs inferred from chat.^[14] In a public X (Twitter) reply, Lewis emphasised that the agent was designed so that "all its messages correspond to actions it currently plans to take" rather than being trained to deceive partners; Cicero is a rare modern example of a strategic dialogue agent whose code and dataset were openly released.^[14]^[17]

Llama 3 (2024)

Lewis is a core contributor and listed in the leading author group of "The Llama 3 Herd of Models" (arXiv:2407.21783), the technical report describing Meta's Llama 3 family, with version 1 posted on 31 July 2024 and a revised version 3 dated 23 November 2024.^[5] The paper introduces a family of dense Transformer language models up to 405B parameters with a 128K context window, natively multilingual, with coding, reasoning, and tool-use capabilities, and reports performance comparable on many benchmarks to leading proprietary models such as GPT-4.^[5] The Llama 3 flagship was pre-trained on roughly fifteen trillion tokens drawn from publicly available sources, roughly seven times more data than Llama 2, with extensive data filtering, semantic deduplication, and quality classifiers, and with a markedly higher share of code than earlier Llama generations.^[20] Architecturally the 405B model is a decoder-only Transformer with 126 layers, 128 attention heads, Grouped-Query Attention, RMSNorm, and Rotary Position Embeddings, and a vocabulary increased from 32K to 128K tokens relative to Llama 2.^[5]^[27]

Lewis is publicly identified by Meta and by external venues as the pre-training research lead on the Llama team and as having led pre-training for Llama 3.^[1]^[2] In 2024 he gave invited talks describing the work, including an invited ICLR 2024 talk titled "Bridging the Gap Between Pre-training Data and Alignment" and a seminar at the Johns Hopkins Center for Language and Speech Processing titled "Science and Scaling: How (really) to Pre-train a Llama."^[2]^[15]

His public talks on Llama 3 emphasised two themes for improving base models without relying purely on a separate alignment phase: in-context pre-training, which constructs training sequences from semantically related documents to strengthen long-context and few-shot behaviour, and instruction backtranslation, which automatically infers candidate instructions for unlabelled web documents.^[2]^[15] Both ideas appear as named arXiv preprints with Lewis as an author: "In-Context Pretraining: Language Modeling Beyond Document Boundaries" (Shi et al., arXiv:2310.10638) describes the document-relatedness pre-training scheme, and "Self-Alignment with Instruction Backtranslation" (Li et al., arXiv:2308.06259) describes the Humpback model that uses an iterative self-augment and self-curate procedure starting from a seed-finetuned LM on a web corpus.^[21]^[22]

Llama 4

Llama 4 was announced by Meta on 5 April 2025 as a mixture-of-experts-based, natively multimodal family of models, with the initial public release including Llama 4 Scout and Maverick and a larger Behemoth model previewed but not released as open weights.^[16] Lewis is publicly described as having continued in his pre-training research lead role on the Llama team during the development of Llama 4; he is reported to have led pre-training for Llama 3 and to remain Meta's pre-training research lead on the Llama team in subsequent cycles.^[1]^[2] Meta's Llama 4 announcement does not name individual researchers as pre-training lead, and the launch announcement instead describes the introduction of a new MetaP technique for setting per-layer hyperparameters such as learning rates and initialisation scales reliably across model sizes.^[16] The authorship listings for Llama 4 had not been published as a standalone arXiv paper at the time of the model launch, so attributions of personal leadership rest on his standing job description rather than a paper byline.^[1]^[2]^[16]

Other later work

Lewis is a co-author of several other FAIR papers relevant to the modern open-weights pre-training stack:

"Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models" (Li, Gururangan, Dettmers, Lewis, Althoff, Smith, Zettlemoyer; arXiv:2208.03306, August 2022) proposes training a collection of domain-expert language models independently in an embarrassingly parallel way and then combining them by ensembling or weight averaging, avoiding the multi-node synchronisation cost of monolithic LM training.^[23]
"LIMA: Less Is More for Alignment" (Zhou et al., arXiv:2305.11206, May 2023; NeurIPS 2023) fine-tunes a 65B LLaMA on only one thousand carefully curated prompt and response pairs, without RLHF, and shows competitive instruction-following quality, supporting the hypothesis that most of the knowledge in instruction-tuned LMs comes from pre-training.^[24]
"Self-Alignment with Instruction Backtranslation" (Li et al., arXiv:2308.06259, August 2023) introduces the Humpback model, which uses an instruction-backtranslation loop on web text to bootstrap an aligned LM.^[22]
"In-Context Pretraining: Language Modeling Beyond Document Boundaries" (Shi et al., arXiv:2310.10638, October 2023) reorders pre-training data so that consecutive documents in a context are semantically related, improving long-context behaviour and in-context learning.^[21]

Research themes

Lewis's published work spans several themes:

Theme	Representative works
Wide-coverage CCG semantics and parsing	"Combined Distributional and Logical Semantics" (TACL 2013); "A* CCG Parsing with a Supertag-factored Model" (NAACL 2014); "Global Neural CCG Parsing with Optimality Guarantees" (EMNLP 2016, Best Paper)^[8]^[9]
Dialogue and grounded agents	"Deal or No Deal? End-to-End Learning of Negotiation Dialogues" (EMNLP 2017); "Hierarchical Neural Story Generation" (ACL 2018); Cicero (Science 2022)^[12]^[13]^[14]
Pre-trained encoders	RoBERTa (2019)^[11]
Pre-trained encoder-decoders	BART (ACL 2020)^[3]^[10]
Modular and parallel LM training	Branch-Train-Merge (arXiv 2022)^[23]
Data-efficient alignment	LIMA (NeurIPS 2023); Self-Alignment with Instruction Backtranslation / Humpback (arXiv 2023); In-Context Pretraining (arXiv 2023)^[21]^[22]^[24]
Llama foundation models	"The Llama 3 Herd of Models" (arXiv 2024); pre-training lead for Llama 3 and continuing in that role on Llama 4^[1]^[2]^[5]

Across these projects a recurring thread is moving from explicit symbolic structure (CCG, logical forms) toward large-scale neural pre-training where structure is induced from corrupted-text reconstruction or from agent-environment interaction. A second recurring thread is a preference for pushing capability into the pre-training stage itself (data selection, document ordering, infilling, instruction backtranslation, modular experts) rather than relying solely on a separate post-training alignment phase.^[2]^[15]^[21]^[22]

Technical context

Three technical threads in Lewis's work are particularly load-bearing for the rest of the field:

Denoising as a unifying objective

BART (2019/2020) frames a wide range of prior pre-training schemes as special cases of a denoising autoencoder over text. Token masking with no shuffling recovers a BERT-style objective; deletion is similar to masking but forces the model to also recover positions; permutation captures the kinds of word-order corruption used in some earlier work; and span infilling with a span-length distribution sampled from Poisson(λ=3) generalises masked-span objectives such as those of SpanBERT and T5 while preserving variable-length outputs.^[3]^[19] Because the decoder generates the target autoregressively, the same pre-trained checkpoint can be fine-tuned for both comprehension and generation, without architectural changes; this is the property the title's "Translation, and Comprehension" phrase refers to.^[3]

Pre-training data as the primary lever

In his Llama-era public talks Lewis has consistently emphasised data composition, ordering, and quality over architectural novelty as the dominant driver of base-model capability.^[2]^[15] This view is supported by the Llama 3 paper's heavy emphasis on data filtering, semantic deduplication, code share, and multilingual coverage, and by the In-Context Pretraining and Instruction Backtranslation lines of work that explicitly modify pre-training corpora and example construction rather than the model itself.^[5]^[20]^[21]^[22]

Modular and memory-augmented models

The Branch-Train-Merge and kNN-LM lines of work argue from different angles that LM capability can be decomposed: BTM into a set of domain experts that can be averaged or ensembled, and kNN-LM into a parametric LM plus a non-parametric memory of training data continuations.^[23]^[25] These two papers, together with QuIP, sit alongside BART as evidence that Lewis's research style frequently mixes large pre-trained backbones with explicit structural devices.^[26]

Recognition and metrics

Public recognition for Lewis includes:^[6]

Best Paper Award, EMNLP 2016 (Global Neural CCG Parsing with Optimality Guarantees, with Lee and Zettlemoyer)
Best Resource Paper, ACL 2017
Best Paper Honourable Mention, ACL 2018

His Google Scholar profile lists more than 136,000 citations, an h-index above 60, and an i10-index above 90; the four most-cited papers are RoBERTa, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (on which Patrick Lewis is the first author, a distinct researcher with whom Mike Lewis is sometimes confused), "The Llama 3 Herd of Models," and BART.^[4]

Significance

The combination of BART, RoBERTa, Cicero, and Llama 3 places Lewis among a small group of researchers who have authored foundational papers in three distinct generations of NLP: the late-2010s era of large pre-trained encoders and encoder-decoders, the early-2020s era of dialogue and game-playing agents, and the mid-2020s era of open-weights frontier foundation models. BART in particular is one of the canonical references for encoder-decoder pre-training; Hugging Face hosts the original facebook/bart-large checkpoint and several widely used fine-tuned variants, including facebook/bart-large-cnn for machine-translation-style abstractive summarization on the CNN/DailyMail dataset and facebook/bart-large-mnli for zero-shot text classification via natural-language inference.^[3]^[28]

The Llama 3 paper is, by citation count, already one of the most cited foundation-model technical reports ever published; with over eighteen thousand citations on Google Scholar within roughly two years of release, it has become the de facto reference for understanding how to train a frontier-scale open-weights LM.^[4]^[5] Lewis's specific imprint on the Llama 3 program, visible in his Llama 3 talks, is the emphasis that aligned model quality is dominated by pre-training corpus design and data-side techniques such as in-context pretraining and instruction backtranslation, rather than by post-hoc preference tuning alone.^[2]^[15]^[21]^[22]

Public profile

Despite leading pre-training for two generations of one of the most widely deployed open-weights model families, Lewis maintains a relatively low public profile compared with senior Meta AI figures such as Yann LeCun.^[1]^[2] Most of his public communication consists of conference and seminar talks (for example at ICLR 2024, the Johns Hopkins Center for Language and Speech Processing, and the Georgia Tech ML@GT seminar series), arXiv papers, and an X (Twitter) account at the handle @ml_perception, rather than mainstream-media interviews or executive-style commentary.^[2]^[15]^[17]

Limitations and disputes

Several caveats apply to public statements about Lewis's role:

Llama paper authorship lists are very large. "The Llama 3 Herd of Models" lists hundreds of authors as Meta's Llama Team; the paper itself does not attribute pre-training leadership to a named individual in the main text. Public attribution of Lewis as pre-training lead comes from Meta's people page, his speaker bios at ICLR 2024 and CLSP, and his own talks, rather than from the paper's authorship metadata.^[1]^[2]^[5]^[15]
Llama 4 leadership. Meta's Llama 4 launch blog credits a team and does not name a specific pre-training lead.^[16] Public characterisations of Lewis as continuing in the pre-training research lead role on the Llama team rest on his persistent Meta job title and his speaker-bio descriptions in 2024 and after, not on a Llama 4 technical report.^[1]^[2]
Name disambiguation. "Mike Lewis" and "Patrick Lewis" are both NLP researchers who worked at FAIR; the highly cited RAG paper (Lewis et al., 2020) is led by Patrick Lewis. Mike Lewis is not an author on RAG, and the two should not be conflated.^[4]
Speech and codec models. The 2021 paper "On Generative Spoken Language Modeling from Raw Audio" (Lakhotia et al., TACL 2021) and related FAIR speech work were produced by FAIR's speech group; Lewis is not in the author list for that paper despite working in the same broader FAIR organisation, so claims tying him personally to Generative Spoken Language Modeling are not supported by that paper's authorship.^[18]

Comparison with adjacent figures

Lewis's role in the modern open-weights LM ecosystem invites comparison with a handful of other researchers who have led pre-training at frontier labs. Unlike Yann LeCun, who as Meta's Chief AI Scientist is the most public face of Meta AI and frequently comments in mainstream press, Lewis communicates mostly through papers, conference talks, and his X account, and his job title (research scientist / pre-training research lead on the Llama team) is intermediate in seniority.^[1]^[2] Compared with peers at other labs such as Alec Radford at OpenAI or Quoc Le at Google, who are similarly associated with multiple landmark pre-training papers, Lewis's published trajectory is somewhat broader, spanning logical semantics and game-playing dialogue in addition to text pre-training and foundation models.^[4]^[8]^[14]

Within FAIR he has worked repeatedly with a recognisable group of collaborators that includes Luke Zettlemoyer (his postdoc advisor and a Meta affiliate), Omer Levy, Naman Goyal, Yinhan Liu, and Marjan Ghazvininejad; many of these names co-author both the BART and RoBERTa papers and the more recent alignment-and-data papers.^[3]^[11]^[21]^[22]^[23]

BART (language model): the denoising sequence-to-sequence pre-trained model that Lewis first-authored.^[3]
RoBERTa: the optimised BERT replication that Lewis co-authored.^[11]
Llama 3: the model family for which Lewis led pre-training.^[1]^[5]
Llama 4: Meta's first natively multimodal mixture-of-experts Llama generation, released April 2025.^[16]
BERT and T5: contemporary encoder-only and encoder-decoder pre-trained models that BART is benchmarked against.^[3]
ELMo (Embeddings from Language Models): earlier contextual representation work in the broader pre-training lineage that BART builds on conceptually.

References

Meta AI, "Mike Lewis", AI at Meta people directory, accessed 2026-05-21. https://ai.meta.com/people/209431298931133/mike-lewis/. Accessed 2026-05-21. ↩
ICLR, "ICLR Invited Talk #1 - Bridging the Gap Between Pre-training Data and Alignment [Speaker: Mike Lewis (Meta AI)]", ICLR 2024 virtual program, 2024. https://iclr.cc/virtual/2024/23167. Accessed 2026-05-21. ↩
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer, "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", arXiv:1910.13461, 2019-10-29. https://arxiv.org/abs/1910.13461. Accessed 2026-05-21. ↩
Google Scholar, "Mike Lewis - Google Scholar citations profile", scholar.google.com, accessed 2026-05-21. https://scholar.google.com/citations?user=SnQnQicAAAAJ&hl=en. Accessed 2026-05-21. ↩
Llama Team, AI @ Meta, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31 (v1), 2024-11-23 (v3). https://arxiv.org/abs/2407.21783. Accessed 2026-05-21. ↩
Center for Language and Speech Processing, Johns Hopkins University, "Mike Lewis (Meta) - 'Science and Scaling: How (really) to Pre-train a Llama'", CLSP events page, 2024. https://www.clsp.jhu.edu/events/mike-lewis-meta-science-and-scaling-how-really-to-pre-train-a-llama/. Accessed 2026-05-21. ↩
University of Edinburgh ILCC, "Natural Language Processing and Computational Linguistics - possible PhD topics", School of Informatics, accessed 2026-05-21. https://informatics.ed.ac.uk/ilcc/study-with-us/possible-phd-topics-in-ilcc/natural-language-processing-and-computational. Accessed 2026-05-21. ↩
Mike Lewis and Mark Steedman, "Combined Distributional and Logical Semantics", Transactions of the Association for Computational Linguistics, vol. 1, pp. 179-192, 2013. https://aclanthology.org/Q13-1015/. Accessed 2026-05-21. ↩
Allen School News (University of Washington), "UW CSE researchers win Best Paper Award at EMNLP 2016", 2016-10-14. https://news.cs.washington.edu/2016/10/14/uw-cse-researchers-win-best-paper-award-at-emnlp-2016/. Accessed 2026-05-21. ↩
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer, "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL Anthology, 2020-07. https://aclanthology.org/2020.acl-main.703/. Accessed 2026-05-21. ↩
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", arXiv:1907.11692, 2019-07-26. https://arxiv.org/abs/1907.11692. Accessed 2026-05-21. ↩
Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, Dhruv Batra, "Deal or No Deal? End-to-End Learning of Negotiation Dialogues", EMNLP 2017, ACL Anthology. https://aclanthology.org/D17-1259/. Accessed 2026-05-21. ↩
Angela Fan, Mike Lewis, Yann Dauphin, "Hierarchical Neural Story Generation", arXiv:1805.04833, ACL 2018. https://arxiv.org/abs/1805.04833. Accessed 2026-05-21. ↩
Meta Fundamental AI Research Diplomacy Team (FAIR), "Human-level play in the game of Diplomacy by combining language models with strategic reasoning", Science, 2022-11-22. https://www.science.org/doi/10.1126/science.ade9097. Accessed 2026-05-21. ↩
Center for Language and Speech Processing, Johns Hopkins University, "Mike Lewis (Meta) - 'Science and Scaling: How (really) to Pre-train a Llama'", CLSP events page, 2024. https://www.clsp.jhu.edu/events/mike-lewis-meta-science-and-scaling-how-really-to-pre-train-a-llama/. Accessed 2026-05-21. ↩
Meta AI, "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation", AI at Meta blog, 2025-04-05. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed 2026-05-21. ↩
Mike Lewis (@ml_perception), public X (Twitter) profile, accessed 2026-05-21. https://x.com/ml_perception. Accessed 2026-05-21. ↩
Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, Emmanuel Dupoux, "On Generative Spoken Language Modeling from Raw Audio", arXiv:2102.01192, 2021-02-01. https://arxiv.org/abs/2102.01192. Accessed 2026-05-21. ↩
Michael Brenndoerfer, "BART Pre-training: Denoising Strategies & Text Infilling", mbrenndoerfer.com, accessed 2026-05-21. https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies. Accessed 2026-05-21. ↩
Meta AI, "Introducing Meta Llama 3: The most capable openly available LLM to date", AI at Meta blog, 2024-04-18. https://ai.meta.com/blog/meta-llama-3/. Accessed 2026-05-21. ↩
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Gergely Szilvasy, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis, "In-Context Pretraining: Language Modeling Beyond Document Boundaries", arXiv:2310.10638, 2023-10-16. https://arxiv.org/abs/2310.10638. Accessed 2026-05-21. ↩
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis, "Self-Alignment with Instruction Backtranslation", arXiv:2308.06259, 2023-08-11. https://arxiv.org/abs/2308.06259. Accessed 2026-05-21. ↩
Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, Luke Zettlemoyer, "Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models", arXiv:2208.03306, 2022-08-05. https://arxiv.org/abs/2208.03306. Accessed 2026-05-21. ↩
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy, "LIMA: Less Is More for Alignment", arXiv:2305.11206, 2023-05-18. https://arxiv.org/abs/2305.11206. Accessed 2026-05-21. ↩
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis, "Generalization through Memorization: Nearest Neighbor Language Models", ICLR 2020. https://arxiv.org/abs/1911.00172. Accessed 2026-05-21. ↩
Robin Jia, Mike Lewis, Luke Zettlemoyer, "Question Answering Infused Pre-training of General-Purpose Contextualized Representations", Findings of the Association for Computational Linguistics: ACL 2022, arXiv:2106.08190. https://arxiv.org/abs/2106.08190. Accessed 2026-05-21. ↩
Hugging Face, "Llama 3.1 - 405B, 70B & 8B with multilinguality and long context", Hugging Face blog, 2024-07-23. https://huggingface.co/blog/llama31. Accessed 2026-05-21. ↩
Meta AI, "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", Facebook AI Research publication page. https://ai.meta.com/research/publications/bart-denoising-sequence-to-sequence-pre-training-for-natural-language-generation-translation-and-comprehension/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

LLM.int8()