Mike Lewis
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,074 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,074 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mike Lewis is a British natural language processing researcher based in Seattle who serves as a research scientist at Meta AI (Facebook AI Research, FAIR) and as the pre-training research lead on the Llama team.[^1][^2] He is the first author of the 2019 paper introducing BART, one of the foundational denoising sequence-to-sequence pre-trained models, and a co-author of RoBERTa, the Cicero Diplomacy agent, and the 2024 technical report describing Llama 3.[^1][^3][^4][^5] He led pre-training for Llama 3 and has continued in that role on subsequent Meta foundation-model efforts.[^1][^2] Lewis holds a PhD from the University of Edinburgh, where he was advised by Mark Steedman, and did a postdoc at the University of Washington with Luke Zettlemoyer; relative to public figures such as Yann LeCun he maintains a comparatively low public profile.[^1][^6]
Lewis took a master's degree at the University of Oxford before moving to the University of Edinburgh's School of Informatics, where he joined the Institute for Language, Cognition and Computation as a PhD student under Mark Steedman.[^1] His doctoral work, completed in 2015, examined wide-coverage natural language processing semantics built on Combinatory Categorial Grammar (CCG), aiming to derive logical semantic representations that capture paraphrase and entailment from machine reading of large unlabelled corpora.[^7][^8] An early flagship paper from this line of work, "Combined Distributional and Logical Semantics" with Steedman, appeared in the inaugural volume of the Transactions of the Association for Computational Linguistics in 2013, and introduced an approach that maps language to logical-form representations whose relational constants are induced offline by distributional clustering at the level of predicate-argument structure, with the aim of unifying the strengths of model-theoretic and distributional semantics.[^8]
After his PhD, Lewis moved to the University of Washington as a postdoctoral researcher in Luke Zettlemoyer's group, focusing on search-based structured prediction and neural CCG parsing.[^1] During this period he co-authored "Global Neural CCG Parsing with Optimality Guarantees," which received a Best Paper Award at the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) with Kenton Lee and Zettlemoyer.[^9] He also received a Best Resource Paper award at the Association for Computational Linguistics (ACL) 2017 and a Best Paper Honourable Mention at ACL 2018.[^6]
Lewis joined Facebook AI Research (now Meta AI) shortly thereafter, taking up a research scientist position at the Seattle FAIR site.[^1]
Lewis is the first author of "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension," posted to arXiv on 29 October 2019 (arXiv:1910.13461) and published in the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics in July 2020, pages 7871-7880.[^3][^10] The co-authors are Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer.[^3][^10]
BART is a denoising autoencoder built on a standard Transformer encoder-decoder. Training corrupts source text with a noising function and asks the decoder to reconstruct the clean original sequence.[^3][^10] The model has a bidirectional encoder (in the style of BERT) and a left-to-right autoregressive decoder (in the style of GPT), with ReLU activations replaced by GeLUs and parameters initialised from N(0, 0.02); base BART uses six layers per side and BART-large uses twelve.[^3][^19] The authors evaluated several corruption strategies, including token masking, token deletion, document rotation, sentence permutation, and text infilling with span lengths drawn from a Poisson(λ=3) distribution, and reported that combining sentence shuffling with span-based in-filling (replacing contiguous spans with a single mask token) worked best on downstream tasks.[^3][^19] BART matched RoBERTa on the GLUE benchmark and SQuAD while delivering up to roughly six ROUGE-point gains on summarization and dialogue benchmarks and a 1.1 BLEU improvement over a back-translation baseline for machine translation.[^3][^10] The paper has become a standard reference for encoder-decoder pretraining alongside T5 and is one of Lewis's most cited works, with more than sixteen thousand citations on Google Scholar.[^4] The model is distributed widely via Hugging Face in checkpoints such as facebook/bart-large, facebook/bart-large-cnn (fine-tuned for CNN/DailyMail summarization), and facebook/bart-large-mnli (fine-tuned for NLI and zero-shot classification).
Lewis is one of ten co-authors of "RoBERTa: A Robustly Optimized BERT Pretraining Approach" (arXiv:1907.11692), posted on 26 July 2019.[^11] The paper is a replication study of BERT showing that the original model was undertrained, and that with longer training, larger batches, more data, and removal of the next-sentence prediction objective, the same architecture could substantially outperform published BERT results.[^11] RoBERTa is the single most cited paper on Lewis's Google Scholar profile, with over forty thousand citations.[^4]
Also in 2020 Lewis was a co-author on "Generalization through Memorization: Nearest Neighbor Language Models" (Khandelwal, Levy, Jurafsky, Zettlemoyer, Lewis; ICLR 2020), introducing kNN-LM, which augments a pre-trained Transformer LM with a linear interpolation against a non-parametric k-nearest-neighbour distribution over training-data continuations. The method delivers state-of-the-art perplexity on WikiText-103 (15.79, a 2.9-point improvement) without additional training and is widely cited as an early example of explicit memory retrieval as a complement to parametric LMs.[^25]
Lewis also co-authored "Question Answering Infused Pre-training of General-Purpose Contextualized Representations" (Jia, Lewis, Zettlemoyer; arXiv:2106.08190, Findings of ACL 2022), which uses 80 million synthesised QA pairs to train a bi-encoder QA model to match a more accurate cross-encoder, and shows the resulting representations transfer to QA, paraphrase detection, named entity recognition, and sentiment analysis.[^26]
Earlier first-author and co-author work at FAIR included "Deal or No Deal? End-to-End Learning of Negotiation Dialogues" at EMNLP 2017 (with Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra), which trained end-to-end dialogue agents on a multi-issue bargaining task and introduced a dialogue-rollouts planning technique in which the model simulates possible complete continuations of a conversation to choose actions that improve expected outcome.[^12] The accompanying dataset of human-human negotiation dialogues was publicly released, and the paper is often cited as one of the earliest end-to-end neural negotiation agents.[^12] In 2018 he was a co-author on "Hierarchical Neural Story Generation" (Fan, Lewis, Dauphin, ACL 2018), introducing a 300K-story WritingPrompts dataset and a fusion model with gated multi-scale self-attention; the paper's dataset and code are part of Meta's fairseq examples and remain a standard reference for long-form open-ended text generation.[^13]
In 2022 Lewis was a co-author on the Meta Fundamental AI Research Diplomacy Team's Science paper "Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning," which introduced the Cicero agent.[^14] Cicero combined a controllable dialogue model with planning and reinforcement-learning components and, playing anonymously against humans in an online Diplomacy league, achieved more than double the average human score across 40 speed games and placed in the top 10% of participants who played more than one game.[^14] The dialogue model was trained to ground generated messages in an explicit plan of in-game actions, and the planning component combined a no-press equilibrium-search algorithm with a per-turn action policy conditioned on partner-modelling beliefs inferred from chat.[^14] In a public X (Twitter) reply, Lewis emphasised that the agent was designed so that "all its messages correspond to actions it currently plans to take" rather than being trained to deceive partners; Cicero is a rare modern example of a strategic dialogue agent whose code and dataset were openly released.[^14][^17]
Lewis is a core contributor and listed in the leading author group of "The Llama 3 Herd of Models" (arXiv:2407.21783), the technical report describing Meta's Llama 3 family, with version 1 posted on 31 July 2024 and a revised version 3 dated 23 November 2024.[^5] The paper introduces a family of dense Transformer language models up to 405B parameters with a 128K context window, natively multilingual, with coding, reasoning, and tool-use capabilities, and reports performance comparable on many benchmarks to leading proprietary models such as GPT-4.[^5] The Llama 3 flagship was pre-trained on roughly fifteen trillion tokens drawn from publicly available sources, roughly seven times more data than Llama 2, with extensive data filtering, semantic deduplication, and quality classifiers, and with a markedly higher share of code than earlier Llama generations.[^20] Architecturally the 405B model is a decoder-only Transformer with 126 layers, 128 attention heads, Grouped-Query Attention, RMSNorm, and Rotary Position Embeddings, and a vocabulary increased from 32K to 128K tokens relative to Llama 2.[^5][^27]
Lewis is publicly identified by Meta and by external venues as the pre-training research lead on the Llama team and as having led pre-training for Llama 3.[^1][^2] In 2024 he gave invited talks describing the work, including an invited ICLR 2024 talk titled "Bridging the Gap Between Pre-training Data and Alignment" and a seminar at the Johns Hopkins Center for Language and Speech Processing titled "Science and Scaling: How (really) to Pre-train a Llama."[^2][^15]
His public talks on Llama 3 emphasised two themes for improving base models without relying purely on a separate alignment phase: in-context pre-training, which constructs training sequences from semantically related documents to strengthen long-context and few-shot behaviour, and instruction backtranslation, which automatically infers candidate instructions for unlabelled web documents.[^2][^15] Both ideas appear as named arXiv preprints with Lewis as an author: "In-Context Pretraining: Language Modeling Beyond Document Boundaries" (Shi et al., arXiv:2310.10638) describes the document-relatedness pre-training scheme, and "Self-Alignment with Instruction Backtranslation" (Li et al., arXiv:2308.06259) describes the Humpback model that uses an iterative self-augment and self-curate procedure starting from a seed-finetuned LM on a web corpus.[^21][^22]
Llama 4 was announced by Meta on 5 April 2025 as a mixture-of-experts-based, natively multimodal family of models, with the initial public release including Llama 4 Scout and Maverick and a larger Behemoth model previewed but not released as open weights.[^16] Lewis is publicly described as having continued in his pre-training research lead role on the Llama team during the development of Llama 4; he is reported to have led pre-training for Llama 3 and to remain Meta's pre-training research lead on the Llama team in subsequent cycles.[^1][^2] Meta's Llama 4 announcement does not name individual researchers as pre-training lead, and the launch announcement instead describes the introduction of a new MetaP technique for setting per-layer hyperparameters such as learning rates and initialisation scales reliably across model sizes.[^16] The authorship listings for Llama 4 had not been published as a standalone arXiv paper at the time of the model launch, so attributions of personal leadership rest on his standing job description rather than a paper byline.[^1][^2][^16]
Lewis is a co-author of several other FAIR papers relevant to the modern open-weights pre-training stack:
Lewis's published work spans several themes:
| Theme | Representative works |
|---|---|
| Wide-coverage CCG semantics and parsing | "Combined Distributional and Logical Semantics" (TACL 2013); "A* CCG Parsing with a Supertag-factored Model" (NAACL 2014); "Global Neural CCG Parsing with Optimality Guarantees" (EMNLP 2016, Best Paper)[^8][^9] |
| Dialogue and grounded agents | "Deal or No Deal? End-to-End Learning of Negotiation Dialogues" (EMNLP 2017); "Hierarchical Neural Story Generation" (ACL 2018); Cicero (Science 2022)[^12][^13][^14] |
| Pre-trained encoders | RoBERTa (2019)[^11] |
| Pre-trained encoder-decoders | BART (ACL 2020)[^3][^10] |
| Modular and parallel LM training | Branch-Train-Merge (arXiv 2022)[^23] |
| Data-efficient alignment | LIMA (NeurIPS 2023); Self-Alignment with Instruction Backtranslation / Humpback (arXiv 2023); In-Context Pretraining (arXiv 2023)[^21][^22][^24] |
| Llama foundation models | "The Llama 3 Herd of Models" (arXiv 2024); pre-training lead for Llama 3 and continuing in that role on Llama 4[^1][^2][^5] |
Across these projects a recurring thread is moving from explicit symbolic structure (CCG, logical forms) toward large-scale neural pre-training where structure is induced from corrupted-text reconstruction or from agent-environment interaction. A second recurring thread is a preference for pushing capability into the pre-training stage itself (data selection, document ordering, infilling, instruction backtranslation, modular experts) rather than relying solely on a separate post-training alignment phase.[^2][^15][^21][^22]
Three technical threads in Lewis's work are particularly load-bearing for the rest of the field:
BART (2019/2020) frames a wide range of prior pre-training schemes as special cases of a denoising autoencoder over text. Token masking with no shuffling recovers a BERT-style objective; deletion is similar to masking but forces the model to also recover positions; permutation captures the kinds of word-order corruption used in some earlier work; and span infilling with a span-length distribution sampled from Poisson(λ=3) generalises masked-span objectives such as those of SpanBERT and T5 while preserving variable-length outputs.[^3][^19] Because the decoder generates the target autoregressively, the same pre-trained checkpoint can be fine-tuned for both comprehension and generation, without architectural changes; this is the property the title's "Translation, and Comprehension" phrase refers to.[^3]
In his Llama-era public talks Lewis has consistently emphasised data composition, ordering, and quality over architectural novelty as the dominant driver of base-model capability.[^2][^15] This view is supported by the Llama 3 paper's heavy emphasis on data filtering, semantic deduplication, code share, and multilingual coverage, and by the In-Context Pretraining and Instruction Backtranslation lines of work that explicitly modify pre-training corpora and example construction rather than the model itself.[^5][^20][^21][^22]
The Branch-Train-Merge and kNN-LM lines of work argue from different angles that LM capability can be decomposed: BTM into a set of domain experts that can be averaged or ensembled, and kNN-LM into a parametric LM plus a non-parametric memory of training data continuations.[^23][^25] These two papers, together with QuIP, sit alongside BART as evidence that Lewis's research style frequently mixes large pre-trained backbones with explicit structural devices.[^26]
Public recognition for Lewis includes:[^6]
His Google Scholar profile lists more than 136,000 citations, an h-index above 60, and an i10-index above 90; the four most-cited papers are RoBERTa, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (on which Patrick Lewis is the first author, a distinct researcher with whom Mike Lewis is sometimes confused), "The Llama 3 Herd of Models," and BART.[^4]
The combination of BART, RoBERTa, Cicero, and Llama 3 places Lewis among a small group of researchers who have authored foundational papers in three distinct generations of NLP: the late-2010s era of large pre-trained encoders and encoder-decoders, the early-2020s era of dialogue and game-playing agents, and the mid-2020s era of open-weights frontier foundation models. BART in particular is one of the canonical references for encoder-decoder pre-training; Hugging Face hosts the original facebook/bart-large checkpoint and several widely used fine-tuned variants, including facebook/bart-large-cnn for machine-translation-style abstractive summarization on the CNN/DailyMail dataset and facebook/bart-large-mnli for zero-shot text classification via natural-language inference.[^3][^28]
The Llama 3 paper is, by citation count, already one of the most cited foundation-model technical reports ever published; with over eighteen thousand citations on Google Scholar within roughly two years of release, it has become the de facto reference for understanding how to train a frontier-scale open-weights LM.[^4][^5] Lewis's specific imprint on the Llama 3 program, visible in his Llama 3 talks, is the emphasis that aligned model quality is dominated by pre-training corpus design and data-side techniques such as in-context pretraining and instruction backtranslation, rather than by post-hoc preference tuning alone.[^2][^15][^21][^22]
Despite leading pre-training for two generations of one of the most widely deployed open-weights model families, Lewis maintains a relatively low public profile compared with senior Meta AI figures such as Yann LeCun.[^1][^2] Most of his public communication consists of conference and seminar talks (for example at ICLR 2024, the Johns Hopkins Center for Language and Speech Processing, and the Georgia Tech ML@GT seminar series), arXiv papers, and an X (Twitter) account at the handle @ml_perception, rather than mainstream-media interviews or executive-style commentary.[^2][^15][^17]
Several caveats apply to public statements about Lewis's role:
Lewis's role in the modern open-weights LM ecosystem invites comparison with a handful of other researchers who have led pre-training at frontier labs. Unlike Yann LeCun, who as Meta's Chief AI Scientist is the most public face of Meta AI and frequently comments in mainstream press, Lewis communicates mostly through papers, conference talks, and his X account, and his job title (research scientist / pre-training research lead on the Llama team) is intermediate in seniority.[^1][^2] Compared with peers at other labs such as Alec Radford at OpenAI or Quoc Le at Google, who are similarly associated with multiple landmark pre-training papers, Lewis's published trajectory is somewhat broader, spanning logical semantics and game-playing dialogue in addition to text pre-training and foundation models.[^4][^8][^14]
Within FAIR he has worked repeatedly with a recognisable group of collaborators that includes Luke Zettlemoyer (his postdoc advisor and a Meta affiliate), Omer Levy, Naman Goyal, Yinhan Liu, and Marjan Ghazvininejad; many of these names co-author both the BART and RoBERTa papers and the more recent alignment-and-data papers.[^3][^11][^21][^22][^23]