Jacob Devlin
Last reviewed
Jun 5, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,136 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 5, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,136 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jacob Devlin is an American research scientist in natural language processing and machine learning, best known as the lead author of BERT, the bidirectional language representation model that Google introduced in 2018 and that became a standard building block for modern NLP [1][2]. Before BERT he led the shift of Microsoft's production translation system to neural methods at Microsoft Research, work for which he won a best-paper award at the Association for Computational Linguistics (ACL) [3][4]. He later moved briefly to OpenAI in early 2023 and then returned to Google, where he has been a contributor to the Gemini family of models [5][6].
Devlin's research centers on fast, scalable deep learning models for language understanding, machine translation, question answering, and information retrieval [1]. His most influential contribution, BERT, demonstrated that a single pretrained Transformer encoder, fine-tuned on downstream tasks, could set state-of-the-art results across a wide range of NLP benchmarks, and it helped popularize the pretrain-then-fine-tune paradigm that underpins later large language models [2][7]. He has worked across several of the organizations at the center of the recent wave of language model development, including Microsoft Research, Google, and OpenAI [3][5]. Beyond BERT, he has been a co-author on later Google language-model efforts including the PaLM family and instruction-tuning research, and earlier he contributed to neural program synthesis [11][12][13].
Devlin earned a Master of Science in computer science from the University of Maryland in 2009, where he was advised by Bonnie Dorr [1][4]. Dorr is a professor of computer science at the University of Maryland Institute for Advanced Computer Studies whose research spans multilingual processing, including machine translation, summarization, and cross-language information retrieval [14]. Devlin's graduate and early-career research focused on statistical machine translation and other core NLP problems, the area in which he would make his first major mark [4].
Early in his career Devlin worked as a natural language processing research scientist at BBN Technologies, with a stint as a visiting researcher at Johns Hopkins University [1][4]. At BBN he co-authored "Fast and Robust Neural Network Joint Models for Statistical Machine Translation," presented at ACL 2014 with Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul [8]. The paper introduced a neural network joint model that combined source-context and target-history information for translation and ran fast enough to be used directly in decoding, and it received the conference's best long paper award [3][4][8]. Devlin had earlier been recognized with a best short paper award at the 2012 conference of the North American Chapter of the ACL (NAACL) for "Trait-Based Hypothesis Selection For Machine Translation," co-authored with Spyros Matsoukas [1][15].
From 2014 to 2017 Devlin was a principal research scientist at Microsoft Research [1]. There he led the transition of Microsoft Translator from phrase-based statistical translation to neural machine translation, helping move a large production system onto neural sequence models [1][3]. This work placed him among the researchers applying deep learning to translation at industrial scale during the period when neural methods displaced earlier statistical pipelines, a shift that reached production systems in the latter part of 2016 [3][16].
While at Microsoft, Devlin also worked on neural program synthesis, the task of having a model generate a program from examples of its intended input and output behavior. He was the lead author of "RobustFill: Neural Program Learning under Noisy I/O," presented at the 34th International Conference on Machine Learning (ICML 2017) with Jonathon Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli [12]. The paper compared neural program synthesis and neural program induction on a string-transformation domain and reported that its best synthesis model reached 92 percent accuracy on a real-world test set while remaining robust to noisy inputs such as typos [12].
Devlin joined Google Research in 2017 as a staff research scientist, working on deep learning models for language understanding [1][3]. In October 2018, with Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, he published "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," with himself listed as first author [2][7]. The acronym BERT stands for Bidirectional Encoder Representations from Transformers, and the model is built on an encoder-only Transformer architecture [2]. The work was carried out within the Google AI Language group [2].
The central idea of BERT was to pretrain a deeply bidirectional encoder on large unlabeled text, so that each token's representation could attend to context on both its left and right, rather than in a single direction as in earlier left-to-right models [7]. This was achieved through two self-supervised objectives: a masked language model that predicts randomly hidden tokens from their surrounding context, and a next-sentence-prediction task [7]. The resulting pretrained model could then be fine-tuned with a single additional output layer to handle tasks such as question answering and language inference without heavy task-specific architecture changes [7].
The original release came in two sizes. BERT-Base had about 110 million parameters and BERT-Large had about 340 million parameters, and both were pretrained on the Toronto BookCorpus and English Wikipedia [2]. On publication the paper reported state-of-the-art results on eleven NLP tasks, including pushing the GLUE benchmark score to 80.4 percent and improving results on the SQuAD question-answering datasets (versions 1.1 and 2.0) and the SWAG commonsense-inference dataset [2][7]. BERT quickly became a ubiquitous baseline in NLP research [2][7].
The model moved into production at Google as well. On October 25, 2019, Google announced that it had begun applying BERT to English-language queries on Google Search in the United States, and by December 2019 the company said BERT had been adopted in Search for more than 70 languages [2]. The model also spawned a large family of successors and variants such as RoBERTa, ALBERT, DistilBERT, ELECTRA, and DeBERTa [2]. The BERT paper became one of the most cited works in the field, accumulating tens of thousands of citations in scholarly indexes within a few years of publication [2][17]. Among its four authors, Devlin's co-authors Ming-Wei Chang and Kenton Lee later continued at Google DeepMind, and Kristina Toutanova remained a research scientist at Google [18][19].
The following table summarizes key facts about the original BERT work.
| Item | Detail |
|---|---|
| Paper | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [7] |
| First author | Jacob Devlin [2][7] |
| Co-authors | Ming-Wei Chang, Kenton Lee, Kristina Toutanova [2][7] |
| Affiliation | Google AI Language [2] |
| Year | 2018 (arXiv); presented at NAACL 2019 [2][7] |
| Architecture | Encoder-only Transformer [2] |
| Sizes | BERT-Base ~110M parameters; BERT-Large ~340M parameters [2] |
| Pretraining data | Toronto BookCorpus and English Wikipedia [2] |
| Pretraining objectives | Masked language modeling and next-sentence prediction [7] |
| Award | NAACL 2019 Best Long Paper [2][20] |
| Search deployment | English in US from October 25, 2019; 70+ languages by December 2019 [2] |
After BERT, Devlin continued to contribute to Google's large-scale language-model research as a co-author on several major papers. He is listed among the authors of "PaLM: Scaling Language Modeling with Pathways" (2022), which described a 540-billion-parameter densely activated Transformer trained on Google's Pathways system and reported strong few-shot results across hundreds of benchmarks [11]. He was also a co-author of "Scaling Instruction-Finetuned Language Models" (2022), the Flan work that showed instruction fine-tuning across many tasks improved performance for model families including PaLM and T5 and that released the Flan-T5 checkpoints [13]. These projects placed him among the contributors to the line of Google research that led toward the Gemini program.
In January 2023 Devlin left Google to join OpenAI, a move reported by The Information and covered by Business Insider and other outlets [5][9]. According to that reporting, he had warned senior Google leaders, including chief executive Sundar Pichai and AI lead Jeff Dean, that the team building the Bard chatbot appeared to be training on data from ShareGPT, a site where users post their ChatGPT conversations, which he believed risked violating OpenAI's terms of service [5][9]. The reporting added that Google stopped using the data after Devlin raised the concern, while Google publicly denied that Bard was trained on ShareGPT or ChatGPT data [5][9].
His tenure at OpenAI was brief. In June 2023 The Information reported that Devlin had returned to Google, where he was said to be working closely with Slav Petrov, a researcher central to the company's Bard and Gemini efforts [6]. By then Google had merged its Google Brain team with DeepMind to form Google DeepMind, the unit responsible for the Gemini program [6].
Devlin returned to Google in 2023 and has continued to work on the company's frontier models [6]. He is credited among the contributors to the Gemini 1.5 technical report, published by the Gemini Team at Google in 2024 [10]. Detailed, independently verified information about his exact title is limited in public reporting; sources consistently describe him as a research scientist contributing to Google's large language model work [1][6][10].
| Year | Paper | Role | Venue |
|---|---|---|---|
| 2014 | Fast and Robust Neural Network Joint Models for Statistical Machine Translation | First author | ACL 2014 (Best Long Paper) [8] |
| 2017 | RobustFill: Neural Program Learning under Noisy I/O | First author | ICML 2017 [12] |
| 2018/2019 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | First author | NAACL 2019 (Best Long Paper) [7][20] |
| 2022 | PaLM: Scaling Language Modeling with Pathways | Co-author | arXiv / JMLR [11] |
| 2022 | Scaling Instruction-Finetuned Language Models (Flan) | Co-author | arXiv / JMLR [13] |
| 2024 | Gemini 1.5 technical report | Contributor | arXiv [10] |
Devlin's research has been recognized with multiple best-paper awards at the field's leading venues: the NAACL 2012 Best Short Paper award, the ACL 2014 Best Long Paper award for his statistical machine translation work, and the NAACL 2019 Best Long Paper award for BERT [1][4][20]. BERT itself is widely regarded as one of the most influential papers in modern NLP, both for its benchmark results and for helping to establish the pretrain-then-fine-tune approach used by later large language models [2][7].