Jacob Devlin

Natural Language Processing People

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 2,274 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Jacob Devlin is an American research scientist in natural language processing and machine learning, best known as the first author of BERT, the bidirectional language representation model that Google introduced in 2018 and that became a standard building block for modern NLP ^[1]^[2]. Before BERT he led the shift of Microsoft's production translation system to neural methods at Microsoft Research, work for which he won a best-paper award at the Association for Computational Linguistics (ACL) ^[3]^[4]. He left Google for OpenAI in January 2023 and then returned to Google within months, where he has contributed to the Gemini family of models ^[5]^[6]^[10].

Who is Jacob Devlin?

Devlin's research centers on fast, scalable deep learning models for language understanding, machine translation, question answering, and information retrieval ^[1]. His most influential contribution, BERT, demonstrated that a single pretrained Transformer encoder, fine-tuned on downstream tasks, could set state-of-the-art results across a wide range of NLP benchmarks, and it helped popularize the pretrain-then-fine-tune paradigm that underpins later large language models ^[2]^[7]. He has worked across several of the organizations at the center of the recent wave of language model development, including Microsoft Research, Google, and OpenAI ^[3]^[5]. Beyond BERT, he has been a co-author on later Google language-model efforts including the PaLM family and instruction-tuning research, and earlier he contributed to neural program synthesis ^[11]^[12]^[13].

Where did Jacob Devlin study?

Devlin earned a Master of Science in computer science from the University of Maryland in 2009, where he was advised by Bonnie Dorr ^[1]^[4]. Dorr is a professor of computer science at the University of Maryland Institute for Advanced Computer Studies whose research spans multilingual processing, including machine translation, summarization, and cross-language information retrieval ^[14]. Devlin's graduate and early-career research focused on statistical machine translation and other core NLP problems, the area in which he would make his first major mark ^[4].

BBN Technologies and machine translation

Early in his career Devlin worked as a natural language processing research scientist at BBN Technologies, with a stint as a visiting researcher at Johns Hopkins University ^[1]^[4]. At BBN he co-authored "Fast and Robust Neural Network Joint Models for Statistical Machine Translation," presented at ACL 2014 with Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul ^[8]. The paper introduced a neural network joint model that combined source-context and target-history information for translation and ran fast enough to be used directly in decoding, and it received the conference's best long paper award ^[3]^[4]^[8]. Devlin had earlier been recognized with a best short paper award at the 2012 conference of the North American Chapter of the ACL (NAACL) for "Trait-Based Hypothesis Selection For Machine Translation," co-authored with Spyros Matsoukas ^[1]^[15].

What did Jacob Devlin do at Microsoft Research?

From 2014 to 2017 Devlin was a principal research scientist at Microsoft Research ^[1]. There he led the transition of Microsoft Translator from phrase-based statistical translation to neural machine translation, helping move a large production system onto neural sequence models ^[1]^[3]. This work placed him among the researchers applying deep learning to translation at industrial scale during the period when neural methods displaced earlier statistical pipelines, a shift that reached production systems in the latter part of 2016 ^[3]^[16].

While at Microsoft, Devlin also worked on neural program synthesis, the task of having a model generate a program from examples of its intended input and output behavior. He was the lead author of "RobustFill: Neural Program Learning under Noisy I/O," presented at the 34th International Conference on Machine Learning (ICML 2017) with Jonathon Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli ^[12]. The paper compared neural program synthesis and neural program induction on a string-transformation domain and reported that its best synthesis model reached 92 percent accuracy on a real-world test set while remaining robust to noisy inputs such as typos ^[12].

What is BERT and what was Jacob Devlin's role?

Devlin joined Google Research in 2017 as a staff research scientist, working on deep learning models for language understanding ^[1]^[3]. In October 2018, with Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, he published "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," with himself listed as first author ^[2]^[7]. The acronym BERT stands for Bidirectional Encoder Representations from Transformers, and the model is built on an encoder-only Transformer architecture ^[2]. The work was carried out within the Google AI Language group ^[2]. The paper's abstract states: "We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers," and explains that "unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers" ^[7].

How does BERT work, and why did it matter?

The central idea of BERT was to pretrain a deeply bidirectional encoder on large unlabeled text, so that each token's representation could attend to context on both its left and right, rather than in a single direction as in earlier left-to-right models ^[7]. This was achieved through two self-supervised objectives: a masked language model that predicts randomly hidden tokens from their surrounding context, and a next-sentence-prediction task ^[7]. The resulting pretrained model could then be fine-tuned with a single additional output layer to handle tasks such as question answering and language inference without heavy task-specific architecture changes ^[7].

The original release came in two sizes. BERT-Base had about 110 million parameters and BERT-Large had about 340 million parameters, and both were pretrained on the Toronto BookCorpus and English Wikipedia ^[2]. On publication the paper reported state-of-the-art results on eleven NLP tasks, including pushing the GLUE benchmark score to 80.4 percent and improving results on the SQuAD question-answering datasets (versions 1.1 and 2.0) and the SWAG commonsense-inference dataset ^[2]^[7]. BERT quickly became a ubiquitous baseline in NLP research ^[2]^[7].

The model moved into production at Google as well. On October 25, 2019, Google announced that it had begun applying BERT to English-language queries on Google Search in the United States, and by December 2019 the company said BERT had been adopted in Search for more than 70 languages ^[2]. The model also spawned a large family of successors and variants such as RoBERTa, ALBERT, DistilBERT, ELECTRA, and DeBERTa ^[2]. The BERT paper became one of the most cited works in the field, accumulating tens of thousands of citations in scholarly indexes within a few years of publication ^[2]^[17]. Among its four authors, Devlin's co-authors Ming-Wei Chang and Kenton Lee later continued at Google DeepMind, and Kristina Toutanova remained a research scientist at Google ^[18]^[19].

The following table summarizes key facts about the original BERT work.

Item	Detail
Paper	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ^[7]
First author	Jacob Devlin ^[2]^[7]
Co-authors	Ming-Wei Chang, Kenton Lee, Kristina Toutanova ^[2]^[7]
Affiliation	Google AI Language ^[2]
Year	2018 (arXiv); presented at NAACL 2019 ^[2]^[7]
Architecture	Encoder-only Transformer ^[2]
Sizes	BERT-Base ~110M parameters; BERT-Large ~340M parameters ^[2]
Pretraining data	Toronto BookCorpus and English Wikipedia ^[2]
Pretraining objectives	Masked language modeling and next-sentence prediction ^[7]
Award	NAACL 2019 Best Long Paper ^[2]^[20]
Search deployment	English in US from October 25, 2019; 70+ languages by December 2019 ^[2]

What did Jacob Devlin work on at Google after BERT?

After BERT, Devlin continued to contribute to Google's large-scale language-model research as a co-author on several major papers. He is listed among the authors of "PaLM: Scaling Language Modeling with Pathways" (2022), which described a 540-billion-parameter densely activated Transformer trained on Google's Pathways system and reported strong few-shot results across hundreds of benchmarks ^[11]. He was also a co-author of "Scaling Instruction-Finetuned Language Models" (2022), the Flan work that showed instruction fine-tuning across many tasks improved performance for model families including PaLM and T5 and that released the Flan-T5 checkpoints ^[13]. These projects placed him among the contributors to the line of Google research that led toward the Gemini program.

Why did Jacob Devlin leave Google for OpenAI, and why did he return?

In January 2023 Devlin left Google to join OpenAI, a move reported by The Information and covered by Business Insider and other outlets ^[5]^[9]. According to that reporting, he had warned senior Google leaders, including chief executive Sundar Pichai and AI lead Jeff Dean, that the team building the Bard chatbot appeared to be training on data from ShareGPT, a site where users post their ChatGPT conversations, which he believed risked violating OpenAI's terms of service ^[5]^[9]. The reporting added that Google stopped using the data after Devlin raised the concern, while Google publicly denied that Bard was trained on ShareGPT or ChatGPT data ^[5]^[9].

His tenure at OpenAI was brief. In June 2023 The Information reported that Devlin had returned to Google, where he was said to be working closely with Slav Petrov, a researcher central to the company's Bard and Gemini efforts ^[6]. By then Google had merged its Google Brain team with DeepMind to form Google DeepMind, the unit responsible for the Gemini program ^[6].

Where does Jacob Devlin work now?

Devlin returned to Google in 2023 and has continued to work on the company's frontier models ^[6]. He is credited among the contributors to the Gemini 1.5 technical report, published by the Gemini Team at Google in 2024 ^[10]. As of 2026 his public profiles continue to list Google, based in Seattle, as his employer ^[21]. Detailed, independently verified information about his exact title is limited in public reporting, and there is no verified report of a later move to another organization; sources consistently describe him as a research scientist contributing to Google's large language model work ^[1]^[6]^[10]^[21].

Selected publications

Year	Paper	Role	Venue
2014	Fast and Robust Neural Network Joint Models for Statistical Machine Translation	First author	ACL 2014 (Best Long Paper) ^[8]
2017	RobustFill: Neural Program Learning under Noisy I/O	First author	ICML 2017 ^[12]
2018/2019	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	First author	NAACL 2019 (Best Long Paper) ^[7]^[20]
2022	PaLM: Scaling Language Modeling with Pathways	Co-author	arXiv / JMLR ^[11]
2022	Scaling Instruction-Finetuned Language Models (Flan)	Co-author	arXiv / JMLR ^[13]
2024	Gemini 1.5 technical report	Contributor	arXiv ^[10]

What awards has Jacob Devlin received?

Devlin's research has been recognized with multiple best-paper awards at the field's leading venues: the NAACL 2012 Best Short Paper award, the ACL 2014 Best Long Paper award for his statistical machine translation work, and the NAACL 2019 Best Long Paper award for BERT ^[1]^[4]^[20]. BERT itself is widely regarded as one of the most influential papers in modern NLP, both for its benchmark results and for helping to establish the pretrain-then-fine-tune approach used by later large language models ^[2]^[7].

References

"Jacob Devlin," Stanford NLP Seminar speaker page, Stanford Natural Language Processing Group. https://nlp.stanford.edu/seminar/details/jdevlin.shtml ↩
"BERT (language model)," Wikipedia. https://en.wikipedia.org/wiki/BERT_(language_model) ↩
"Jacob Devlin," speaker biography, 2nd Workshop on Neural Machine Translation and Generation (WNMT 2018). https://sites.google.com/site/wnmt18/speakers ↩
"Jacob Devlin wins ACL Best Paper Award," University of Maryland Department of Computer Science, June 2014. https://www.cs.umd.edu/article/2014/06/jacob-devlin-wins-acl-best-paper-award ↩
"A top AI researcher reportedly left Google for OpenAI after sharing concerns the company was training Bard on ChatGPT data," Business Insider via Yahoo Finance, March 2023. https://finance.yahoo.com/news/top-ai-researcher-reportedly-left-190003999.html ↩
"Source: Jacob Devlin, who left Google to join OpenAI in January 2023 after complaining internally Bard was being trained on ChatGPT data, has returned to Google," Jon Victor / The Information, via Techmeme, June 23, 2023. https://www.techmeme.com/230623/p19 ↩
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805 (NAACL 2019). https://arxiv.org/abs/1810.04805 ↩
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul, "Fast and Robust Neural Network Joint Models for Statistical Machine Translation," Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014). https://aclanthology.org/P14-1129/ ↩
"Google denies Bard was trained with ChatGPT data," The Verge, March 29, 2023. https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies ↩
Gemini Team Google, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context," arXiv:2403.05530 (2024). https://arxiv.org/abs/2403.05530 ↩
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al., "PaLM: Scaling Language Modeling with Pathways," arXiv:2204.02311 (2022); Journal of Machine Learning Research, vol. 24 (2023). https://arxiv.org/abs/2204.02311 ↩
Jacob Devlin, Jonathon Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli, "RobustFill: Neural Program Learning under Noisy I/O," Proceedings of the 34th International Conference on Machine Learning (ICML 2017); arXiv:1703.07469. https://arxiv.org/abs/1703.07469 ↩
Hyung Won Chung, Le Hou, Shayne Longpre, et al. (including Jacob Devlin), "Scaling Instruction-Finetuned Language Models," arXiv:2210.11416 (2022); Journal of Machine Learning Research, vol. 25 (2024). https://arxiv.org/abs/2210.11416 ↩
"Bonnie Dorr," University of Maryland Institute for Advanced Computer Studies. https://www.umiacs.umd.edu/our-experts/faculty/bonnie-dorr ↩
Jacob Devlin and Spyros Matsoukas, "Trait-Based Hypothesis Selection For Machine Translation," Proceedings of NAACL-HLT 2012. https://aclanthology.org/N12-1067/ ↩
"Neural Machine Translation enabling human parity innovations in the cloud," Microsoft Translator / Microsoft Research, 2016. https://www.microsoft.com/en-us/translator/business/machine-translation/ ↩
"Jacob Devlin," dblp computer science bibliography. https://dblp.org/pid/116/0575.html ↩
"Kenton Lee," Google Research / Google DeepMind. https://research.google/people/kentonlee/ ↩
"Kristina N. Toutanova," Google Research. https://research.google/people/kristinantoutanova/ ↩
"NAACL 2019: Google BERT Wins Best Long Paper," Synced, April 11, 2019. https://syncedreview.com/2019/04/11/naacl-2019-google-bert-wins-best-long-paper/ ↩
"Jacob Devlin, Software Engineer at Google," LinkedIn professional profile (Seattle, WA). https://www.linkedin.com/in/jacob-devlin-135ab048/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Bert-base-uncased model Classification (machine learning)Google Brain

Who is Jacob Devlin?

Where did Jacob Devlin study?

BBN Technologies and machine translation

What did Jacob Devlin do at Microsoft Research?

What is BERT and what was Jacob Devlin's role?

How does BERT work, and why did it matter?

What did Jacob Devlin work on at Google after BERT?

Why did Jacob Devlin leave Google for OpenAI, and why did he return?

Where does Jacob Devlin work now?

Selected publications

What awards has Jacob Devlin received?

References

Improve this article

Related Articles

Christopher Manning

Mike Lewis

Emily M. Bender

Łukasz Kaiser

Demis Hassabis

Ilya Sutskever

What links here

Related Articles

Christopher Manning

Mike Lewis

Emily M. Bender

Łukasz Kaiser

Demis Hassabis

Ilya Sutskever

What links here