Percy Liang
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,649 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,649 words
Add missing citations, update stale details, or suggest a clearer explanation.
Percy Liang is an American computer scientist and an associate professor of computer science at stanford university, where he holds a courtesy appointment in the Department of Statistics. He is the founding director of the Stanford Center for Research on Foundation Models (CRFM), launched in 2021 as part of the Stanford Institute for Human-Centered Artificial Intelligence (HAI). [1][2] Liang is widely recognized for his contributions to natural language processing, machine learning, and the empirical study of large-scale foundation models, including semantic parsing, robustness, weakly supervised learning, and rigorous evaluation. [3]
Liang led the team that produced the HELM benchmark (Holistic Evaluation of Language Models), released in 2022 as one of the first comprehensive, multi-metric, multi-scenario evaluations of contemporary language models, and he has continued to expand the framework through successor releases that target safety, multimodality, and frontier capabilities. [4][5] He also co-authored the influential 2021 report "On the Opportunities and Risks of Foundation Models", which introduced the term "foundation model" to the research literature and helped frame discussion of the new generation of broadly applicable AI systems. [6][7]
In June 2022, Liang co-founded the AI infrastructure company Together AI alongside Vipul Ved Prakash, Ce Zhang, and Christopher Re, where he serves as a founder. [8][9] He has received the Presidential Early Career Award for Scientists and Engineers (PECASE), the IJCAI Computers and Thought Award, an NSF CAREER Award, a Sloan Research Fellowship, and a Microsoft Research Faculty Fellowship. [10][11][12]
| Full name | Percy Shuo Liang [13] |
| Field | Computer science, natural language processing, machine learning |
| Institution | stanford university |
| Position | Associate Professor of Computer Science (with courtesy appointment in Statistics); Director, Center for Research on Foundation Models (CRFM) [1][13] |
| Education | B.S., mit (2004); M.Eng., MIT (2005); Ph.D., uc berkeley (2011) [3] |
| Doctoral advisors | Michael I. Jordan and Dan Klein [3] |
| Notable projects | HELM, CRFM, Foundation Models report, squad (co-author), CodaLab Worksheets, Marin [4][6][14][15][16] |
| Company | Co-founder of together ai (2022) [8][9] |
| Major awards | PECASE (2019); IJCAI Computers and Thought Award (2016); NSF CAREER (2016); Sloan Research Fellowship (2015); Microsoft Research Faculty Fellowship (2014) [10][11] |
Specific details of Liang's birth date, place of birth, and family background are not documented in widely available reputable sources, and as a result this article omits speculative biographical claims. As a high school student, Liang represented the United States at the International Olympiad in Informatics (IOI), where he earned bronze and silver medals in successive years. The IOI is an annual computer-programming competition for secondary-school students, and medalling at the international final is considered a strong predictor of subsequent research success in algorithms and theoretical computer science. [3]
Liang attended the mit for his undergraduate and master's studies, earning a Bachelor of Science in 2004 and a Master of Engineering in 2005, both in electrical engineering and computer science. His master's thesis at MIT was advised by the statistical NLP researcher Michael Collins (later of Columbia University and Google), who was then at MIT. The MIT NLP environment during Liang's time was a particularly active center for probabilistic and structured-prediction approaches to language understanding, and Liang's exposure to these methods would shape his subsequent doctoral program. [11][3]
He went on to the uc berkeley for doctoral study in computer science, where he was jointly advised by Michael I. Jordan (statistical machine learning) and Dan Klein (natural language processing). Both Jordan and Klein are themselves widely recognized figures in their respective fields, and the joint advising arrangement gave Liang training that explicitly bridged statistical theory and applied NLP. He completed his Ph.D. in 2011 with a dissertation titled "Learning Dependency-Based Compositional Semantics", which developed a new semantic formalism (DCS) for learning semantic parsers from question-answer pairs rather than from expensive annotated logical forms. [17][18] Closely related work appeared at the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011) in Portland, Oregon, and was subsequently published in the journal Computational Linguistics. [18] After receiving his Ph.D., Liang held a short postdoctoral position at Google Research before joining the Stanford faculty. [3]
Liang joined the Stanford Computer Science Department as an assistant professor in fall 2012. [11] In 2019 he was promoted to associate professor, the same year he received the PECASE. [12] He holds a courtesy appointment in the Department of Statistics. [1]
Liang's research has spanned several interrelated themes in machine learning and natural language processing, including: (1) semantic parsing and question answering; (2) weak and indirect supervision; (3) robustness, generalization, and uncertainty quantification; (4) the empirical study of foundation models and their evaluation; and (5) tools and infrastructure for reproducible research. [3][11] At Stanford he has supervised a large group of graduate students who have produced influential work in NLP, including contributions to widely used datasets, benchmarks, and frameworks.
Liang teaches several flagship courses at Stanford, including CS221 (Artificial Intelligence: Principles and Techniques), the statistical learning theory course CS229T/STATS231, CS324 (Advances in Foundation Models), and CS336 (Language Models from Scratch). [1]
Liang's doctoral work and much of his early Stanford research focused on semantic parsing: mapping natural language utterances to logical forms that can be executed against a knowledge base or database to produce an answer. The motivating problem is conceptually old, often associated with question-answering systems of the 1960s and 1970s, but Liang's contribution was to recast it as a statistical learning problem trainable from indirect supervision. His paper "Semantic Parsing on Freebase from Question-Answer Pairs" (Berant, Chou, Frostig, and Liang, EMNLP 2013) established a popular paradigm in which the logical form is treated as a latent variable trained from question-answer pairs alone, sidestepping the need for expensive logical-form annotation. The accompanying WebQuestions dataset became a widely used benchmark for open-domain factoid question answering against the Freebase knowledge graph. [19] Liang summarized this body of work in the 2016 Communications of the ACM article "Learning Executable Semantic Parsers for Natural Language Understanding", which has served as a frequently cited introduction to the area. [20] He also developed and released SEMPRE, an open-source toolkit for training semantic parsers that map natural language utterances to denotations via intermediate logical forms; SEMPRE supports several formalisms, including lambda calculus, lambda DCS, and Java expressions, and is agnostic to whether logical-form construction proceeds via combinatory categorial grammar or simpler chart-based approaches. [21]
Liang's work on question answering culminated in the co-authorship of the squad dataset. With his student Pranav Rajpurkar and collaborators Jian Zhang and Konstantin Lopyrev, he introduced the Stanford Question Answering Dataset at EMNLP 2016: a collection of more than 100,000 crowd-sourced reading-comprehension questions over Wikipedia passages, paired with span-based answers. The paper analysed the kinds of reasoning required to answer SQuAD questions using dependency and constituency parses, reported a strong logistic-regression baseline achieving an F1 of 51.0 (compared with a simple baseline at roughly 20.0 and human performance at 86.8), and made the dataset and an associated leaderboard publicly available. [14] SQuAD became one of the most influential benchmarks of the pre-foundation-model era and was used to train and evaluate a long succession of architectures, including bi-directional attention-flow networks, the QA-specific variants of ELMo, BERT, and its successors. A follow-on version, SQuAD 2.0 (Rajpurkar, Jia, and Liang, ACL 2018), added unanswerable questions to make the task significantly harder.
Liang has been a leading voice on the question of when machine-learning models actually generalize, and what their failure modes reveal about their internal representations. The 2017 paper "Adversarial Examples for Evaluating Reading Comprehension Systems" with his student Robin Jia showed that state-of-the-art SQuAD models could be fooled by appending semantically irrelevant distractor sentences to passages: the inserted sentence resembled the question syntactically but did not change the correct answer, yet caused systems to flip to the distractor's content. The paper exposed brittle pattern-matching behavior in ostensibly strong reading-comprehension systems and was awarded the Best Long Paper Award at EMNLP 2017, helping catalyze a broader literature on adversarial NLP. [22]
Subsequent work in Liang's group developed methods for certified robustness to adversarial word substitutions using interval-bound propagation through deep models, distributionally robust optimization for protecting performance on underrepresented subpopulations, and analyses of how training data and pretraining objectives shape downstream model behavior. [23] More broadly, Liang's group has contributed to the study of calibration in neural networks, the analysis of memorization versus generalization in large models, the construction of test suites designed to detect spurious correlations, and the use of conformal prediction and selective classification for uncertainty quantification. This body of work has been highly influential in shaping the discussion of "reliable" machine learning that accompanied the deployment of language models into safety-sensitive applications.
Beginning around 2020, Liang's research has increasingly focused on the empirical and methodological study of large-scale pretrained models, the class of systems that he and his collaborators would soon dub "foundation models". His group at Stanford has contributed to work on instruction tuning, in-context learning, retrieval augmentation, calibration, faithfulness of generated text, and the scaling behavior of large language models, as well as on transparent benchmarking and reporting. A representative example is work on alpaca and self-instruct-style data generation (in collaboration with other Stanford NLP faculty), and work studying the effects of pretraining data composition on downstream behavior. Liang has also collaborated on foundational papers in retrieval-augmented language modeling and on the DSPy line of programmatic LM composition work led by his Stanford colleagues. [30][31]
Liang has been a vocal advocate for open and reproducible AI research, arguing that "open weight" releases such as Llama or Gemma are not enough if the training data, training code, and developmental decisions are kept proprietary. To put that view into practice he leads Marin, an open-development laboratory for building foundation models in which experiments, including failed ones, are pre-registered on GitHub and the full training pipeline (code, data, weights, and logs) is released publicly. Marin's announcement post characterized this approach as "a radically new way of doing model development, inspired by true open-source software, where every experiment is done in the open and anyone can suggest ideas, review, and even run experiments through GitHub". Stanford and collaborators announced Marin's initial 8-billion-parameter model in May 2025, accompanied by a detailed retrospective ("Marin 8B retrospective") that documented the training process in the spirit of Meta's earlier OPT logbook, and the project has subsequently released larger 32-billion-parameter models trained from scratch. [15][24]
Throughout his career Liang has argued that progress in AI requires careful, standardized measurement, and that ad hoc reporting practices have systematically distorted the public understanding of what models can and cannot do. This conviction motivated CodaLab Worksheets, an open-source platform for managing reproducible computational experiments, which Liang began in 2013 with support from Microsoft Research and Evelyne Viegas. CodaLab Worksheets allows researchers to capture the full provenance of an experiment, from raw data through preprocessing and model training to final results, and to share that provenance with collaborators. The platform has been adopted as a submission target for software resources at several Association for Computational Linguistics conferences, including ACL 2016. [16] More recently, the same conviction motivates the HELM benchmark and its successors at CRFM, the Foundation Model Transparency Index project, and the Ecosystem Graphs initiative tracking which foundation models depend on which datasets, assets, and providers.
In August 2021, Stanford's HAI launched the Center for Research on Foundation Models with Liang as its founding director. CRFM was framed as an interdisciplinary effort spanning more than ten Stanford departments and dedicated to making "fundamental advances in the study, development, and deployment of foundation models". [2][25] Its mission areas, as articulated on the center's website, include technical research on data, systems, architecture, training, adaptation, inference, interpretability, and evaluation of foundation models; specialized applications in domains such as law, music, robotics, and biomedicine; analysis of societal considerations such as transparency, supply chains, openness, copyright, privacy, and systemic risks; and engagement with policymakers on evidence-based AI policy across multiple jurisdictions. [2]
CRFM's launch in 2021 was accompanied by the release of "On the Opportunities and Risks of Foundation Models", a 200-plus-page collaborative report authored by Rishi Bommasani, Liang, and more than 100 additional researchers at Stanford and other institutions. The report introduced the term "foundation model" as a label for "any model that is trained on broad data... and can be adapted (e.g., fine-tuned) to a wide range of downstream tasks", and provided a comprehensive survey of their capabilities (language, vision, robotics, reasoning, human interaction), technical underpinnings (model architectures, training procedures, data, systems, security, evaluation, theory), applications (law, healthcare, education), and societal implications (inequity, misuse, economic and environmental impact, legal and ethical considerations). Liang initiated and conceptualized the overall framing of the report and, together with Bommasani, led the decentralized writing effort while providing guidance on individual sections. [6][7] The report has been widely cited in academic and policy contexts. Its central terminological proposal, that the new class of broadly applicable pretrained models be called "foundation models", was deliberately chosen to avoid the narrower implications of "pretrained model" or the more loaded "large language model"; the term has since entered standard use across the AI research community and in regulatory documents in the United States and Europe.
Under Liang's leadership, CRFM has produced a steady stream of research outputs, ranging from open-weight model releases and infrastructure projects (such as Levanter, the JAX-based training framework, and Mistral, an earlier CRFM training framework not to be confused with the company of the same name) to influential reports on the foundation-model ecosystem, including the Foundation Model Transparency Index (FMTI) and the Ecosystem Graphs project that catalogues dependencies among models, datasets, and providers. [2] CRFM has also organized workshops, working groups, and conferences such as the Stanford Foundation Models workshops, bringing together academic researchers, industrial labs, and policy actors.
In November 2022, CRFM released "Holistic Evaluation of Language Models" (HELM), a benchmark and accompanying paper authored by Liang, Bommasani, Tony Lee, and many co-authors. [4][5] The paper argued that prior evaluation practice was fragmented and incomplete: different models were typically reported on different subsets of tasks, and important properties such as calibration, robustness, fairness, and toxicity were rarely measured at all.
HELM's central contribution was a taxonomy of scenarios (use cases such as question answering, summarization, sentiment analysis, and so on) paired with seven metrics evaluated when applicable: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The first release benchmarked 30 prominent language models, spanning open, limited-access, and fully closed systems, across 42 scenarios. According to the authors, prior to HELM models were on average evaluated on just 17.9 percent of the core HELM scenarios, with many pairs of widely cited systems having no scenario in common; HELM raised that coverage to 96.0 percent under uniform conditions. [4]
HELM was designed as a "living benchmark" that would be continuously updated as new models, scenarios, and metrics emerged. Subsequent releases have introduced or extended specialized variants for particular evaluation concerns, including HELM Lite (a smaller, frequently updated leaderboard), HELM Capabilities, HELM Instruct, HELM Safety, HELM MMLU, HELM AIR-Bench, and multimodal extensions such as VHELM and HEIM (for image generation). The HELM source code is maintained as an open-source framework on GitHub. [5][26]
HELM has been widely adopted by researchers, model developers, and policymakers as a reference point for comparing model behavior. Liang has frequently cited the project in talks and articles as an example of how transparent, holistic measurement can shape both research and the public conversation around frontier AI. [25][27]
In June 2022, Liang co-founded Together AI, a company building a cloud platform for training, fine-tuning, and running open-source foundation models. The other co-founders were Vipul Ved Prakash, formerly chief technology officer of Twitter following its acquisition of Topsy; the systems-and-ML researcher Ce Zhang, then at ETH Zurich; and Stanford computer scientist Christopher Re. [8][9] Together AI's stated goal has been to make open-weight model training and inference broadly accessible at substantially lower cost than the dominant hyperscale cloud vendors.
Together AI announced a $20 million seed financing round led by Lux Capital in May 2023, followed by an additional $102.5 million Series A round later that year. [28][9] Researcher Tri Dao, known for the FlashAttention work and other systems-level ML contributions, subsequently joined as chief scientist; the company lists Prakash (CEO), Zhang (CTO), Re, Dao, and Liang as founders. [29] Liang's role is that of a founder, while continuing in his primary capacity as a Stanford faculty member.
The company has released or partnered on a number of widely used open-source models, including the RedPajama datasets and language models (a community reproduction of the LLaMA training data and model series), as well as serving and fine-tuning infrastructure used by a wide range of customers. [9]
Liang has received numerous recognitions for his research, including: [11][12][10]
He has also been recognized through Best Paper awards at leading NLP and machine-learning venues. The 2017 EMNLP paper "Adversarial Examples for Evaluating Reading Comprehension Systems" (Jia and Liang) received the Best Long Paper Award. [22] Other works from his group have received outstanding-paper or best-paper recognition at conferences including ACL, EMNLP, and NeurIPS.
The following list highlights a selection of Liang's most widely cited and influential publications. Many are co-authored with students and collaborators.