Christopher Olah
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,619 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,619 words
Add missing citations, update stale details, or suggest a clearer explanation.
Christopher Olah (commonly Chris Olah) is a Canadian machine learning researcher and a co-founder of Anthropic, where he leads research on the interpretability of large language models.[1][2] He is widely identified as one of the pioneers of mechanistic interpretability, the subfield that attempts to reverse-engineer the internal computations of trained neural networks.[3][4] Olah previously worked at Google Brain (roughly 2014 to 2018) and led the interpretability team at OpenAI (2018 to late 2021), where he initiated the Circuits research thread.[5][6] He is also a co-founder and former editor-in-chief of the open-science journal Distill, which launched in March 2017 and went on indefinite hiatus in July 2021.[7][8] Olah does not hold an undergraduate degree; he received a Thiel Fellowship in 2012 and entered research through a combination of self-study, blogging, and mentorship.[9][10]
| Nationality | Canadian[1] |
| Education | The Abelard School, Toronto (AP National Scholar, 2010); brief enrollment at University of Toronto; no university degree[11][9] |
| Known for | Mechanistic interpretability; Distill journal; DeepDream; Circuits thread; sparse-autoencoder feature extraction[3][5][12] |
| Notable employers | Google Brain (2014 to 2018); OpenAI (2018 to 2021); Anthropic (2021 to present)[5][2] |
| Notable awards | Thiel Fellowship (2012); TIME100 AI (2024)[9][4] |
| Personal site | colah.github.io[1] |
Olah grew up in Toronto, Canada, and attended The Abelard School, a small private high school. He graduated in July 2010 and was named an Advanced Placement National Scholar, having completed six AP-level subjects.[11] As a teenager he became involved in the local technology community, joining the hacklab.to hackerspace in 2009; the space gave him early exposure to programming, electronics, and self-directed projects.[11]
In a 2020 blog post titled "Do I Need to Go to University?", Olah recounts that he attended the University of Toronto for a single year. He audited courses there as a high school student and took some advanced coursework during his year of formal enrollment, but did not complete a degree.[9] He has since written publicly that "I've been somewhat successful as a researcher without an undergraduate degree or PhD," while cautioning that the path is risky for most people and that, for almost everyone who writes to him for advice, the right choice is to attend university.[9] In a podcast interview with the career-advice organization 80,000 Hours, Olah described leaving school in part to support an acquaintance who faced what he describes as false terrorism charges related to hobby chemistry equipment; he spent roughly two years on that case, which he later judged "altruistic, but not especially effective."[10]
After this period, Olah's career was redirected by his selection as a Thiel Fellow in 2012. The Thiel Fellowship, run by venture investor Peter Thiel's foundation, awards a $100,000 grant to young people under 20 who agree to skip or leave university in order to pursue research or entrepreneurship.[9][10] His initial Thiel project focused on three-dimensional printing and open-source CAD software, but he pivoted to machine learning around 2013 after attending a seminar series on neural networks led by physicist and science writer Michael Nielsen.[10] Nielsen subsequently became an informal mentor and collaborator. Olah has also credited Yoshua Bengio, who invited him to visit Mila in Montreal in 2013, with helping arrange a path into academic research; Bengio extended a PhD offer that Olah ultimately declined.[10][9]
Olah joined Google Brain in Mountain View as a research intern around 2013 to 2014, with senior researchers Jeff Dean and Greg Corrado supporting his hiring despite his lack of a formal degree.[10] He converted to a full-time researcher and spent roughly four years on the team. His Google Brain work centered on visualization and interpretation of trained convolutional networks, with a particular focus on the InceptionV1 image classifier.[5][13]
In June 2015, Google researchers Alexander Mordvintsev, Olah, and Mike Tyka published "Inceptionism: Going Deeper into Neural Networks" on the Google Research Blog, describing a procedure that iteratively modifies an input image to amplify whatever features a chosen layer of a classifier responds to.[13] The resulting hallucinatory images, including dogs and eyes appearing in clouds and landscapes, were released as the open-source DeepDream code on 1 July 2015 and became one of the first viral demonstrations of feature visualization.[13][14] Olah was the second-named author on the original blog post.[13]
Building on this line of work, Olah and collaborators produced a series of long-form articles on neural-network visualization, including "Feature Visualization" (Distill, 7 November 2017) and "The Building Blocks of Interpretability" (Distill, 6 March 2018, with Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev).[15][16] The latter introduced techniques for combining feature visualization with attribution methods, and was accompanied by the open-source Lucid library.[15] In March 2019, the same group, together with Zan Armstrong and others at OpenAI and Google, published "Activation Atlas" in Distill, which used feature inversion across millions of activations to produce a navigable map of features a vision network has learned.[17]
Across these years Olah also wrote prolifically on his personal site, colah.github.io. The post "Neural Networks, Manifolds, and Topology" (April 2014) framed deep classifiers in terms of homeomorphisms of input space.[18] "Visualizing MNIST: An Exploration of Dimensionality Reduction" (October 2014) introduced a wide audience to t-SNE and related techniques.[19] "Understanding LSTM Networks" (27 August 2015) became one of the most cited explanatory blog posts in machine learning; according to Olah's Google Scholar profile the post has accrued more than three thousand citations and remains assigned reading in many graduate ML courses.[20][21]
On 20 March 2017, Olah and Shan Carter, then both at Google Brain, announced the launch of Distill, "a new open science journal and ecosystem supporting human understanding of machine learning."[7][22] Distill was developed in collaboration with Google, OpenAI, DeepMind, and YC Research, and operated as a peer-reviewed venue dedicated to articles that use interactive visualizations and explorable explanations rather than the standard static PDF format.[8][7] The editors-in-chief were Carter, Olah, and Arvind Satyanarayan of MIT.[8] An accompanying initiative, the Distill Prize for Clarity in Machine Learning, was endowed at $125,000 by Olah, Greg Brockman, Jeff Dean, DeepMind, and the Open Philanthropy Project; Olah personally contributed $25,000.[23]
Distill published several of Olah's most influential articles, including "Research Debt" (2017) with Carter, which argued that the accumulation of un-distilled complexity in technical fields is a major and underappreciated cost to research progress.[24] The journal published 39 peer-reviewed articles between 2016 and 2021, on topics ranging from attention mechanisms to graph neural networks.[8][25]
On 2 July 2021, Olah, Nick Cammarata, Sam Greydanus, and Janelle Tam announced the "Distill Hiatus," explaining that the volunteer editorial team had experienced sustained burnout from the tension between mentoring submitters, running peer review, and authoring their own articles, and that Distill would pause publishing for at least one year (which has since extended indefinitely).[26][8] Olah used the same post to commit to continuing the Circuits research line on a successor venue.[26]
Around 2018 Olah moved from Google Brain to OpenAI in San Francisco, where he founded and led a team initially called Clarity and later renamed Circuits, dedicated to mechanistic interpretability of vision and multimodal networks.[10][27] The flagship output was the Circuits research thread, published as a series of invited Distill articles starting with "Zoom In: An Introduction to Circuits" by Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter on 10 March 2020.[5][27] "Zoom In" laid out three speculative claims that have since defined the field: that neural-network features are meaningful units of analysis, that the connections between features form interpretable circuits, and that important features and circuits are universal across models.[5]
The Circuits thread continued through 2020 and 2021 with articles such as "An Overview of Early Vision in InceptionV1," "Curve Detectors," "Curve Circuits," "Naturally Occurring Equivariance in Neural Networks," and "Understanding RL Vision," each focused on reverse-engineering a specific sub-circuit of trained models.[5] In March 2021, the same group published "Multimodal Neurons in Artificial Neural Networks" in Distill, identifying neurons in OpenAI's CLIP model that respond to a single concept across photographs, drawings, and rendered text.[28]
According to Olah's own account in a 2021 podcast interview, he left OpenAI in December 2021 to help start a new AI-safety-focused lab; that lab was Anthropic.[10]
Olah is one of eight co-founders of Anthropic, an AI safety company that public records and his Anthropic profile list as founded in 2021 by Dario Amodei, Daniela Amodei, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Ben Mann, and Olah, all formerly at OpenAI.[29][2] At Anthropic he holds the title of co-founder and serves as the lead of the interpretability team; he is also identified in public press coverage as the company's head of research on interpretability of artificial intelligence.[4][30]
Olah's work at Anthropic is published primarily on the Transformer Circuits Thread (transformer-circuits.pub), which Anthropic positioned as a continuation of the Circuits agenda from OpenAI. The first major release, "A Mathematical Framework for Transformer Circuits" (22 December 2021) by Nelson Elhage, Neel Nanda, Catherine Olsson, and others, with Olah as a senior author, introduced a tensor-product decomposition of attention and identified two-layer attention-only transformers as the simplest setting in which to study transformer circuits.[31] In a follow-up paper, "In-context Learning and Induction Heads" (8 March 2022), Olsson, Elhage, Nanda, and colleagues, with Olah as a senior author, defined induction heads as attention heads that complete patterns of the form [A][B] ... [A] -> [B] and presented six lines of evidence that such heads are a major mechanism behind in-context learning in transformer models of many sizes.[32]
In September 2022, Olah and a large team of Anthropic researchers and external collaborators (including Martin Wattenberg of Harvard) published "Toy Models of Superposition," which formalized the long-standing observation that individual neurons in trained networks often respond to multiple unrelated features.[33][34] The paper introduces a small ReLU network trained on sparse synthetic data in which superposition (the storage of more features than dimensions) can be exhibited cleanly, identifies a phase transition between non-superposed and superposed regimes, and links the geometry of the resulting feature arrangements to uniform polytopes.[33][12] The paper has since become a standard reference and is the conceptual basis for much of the sparse autoencoder work that followed.[34]
In October 2023, the Anthropic interpretability team published "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning," with Trenton Bricken and Adly Templeton as lead authors and Olah as a senior author.[35][36] The paper showed that training a sparse autoencoder on the residual-stream activations of a one-layer transformer could decompose a 512-dimensional layer into thousands of sparse, more interpretable features such as detectors for DNA sequences, HTTP requests, legal language, and Hebrew text.[35] This established sparse autoencoders, also known as sparse dictionary learning, as a viable method for finding more interpretable directions in language-model activation space, and is the basis for the Towards Monosemanticity thread on the wiki.[35]
The 2024 sequel, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (published 21 May 2024 on transformer-circuits.pub), applied the same method to a production model, the middle residual stream of Claude 3 Sonnet, and extracted on the order of 34 million features, including features for the Golden Gate Bridge, sycophantic praise, security vulnerabilities, and several abstract concepts that could be used to steer model behaviour.[37][4] The paper, alongside the public "Golden Gate Claude" demonstration that followed, brought interpretability research into mainstream press coverage. TIME magazine subsequently named Olah to its TIME100 AI list in September 2024, citing the Scaling Monosemanticity work as a key reason.[4]
Anthropic's interpretability program continued to expand through 2024 and 2025. In March 2025, Jack Lindsey, Wes Gurnee, and colleagues, with Olah and Joshua Batson among the senior authors, published "Circuit Tracing: Revealing Computational Graphs in Language Models" and the accompanying case study "On the Biology of a Large Language Model," which used cross-layer transcoders and backward Jacobian tracing to construct attribution graphs that follow multi-step reasoning, planning, and hallucination-suppression behaviour through Claude 3.5 Haiku.[38] Anthropic open-sourced an associated circuit-tracing library in May 2025 with an interactive frontend hosted on Neuronpedia.[39]
In March 2025, Anthropic announced a Series E funding round of $3.5 billion at a $61.5 billion post-money valuation, with the company specifically citing deepening its research in mechanistic interpretability and alignment as one of four planned uses of the funds; Olah's team is the locus of that work.[40] By February 2026 the company had been valued at roughly $380 billion by private investors.[41]
On 25 May 2026, Olah was confirmed as one of the lay speakers at the Vatican launch of Pope Leo XIV's first encyclical, "Magnifica Humanitas," on safeguarding the human person in the time of artificial intelligence; the event marks the first time the co-founder of an AI company has spoken at the launch of a papal encyclical.[41][42]
Olah's defining contribution is helping establish mechanistic interpretability as a distinct research subfield with shared concepts (features, circuits, universality), shared methodology (feature visualization, attribution, dictionary learning, circuit tracing), and shared empirical conjectures.[3][5] The Circuits thread (2020 and after) introduced the working hypothesis that trained vision networks can be decomposed into a graph of interpretable features and the connections between them.[5] The Transformer Circuits Thread extended the same framing to attention-only and full transformer language models.[31][32]
The "Zoom In" article in particular laid out three foundational claims that have organised the agenda of the field ever since. The first, the features hypothesis, is that features are the fundamental unit of neural-network computation and can be rigorously studied. The second, the circuits hypothesis, is that features are connected by weights that themselves form interpretable circuits implementing identifiable algorithms. The third, the universality hypothesis, is that analogous features and circuits arise across different architectures and training runs.[5] Subsequent work by Olah and others has produced empirical support for all three claims in vision networks (for instance, curve detectors in InceptionV1) and increasingly in language models (for instance, induction heads in two-layer transformers).[5][32]
In his framing of why interpretability matters, Olah has consistently linked the research to AI safety. In the TIME100 AI profile he stated that if mechanistic interpretability succeeds, "we might be able to go and say when these models are actually safe, or whether they just appear safe."[4] The Transformer Circuits Thread describes Anthropic's interpretability programme as "an effort to reverse engineer the algorithms learned by neural networks into human-understandable algorithms," with the longer-term goal of providing audit-grade evidence about model behaviour.[31]
A second strand of contributions concerns superposition, the phenomenon in which neural networks encode more features than they have neurons by representing those features as overlapping non-orthogonal directions in activation space.[33] "Toy Models of Superposition" (2022) made the phenomenon mathematically tractable in a small setting, and the 2023 to 2024 monosemanticity papers used sparse autoencoders to recover monosemantic feature directions from real models.[33][35][37] Together these papers shaped the dominant empirical methodology used by interpretability groups outside Anthropic, including at Google DeepMind, EleutherAI, and academic labs.[36]
The 2022 "Toy Models" paper was particularly influential because it gave a clean experimental setting in which the trade-off between feature interference and feature count could be controlled by the sparsity of the underlying data. The paper showed a phase transition between a regime where the network simply stores the most important features in orthogonal directions and a regime where it packs many features into overlapping directions, accepting some interference because the inputs are sparse enough that interference rarely fires.[12][33] It also drew a connection to the geometry of uniform polytopes: the optimal arrangements of features in a low-dimensional bottleneck turn out to be configurations such as pentagons, tetrahedra, and other regular shapes.[33]
"Towards Monosemanticity" (2023) operationalised these ideas by training an over-complete sparse autoencoder on the residual stream of a one-layer transformer, decomposing a 512-dimensional layer into roughly four thousand sparse, more interpretable features such as detectors for DNA sequences, HTTP requests, legal language, and Hebrew text.[35] Anthropic researchers reported that human raters could assign single-concept descriptions to a clear majority of the features, in contrast to the polysemantic and largely uninterpretable individual neurons of the underlying transformer.[35][36] The paper also introduced practical training advice and the notion of "feature splitting," in which scaling the autoencoder dictionary causes a single broad feature to split into several more specific features, supporting an interpretation of features as a hierarchical concept space.[35]
"Scaling Monosemanticity" (2024) showed that the same method could be applied to a production model, Claude 3 Sonnet, extracting roughly 34 million features from the residual stream of a middle layer.[37] The paper reports features that activate for highly specific stimuli, ranging from the Golden Gate Bridge to bugs in code, and demonstrates that clamping a feature to an unusually high value can cause the model to manifest the associated concept in its outputs.[37] An accompanying live demonstration in May 2024, "Golden Gate Claude," in which Anthropic deployed a version of Claude 3 Sonnet with the Golden Gate Bridge feature pinned on, brought the work into mainstream news coverage and was cited in TIME's 2024 selection of Olah.[4]
Olah's earlier work, particularly DeepDream (2015), "Feature Visualization" (2017), "The Building Blocks of Interpretability" (2018), and "Activation Atlas" (2019), established the visual language and tooling that the interpretability field still uses.[13][15][17] These articles are also notable for their unusual format: long, interactive, image-heavy explorable explanations published on Distill rather than in conference proceedings.[7]
Through Distill and "Research Debt" (2017), Olah and his collaborators argued that clearer, better-illustrated technical writing is itself a research contribution rather than a side activity, and built tooling and a prize to encourage that style of work.[24][23] Olah has cited research communication as central to his career and has identified the intersection of machine learning and drawing as the niche in which he aimed to be most effective.[10]
"Research Debt" introduced a vocabulary that has since been adopted by parts of the broader research community: research debt is the accumulation of poorly explained concepts, bad notation, and missing intermediate explanations in a technical field, which makes new entrants slower to onboard and harder to integrate.[24] Distillation, in Olah's framing, is the process of clearing that debt by producing high-quality, often interactive explanations that compress and clarify earlier work.[24] The Distill Prize for Clarity in Machine Learning was created to reward such work directly, with grants of up to $10,000 per recipient drawn from the journal's $125,000 endowment.[23]
Olah's own pedagogical posts on colah.github.io have, in aggregate, served as informal textbooks for cohorts of ML practitioners. "Understanding LSTM Networks" in particular is assigned in graduate machine-learning courses at universities including MIT and Stanford and has been republished or mirrored in many forms.[11][20] Olah has been explicit that he sees this writing as part of his research output rather than ancillary to it.[9][24]
Olah has described his long-term goal as building "a kind of microscope" for neural networks: tools and concepts powerful enough that researchers can routinely look inside a trained model and read off its algorithms.[5][36] He has framed this as an alternative to purely behavioural evaluation of AI systems, arguing that behaviour alone cannot distinguish models that are aligned from models that merely appear aligned under the available test set.[4][6] The framing has propagated through Anthropic's public communications; the company's 2025 Series E announcement names mechanistic interpretability and alignment as one of four investment priorities, and Anthropic CEO Dario Amodei has cited interpretability research as a central reason Anthropic exists.[40][29]
According to his Google Scholar profile, Olah's published work had accumulated more than 110,000 citations and an h-index of 53 as of early 2026, with the "Understanding LSTM Networks" blog post alone cited more than 3,600 times.[20]
The following list emphasises articles in which Olah was a principal author or led the research direction. Dates refer to the public release date.
Although Olah is not primarily an academic author, his personal site colah.github.io has been an unusually influential venue for ML pedagogy. Posts referenced in his own CV and cited in subsequent research include "Calculus on Computational Graphs: Backpropagation" (August 2015), "Conv Nets: A Modular Perspective" (July 2014), "Attention and Augmented Recurrent Neural Networks" (with Shan Carter, Distill, September 2016), "Visualizing Representations: Deep Learning and Human Beings" (January 2015), and "Do I Need to Go to University?" (May 2020).[1][9][25][18][19][20]