Goodfire AI
Last reviewed
May 17, 2026
Sources
38 citations
Review status
Source-backed
Revision
v1 ยท 3,996 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
38 citations
Review status
Source-backed
Revision
v1 ยท 3,996 words
Add missing citations, update stale details, or suggest a clearer explanation.
Goodfire AI is a San Francisco-based artificial intelligence research lab and public benefit corporation focused on mechanistic interpretability, the science of reverse engineering the internal computations of neural networks. Founded in 2024 by Eric Ho, Daniel Balsam, and Tom McGrath, the company builds tools and platforms that let researchers and developers inspect, debug, and steer the internal representations of large foundation models. Its flagship product, Ember, is the first hosted API for mechanistic interpretability, exposing the internal features of frontier models such as Llama 3 so that users can directly manipulate model behavior rather than only prompting it from the outside. By early 2026 Goodfire had raised roughly $209 million across three rounds, reached a $1.25 billion valuation, and become the first external startup to receive a direct investment from Anthropic [1][2].
The company sits at the intersection of pure interpretability research, applied AI safety work, and product-oriented platform engineering. Its founding team combines the leadership of Google DeepMind's former interpretability group with operators from the New York startup world, and its broader staff includes researchers from early circuits-level work at OpenAI and DeepMind. Goodfire markets itself less as an interpretability tools vendor and more as a research lab building toward what it calls intentional design of AI, where models are constructed, audited, and modified through their internal mechanisms rather than only through prompting or fine-tuning [3].
Goodfire was incorporated as a Delaware public benefit corporation in 2024 and is headquartered in San Francisco's Telegraph Hill neighborhood. The company operates a single in-person office with employees working five days a week, a stance that distances it from the remote-first culture common in earlier waves of AI startups. As of early 2026 the company reported roughly 51 employees, with engineering, research, and a small commercial team focused on customer engagements with frontier model labs, enterprise customers, and scientific partners [4].
As a public benefit corporation, Goodfire has a legal obligation to balance shareholder returns with a stated public mission. Cofounder Daniel Balsam has publicly described interpretability as the most important problem in the world, and the company's public materials frame its commercial work as a means to fund and accelerate that broader research agenda [5].
| Attribute | Detail |
|---|---|
| Founded | 2024 |
| Headquarters | San Francisco, California |
| Legal form | Public benefit corporation |
| Founders | Eric Ho, Daniel Balsam, Tom McGrath |
| Flagship product | Ember interpretability platform |
| Total funding | ~$209 million (seed, Series A, Series B) |
| Valuation | $1.25 billion (Series B, early 2026) |
| Employees | ~51 (early 2026) |
| Key investors | Anthropic, Menlo Ventures, Lightspeed, B Capital, Eric Schmidt |
Goodfire's business model combines direct API access to Ember, custom research and integration contracts with foundation model developers, and partnerships with scientific organizations applying interpretability methods to domain-specific models in fields including genomics and materials science. The company has emphasized that it does not see itself as competing with frontier model developers but as offering an interpretability layer on top of their models [6].
Eric Ho and Daniel Balsam first worked together at RippleMatch, an AI-powered recruiting platform that Ho cofounded in 2016 after a frustrating personal experience with the job market as a Yale undergraduate. Ho served as CEO and Balsam as a founding engineer who built the platform's machine learning systems. Over seven years RippleMatch grew past $10 million in annual recurring revenue, raised venture capital from investors including Goldman Sachs and G2 Venture Partners, and earned Ho a place on the 2022 Forbes 30 Under 30 list [7].
In 2023 Ho and Balsam departed RippleMatch, united by what they later described as a conviction that mitigating AI risk had become the central technical and societal challenge of the decade. After surveying the alignment landscape they settled on mechanistic interpretability as the area where they could build the most useful tools, in part because the field had begun producing concrete results that could be turned into commercial infrastructure [8].
The third cofounder, Tom McGrath, joined from Google DeepMind, where he had founded the company's interpretability team in 2022. McGrath had been thinking about long-term AI risk since his PhD work in 2016, when he became convinced that AI would become a larger societal force than was widely appreciated. In March 2024 he left DeepMind to spend several months at South Park Commons, a community for technical founders, where he met Ho and Balsam and began the conversations that led to Goodfire's incorporation later that year [9].
Goodfire was formally launched in 2024 with McGrath as Chief Scientist, Ho as CEO, and Balsam as CTO. The company quickly positioned itself as a rare combination of seasoned operators and senior research talent, a profile that helped it attract early investor interest [10].
Goodfire's funding has followed an unusually compressed schedule. In August 2024 it announced a $7 million seed round led by Lightspeed and Menlo Ventures, the first investment from the Anthology Fund that Menlo co-leads with Anthropic. In April 2025 the company closed a $50 million Series A led by Menlo, with Anthropic participating directly. This marked the first time the Claude developer had invested in another startup. In December 2025 Goodfire closed a $150 million Series B led by B Capital at a $1.25 billion valuation, formally announced in February 2026 [11][12].
| Round | Date | Amount | Lead investor | Valuation |
|---|---|---|---|---|
| Seed | August 2024 | $7 million | Lightspeed, Menlo Ventures | Undisclosed |
| Series A | April 2025 | $50 million | Menlo Ventures (with Anthropic participating) | Undisclosed |
| Series B | December 2025 / February 2026 | $150 million | B Capital | $1.25 billion |
The combined investor list includes Menlo Ventures, Lightspeed, Anthropic, B Capital, Work-Bench, Wing, South Park Commons, DFJ Growth, Salesforce Ventures, and former Google CEO Eric Schmidt. Goodfire's total funding of roughly $209 million in less than two years made it one of the best-capitalized interpretability-focused efforts outside the major frontier labs by early 2026 [13].
Goodfire's leadership combines applied AI experience with senior research credentials. The three cofounders together control much of the company's strategy, with Ho focused on commercial and platform questions, Balsam leading engineering and infrastructure, and McGrath setting the research agenda.
| Person | Role | Background |
|---|---|---|
| Eric Ho | Cofounder and CEO | Cofounded and led RippleMatch for seven years, scaling it past $10M ARR; Yale graduate, Forbes 30 Under 30 in 2022 |
| Daniel Balsam | Cofounder and CTO | Founding engineer at RippleMatch where he led the core platform and machine learning teams |
| Tom McGrath | Cofounder and Chief Scientist | Founded the interpretability team at Google DeepMind; PhD researcher who pivoted to AI safety in the mid-2010s |
| Nick Cammarata | Senior researcher | Core contributor to the original interpretability work at OpenAI on circuits and features |
| Leon Bergen | Researcher | Professor at UC San Diego on leave to work at Goodfire |
Beyond the named leadership, Goodfire has hired aggressively from frontier model labs and academia. The research bench is unusually deep for a company of its size, with staff drawn from Harvard, Stanford, DeepMind, OpenAI, and several smaller alignment-focused research organizations. Hiring criteria emphasize a combination of strong research output and the willingness to ship products, a profile that is rare in the interpretability community where many researchers come from a pure academic background.
The broader culture is shaped by the fact that several senior staff explicitly identify with the AI safety and effective altruism communities, which has helped Goodfire build credibility with researchers who might otherwise be skeptical of a venture-backed startup. The company also emphasizes engineering discipline and platform thinking, reflecting Ho and Balsam's backgrounds, and publishes most of its research as detailed technical posts on its website rather than only as academic papers [14].
Goodfire's stated mission is to build the next generation of safe and powerful AI by understanding the internal mechanisms of large models, not only by scaling them. Its public materials describe a future in which AI systems are constructed, debugged, and modified the way software has historically been: with introspection tools, programmable interfaces, and meaningful unit tests. The company calls this vision intentional design and contrasts it with the current paradigm of training models on ever larger datasets and hoping desired behaviors emerge [15].
This framing has practical consequences for how Goodfire positions itself in the broader AI ecosystem. The company is explicit that it does not aim to compete with frontier model developers such as Anthropic, OpenAI, or Google DeepMind, and it does not plan to release its own frontier base models. Instead, it sees its products as a complementary interpretability layer for the developers and users of those models. Cofounder Eric Ho has argued in interviews that without such a layer, the safety case for very capable models will remain weak, because behavioral testing alone cannot guarantee that a model will behave safely outside the test distribution [16].
Anthropic CEO Dario Amodei, when announcing his company's investment, described mechanistic interpretability as among the best bets for transforming black-box networks into understandable, steerable systems and framed the Goodfire investment as part of Anthropic's broader effort to grow the safety ecosystem outside its own walls [17].
Ember is Goodfire's flagship product. The platform is a hosted API and SDK that lets users inspect and manipulate the internal features of supported foundation models. Rather than treating a model as a black box that takes prompts and produces text, Ember exposes interpretable units of internal computation, called features, that correspond to recognizable concepts learned by the model. Users can search for features, observe how they activate on different inputs, and turn them up or down to change model behavior at inference time [18].
Ember was first announced in 2024 and entered a broader public beta in 2025. By 2026 it supported Llama 3.1 8B and Llama 3.3 70B in its hosted API, with additional models available to selected partners. The platform is delivered as a developer SDK, HTTP API, and web console, and is hosted at platform.goodfire.ai. Goodfire has open sourced parts of the underlying tooling and released several interpreter models on Hugging Face, including SAEs trained on the LMSYS-Chat-1M dataset [19].
The core abstraction in Ember is the concept of a feature, a direction in a model's internal activation space that corresponds to a roughly interpretable concept. Features are extracted using sparse autoencoders, or SAEs, which are trained to reconstruct the residual stream of a transformer model using a sparse combination of learned components. The intuition is that the polysemantic neurons of a typical transformer can be decomposed into a much larger set of more interpretable directions, each capturing a distinct concept learned during training [20].
Goodfire trains state-of-the-art interpreter models on supported base models and labels the resulting features using a combination of automated description pipelines and human review. The result is a searchable catalog of human-readable features spanning lexical concepts like specific words and phrases, higher-level semantic patterns such as conciseness or technical explanation, and procedural knowledge such as code reasoning and refusal behavior. Users can query this catalog by description, find features that fire most strongly on a given input, or contrast features that differ between two datasets [21].
Once features are identified, Ember lets users intervene on them at inference time. The platform exposes three main mechanisms. Auto Steer accepts a natural-language description of the desired change in behavior and automatically selects and adjusts feature weights. Feature Search uses semantic search to find features that match a description. Contrastive Search compares two datasets and surfaces features that differentiate them, useful for tasks like discovering features that distinguish jailbreak prompts from benign prompts [22].
Goodfire and its early users have used these mechanisms for several practical applications. The platform has been used to harden models against jailbreaks by detecting jailbreak patterns and boosting the model's refusal feature, with reports of substantial robustness improvements at no measurable cost to latency or general performance. It has also been used to build feature-based classifiers, in some cases using only a handful of features to reach reasonable accuracy on financial filings or medical text classification [23].
In parallel with its language-model work, Goodfire has applied similar techniques to image models. Paint With Ember is a research demo that interprets a specific block in Stable Diffusion XL-Turbo, an open-source 3.5 billion parameter latent diffusion model that can generate images in only a few inference steps. By performing non-negative matrix factorization on the activations of the chosen block, Goodfire decomposed image generation into higher-level concepts and built an interface where users paint with those concepts on a 2D canvas rather than only writing text prompts. The demo showed that interpretability primitives developed for language models extend to vision and to other modalities [24].
Goodfire is unusual among interpretability companies in publishing a substantial volume of detailed technical research alongside its commercial product. By early 2026 the company had released a steady stream of papers, technical posts, and open source artifacts that document specific advances in interpretability methods and applications.
Much of Goodfire's foundational work has focused on improving the quality and scalability of sparse autoencoders. The company has trained SAEs on a range of base models including Llama 3.1 8B and Llama 3.3 70B, and has experimented with variants such as BatchTopK SAEs that improve efficiency and feature quality. Several of its researchers were directly involved in the original lines of work that introduced sparse autoencoders into interpretability [25].
The company has also developed techniques to decompose model internals into more granular structures and to relate features across layers and across models. These methods address one of the central challenges of mechanistic interpretability, which is that features extracted at a single layer often correspond to entangled concepts and may not capture the procedural structure of how a model carries out a task [26].
A particularly visible line of research at Goodfire focuses on using interpretability tools not only to understand trained models but to influence the training process itself. In 2025 the company published a paper introducing a technique called Reinforcement Learning from Feature Rewards, or RLFR, in which lightweight probes trained on a model's internal representations are used as reward signals during reinforcement learning. The first application was hallucination reduction in a language model, where Goodfire reported cutting hallucinations by roughly half at a cost per intervention dramatically lower than using an LLM as a judge, and without measurable degradation on standard benchmarks [27].
A key technical trick in RLFR was to run the hallucination detection probe on a frozen copy of the model during training, so that the model being trained could not learn to evade the probe by modifying the representations the probe depended on. Goodfire has described this work as an early example of the broader vision of intentional design, where interpretability tools enable closed-loop control over the behavior of trained models [28].
Goodfire has also applied its tools to reasoning models, which are large language models trained to produce extended chains of thought before final answers. The 2025 paper Under the Hood of a Reasoning Model interpreted DeepSeek R1 by analyzing how features evolved as the model worked through multi-step problems, providing one of the first detailed mechanistic accounts of a publicly available reasoning system [29].
Goodfire has expanded its research scope beyond language and vision models by partnering with scientific organizations that train large foundation models in other domains. The most prominent collaboration has been with Arc Institute, a nonprofit research organization developing the Evo series of genomic foundation models. Goodfire trained BatchTopK SAEs on layer 26 of Evo 2 and identified features corresponding to biological concepts including exon-intron boundaries, transcription factor binding sites, elements of protein secondary structure, and prophage genomic regions. This work was included in the Evo 2 publication in Nature in 2026 and represented one of the first detailed interpretability analyses of a large biological foundation model [30].
In a related line of work Goodfire applied its tools to an epigenetic model and reported the identification of a novel class of biomarkers associated with Alzheimer's disease, describing it as the first major natural-science finding obtained by reverse engineering a foundation model. In August 2025 the company also announced a partnership with Radical AI, a materials science research organization, extending interpretability work into another scientific domain [31].
Goodfire's relationship with Anthropic has been particularly important to its development. Anthropic was the first frontier AI lab to make a direct investment in the company, contributing to both its seed and Series A rounds. The seed round was the first deployment of capital from Anthropic's Anthology Fund, a $100 million vehicle co-managed with Menlo Ventures and dedicated to backing startups working on safety, interpretability, and the broader Anthropic ecosystem. The Series A in 2025 was both the largest interpretability-focused round to that point and the first time Anthropic had invested directly in another startup outside the Anthology Fund structure [32].
Beyond capital, the relationship has included technical collaboration. Anthropic has provided Goodfire with access to internal models and discussed shared research priorities. The two organizations are sometimes described as complementary in that Anthropic continues to invest heavily in interpretability for its own Claude models, while Goodfire focuses on making interpretability methods available across multiple base models. Dario Amodei has publicly framed mechanistic interpretability as one of the most important areas of safety research and described investments in companies like Goodfire as part of Anthropic's broader strategy of strengthening the safety ecosystem rather than treating safety research as a strictly internal advantage [33].
Goodfire has been received as one of the most prominent commercial efforts in mechanistic interpretability. Trade press coverage in 2025 and 2026 frequently described the company as the leading interpretability startup, citing its combination of senior research staff, rapid funding trajectory, and product scope. MIT Technology Review's 2026 list of breakthrough technologies cited mechanistic interpretability as one of its ten selections and highlighted Goodfire's Ember platform as an example of how the field had begun to produce shippable tools rather than only academic research [34].
The company's positioning is not without skepticism. Critics within the broader AI safety community have argued that even well-developed interpretability tools may not scale to models with significantly higher capability levels, and that sparse autoencoders in particular have limitations that have not yet been fully characterized. Some researchers have also raised concerns about the dual use of interpretability tools, noting that techniques allowing precise control over model internals could also be used to bypass safety training or inject harmful behaviors. Goodfire has addressed these concerns in its public safety statements, describing staged release processes and a commitment to publishing safety-relevant research that includes failure modes as well as successes [35].
Goodfire's growth has tracked a wider shift in attention toward interpretability as both an alignment tool and a feature engineering substrate. Several large model developers have publicly increased their interpretability staffing, and a small ecosystem of academic groups and smaller companies has formed around Goodfire's open source releases. By early 2026 the company had become both a notable commercial entity and a node in the broader academic and policy conversation about how to make AI systems more transparent and controllable [36].
Goodfire's roadmap, as described in its Series B announcement, focuses on three areas. The first is broadening the set of models supported by Ember, including additional open-weight large language models, multimodal models, and selected domain-specific foundation models. The second is deepening the platform's research capabilities, including better feature discovery and tighter integration with training pipelines for interpretability-informed training. The third is expanding scientific partnerships in biology, chemistry, and materials science, where Goodfire's leadership has argued that mechanistic interpretability can act as a tool for accelerating discovery [37].
The broader question that surrounds Goodfire is whether interpretability tools can keep up with the pace at which frontier models are scaling. The company has argued that the answer depends on continued investment in methods that scale with model size and on the willingness of frontier developers to integrate interpretability into their training and deployment pipelines. Goodfire's own bet is that the resulting layer of understanding will be a necessary part of any future in which very capable AI systems can be safely deployed [38].