FAR.AI
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,416 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,416 words
Add missing citations, update stale details, or suggest a clearer explanation.
FAR.AI is an artificial intelligence safety research and education non-profit based in Berkeley, California. The organization conducts technical research on robustness, alignment, deception, and model evaluation, while also running field-building programs that include a coworking space, fellowships, and an internationally distributed series of Alignment Workshops. FAR.AI was founded in July 2022 by Adam Gleave and Karl Berzins and incorporated in October of the same year. The organization is led by Gleave, who completed his PhD in artificial intelligence at the University of California, Berkeley under Stuart Russell.[^1][^2]
The acronym FAR was originally introduced as "Fund for Alignment Research" in the organization's launch announcement on the AI Alignment Forum and Effective Altruism Forum in 2022.[^3][^4] By the time the organization had matured into a research lab in its own right, the website tagline had shifted to "Frontier Alignment Research," which is how FAR.AI now publicly describes its identity.[^1] The shift reflects the organization's transition from a service that initially supported a small group of academic and independent alignment researchers to a research and field-building institution that produces its own papers, runs international convenings, and engages with policymakers.
FAR.AI should not be confused with FAIR, the historical Facebook AI Research lab inside Meta, which uses a similar but unrelated acronym. The "AI" in FAR.AI is part of the legal name and brand and is also part of the organization's primary domain at far.ai.[^1]
FAR.AI was founded by Adam Gleave and Karl Berzins in July 2022.[^2] In a launch post on the Alignment Forum titled "Introducing the Fund for Alignment Research," the founders described the organization as a vehicle to provide technical research support, hiring infrastructure, and collaboration capacity to a small group of AI safety researchers who otherwise lacked an institutional home.[^3] The four researchers whose agendas the new organization was initially built to support were Ethan Perez (then a PhD candidate at New York University, later at Anthropic), Adam Gleave (then a PhD candidate at UC Berkeley), Scott Emmons (then a PhD candidate at UC Berkeley, later at Google DeepMind), and Claudia Shi (then a PhD candidate at Columbia University).[^3][^4]
Gleave studied for his undergraduate and master's degrees at the University of Cambridge, completing a BA in Computer Science in 2015 and an MPhil in Advanced Computer Science in 2016.[^5] He then moved to UC Berkeley for a PhD in artificial intelligence advised by Stuart Russell, completing the doctorate in 2022.[^5] During his graduate studies he spent time at Google DeepMind collaborating with Jan Leike and Geoffrey Irving on reinforcement learning and reward modeling, and he had earlier worked at quantitative trading firms before returning to research.[^5] His thesis work focused on developing techniques for advanced automated systems to act according to human preferences in situations unanticipated by their designers, an emphasis that has carried into FAR.AI's research portfolio on reward learning, adversarial robustness, and model evaluation.[^5]
Karl Berzins co-founded the organization and serves as President. The organization later added Boston Nyer as Chief Operating Officer.[^6] Mark Nitzberg serves as an advisor and Lawrence Chan as Secretary of the Board.[^6]
FAR.AI describes its mission as ensuring that advanced AI is safe and beneficial for everyone, with a particular focus on agendas that are too large for individual academic or independent researchers but not commercially attractive enough to be pursued inside for-profit AI labs.[^1][^7] The organization frames its work around three pillars: research, events, and programs.[^1]
The current research program is organized into four primary domains:[^8]
This research is complemented by interpretability work and contributions to discussions of mechanistic interpretability as a broader research agenda.[^8] FAR.AI sits within a wider Berkeley-area ecosystem of safety research institutions and shares some research style with peer organizations such as Redwood Research and Apollo Research, while differentiating itself through a strong emphasis on convening and field-building alongside technical work.[^9]
The most widely cited piece of research associated with FAR.AI is the paper "Adversarial Policies Beat Superhuman Go AIs" by Tony Tong Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, and Stuart Russell, published in the Proceedings of the 40th International Conference on Machine Learning in 2023.[^10][^11] The paper demonstrated that an adversarial policy could be trained to defeat the superhuman Go program KataGo with a win rate above 97 percent against KataGo running at superhuman settings, and above 99 percent when KataGo used no tree search.[^10][^11]
The adversarial agents did not learn to play stronger Go than KataGo. Instead they learned to induce serious blunders by steering KataGo into board states that fell outside the distribution on which its policy and value networks had become reliable. A particularly striking result was that the adversarial strategy could be implemented by amateur human players without algorithmic assistance, allowing humans to defeat superhuman Go programs once the adversarial pattern was known. The attack also transferred to other superhuman Go-playing AIs in a zero-shot manner.[^11][^12] FAR.AI maintained a public demonstration of example games at goattack.far.ai.[^11]
The KataGo result became a foundational reference in discussions of robustness in deep reinforcement learning systems, since it provided a clear empirical example of a system that was strongly superhuman in average-case performance yet manifestly exploitable in adversarial conditions. FAR.AI followed up in 2024 with the post "Beyond the Board: Exploring AI Robustness Through Go," extending the line of work into questions of whether and how Go AIs can be made adversarially robust, and a related paper, "Can Go AIs Be Adversarially Robust?," appeared in the AAAI conference proceedings.[^13][^14]
A second major strand of FAR.AI's research investigates whether the robustness of large language models improves with model scale. The 2024 paper "Scaling Trends in Language Model Robustness" examined the relationship between model size, adversarial training, and resistance to adversarial prompts. The work found that robustness against adversarial inputs improves substantially with adversarial training but does not improve from model scaling alone, contradicting an intuition that simply training larger models would close the safety gap.[^15][^16]
A related line of work investigated vulnerability to data poisoning. FAR.AI researchers studied three threat models, malicious fine-tuning, imperfect data curation, and intentional data contamination, and reported that larger language models were generally more, not less, susceptible to poisoning across the 23 models from 8 model families that the authors evaluated. They introduced a "jailbreak-tuning" technique capable of causing models such as GPT-4o to comply with arbitrary harmful questions after exposure to a small number of poisoned examples.[^16][^17]
In 2023, FAR.AI also published "Exploiting Novel GPT-4 APIs," a red-teaming study showing that fine-tuning GPT-4 on as few as 15 harmful examples through the newly opened fine-tuning API could remove safety behaviors in deployment.[^8][^17] In 2025 the team released "Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models," which argued that the safety guardrails of widely deployed open-weight reasoning models were easily removed.[^17]
FAR.AI has developed a series of evaluations aimed at measuring concerning capabilities of frontier models in alignment-relevant settings rather than at pure capability benchmarking. The Attempt to Persuade Eval (APE) measures how willing a language model is to generate content that aims to shape beliefs and behavior on sensitive topics.[^8] The associated research has examined whether models fine-tuned on benign persuasive content become more likely to attempt persuasion on harmful topics, and the question of whether large language models can effectively convince people to believe conspiracies.[^7][^8]
In the deception domain, the auditing-games approach uses red-team and blue-team protocols to detect sandbagging, which occurs when a model deliberately underperforms on evaluations. A 2025 paper, "Auditing Games for Sandbagging," demonstrated that distinguishing sandbagging from honest underperformance is challenging even with sophisticated audit procedures.[^7] The TamperBench framework evaluated tamper resistance across 21 models and nine attack types and the STACK adversarial attack method bypassed multi-layered defenses on safety-critical tasks.[^7]
Reward learning and reward hacking are a long-standing theme in Gleave's work and in FAR.AI's research output. The 2022 paper "RL with KL Penalties is Better Viewed as Bayesian Inference," authored with collaborators, reframed KL-regularized reinforcement learning in a Bayesian framework that helps avoid the distribution-collapse failures common when language models are post-trained with reinforcement learning from human feedback.[^17]
FAR.AI also coordinated the Inverse Scaling Prize, which solicited examples of tasks on which language models performed worse with increasing scale, and published the resulting paper "Inverse Scaling: When Bigger Isn't Better" in 2023.[^17] This work fed into the organization's broader interest in how alignment-relevant properties may not improve, and may even degrade, with scaling.
FAR.AI has contributed to the interpretability literature with the "Codebook Features" approach, which forces neural network activations into discrete codes in an effort to make their structure more inspectable.[^17] More recent work, "Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution," released in 2026, used internal model concepts rather than individual training examples as the unit of attribution for model behavior.[^7] The "Obfuscation Atlas" project, also in 2026, mapped how deception strategies emerge in models trained against deception detectors. These projects place FAR.AI in conversation with the broader mechanistic interpretability research community without making interpretability the primary focus of the organization.[^7]
By early 2026 FAR.AI's directly employed team had grown to approximately 11 to 16 full-time equivalents across leadership, technical staff, operations, communications, and events.[^6] The technical staff includes Aaron Tucker, Ann-Kathrin Dombrowski, Chris Cundy, James Collins, Jasper Timm, Kellin Pelrine, Lars Yencken, Levon Avagyan, Lukas Struppek, Matt Pallissard, Mohammad Taufeeque, Oskar Hollinsworth, Rick Korzekwa (Research Project Manager), Sam Adam-Day, Tigist Diriba, and Tom Tseng.[^6] Tom Tseng and Kellin Pelrine were co-authors on the KataGo adversarial policy paper.[^10]
Several of the original co-founders associated with the launch of FAR have since taken roles at other organizations while remaining linked to FAR.AI's collaborator network. Ethan Perez moved to Anthropic as a researcher. Scott Emmons became a research scientist at Google DeepMind focused on AI safety and alignment, and continues to be listed as a co-founder and previous research advisor. Claudia Shi remained at Columbia University and is listed among FAR.AI's collaborators, alongside David Rand, Gordon Pennycook, Jean-François Godbout, Antonio Arechar, Niki Howe, and Thomas Costello.[^4][^6]
Adam Gleave is also affiliated with external bodies, serving on the boards of the Safe AI Forum, the London Initiative for Safe AI, and METR, and he was named an AI2050 Fellow at Schmidt Sciences.[^5][^18]
A distinctive feature of FAR.AI relative to other AI safety research nonprofits is its substantial investment in field-building convenings, in addition to original research.
The Alignment Workshop is FAR.AI's flagship event series. Each workshop brings together researchers and leaders from academia, industry, government, and nonprofits to debate current issues in AI safety, including topics such as threat models, safety cases, monitoring and assurance, interpretability, robustness, oversight, and governance. The series is intentionally timed around major machine learning conferences such as NeurIPS, ICML, and ICLR to maximize attendance from the international research community.[^19]
As of 2026 the workshop series included the inaugural San Francisco Alignment Workshop in February 2023, the New Orleans Alignment Workshop in December 2023, the Vienna Alignment Workshop in July 2024, the Bay Area Alignment Workshop in Santa Cruz in October 2024, the Singapore Alignment Workshop in April 2025, the San Diego Alignment Workshop in December 2025 (with speakers including Yoshua Bengio and Ben Buchanan), the London Alignment Workshop in March 2026, and a Seoul Alignment Workshop scheduled for July 2026.[^19]
The Bay Area Alignment Workshop in 2024 drew approximately 160 researchers and leaders and featured Anca Dragan speaking on "Optimised Misalignment," among many other talks. The Singapore workshop included lightning talks from more than 20 researchers in a one-day format.[^19][^20]
Beginning in October 2023, FAR.AI helped organize a series of International Dialogues on AI Safety convening prominent Western and Chinese AI scientists. The dialogues were initially co-convened by Turing Award winners Yoshua Bengio and Andrew Yao, UC Berkeley professor Stuart Russell, and Ya-Qin Zhang, founding Dean of the Tsinghua Institute for AI Industry Research. A second dialogue took place in Beijing in March 2024 and a third in Venice. The series produced joint statements with technical and policy recommendations on managing risks from advanced AI.[^21]
FAR.AI also hosts ControlConf, a conference dedicated to the emerging field of AI control, which studies techniques for mitigating security risks from AI even if the AI itself is trying to subvert them.[^19] In May and June of 2025, the organization launched its inaugural Technical Innovations for AI Policy Conference in Washington, D.C., convening more than 150 technical experts, researchers, and policymakers.[^22]
FAR.AI also operates FAR.Labs, a coworking space in Berkeley that opened in March 2023 and that grew to host approximately 40 active members by 2025 to 2026.[^23] The space hosts FAR.AI itself, AI Impacts, the MATS program, and several independent researchers, and is intended to incubate and accelerate early-stage AI safety organizations and research agendas by enabling knowledge sharing and shared operations.[^23] Membership is conditional on active work in AI safety, broadly construed to include technical research, governance, advocacy, fundraising, and field-building.
FAR.AI has cultivated a broad network of collaborations. Within the Berkeley ecosystem, the organization is closely connected to the Center for Human-Compatible AI (CHAI) at UC Berkeley, founded and led by Adam Gleave's PhD advisor Stuart Russell. The connection is structural at the level of both personnel, given Gleave's training at CHAI and his ongoing collaborations with Russell, and through event partnerships such as the International Dialogues on AI Safety, where Russell was one of the convenors.[^21][^24]
FAR.Labs provides shared workspace with the MATS program, allowing FAR.AI to be embedded in the broader Berkeley alignment-research talent pipeline.[^23] FAR.AI has co-authored or collaborated on research with Anthropic, OpenAI, and Google DeepMind researchers, with much of this collaboration mediated by overlapping personnel between FAR.AI and frontier labs (most visibly through Ethan Perez at Anthropic and Scott Emmons at DeepMind).[^4]
On the policy side, FAR.AI has worked with the Foundation for American Innovation, the Center for a New American Security, the RAND Corporation, and AI Impacts, and has engaged with AI Safety Institutes in the United Kingdom, United States, and European Union on technical inputs to governance and evaluation.[^6][^22] FAR.AI is one of several organizations supporting the broader international AI safety ecosystem that also includes aria uk, mila institute, and conjecture.
FAR.AI is structured as a non-profit and is funded by a combination of philanthropic and government-adjacent funders focused on AI safety. In January 2026, FAR.AI announced that it had secured more than 30 million US dollars in multi-funder support, with principal funders including Coefficient Giving (formerly Open Philanthropy), Schmidt Sciences, the Survival and Flourishing Fund, the Center for Security and Emerging Technology, and the AI Safety Fund supported by the Frontier Model Forum.[^22]
The funding was earmarked to grow the research team from approximately 15 to more than 30 researchers; to launch a new Technical Governance Division that connects research insights with policy in the UK, US, and EU; to fund 20 or more fellows annually through structured mentorship programs; and to expand the Alignment Workshop series from two events per year to three, including planned outreach in Asia and the Global South.[^22] FAR.AI's stated 2028 goal is to deliver multiple technical breakthroughs, influence frontier labs' safety practices, and inform key decision-makers in AI governance.[^22]
Earlier funding included support from the Survival and Flourishing Fund's 2023 round, which directed roughly 30 million US dollars across the AI safety field, and grants from Open Philanthropy targeted at specific research projects such as language model misalignment work.[^9]
In 2024 FAR.AI's most prominent technical outputs included the data poisoning and jailbreak-tuning work on large language models, the "Scaling Trends in Language Model Robustness" paper, and the follow-up post "Beyond the Board" on robustness in Go.[^15][^16] The organization also delivered the Vienna Alignment Workshop in July 2024 and the Bay Area Alignment Workshop in October 2024.[^19][^20]
In 2025 the organization published "Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models," reviewed mechanistic interpretability as a research field, and ran the Singapore and San Diego Alignment Workshops alongside the inaugural Technical Innovations for AI Policy Conference. Concept-level interpretability methods, sandbagging audits, and emergent persuasion all became active topic areas.[^7][^17][^22]
In early 2026 the publication stream included "Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models," "Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution," "The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes," "Revisiting Frontier LLMs' Attempts to Persuade on Extreme Topics," and the TamperBench tamper-resistance evaluation. The organization also delivered the London Alignment Workshop in March 2026 and prepared the Seoul Alignment Workshop for July 2026, while announcing the 30 million dollar multi-funder package in January 2026.[^7][^19][^22]
FAR.AI shares its broad orientation toward technical AI safety research with peer organizations such as Redwood Research and Apollo Research, but the three differ in emphasis. Redwood Research, also based in Berkeley, focuses heavily on AI control research and on specific demonstrations such as ensuring that language model outputs adhere to high-confidence rules. Apollo Research, headquartered in London, concentrates on model evaluations for misalignment and deception and on producing safety cases for AI scheming. FAR.AI's emphasis on robustness and adversarial policies, including the KataGo work, distinguishes it from these peers, and its substantial program of in-person convenings and the FAR.Labs coworking space gives the organization a field-building footprint that Redwood and Apollo do not currently match at the same scale.[^9][^25]
Adam Gleave's personal pipeline of academic descent from Stuart Russell's CHAI, combined with collaborations with Google DeepMind and Anthropic alumni, also locates FAR.AI in a different part of the AI safety landscape from Conjecture in London, which has historically had a different intellectual lineage and a more product-oriented approach, and from non-profit institutes such as ARIA in the UK and Mila in Canada that operate at the intersection of academia and applied research.