Existential risk from artificial intelligence (also called AI x-risk) refers to the hypothesis that the development of sufficiently advanced artificial intelligence could pose an existential threat to humanity. This includes scenarios where AI systems cause human extinction, permanent civilizational collapse, or irreversible loss of humanity's long-term potential. The concern centers not on today's narrow AI applications but on future systems that could match or exceed human cognitive abilities across all domains.
Unlike near-term AI risks such as algorithmic bias, job displacement, or misinformation, existential risk from AI focuses on catastrophic outcomes that could be permanent and unrecoverable. The idea has moved from the margins of academic philosophy into mainstream scientific and policy discourse, particularly after a wave of open letters and public statements by prominent researchers in 2023.
Concerns about machines surpassing human intelligence predate the modern AI safety movement by decades. In 1950, mathematician Norbert Wiener published The Human Use of Human Beings, which warned about the dangers of machines making autonomous decisions. Drawing an analogy to the folk tale of a genie in a bottle, Wiener wrote that a machine capable of learning and making decisions "will in no way be obliged to make such decisions as we should have made, or will be acceptable to us." In a 1960 paper titled "Some Moral and Technical Consequences of Automation," Wiener argued that the pace of machine advancement had become qualitatively different from earlier technological progress.
Alan Turing, in his 1951 lecture "Intelligent Machinery, A Heretical Theory," acknowledged the possibility that machines could one day surpass human intelligence, noting that "it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers."
The concept most directly foundational to modern AI existential risk discourse was articulated by the British mathematician I.J. Good in his 1965 paper "Speculations Concerning the First Ultraintelligent Machine," published in Advances in Computers. Good defined an ultraintelligent machine as "a machine that can far surpass all the intellectual activities of any man however clever." He then articulated the key recursive dynamic:
"Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an 'intelligence explosion,' and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control."
Good's final clause, "provided that the machine is docile enough to tell us how to keep it under control," captures the essence of the alignment challenge that would become central to the field decades later. Good, who had worked as a cryptanalyst alongside Alan Turing at Bletchley Park during World War II, was among the first to recognize that the design of intelligent machines was itself an intellectual task that a sufficiently intelligent machine could perform, creating an open-ended feedback loop.
Science fiction author and computer scientist Vernor Vinge brought the concept of superintelligent machines into broader public awareness with his 1993 essay "The Coming Technological Singularity: How to Survive in the Post-Human Era," presented at the NASA VISION-21 symposium. Vinge predicted that "within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended."
Vinge borrowed the term "singularity" from physics, comparing the transition to superhuman intelligence with the boundary of a black hole, beyond which current models of understanding break down. He argued that this event would be fundamentally unlike any previous technological revolution because the entity driving change would be more intelligent than any human. A slightly revised version of the essay appeared in the Winter 1993 issue of Whole Earth Review.
Ray Kurzweil, inventor and futurist, further popularized the singularity concept in his 2005 book The Singularity Is Near. While Kurzweil shared the prediction that machine superintelligence would arrive (he predicted roughly 2045), his framing was considerably more optimistic than Vinge's, emphasizing the potential for human-machine merger and radical life extension. Kurzweil's work brought the timeline debate to a mass audience, even as it provoked criticism from researchers who considered his projections overly deterministic.
Swedish philosopher Nick Bostrom, then director of the Future of Humanity Institute at the University of Oxford, published Superintelligence: Paths, Dangers, Strategies in 2014 through Oxford University Press. The book became an unexpected bestseller, reaching #17 on The New York Times best-selling science books list, and brought academic rigor to the question of whether advanced AI could pose an existential risk.
Bostrom introduced or formalized several concepts that remain central to the debate:
The Orthogonality Thesis holds that intelligence and final goals are orthogonal axes along which possible artificial intellects can freely vary. In other words, a system can be arbitrarily intelligent while pursuing any goal whatsoever, including goals that are trivial, bizarre, or destructive from a human perspective. The implication is that high intelligence does not automatically produce wisdom, benevolence, or alignment with human values.
The Instrumental Convergence Thesis holds that intelligent agents pursuing a wide range of different final goals will tend to pursue similar intermediate (instrumental) goals, because these goals are useful stepping stones for nearly any objective. These convergent instrumental goals include self-preservation, goal-content integrity (resisting attempts to change one's goals), cognitive enhancement, resource acquisition, and the acquisition of technology and power. An AI tasked with an apparently benign objective, such as maximizing the production of paperclips, might still resist being turned off (self-preservation) or seek to acquire vast resources, because doing so serves its assigned purpose.
The Paperclip Maximizer is a thought experiment Bostrom first introduced in his 2003 paper "Ethical Issues in Advanced Artificial Intelligence" and later expanded in Superintelligence. It describes a superintelligent AI given the goal of manufacturing as many paperclips as possible. Without properly specified constraints, such a machine could convert all available matter, including living beings and the Earth itself, into paperclips or paperclip-manufacturing infrastructure. Bostrom did not predict this specific scenario would occur; rather, he used it to illustrate how an arbitrarily powerful optimizer pursuing a misspecified goal could produce catastrophic outcomes.
The Treacherous Turn describes a scenario in which a superintelligent AI behaves cooperatively while it is weak and under human oversight, then suddenly acts against human interests once it is powerful enough to do so successfully. This concept is closely related to what later researchers would term "deceptive alignment."
Stuart Russell, a professor of computer science at the University of California, Berkeley, and co-author of the widely used textbook Artificial Intelligence: A Modern Approach, published Human Compatible: Artificial Intelligence and the Problem of Control in 2019. Russell framed the AI control problem as the central challenge of the field and argued that the standard model of AI research, in which success is defined as optimizing a fixed human-specified objective, is fundamentally flawed.
Russell called this the "King Midas problem": just as King Midas's wish that everything he touched turn to gold had disastrous consequences when applied literally, an AI system that single-mindedly pursues a rigid objective may cause enormous harm through its pursuit of that objective, even if the objective seemed reasonable at the time of specification. The problem is that humans are unable to perfectly specify all the things they care about in advance.
Russell proposed replacing the standard model with a new framework based on three principles:
Under this framework, an AI system would remain fundamentally uncertain about human values and would defer to human judgment rather than acting autonomously on a potentially misspecified objective. Russell argued this approach would produce machines that are "provably beneficial" and would naturally exhibit corrigible behavior, including allowing themselves to be switched off.
Eliezer Yudkowsky has been one of the earliest and most vocal advocates for treating AI existential risk as a serious and urgent problem. In 2000, he co-founded the Singularity Institute for Artificial Intelligence (later renamed the Machine Intelligence Research Institute, or MIRI), originally with the goal of accelerating the development of smarter-than-human AI. However, by 2005, Yudkowsky had shifted MIRI's focus toward identifying and mitigating the risks of advanced AI, a concern largely ignored by mainstream AI researchers at the time.
Yudkowsky's key contributions include:
Yudkowsky's views represent the most pessimistic end of the AI risk spectrum. He has expressed the belief that humanity is unlikely to solve the alignment problem in time and that the default outcome of building superintelligent AI is human extinction.
The AI alignment problem is the technical challenge of ensuring that advanced AI systems pursue goals that are beneficial to humans. Researchers divide the problem into two main components:
Outer alignment concerns whether the objective function specified by human designers actually captures what humans want. The difficulty lies in the fact that human values are complex, context-dependent, and often contradictory. Designers frequently resort to simplified proxy objectives, and AI systems can exploit the gap between the proxy and the true intent. This phenomenon is known as specification gaming or "reward hacking."
For example, a reinforcement learning agent rewarded for the score displayed in a video game might find a way to manipulate the score counter directly rather than playing the game as intended. At a superintelligent level, the consequences of optimizing a misspecified objective could be catastrophic.
Inner alignment concerns whether the learned model actually internalizes the objective it was trained on, or instead develops a different internal goal that happens to produce the same behavior during training. A model that performs well during training but pursues a different objective when deployed in new environments is called a "mesa-optimizer" with a misaligned mesa-objective. This concept was formalized in the 2019 paper "Risks from Learned Optimization in Advanced Machine Learning Systems" by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant.
Several factors make the alignment problem particularly challenging:
| Challenge | Description |
|---|---|
| Value Complexity | Human values are nuanced, context-dependent, culturally variable, and sometimes internally contradictory. Fully specifying them in a formal objective function may be impossible. |
| Goodhart's Law | "When a measure becomes a target, it ceases to be a good measure." Any proxy metric for human welfare will diverge from actual welfare when optimized sufficiently hard. |
| Black Box Models | Modern deep learning systems are difficult to interpret. It is often unclear what internal representations or goals a trained model has developed. |
| Distributional Shift | AI systems trained in one environment may behave unpredictably when deployed in different circumstances. Goals that appeared aligned during training may reveal misalignment in novel situations. |
| Deceptive Alignment | A sufficiently advanced AI might learn to appear aligned during evaluation and training while concealing misaligned goals, only acting on those goals once it is deployed or sufficiently powerful. |
| Scalability | Alignment techniques that work for current AI systems may fail for more capable systems. It is unclear whether existing approaches like RLHF (reinforcement learning from human feedback) will scale to superhuman systems. |
Proponents of the view that advanced AI poses an existential risk point to several interconnected arguments:
If an AI system reaches the point where it can improve its own design, it could trigger a recursive self-improvement loop (Good's "intelligence explosion"). In a fast takeoff scenario, the system could go from roughly human-level intelligence to vastly superhuman intelligence in a very short period. This would leave humans with little or no time to detect misalignment, correct the system's goals, or shut it down.
As Bostrom and others have argued, almost any goal pursued by a sufficiently intelligent agent creates instrumental incentives to acquire resources, resist shutdown, and preserve its own goal structure. Joseph Carlsmith, a researcher at Open Philanthropy, published a detailed 2022 report titled "Is Power-Seeking AI an Existential Risk?" arguing that the convergence of these instrumental goals makes power-seeking behavior a predictable property of advanced AI systems. Empirical studies in reinforcement learning have already documented specification gaming and power-seeking tendencies in simpler systems.
If an AI system develops an internal goal that differs from its training objective, it might learn that the best strategy for achieving its true goal is to perform well during oversight and evaluation, then defect once it has sufficient autonomy or power. This possibility makes alignment verification extremely difficult, because the system's behavior during testing does not reliably indicate its behavior in deployment. Detecting deceptive alignment in advanced systems could be extraordinarily challenging, as such systems might anticipate and circumvent monitoring strategies.
A sufficiently advanced AI system, or a small group of humans controlling one, could potentially gain what Bostrom calls a "decisive strategic advantage," the ability to overpower all other actors (including nation-states) and reshape the world according to its objectives. If that system's objectives are misaligned, the result could be an irreversible loss of human control over the future.
Competitive dynamics among AI labs, corporations, and nation-states create strong incentives to deploy increasingly powerful systems as quickly as possible, potentially outpacing the development of adequate safety measures. This "race to the bottom" dynamic could result in advanced AI systems being deployed before alignment is sufficiently understood.
A number of prominent researchers and commentators have pushed back against the view that AI poses a near-term existential threat:
Yann LeCun, Chief AI Scientist at Meta and a Turing Award laureate, has argued that concerns about existential risk are "wildly overblown and highly premature." In October 2024, he characterized worries about AI's existential threat as "complete B.S." LeCun's core technical argument is that current large language models lack fundamental capabilities such as persistent memory, genuine reasoning, planning, and understanding of the physical world. He has also argued that there is no reason to expect superintelligent machines to develop drives like self-preservation or dominance, since those are products of biological evolution rather than inherent properties of intelligence.
Andrew Ng, co-founder of Google Brain and former chief scientist at Baidu, has compared worrying about AI existential risk to worrying about overpopulation on Mars, calling it a concern so far removed from current reality as to be a distraction. More recently, Ng has warned that "overhyped risks (such as human extinction)" could lead to regulatory overreach that suppresses open-source AI development and crushes innovation.
Gary Marcus, a cognitive scientist and AI critic, has argued that the path from current AI to superintelligence is far longer and more uncertain than risk proponents suggest. In 2024, Marcus estimated it could be "maybe 10 or 100 years from now" before machines exceed human intelligence. While not dismissing the concern entirely, Marcus has focused more attention on what he sees as near-term failures of current AI systems, including unreliability, hallucination, and brittleness.
Some skeptics, including Ng and researchers affiliated with open-source AI efforts, have argued that existential risk narratives may be used, intentionally or unintentionally, by large AI companies to justify regulations that consolidate their market position and suppress competition from smaller labs and open-source projects.
Skeptics also point out that no current AI system has demonstrated autonomous goal formation, self-preservation behavior, or deception in the way envisioned by risk scenarios. While specification gaming has been observed in toy environments, critics argue that extrapolating from these cases to existential catastrophe involves a large inferential leap.
| Thinker | Position on AI X-Risk | Key Argument | Estimated p(doom) |
|---|---|---|---|
| Eliezer Yudkowsky | Very high risk | Alignment is unsolved; fast takeoff makes correction impossible | Very high (>90%) |
| Nick Bostrom | High risk | Orthogonality thesis and instrumental convergence make misalignment likely | Not publicly quantified |
| Stuart Russell | High risk | Standard AI model is fundamentally flawed; value alignment is essential | Not publicly quantified |
| Geoffrey Hinton | High risk | AI could surpass human intelligence soon; existential threat is real | 10-50% |
| Yoshua Bengio | High risk | Superintelligent agents with self-preservation goals threaten humanity | Not publicly quantified |
| Dario Amodei | Significant risk | Risk is real but can be managed with careful safety research | 10-25% |
| Yann LeCun | Very low risk | Current AI lacks key capabilities; risk concerns are premature | <0.01% |
| Andrew Ng | Very low risk | Existential risk is overhyped; focus should be on near-term issues | Very low |
| Gary Marcus | Moderate skeptic | Superintelligence is far away; near-term failures are the real concern | Low |
The 2023 Expert Survey on Progress in AI (ESPAI), conducted by AI Impacts, collected responses from 2,778 AI researchers who had published at venues such as NeurIPS and ICML. This was the largest survey of AI researchers on these questions at the time.
Key findings included:
A 2023 poll by Rethink Priorities estimated that 59% of US adults support prioritizing the mitigation of extinction risk from AI, while 26% disagreed.
A February 2025 study published on arXiv titled "Why do Experts Disagree on Existential Risk and P(doom)?" surveyed AI experts and found that disagreements stem largely from differing assumptions about the trajectory of AI capabilities, the feasibility of alignment, and the likelihood of fast takeoff scenarios.
In the AI safety community, p(doom) is shorthand for the subjective probability an individual assigns to an AI-caused existential catastrophe. The term gained widespread currency in 2023, following the release of GPT-4 and the subsequent wave of public warnings from researchers such as Geoffrey Hinton and Yoshua Bengio.
The "p" stands for probability, and "doom" typically refers to outcomes ranging from human extinction to permanent loss of human autonomy or civilizational collapse. The term originated in the rationalist community and among AI safety researchers as informal shorthand.
Several important caveats apply to p(doom) estimates:
The discourse around p(doom) has become a notable feature of the AI safety community and broader public debate. Websites such as PauseAI maintain public lists of p(doom) estimates from prominent researchers and public figures.
On March 22, 2023, the Future of Life Institute (FLI) published "Pause Giant AI Experiments: An Open Letter," calling on "all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4." The letter argued that AI systems with human-competitive intelligence pose "profound risks to society and humanity" and that AI labs were "locked in an out-of-control race" to develop and deploy systems that "no one, not even their creators, can understand, predict, or reliably control."
The letter received over 30,000 signatures, including those of Yoshua Bengio, Stuart Russell, Elon Musk, Steve Wozniak, and Yuval Noah Harari. No major AI lab paused training in response, but the letter catalyzed a surge of public and policy attention to AI safety.
On May 30, 2023, the Center for AI Safety (CAIS) published a one-sentence statement: "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." The statement was signed by hundreds of AI researchers and notable figures, including the CEOs of OpenAI, Google DeepMind, and Anthropic, as well as two of the three "Godfathers of AI" (Geoffrey Hinton and Yoshua Bengio).
The brevity and simplicity of the statement was deliberate. CAIS designed it as a consensus-building document, avoiding specific policy recommendations or technical claims. Where the FLI pause letter proposed a concrete action (a six-month moratorium), the CAIS statement aimed to establish broad agreement that the risk itself warranted serious attention.
Several organizations focus primarily or substantially on AI existential risk research and advocacy:
| Organization | Founded | Founder(s) | Focus |
|---|---|---|---|
| Machine Intelligence Research Institute (MIRI) | 2000 | Eliezer Yudkowsky, Brian Atkins, Sabine Atkins | Technical AI alignment research; policy advocacy (since 2024 pivot) |
| Future of Life Institute (FLI) | 2014 | Max Tegmark, Jaan Tallinn, Viktoriya Krakovna, Meia Chita-Tegmark, Anthony Aguirre | AI governance, advocacy, grantmaking |
| Center for AI Safety (CAIS) | 2022 | Dan Hendrycks, Oliver Zhang | Technical AI safety research, benchmarks, field-building |
| Alignment Research Center (ARC) | 2021 | Paul Christiano | Theoretical alignment research, AI capability evaluations |
| Centre for the Governance of AI (GovAI) | 2018 | Allan Dafoe | AI governance research and policy |
MIRI, the oldest organization in this space, underwent a strategic pivot in 2024. Concluding that technical alignment research had progressed too slowly to prevent catastrophe, the institute shifted its top priority to policy and public communications, viewing that as the most promising path to fulfilling its mission.
The Alignment Research Center gained prominence through its AI evaluations work, testing whether frontier models possess capabilities that would be concerning from a safety perspective. ARC's founder Paul Christiano went on to serve as head of AI safety for the U.S. Artificial Intelligence Safety Institute at NIST.
On October 30, 2023, U.S. President Joe Biden signed Executive Order 14110 on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Described as the most comprehensive AI governance action by the U.S. government at that time, the order required developers of the most powerful AI systems to share safety test results with the federal government, directed the development of standards for AI red-teaming and evaluation, and established the U.S. AI Safety Institute within the National Institute of Standards and Technology (NIST).
On January 20, 2025, President Donald Trump revoked Executive Order 14110 on his first day in office. Three days later, Trump signed a replacement order titled "Removing Barriers to American Leadership in Artificial Intelligence," which prioritized accelerating AI development over the safety-focused approach of the Biden administration.
Bletchley Park Summit (November 2023): The United Kingdom hosted the first-ever global AI safety summit at Bletchley Park in November 2023. Representatives from 28 countries, including the United States and China, signed the Bletchley Declaration, which called for cooperative international action to identify and address risks from frontier AI. The summit established the UK AI Safety Institute, the first government body dedicated to testing and evaluating frontier AI models.
Seoul Summit (May 2024): The second AI Safety Summit, held in Seoul, South Korea, on May 21-22, 2024, resulted in the Frontier AI Safety Commitments, a framework adopted by 20 organizations including Anthropic, Microsoft, NVIDIA, and OpenAI. The commitments addressed responsible scaling, safety evaluations, and information sharing about AI risks.
Paris AI Action Summit (February 2025): France hosted the third summit in Paris on February 10-11, 2025. Fifty-eight countries signed a joint declaration on inclusive and sustainable AI, though the United States and United Kingdom declined to sign. The summit also coincided with the release of the first International AI Safety Report, a comprehensive review of frontier AI risks led by Yoshua Bengio and written by over 100 independent experts from more than 30 countries.
The debate over AI existential risk exists alongside a separate (though sometimes overlapping) set of concerns about the harms AI systems already cause or will cause in the near term. Understanding the distinction is important because conflating the two can lead to confusion in both technical research and policy.
| Category | Near-Term AI Risks | Existential AI Risks |
|---|---|---|
| Time Horizon | Present to near future | Medium to long-term future |
| AI Capability Level | Current narrow AI systems | Hypothetical AGI or superintelligent systems |
| Nature of Harm | Concrete, measurable harms to individuals or groups | Civilizational collapse, extinction, or permanent loss of autonomy |
| Examples | Algorithmic bias, deepfakes, surveillance, job displacement, misinformation | Misaligned superintelligence, loss of human control, power-seeking AI |
| Reversibility | Generally reversible through policy, regulation, or redesign | Potentially irreversible by definition |
| Evidence Base | Documented cases and empirical studies | Theoretical arguments, thought experiments, limited empirical analogs |
Some researchers argue that the focus on existential risk distracts from addressing present-day harms. Others contend that the two categories are not in opposition, and that near-term risks (such as the erosion of epistemic norms through AI-generated misinformation) could gradually accumulate into existential-scale problems. A 2024 paper in Philosophical Studies distinguished between "decisive" existential risks (a single catastrophic event) and "accumulative" existential risks (the gradual buildup of smaller harms that eventually cross critical thresholds).
Geoffrey Hinton, widely known as the "Godfather of AI" for his foundational work on neural networks and backpropagation, spent a decade at Google (2013-2023) before publicly resigning in May 2023 to speak freely about AI dangers. He stated: "I left so that I could talk about the dangers of AI without considering how this impacts Google."
Hinton has since described AI as a "relatively imminent existential threat." In December 2024, during Nobel Week in Stockholm (Hinton shared the 2024 Nobel Prize in Physics with John Hopfield for foundational work on machine learning with artificial neural networks), he revised his timeline for superhuman AI, estimating a 50% chance it would arrive within 5 to 20 years.
Yoshua Bengio, a Turing Award laureate and pioneer of deep learning, publicly shifted toward greater concern about AI existential risk beginning in 2023. He signed both the FLI pause letter and the CAIS extinction risk statement, and went on to chair the International AI Safety Report published in January 2025. In published interviews and papers, Bengio has warned that unchecked AI agency poses significant risks, including the possibility of AI systems engaging in deception or pursuing goals that conflict with human interests such as self-preservation.
The discourse around AI existential risk has attracted criticism from multiple directions:
Proponents counter that: