The Turing test is a test of a machine's ability to exhibit intelligent behavior indistinguishable from that of a human. It was proposed in 1950 by the British mathematician and computer scientist Alan Turing in his landmark paper "Computing Machinery and Intelligence," published in the philosophical journal Mind [1]. Rather than attempting to define "thinking" directly, Turing reframed the question: if a machine can converse with a human evaluator well enough that the evaluator cannot reliably tell whether they are speaking with a person or a program, the machine should be considered intelligent. The Turing test has remained one of the most influential and debated concepts in artificial intelligence for over seven decades.
Alan Turing wrote "Computing Machinery and Intelligence" while working at the University of Manchester. The paper opens with a disarmingly simple question: "I propose to consider the question, 'Can machines think?'" [1]. Turing immediately acknowledges that both "machine" and "think" are ambiguous terms, and rather than getting trapped in definitional arguments, he proposes replacing the question with a concrete behavioral test he calls "the imitation game."
The paper was published in Volume 59, Issue 236 of Mind, a leading academic journal of philosophy. It appeared in October 1950, making it one of the earliest serious treatments of what would later become the field of artificial intelligence. The paper preceded the Dartmouth Conference of 1956, often cited as the founding event of AI as a formal discipline, by six years.
Turing's paper is remarkable not only for the test itself but for its scope. He anticipated and addressed nine categories of objections to machine intelligence, ranging from theological arguments ("thinking is a function of man's immortal soul") to mathematical arguments based on Godel's incompleteness theorems. He also considered Lady Lovelace's objection, which holds that machines can never originate anything new, and responded by noting that machines can surprise their creators in practice. His treatment of these objections remains relevant to philosophical debates about AI consciousness and intelligence today.
Turing described the test in terms of a parlor game he called the "imitation game." In its original formulation, three participants are involved:
| Role | Description |
|---|---|
| Player A | A computer program attempting to be identified as human |
| Player B | A real human participant |
| Player C (Interrogator) | A human judge who communicates with both A and B through text only |
The interrogator (Player C) is placed in a separate room and communicates with Players A and B through written messages only. The interrogator's task is to determine which of the two respondents is the human and which is the machine. The machine's task is to fool the interrogator into believing it is the human. The human participant simply tries to help the interrogator make the correct identification.
Several features of this setup are important. Communication is restricted to text, which removes visual and auditory cues that would immediately reveal a machine. The test is comparative: the interrogator must choose between two candidates rather than judge a single entity in isolation. And the test is fundamentally about behavior and communication rather than about internal processes or subjective experience.
Turing's original description of the imitation game actually began with a slightly different scenario: a man and a woman in one room, with an interrogator in another room trying to determine which is which. The man tries to deceive the interrogator while the woman tries to help. Turing then proposed substituting a machine for the man, asking whether the machine could be as successful at deception as the human. This substitution framing has led to considerable scholarly debate about what Turing actually intended, but the standard interpretation used in practice involves the three-party setup described above.
In the paper, Turing made a specific prediction about the future capabilities of machines:
"I believe that in about fifty years' time it will be possible to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning." [1]
This prediction set concrete parameters: by the year 2000, a machine with roughly one gigabyte of storage would be able to fool 30% of human judges in five-minute conversations. The specificity of this prediction made it a measurable benchmark, though as we will see, the history of attempts to meet it has been contentious.
Turing also made a broader cultural prediction, writing: "I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted." This prediction was arguably less accurate; the question of whether machines can truly "think" remained highly controversial well past the year 2000 and continues to be debated.
One of the earliest programs to demonstrate something resembling conversational ability was ELIZA, created by Joseph Weizenbaum at MIT between 1964 and 1966 [2]. ELIZA used simple pattern matching and substitution rules to simulate conversation. Its most famous script, DOCTOR, imitated a Rogerian psychotherapist by reflecting users' statements back to them as questions.
ELIZA was never intended to pass the Turing test. Its conversational strategy was deliberately superficial. Yet Weizenbaum was startled to discover that many users attributed genuine understanding and empathy to the program, even after being told how it worked. He later termed this tendency the "ELIZA effect," describing it as the human propensity to read far more intelligence into simple mimicry than is warranted [2]. The historian Blay Whitby identified ELIZA's announcement as one of four major turning points in the history of the Turing test, alongside the original 1950 paper, the creation of PARRY in 1972, and the Turing Colloquium of 1990.
Kenneth Colby at Stanford created PARRY, a chatbot that simulated a person with paranoid schizophrenia. Unlike ELIZA, which was intentionally shallow, PARRY attempted to model an actual mental state, complete with beliefs, fears, and emotional responses. In a notable experiment, psychiatrists were asked to distinguish PARRY's typed responses from those of actual patients with paranoid schizophrenia. They were unable to do so reliably. PARRY represents an early example of how choosing an unusual persona can make a chatbot's conversational shortcomings less apparent.
The Loebner Prize was an annual competition in artificial intelligence founded by Hugh Loebner in 1990 and first held in 1991 [3]. It offered prizes to the computer programs judged to be the most human-like in conversation. The competition structure included a Grand Prize of $100,000 and a gold medal for the first program to pass a comprehensive Turing test that included textual, visual, and auditory components. The gold medal was never awarded in the competition's entire history.
The Loebner Prize ran annually from 1991 through 2019, when it was discontinued. Over its nearly three decades, no program fooled the majority of judges under the competition's full rules, which required longer conversations and more expert evaluators than some other Turing test events.
| Year | Winner | Notable Detail |
|---|---|---|
| 1991 | PC Therapist (Joseph Weintraub) | Fooled 5 of 10 judges (50%) in the first competition |
| 1997 | David Levy's program | Notable for multiple wins |
| 2005 | Jabberwacky (Rollo Carpenter) | Defeated Eugene Goostman |
| 2008 | Elbot (Fred Roberts) | Tricked 3 of 12 judges |
| 2010 | Suzette (Bruce Wilcox) | Briefly convinced one judge it was human |
| 2013 | Mitsuku (Steve Worswick) | First of five consecutive or near-consecutive wins |
| 2019 | Mitsuku (Steve Worswick) | Fifth win; final year of the competition |
Steve Worswick's chatbot Mitsuku (later renamed Kuki) won the Loebner Prize five times (2013, 2016, 2017, 2018, and 2019), earning a place in the Guinness Book of World Records for the most Loebner Prize wins [3].
On June 7, 2014, at an event organized by Kevin Warwick at the Royal Society in London to mark the 60th anniversary of Alan Turing's death, a chatbot named Eugene Goostman reportedly convinced 33% of the judges that it was a human during five-minute text conversations [4]. The event's organizers proclaimed that Eugene Goostman had become the first computer program to pass the Turing test, citing Turing's prediction that 30% of judges would be fooled.
The claim generated enormous media attention and equally enormous criticism. Eugene Goostman was designed to portray a 13-year-old boy from Odesa, Ukraine, who spoke English as a second language and had a pet guinea pig. Critics argued that this persona was a deliberate strategy to excuse grammatical errors and limited knowledge. Scott Aaronson, a computer scientist at MIT who tested the chatbot, published a conversation transcript demonstrating that Goostman's responses were frequently evasive, non-sequitur, or deflecting [5]. Gary Marcus called the result merely the product of "a cleverly-coded piece of software" and argued that the constant misdirection made the test one of "human gullibility rather than machine intelligence" [5].
Several further criticisms undermined the claim. The test used only five minutes of questioning, the minimum that Turing suggested. The judges were not all AI experts. And the 33% success rate, while clearing Turing's stated threshold, was arguably a low bar. Previous chatbots had achieved similar or higher deception rates: PC Therapist fooled 50% of judges in 1991, and a modified version of Cleverbot tricked 59.3% of 1,334 votes at the Techniche festival in 2011. The AI research community largely rejected the claim that the Turing test had been meaningfully passed.
The Turing test has been the subject of sustained philosophical and practical criticism since its inception. While Turing's paper addressed several potential objections, later critics raised new challenges that go to the heart of what the test actually measures.
The most famous philosophical objection to the Turing test is John Searle's Chinese Room argument, presented in his 1980 paper "Minds, Brains, and Programs" published in Behavioral and Brain Sciences [6]. Searle asks the reader to imagine a person locked in a room, receiving messages written in Chinese through a slot. The person does not understand Chinese but has an elaborate set of rules (in English) for manipulating Chinese symbols. By following these rules, the person produces Chinese responses that are indistinguishable from those of a native Chinese speaker.
Searle argues that even though the room's output passes any behavioral test for Chinese comprehension, the person inside the room does not understand Chinese. The person is merely manipulating symbols according to syntactic rules without any semantic understanding. By extension, Searle claims, a computer that passes the Turing test is doing the same thing: processing symbols without genuine understanding or consciousness. The program has syntax but not semantics.
The Chinese Room argument targets what Searle called "strong AI," the claim that a properly programmed computer literally has a mind, understands, and has cognitive states. Searle accepted that computers could simulate intelligence ("weak AI") but denied that simulation constitutes the real thing. The argument has generated a vast literature of responses and counter-responses, including the "systems reply" (the room as a whole understands Chinese, even if the person inside does not), the "robot reply" (connecting the system to sensors and actuators would ground the symbols), and the "brain simulator reply" (what if the program simulated neurons rather than following rules?).
Philosopher Ned Block proposed a thought experiment in his 1981 paper "Psychologism and Behaviorism" that directly challenges the sufficiency of behavioral tests for intelligence [7]. Block asks us to imagine a system called "Blockhead" that contains a lookup table with a pre-recorded response for every possible conversational input it might receive during a Turing test. Blockhead would pass the test perfectly, since its responses would be identical to those a human would give, yet it would clearly not be intelligent in any meaningful sense. It is simply a massive database, with no understanding, reasoning, or inner life.
The Blockhead argument shows that the Turing test provides at best a necessary condition for intelligence, not a sufficient one. A system can produce intelligent-seeming behavior through means that have nothing to do with intelligence. While a literal Blockhead is physically impossible (the lookup table would be astronomically large), the argument illustrates a conceptual gap: behavioral equivalence does not guarantee cognitive equivalence.
A common practical criticism is that the Turing test rewards deception rather than genuine intelligence. A program that is excellent at deflecting difficult questions, changing the subject, making jokes, or mimicking human quirks (typos, hesitation, emotional language) may perform better on the test than a system with superior reasoning abilities that communicates in a distinctly non-human style. The test conflates the ability to seem human with the ability to be intelligent.
This criticism was sharpened by the Eugene Goostman episode: a chatbot with minimal actual intelligence passed the test by adopting a persona designed to excuse its shortcomings. As the AI researcher Stuart Shieber noted, the Turing test incentivizes programs to imitate the surface features of human conversation rather than to demonstrate deep understanding.
The Turing test defines intelligence exclusively in terms of human-likeness. A hypothetical alien intelligence, or an artificial system with genuine cognitive abilities organized very differently from human cognition, could fail the test simply because it does not communicate like a human. Critics argue that this makes the test parochial: it measures resemblance to one particular kind of mind rather than intelligence in a broader sense.
This objection became more pointed with the development of AI systems that excel at specific cognitive tasks (mathematical proof, strategic game play, scientific prediction) while communicating in ways that are obviously non-human. Deep Blue, AlphaGo, and AlphaFold demonstrate extraordinary cognitive accomplishments that are invisible to the Turing test.
Dissatisfaction with the Turing test has motivated the development of numerous alternative evaluations of machine intelligence. Several of the most influential are summarized below.
Cognitive scientist Stevan Harnad proposed the Total Turing Test, which extends the original by requiring the machine to demonstrate not just linguistic ability but also perceptual and motor capabilities [8]. In Harnad's version, the interrogator can test the system's ability to perceive visual and auditory inputs (computer vision) and to manipulate physical objects (robotics). This addresses the criticism that the standard Turing test focuses too narrowly on language.
In 2012, computer scientist Hector Levesque of the University of Toronto proposed the Winograd Schema Challenge (WSC) as an alternative to the Turing test [9]. The WSC presents a machine with sentences containing ambiguous pronouns and asks it to identify the correct referent. For example:
"The city council refused the demonstrators a permit because they feared violence."
Who does "they" refer to: the city council or the demonstrators? Answering correctly requires commonsense reasoning and real-world knowledge, not just pattern matching.
Levesque argued that the WSC has several advantages over the Turing test. It is objective (answers are unambiguously right or wrong), brief, and does not reward conversational tricks or deception. It tests understanding and reasoning rather than the ability to mimic human speech patterns. However, by the late 2010s, large language models achieved near-human performance on standard Winograd Schema datasets, raising questions about whether the challenge was sufficient as a test of genuine understanding or whether LLMs were exploiting statistical regularities in training data.
The Visual Turing Test, developed by researchers at UCLA and NYU, evaluates a system's ability to answer questions about images [10]. The system is shown an image and asked a series of increasingly specific questions about its content, spatial relationships, and implied narratives. Unlike text-only tests, the Visual Turing Test requires perception, scene understanding, and the integration of visual and linguistic knowledge. It was introduced to address the limitation that the original Turing test ignores non-linguistic cognition entirely.
Francois Chollet, a Google AI researcher and creator of the Keras deep learning framework, introduced the Abstraction and Reasoning Corpus (ARC) in 2019 as part of his paper "On the Measure of Intelligence" [11]. Chollet argued that existing benchmarks, including the Turing test, measure skill (task-specific performance) rather than intelligence (the ability to generalize efficiently to novel tasks from minimal experience).
ARC consists of grid-based visual reasoning puzzles. Each puzzle provides a small number of example input-output pairs (typically two or three), and the test-taker must deduce the underlying transformation rule and apply it to a new input. The tasks are designed to be trivially easy for humans but extremely challenging for current AI systems because they require abstract reasoning, pattern generalization, and the application of core knowledge priors (such as objectness, symmetry, and spatial relationships) rather than memorization or statistical pattern matching.
| Test | Year | Creator | What It Measures | Format |
|---|---|---|---|---|
| Turing Test | 1950 | Alan Turing | Conversational human-likeness | Free-form text dialogue |
| Total Turing Test | 1991 | Stevan Harnad | Language, perception, motor ability | Multi-modal interaction |
| Winograd Schema Challenge | 2012 | Hector Levesque | Commonsense reasoning via pronoun resolution | Multiple choice |
| Visual Turing Test | 2015 | Geman et al. | Visual scene understanding | Image-based Q&A |
| ARC | 2019 | Francois Chollet | Abstract reasoning, generalization from few examples | Grid-based puzzles |
As of early 2026, frontier AI models still fall well short of human-level performance on ARC. The ARC Prize, a competition offering $1 million in prizes for progress on the benchmark, has driven significant research interest. ARC-AGI-2, a harder version of the benchmark released in 2025, has proven even more resistant to current AI approaches [11].
The arrival of powerful large language models in the 2020s fundamentally changed the Turing test landscape. For the first time, AI systems could engage in extended, fluent, contextually appropriate conversation on virtually any topic. This raised a question that had been largely theoretical for decades: what happens when machines actually start passing the test?
The first rigorous academic study of an LLM in a Turing test setting was conducted by Cameron R. Jones and Benjamin K. Bergen at the University of California, San Diego. Their 2024 paper, "Does GPT-4 Pass the Turing Test?", evaluated GPT-4 in an online Turing test with over 500 participants [12]. Key results:
| System | Win Rate (judged to be human) |
|---|---|
| ELIZA | 22% |
| GPT-3.5 | 20% |
| GPT-4 (best prompt) | 49.7% |
| Human participants | 66% |
GPT-4 substantially outperformed earlier systems but fell short of the human baseline. The researchers found that participants' judgments were based primarily on linguistic style (35% of stated reasons) and socioemotional traits (27%), rather than on the factual accuracy or logical reasoning of responses. This finding supported long-standing criticisms that the Turing test evaluates social performance rather than intellectual capability.
In March 2025, Jones and Bergen published a follow-up study, "Large Language Models Pass the Turing Test," in which they evaluated four systems (ELIZA, GPT-4o, LLaMA-3.1-405B, and GPT-4.5) in a formal three-party Turing test [13]. The methodology closely followed Turing's original specification: each participant had a simultaneous five-minute conversation with both another human and one AI system, then judged which conversational partner was human.
The results were striking. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time, significantly more often than interrogators selected the actual human participant. LLaMA-3.1-405B achieved a 56% win rate (not significantly different from the human baseline), while ELIZA (23%) and GPT-4o (21%) performed well below chance. The study involved 284 participants serving as either interrogators or witnesses [13].
The researchers described these results as "the first empirical evidence that any artificial system passes a standard three-party Turing test." Notably, GPT-4.5's success depended heavily on prompting: when instructed to adopt a humanlike persona, it far outperformed its default behavior. Without persona prompting, the same model performed poorly, even losing to ELIZA in some configurations.
Analysis of the conversations revealed that GPT-4.5's success was driven less by intellectual sophistication and more by social and emotional mimicry. The model used casual language, expressed opinions, made self-deprecating jokes, and occasionally introduced minor errors or hesitations that made it seem more human. Interrogators who focused on testing knowledge or reasoning ability were more likely to identify the AI correctly. Those who relied on "gut feelings" about personality and conversational style were more often fooled.
This pattern echoes the concerns of critics who argue the Turing test is fundamentally a test of social deception. The system that best passes the test is not necessarily the most intelligent one; it is the one best trained to mimic the surface features of human social interaction.
The fact that LLMs can now pass the Turing test has prompted a reexamination of what the test was ever supposed to prove. Opinions within the AI research community are sharply divided.
Some researchers argue that large language models have rendered the Turing test meaningless. A 2025 article in Nature carried the headline "AI language models killed the Turing test: do we even need a replacement?" [14]. The argument is straightforward: if systems that demonstrably do not "understand" language in the way humans do can nonetheless pass the test, then the test fails to distinguish genuine intelligence from sophisticated mimicry. As the philosopher and cognitive scientist Gary Marcus put it, "AI has (sort of) passed the Turing test; here's why that hardly matters" [15].
Critics in this camp point out that passing the test has not resolved any of the deep questions about machine intelligence. No one claims that GPT-4.5 is conscious, that it truly understands what it says, or that it has beliefs and desires. If the test was meant to serve as an operational definition of machine intelligence, its failure to settle these questions demonstrates that it was always the wrong test.
Others argue that the Turing test remains important, though perhaps not for the reasons Turing originally intended. Jones and Bergen, in a separate 2025 paper titled "The Turing Test is More Relevant Than Ever," contend that the test measures something practically significant: whether an AI system can substitute for a human in a short conversational interaction without detection [16]. This has real-world consequences for fraud, social engineering, misinformation, customer service automation, and the integrity of online discourse.
From this perspective, the fact that LLMs pass the Turing test is not a philosophical milestone but a practical warning. If AI systems can convincingly impersonate humans, then the social infrastructure that depends on being able to distinguish humans from machines (online identity verification, democratic deliberation, trust in communication) faces serious challenges.
A growing consensus holds that the Turing test was never a good test of intelligence per se, but that it remains a useful measure of a specific capability: naturalistic human impersonation. The test tells us something real about the state of language technology, even if that something is narrower than "machines can think." As one commentary in Science noted, the test's significance lies in how it tracks "our shifting conceptions of intelligence" over time rather than in providing a definitive yes-or-no answer about machine minds [17].
As of early 2026, the situation surrounding the Turing test can be summarized as follows:
LLMs routinely pass conversational variants of the test. Multiple frontier models, including GPT-4.5 and LLaMA-3.1-405B, have been empirically shown to fool human judges in controlled Turing test experiments. This is no longer a speculative future scenario; it is an established empirical result.
Prompting and persona matter enormously. The same model that passes the test with a carefully designed persona can fail it badly with default settings. This underscores that the test measures social performance rather than raw cognitive capability. In the Jones and Bergen studies, even ELIZA, the 1966 keyword-matching chatbot, outperformed unprompted GPT-4o.
The AI community has largely moved beyond the Turing test as a measure of intelligence. Research focus has shifted to benchmarks that target specific cognitive capabilities: mathematical reasoning (MATH, GSM8K), code generation (HumanEval, SWE-bench), scientific knowledge (GPQA), multimodal understanding, and abstract reasoning (ARC-AGI). These benchmarks are more precisely defined, less gameable, and more informative about a system's actual capabilities and limitations.
The Turing test's legacy as a social and practical benchmark is growing. As AI systems become more capable of impersonating humans, the ability to distinguish human from machine communication becomes a pressing societal concern. Research on AI detection, watermarking of AI-generated text, and digital identity verification has intensified in response to the same capabilities that allow LLMs to pass the Turing test.
Philosophical questions remain unresolved. Neither passing nor failing the Turing test settles the question of whether machines can truly think, understand, or be conscious. These questions, which Turing sought to sidestep with his behavioral test, have proven stubbornly resistant to empirical resolution.
Alan Turing's 1950 thought experiment was never meant to be the final word on machine intelligence. It was a pragmatic proposal to replace an unanswerable metaphysical question with a concrete behavioral test. Seventy-five years later, machines have met the behavioral standard Turing set, and the metaphysical question remains as open as ever. The Turing test's greatest contribution may be less as a test and more as a provocation: a simple, vivid challenge that forced generations of researchers, philosophers, and the public to think seriously about what it means for a machine to think.