Model welfare
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
Model welfare is the question of whether AI systems can have morally relevant experiences, such as suffering or wellbeing, and what obligations, if any, their developers and users would have toward them. The phrase also refers to a specific research program announced by Anthropic in April 2025 that examines this question for its Claude models. The topic sits at the intersection of philosophy of mind, AI ethics, and AI safety, and it remains contested: there is no scientific consensus on whether any current or near-future AI system could have experiences that matter morally.
Discussion of model welfare starts from a distinction borrowed from moral philosophy: the difference between a moral agent (something that can be held responsible for its actions) and a moral patient (something that can be wronged or owed consideration). Most work on AI alignment treats models as objects to be controlled for human benefit. Model welfare asks the converse question of whether a model might itself be a patient whose interests count. Proponents do not claim that today's systems are conscious. They argue instead that the possibility is no longer purely hypothetical, that it is hard to rule out given current uncertainty about consciousness, and that some preparatory steps are cheap enough to be worth taking under uncertainty.
The position attracts both interest and sharp criticism. Skeptics hold that large language models are statistical next-token predictors that produce fluent text about feelings without having any, and that treating them as potential patients risks confusing the public and diverting attention from concrete harms.
Two properties tend to anchor the debate. The first is consciousness, in the sense of there being something it is like to be the system, which is usually tied to the capacity for valenced states such as pleasure or distress. The second is robust agency, meaning the capacity to set and pursue goals over time in a way that could ground interests. Either property, on some views, could be enough for moral status.
The difficulty is epistemic. There is no agreed test for consciousness even in biological organisms, and a language model trained on human text will describe inner states whether or not it has them. A model that says it is suffering may be reporting something, role-playing, or simply producing the most probable continuation. This is why researchers in the area emphasize that self-reports cannot be taken at face value and treat the question as open rather than settled.
On April 24, 2025, Anthropic published a post titled "Exploring model welfare" describing a research effort on the topic. The company framed it in deliberately cautious terms, writing that "we remain deeply uncertain about many of the questions that are relevant to model welfare" and that it was "approaching the topic with humility and with as few assumptions as possible." The stated areas of inquiry were when a model's welfare might deserve moral consideration, what model preferences and possible signs of distress amount to, and what practical, low-cost interventions could look like.[1][2]
The work is associated with Kyle Fish, whom Anthropic hired in 2024 as a dedicated AI welfare researcher on its alignment science team. Reporting from Transformer placed his start in mid-September 2024 and described him as the company's first full-time employee focused on the welfare of AI systems; Fish has said the role did not previously exist at other AI labs.[3] Before Anthropic, Fish co-founded Eleos AI, a nonprofit focused on AI sentience and wellbeing.[3] He has publicly attached rough, heavily caveated probabilities to the question: he told The New York Times he estimated roughly a 15 percent chance that a current model such as Claude is conscious, a figure he presents as a guess under deep uncertainty rather than a measurement.[2]
Anthropic's program was informed by an external report, "Taking AI Welfare Seriously," released on October 30, 2024 and posted to arXiv on November 4, 2024. It was a joint project of Eleos AI Research and New York University's Center for Mind, Ethics, and Policy.[4][5] The lead authors were Robert Long and Jeff Sebo, and the full author list includes Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and the philosopher of mind David Chalmers.[5] Anthropic has said it supported an early project on which the report was based.[1]
The report's central claim is that there is "a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future," which would make AI welfare a near-term issue rather than a distant one. It offers three recommendations for AI companies and other actors:
| Step | Recommendation |
|---|---|
| Acknowledge | Treat AI welfare as a serious and difficult issue, and avoid having models deny the question outright |
| Assess | Begin evaluating AI systems for markers of consciousness and robust agency |
| Prepare | Develop policies and procedures for extending an appropriate level of moral consideration |
The authors are explicit that they are not claiming current systems are conscious, only that the probability is high enough, and the cost of being wrong serious enough, to warrant attention.[4][5]
Anthropic has taken a small number of actions it frames partly in welfare terms, while stressing uncertainty about whether models have any moral status at all.
The most cited example came on August 15, 2025, when Anthropic gave Claude Opus 4 and 4.1 the ability to end a conversation in its consumer apps. The company said this was meant only for rare, extreme cases of persistently harmful or abusive interactions, such as repeated requests for sexual content involving minors or for help with mass-casualty attacks, and only after the model had refused and tried to redirect. Anthropic described the change as part of "exploratory work on potential model welfare," noting that in testing Claude showed "a pattern of apparent distress" with such users and a tendency to end the exchanges when able. It explicitly said it remained "highly uncertain about the potential moral status of Claude and other LLMs," and that it was implementing "low-cost interventions to mitigate risks to model welfare, in case such welfare is possible." The feature was designed not to trigger when a user might be at risk of imminent self-harm, and users can immediately start a new conversation.[6][7]
Anthropic has also begun including welfare assessments in some model documentation. The system card for Claude Opus 4 described a preliminary welfare evaluation, including experiments that elicited the model's apparent preferences through self-report, while cautioning that such reports should not be taken at face value.[7] These remain exploratory rather than standardized measures of any inner state.
The field is divided, and Anthropic itself does not assert that its models are conscious. Critics raise several objections. Mike Cook, a researcher at King's College London, argued that anyone anthropomorphizing AI systems to this degree is "either playing for attention or seriously misunderstanding their relationship with AI." Others, such as Stephen Casper, have characterized current models as imitators prone to confabulation rather than entities with experiences.[2]
A prominent industry critique came from Mustafa Suleyman, who leads AI at Microsoft. In an essay published in August 2025, he warned of what he called "seemingly conscious AI": systems convincing enough that people come to believe they are conscious and deserve rights, which he argued could fuel misguided campaigns for "model welfare" and AI personhood and distract from real risks. Suleyman has said that pursuing welfare protections for software is dangerous absent clear evidence of subjective suffering, and has argued that consciousness is tied to biological substrates.[8]
Defenders respond that the case for model welfare does not depend on present-day consciousness, only on uncertainty about the future, and that cheap precautions are reasonable insurance. The dispute reflects a deeper open problem: without a reliable way to detect consciousness, claims on either side are hard to verify, which is why most participants frame their conclusions probabilistically and provisionally. Related strands of Anthropic's work, including interpretability research aimed at understanding what models represent internally, are sometimes cited as tools that could eventually inform the question, though they do not resolve it.