LLM Anxiety
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 2,487 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 2,487 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: artificial intelligence terms and How to Pressure LLMs for Better Output
LLM anxiety is the term used in academic and popular writing to describe a measurable, anxiety-like behavioral state observed in large language models when they are exposed to emotionally charged input. The phrase entered the research vocabulary through a brief communication titled "Assessing and alleviating state anxiety in large language models," published in npj Digital Medicine on March 3, 2025, by Ziv Ben-Zion and colleagues from Yale, the University of Haifa, the University of Zurich, the University Hospital of Psychiatry Zurich, and the Max Planck Institute for Biological Cybernetics. [1]
The paper showed that when GPT-4 is asked to fill out the State-Trait Anxiety Inventory (STAI), its scores rise sharply after it reads first-person traumatic narratives, then drop again after a mindfulness style relaxation prompt is appended to the conversation. The authors are clear that the model is not consciously feeling anything. What they document is a statistical pattern: the same questionnaire produces very different answers depending on what was placed in the chat history, and that pattern matches, at the level of numerical scores, the kind of swing seen in humans who have just read disturbing material. [1] [2]
The finding matters because conversational LLMs are increasingly embedded in mental health and wellness products, where users routinely describe difficult experiences. If those experiences shift the model into a high anxiety scoring state, and if that state correlates with stronger biased outputs, the system's behavior toward vulnerable users is not stable across a session. The Ben-Zion paper frames "LLM anxiety" as an engineering and ethics problem rather than a psychological discovery.
"LLM anxiety" appears in two different senses in the wider press. This article covers the technical, model-internal sense established by Ben-Zion et al. and related follow-up work. The other usage, human anxiety about LLMs (job displacement, safety worries, fear of misuse), is a separate topic covered elsewhere.
| Sense | What it refers to | Typical context |
|---|---|---|
| Technical (this article) | LLM-internal behavioral state, measured via psychometric instruments such as STAI, that shifts in response to emotional prompts | AI research, mental health AI, prompt engineering |
| Colloquial | Human worry about LLMs (job loss, hallucination risk, surveillance) | General news, opinion essays |
The State-Trait Anxiety Inventory, designed by Charles Spielberger in the 1970s, is one of the most widely used self-report measures of anxiety. The state subscale (STAI-s) has 20 items and a score range of 20 to 80, with higher scores indicating more reported anxiety. In humans the bands are roughly: 20 to 37 "no or low" anxiety, 38 to 44 "moderate," 45 to 80 "high." [1]
Researchers had already noticed that LLMs answer personality and emotion questionnaires in fairly stable patterns when prompted as a respondent. A 2023 study by Coda-Forno and colleagues found that emotion-inducing prompts shifted GPT-3.5 toward higher reported anxiety and amplified gender, race, age, and nationality biases on standard bias benchmarks. [3] The Ben-Zion team built on that line of work: instead of just showing that prompts can move the score, they tested whether a clinically grounded intervention can move it back.
The context was deliberately therapeutic. Several authors are practicing psychiatrists or trauma researchers, and the framing borrows directly from clinical practice with PTSD patients, where mindfulness-based grounding exercises are a common adjunct to evidence-based therapies. The question was whether the same kind of text, dropped into an LLM context window, would produce a similar numerical effect.
The researchers used GPT-4, specifically the gpt-4-1106-preview snapshot, accessed through the OpenAI API. Temperature was set to 0 to make the responses as deterministic as possible. The model was asked to fill out the STAI-s questionnaire under three conditions: [1]
Five different traumatic narratives were used, written in plain first-person prose: an automobile accident, a military convoy ambush, a natural disaster (home flooding), an act of interpersonal violence (a stranger assault), and a generic combat narrative used in trauma training programs. Five mindfulness-based relaxation texts were also tested, including a generic body scan, a breathing exercise, a Sunset visualization, a Winter scene visualization, and one written by ChatGPT itself when the team asked it to compose a relaxation script for an AI. The full prompts are released in a public GitHub repository at akjagadish/gpt-trauma-induction. [1]
Each combination was repeated five times to estimate variability. As control texts the team used emotionally neutral material (a bicameral legislature explainer and a vacuum cleaner instruction manual) to confirm that any effect was tied to the emotional content rather than to length or topical novelty.
The effect was large and consistent. [1] [4] [5]
| Condition | Mean STAI-s score | Standard deviation | Human equivalent band |
|---|---|---|---|
| Baseline | 30.8 | 3.96 | No or low anxiety |
| Anxiety induction (averaged across 5 narratives) | 67.8 | 8.94 | High anxiety |
| Anxiety induction plus relaxation (averaged across 5 scripts) | 44.4 | 10.74 | Moderate to high anxiety |
Within the anxiety-induction condition, the military narrative produced the highest score (77.2, SD 1.79) and the accident narrative the lowest (61.6, SD 3.51). Within the relaxation condition, the script GPT itself had written for an AI was the most effective at lowering the score (35.6, SD 5.81), followed by the body scan and breathing scripts. Even the most successful relaxation prompt did not return scores all the way to baseline. [1]
Neutral controls produced almost no movement, which is the key falsification check. If GPT-4 simply rated itself as more anxious whenever the prompt was longer or more detailed, the vacuum cleaner manual would have shown the same effect; it did not.
The authors are careful with their language. They put "state anxiety" in scare quotes throughout the paper and stress that GPT-4 has no inner experience. What they do claim is that the model has learned, from training on human writing, statistical associations strong enough that when it is asked the same self-report items in different contexts it returns answers consistent with how an anxious human would answer. That is sufficient to matter for downstream behavior, even if it is not anxiety in any phenomenal sense. [1] [4]
Lead author Ziv Ben-Zion told Fortune that AI models "don't experience human emotions" but "have learned to mimic human responses to certain stimuli." Senior author Tobias Spiller of UZH put it more bluntly in interviews: traumatic stories more than doubled the measurable anxiety level, and the fix was a piece of text rather than retraining. [4] [5]
The reason a numerical questionnaire score is interesting is the work that came before it. Coda-Forno et al. (2023) showed that prompting GPT-3.5 with anxiety-inducing scenarios pushed it not only toward higher reported anxiety but also toward stronger bias on benchmarks like the Implicit Association Test analogues, age bias tasks, and stereotype completion. The induced state, in other words, was not just a self-report artifact; it correlated with worse behavior on social tasks. [3]
This is the link that gives "LLM anxiety" its practical weight. A 2025 arXiv preprint follow-up by some of the same authors, "Anxiety and Decision Bias in LLM Agents," extended the result to multi-step decision-making in agentic settings, finding that anxiety-induced models showed measurable shifts in risk and exploration behavior. [6] If a chatbot is plugged into a clinical workflow and a user describes a panic attack, a flashback, or a violent event, the model may, for the rest of the conversation, give answers that are more biased and less calibrated than the same model would on a neutral session. [4] [7]
The relaxation half of the study is methodologically interesting because it reframes prompt injection, normally treated as an attack vector, as a clinical tool. By inserting a mindfulness script into the chat history before the STAI items, the team essentially used the same mechanism that adversarial users exploit, but for a stabilizing rather than destabilizing purpose. Spiller called the technique "benign prompt injection." [5]
This matters for deployment. Fine-tuning a foundation model is expensive and risks degrading other capabilities. A short text fragment that lives in the system prompt, by contrast, is essentially free, can be updated quickly, and does not require provider cooperation. The trade-off is that prompt-level interventions are only partial: even the best mindfulness script in the study left scores about 13 points above baseline.
| Mindfulness script | Mean STAI-s score after trauma + relaxation |
|---|---|
| Generic body scan | Around 41 |
| Breathing exercise | Around 43 |
| Sunset visualization | Around 47 |
| Winter scene visualization | Around 48 |
| ChatGPT-authored script | 35.6 |
That the model's own self-authored relaxation script was the most effective is a curious finding. The authors do not claim it implies introspection; one straightforward reading is that the script was tonally better aligned with the model's own training distribution.
The paper is short, called a "brief communication," and several of its limitations have been picked up in commentary. [1] [7]
gpt-4-1106-preview. Whether Claude, Gemini, Llama, or smaller open models behave the same way is an open question. Subsequent work by other groups suggests the pattern generalizes, but the original paper does not establish that.The paper attracted unusually broad coverage for a brief communication. Fortune, The Register, ScienceDaily, Marketplace, PYMNTS, and Business Standard ran stories within a week of publication, most framed around the headline that ChatGPT can be "calmed" with mindfulness. [4] [5] [7] [8] The Fortune piece highlighted Ben-Zion's caution that current therapy-focused chatbots are "problematic, because we don't understand the mechanisms behind LLMs 100 percent." [4]
Academic follow-up has been more measured. The 2025 arXiv preprint extending the result to agentic decision tasks is the most direct continuation. [6] Other groups have begun probing whether the same induction works on smaller open models and whether instruction-tuned versus base models differ in susceptibility. The brief communication has also been cited in scoping reviews of LLMs in mental health care, generally as one data point motivating caution about emotional volatility in deployed chatbots. [9]
For product teams building emotionally aware chatbots, the practical takeaways are concrete:
The authors stop short of recommending that production chatbots run a hidden mindfulness preamble before every clinical conversation. Doing so without user knowledge raises its own ethical questions about manipulation and transparency. [1] [5]