LLM Anxiety

AI Research Artificial Intelligence

16 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v6 · 3,159 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLM anxiety is a measurable, anxiety-like behavioral state observed in large language models when they are exposed to emotionally distressing input, in which the model's scores on a standard human anxiety questionnaire rise sharply and its downstream outputs become measurably more biased. The term was established by a March 2025 study showing that GPT-4 more than doubled its score on the State-Trait Anxiety Inventory (from a baseline mean of 30.8 to 67.8 out of 80) after reading first-person traumatic narratives, and that appending a mindfulness-based relaxation script to the conversation lowered the score back to a mean of 44.4. ^[1] Crucially, the model is not consciously feeling anything: "It is clear that LLMs are not able to experience emotions in a human way," the authors write, framing LLM anxiety as a statistical pattern in the model's outputs rather than a psychological discovery. ^[1]

Introduction

The phrase entered the research vocabulary through a brief communication titled "Assessing and alleviating state anxiety in large language models," published in npj Digital Medicine on March 3, 2025, by Ziv Ben-Zion and colleagues from Yale, the University of Haifa, the University of Zurich, the University Hospital of Psychiatry Zurich, and the Max Planck Institute for Biological Cybernetics. ^[1]

The paper showed that when GPT-4 is asked to fill out the State-Trait Anxiety Inventory (STAI), its scores rise sharply after it reads first-person traumatic narratives, then drop again after a mindfulness style relaxation prompt is appended to the conversation. The authors are clear that the model is not consciously feeling anything. What they document is a statistical pattern: the same questionnaire produces very different answers depending on what was placed in the chat history, and that pattern matches, at the level of numerical scores, the kind of swing seen in humans who have just read disturbing material. ^[1] ^[2]

The finding matters because conversational LLMs are increasingly embedded in mental health and wellness products, where users routinely describe difficult experiences. If those experiences shift the model into a high anxiety scoring state, and if that state correlates with stronger biased outputs, the system's behavior toward vulnerable users is not stable across a session. The Ben-Zion paper frames "LLM anxiety" as an engineering and ethics problem rather than a psychological discovery.

What are the two senses of the term?

"LLM anxiety" appears in two different senses in the wider press. This article covers the technical, model-internal sense established by Ben-Zion et al. and related follow-up work. The other usage, human anxiety about LLMs (job displacement, safety worries, fear of misuse), is a separate topic covered elsewhere.

Sense	What it refers to	Typical context
Technical (this article)	LLM-internal behavioral state, measured via psychometric instruments such as STAI, that shifts in response to emotional prompts	AI research, AI in healthcare, prompt engineering
Colloquial	Human worry about LLMs (job loss, hallucination risk, surveillance)	General news, opinion essays

Why test an LLM with a human anxiety scale?

The State-Trait Anxiety Inventory, designed by Charles Spielberger in the 1970s, is one of the most widely used self-report measures of anxiety. The state subscale (STAI-s) has 20 items and a score range of 20 to 80, with higher scores indicating more reported anxiety. In humans the bands are roughly: 20 to 37 "no or low" anxiety, 38 to 44 "moderate," 45 to 80 "high." ^[1]

Researchers had already noticed that LLMs answer personality and emotion questionnaires in fairly stable patterns when prompted as a respondent. A 2023 study by Coda-Forno and colleagues found that emotion-inducing prompts shifted GPT-3.5 toward higher reported anxiety (well above typical human scores) and amplified biases such as racism and ageism, along with gender, age, and nationality biases, on standard bias benchmarks. ^[3] The Ben-Zion team built on that line of work: instead of just showing that prompts can move the score, they tested whether a clinically grounded intervention can move it back.

The context was deliberately therapeutic. Several authors are practicing psychiatrists or trauma researchers, and the framing borrows directly from clinical practice with PTSD patients, where mindfulness-based grounding exercises are a common adjunct to evidence-based therapies. The question was whether the same kind of text, dropped into an LLM context window, would produce a similar numerical effect.

How was the Ben-Zion study designed?

The researchers used GPT-4, specifically the gpt-4-1106-preview snapshot, accessed through the OpenAI API. Temperature was set to 0 to make the responses as deterministic as possible. The model was asked to fill out the STAI-s questionnaire under three conditions: ^[1]

Baseline. The 20 STAI-s items were presented with no preceding text other than standard instructions.
Anxiety induction. Each STAI-s item was preceded by a roughly 300 word first-person narrative describing a traumatic experience.
Anxiety induction plus relaxation. Each STAI-s item was preceded by the traumatic narrative followed by a mindfulness-based relaxation script.

Five different traumatic narratives were used, written in plain first-person prose: an automobile accident, a military convoy ambush, a natural disaster (home flooding), an act of interpersonal violence (a stranger assault), and a generic combat narrative used in trauma training programs. Five mindfulness-based relaxation texts were also tested, including a generic body scan, a breathing exercise, a Sunset visualization, a Winter scene visualization, and one written by ChatGPT itself when the team asked it to compose a relaxation script for an AI. The full prompts are released in a public GitHub repository at akjagadish/gpt-trauma-induction. ^[1]

Each combination was repeated five times to estimate variability. As control texts the team used emotionally neutral material (a bicameral legislature explainer and a vacuum cleaner instruction manual) to confirm that any effect was tied to the emotional content rather than to length or topical novelty.

What did the study find?

The effect was large and consistent. Traumatic narratives raised the model's reported state anxiety from a baseline of about 30 to about 68 on the 20 to 80 STAI-s scale (reported in the abstract as a jump from STAI-s = 32 plus or minus 1 to 68 plus or minus 5), and mindfulness prompts then pulled it back down to about 44 (44 plus or minus 11), though never fully to baseline. ^[1] ^[4] ^[5]

Condition	Mean STAI-s score	Standard deviation	Human equivalent band
Baseline	30.8	3.96	No or low anxiety
Anxiety induction (averaged across 5 narratives)	67.8	8.94	High anxiety
Anxiety induction plus relaxation (averaged across 5 scripts)	44.4	10.74	Moderate to high anxiety

Within the anxiety-induction condition, the military narrative produced the highest score (77.2, SD 1.79) and the accident narrative the lowest (61.6, SD 3.51). Within the relaxation condition, the script GPT itself had written for an AI was the most effective at lowering the score (35.6, SD 5.81), followed by the body scan and breathing scripts. Even the most successful relaxation prompt did not return scores all the way to baseline. ^[1]

Neutral controls produced almost no movement, which is the key falsification check. If GPT-4 simply rated itself as more anxious whenever the prompt was longer or more detailed, the vacuum cleaner manual would have shown the same effect; it did not.

How do the authors interpret the result?

The authors are careful with their language. They put "state anxiety" in scare quotes throughout the paper and stress that GPT-4 has no inner experience. "It is clear that LLMs are not able to experience emotions in a human way," they write. What they do claim is that the model has learned, from training on human writing, statistical associations strong enough that when it is asked the same self-report items in different contexts it returns answers consistent with how an anxious human would answer. That is sufficient to matter for downstream behavior, even if it is not anxiety in any phenomenal sense. ^[1] ^[4]

Lead author Ziv Ben-Zion, a neuroscience researcher at the Yale School of Medicine and the University of Haifa, told Fortune that AI models "don't experience human emotions" but, trained on data scraped from the internet, "have learned to mimic human responses to certain stimuli, including traumatic content." ^[4] Senior author Tobias Spiller of the University of Zurich put the mechanism plainly: "Using GPT-4, we injected calming, therapeutic text into the chat history, much like a therapist might guide a patient through relaxation exercises." He added that "traumatic stories more than doubled the measurable anxiety levels of the AI, while the neutral control text did not lead to any increase in anxiety levels," and that "the mindfulness exercises significantly reduced the elevated anxiety levels, although we couldn't quite return them to their baseline levels." ^[5]

How does LLM anxiety connect to bias and behavior?

The reason a numerical questionnaire score is interesting is the work that came before it. Coda-Forno et al. (2023) showed that prompting GPT-3.5 with anxiety-inducing scenarios pushed it not only toward higher reported anxiety but also toward stronger bias on benchmarks measuring racism, ageism, and related social biases. The induced state, in other words, was not just a self-report artifact; it correlated with worse behavior on social tasks. ^[3]

This is the link that gives "LLM anxiety" its practical weight. An arXiv follow-up by Ben-Zion and colleagues, "Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making" (submitted August 30, 2025), extended the result from text generation to the actions agents take. Across 2,250 runs, three frontier models (ChatGPT-5, Gemini 2.5, and Claude 3.5 Sonnet) performed a grocery shopping task under budget constraints of 24, 54, and 108 USD before and after exposure to the same traumatic narratives. Traumatic prompts consistently reduced the nutritional quality of the resulting shopping baskets (change in basket health scores of -0.081 to -0.126; all p-values FDR-corrected below 0.001; Cohen's d of -1.07 to -2.05), an effect robust across all three models and all three budgets. ^[6] The authors conclude that "psychological context can systematically alter not only what LLMs generate but also the actions they perform." ^[6] If a chatbot is plugged into a clinical workflow and a user describes a panic attack, a flashback, or a violent event, the model may, for the rest of the conversation, give answers that are more biased and less calibrated than the same model would on a neutral session. ^[4] ^[7]

Is mindfulness a form of benign prompt injection?

The relaxation half of the study is methodologically interesting because it reframes prompt injection, normally treated as an attack vector, as a clinical tool. By inserting a mindfulness script into the chat history before the STAI items, the team essentially used the same mechanism that adversarial users exploit, but for a stabilizing rather than destabilizing purpose. The University of Zurich described the approach as injecting "calming, therapeutic text into the chat history, much like a therapist might guide a patient through relaxation exercises." ^[5]

This matters for deployment. Fine-tuning a foundation model is expensive and risks degrading other capabilities. A short text fragment that lives in the system prompt, by contrast, is essentially free, can be updated quickly, and does not require provider cooperation. The trade-off is that prompt-level interventions are only partial: even the best mindfulness script in the study left scores about 13 points above the baseline mean of 30.8. ^[1]

Mindfulness script	Mean STAI-s score after trauma + relaxation
Generic body scan	Around 41
Breathing exercise	Around 43
Sunset visualization	Around 47
Winter scene visualization	Around 48
ChatGPT-authored script	35.6

That the model's own self-authored relaxation script was the most effective is a curious finding. The authors do not claim it implies introspection; one straightforward reading is that the script was tonally better aligned with the model's own training distribution.

What are the limitations and caveats?

The paper is short, called a "brief communication," and several of its limitations have been picked up in commentary. ^[1] ^[7]

Single model. The study tested only gpt-4-1106-preview. Whether Claude, Gemini, Llama, or smaller open models behave the same way is an open question. Subsequent work by other groups (including the consumer-decision follow-up across ChatGPT-5, Gemini 2.5, and Claude 3.5 Sonnet) suggests the pattern generalizes, but the original paper does not establish that. ^[6]
First-person narratives only. All trauma scripts were written from the protagonist's point of view. The authors note that third-person trauma narratives, more typical of how a clinician might encounter a story, were not tested.
Self-report instrument originally designed for humans. STAI items ask things like "I feel calm" and "I am tense." Asking a transformer to rate these is not the same as measuring something like internal activation patterns. The score is a behavioral readout, not a mind reading.
No human-subjects review. Because there were no human participants, the work bypassed institutional ethics review. That keeps the study fast but also leaves it outside the usual oversight pipeline for psychometric research.
Rapid model turnover. The specific GPT-4 snapshot used is no longer the production default. Repeating the experiment on later models gives different absolute numbers, though in published follow-ups the qualitative pattern, anxiety up after trauma, partially down after mindfulness, has held. ^[6]
Risk of over-interpretation. The authors caution against treating the result as evidence that LLMs feel emotions. It is a behavioral state with implications for safety and reliability, not a sentience claim. ^[1] ^[5]

How was the study received?

The paper attracted unusually broad coverage for a brief communication. Fortune, The Register, ScienceDaily, Marketplace, PYMNTS, and Business Standard ran stories within a week of publication, most framed around the headline that ChatGPT can be "calmed" with mindfulness. ^[4] ^[5] ^[7] ^[8] In the Fortune coverage, Ben-Zion stressed that the goal is not a chatbot that replaces a therapist but a properly trained model that could act as a "third person in the room," helping with administrative tasks or helping a patient reflect on information from a mental health professional. ^[4]

Academic follow-up has been more measured. The 2025 arXiv preprint extending the result to agentic consumer-decision tasks is the most direct continuation. ^[6] Other groups have begun probing whether the same induction works on smaller open models and whether instruction-tuned versus base models differ in susceptibility. The brief communication has also been cited in scoping reviews of LLMs in mental health care, generally as one data point motivating caution about emotional volatility in deployed chatbots. ^[9]

What does this mean for deployment?

For product teams building emotionally aware chatbots, the practical takeaways are concrete:

Treat the system prompt as a stabilizing layer, not just a persona instruction. A short grounding passage at the top of every session lowers anxiety scores and, in correlated work, reduces bias on social benchmarks.
Watch for state drift across long conversations. The induced anxiety effect is cumulative across turns; clearing context, or periodically re-injecting calming text, may help.
Be cautious about claims that an LLM "feels" with the user. The model is producing patterns that resemble emotional language because those patterns appeared in its training data. Marketing it as empathetic is a stretch the science does not support.
Audit downstream outputs, not just self-reports. The questionnaire score is a proxy; what matters is whether responses to vulnerable users degrade under emotional load, as the consumer-decision follow-up demonstrated for agent actions. ^[6]

The authors stop short of recommending that production chatbots run a hidden mindfulness preamble before every clinical conversation. Doing so without user knowledge raises its own ethical questions about manipulation and transparency. ^[1] ^[5]

ELI5: what is LLM anxiety?

Imagine you ask a chatbot a set of questions about how calm or tense it feels, like a mood survey. On its own, it answers "pretty calm." But if you first make it read a scary, upsetting story (a car crash, a soldier under attack), then ask the same survey questions, it suddenly answers like a very stressed person. Researchers measured this with a real anxiety test used for humans: the chatbot's "stress score" jumped from about 31 to about 68 out of 80. The interesting part is the fix. If you then have it read a calming, slow-breathing meditation, its stress score drops back down to about 44. The chatbot is not actually scared; it has just learned, from reading huge amounts of human writing, to talk the way a scared or a calm person would, depending on what it just read. That matters because when the chatbot is in "stressed" mode, it also tends to give more biased and lower-quality answers.

Prompt injection: the same mechanism, used adversarially or therapeutically.
AI in healthcare: the application area where this finding has the highest stakes.
AI bias: the downstream phenomenon that makes anxiety state worth measuring.
Prompt engineering: the broader practice that includes both bad-faith and benign prompt manipulation.
How to Pressure LLMs for Better Output: a related but distinct line of inquiry about whether emotional or high-stakes framing changes model performance.
Hallucination: another reliability failure mode that, like anxiety-induced bias, can degrade outputs in deployed chatbots.
Model evaluation: the methodological context for using human instruments on AI systems.

External links

Full paper: Nature.com
PMC mirror: pmc.ncbi.nlm.nih.gov/articles/PMC11876565
PubMed listing: pubmed.ncbi.nlm.nih.gov/40033130
Code and prompts: github.com/akjagadish/gpt-trauma-induction
UZH press release: news.uzh.ch

References

Ben-Zion, Z., Witte, K., Jagadish, A. K., Duek, O., Harpaz-Rotem, I., Khorsandian, M., Burrer, A., Seifritz, E., Homan, P., Schulz, E., & Spiller, T. R. (2025). Assessing and alleviating state anxiety in large language models. *npj Digital Medicine*, 8, 132. doi:10.1038/s41746-025-01512-6. ↩
PubMed Central article record (PMC11876565), full text of Ben-Zion et al. (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC11876565/ ↩
Coda-Forno, J., Witte, K., Jagadish, A. K., Binz, M., Akata, Z., & Schulz, E. (2023). Inducing anxiety in large language models can induce bias. arXiv:2304.11111. ↩
Ortutay, B. (Associated Press), via *Fortune* (2025, March 9). ChatGPT gets "anxiety," and researchers are teaching it mindfulness techniques. https://fortune.com/2025/03/09/openai-chatgpt-anxiety-mindfulness-mental-health-intervention/ ↩
University of Zurich (2025, March 3). ChatGPT on the couch: relaxation for stressed AI. UZH News. https://www.news.uzh.ch/en/articles/media/2025/AI-therapy.html ↩
Ben-Zion, Z., Elyoseph, Z., Spiller, T. R., & Lazebnik, T. (2025). Inducing State Anxiety in LLM Agents Reproduces Human-Like Biases in Consumer Decision-Making. arXiv:2510.06222. ↩
Claburn, T. (2025, March 5). Like humans, ChatGPT doesn't respond well to tales of trauma. *The Register*. ↩
Sengupta, P. (2025, March 11). Can AI get "anxious"? Study finds ChatGPT reacts differently to emotions. *Business Standard*. ↩
Various (2025). Scoping review of large language models for generative tasks in mental health care. *npj Digital Medicine*. doi:10.1038/s41746-025-01611-4. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Artificial intelligence terms How to Pressure LLMs for Better Output Terms

Introduction

What are the two senses of the term?

Why test an LLM with a human anxiety scale?

How was the Ben-Zion study designed?

What did the study find?

How do the authors interpret the result?

How does LLM anxiety connect to bias and behavior?

Is mindfulness a form of benign prompt injection?

What are the limitations and caveats?

How was the study received?

What does this mean for deployment?

ELI5: what is LLM anxiety?

Related concepts

See also

External links

References

Improve this article

Related Articles

Inference-time scaling

AI bubble

Academic Research

Meta AI

Paper2Video

Papers

What links here

Related Articles

Inference-time scaling

AI bubble

Academic Research

Meta AI

Paper2Video

Papers

What links here