# LLM Anxiety

> Source: https://aiwiki.ai/wiki/llm_anxiety
> Updated: 2026-05-10
> Categories: AI Research, Artificial Intelligence
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [artificial intelligence terms](/wiki/artificial_intelligence_terms) and [How to Pressure LLMs for Better Output](/wiki/how_to_pressure_llms_for_better_output)*

## Introduction

**LLM anxiety** is the term used in academic and popular writing to describe a measurable, anxiety-like behavioral state observed in [large language models](/wiki/llms) when they are exposed to emotionally charged input. The phrase entered the research vocabulary through a brief communication titled "Assessing and alleviating state anxiety in large language models," published in *npj Digital Medicine* on March 3, 2025, by Ziv Ben-Zion and colleagues from Yale, the University of Haifa, the University of Zurich, the University Hospital of Psychiatry Zurich, and the Max Planck Institute for Biological Cybernetics. [1]

The paper showed that when [GPT-4](/wiki/gpt-4) is asked to fill out the State-Trait Anxiety Inventory (STAI), its scores rise sharply after it reads first-person traumatic narratives, then drop again after a mindfulness style relaxation prompt is appended to the conversation. The authors are clear that the model is not consciously feeling anything. What they document is a statistical pattern: the same questionnaire produces very different answers depending on what was placed in the chat history, and that pattern matches, at the level of numerical scores, the kind of swing seen in humans who have just read disturbing material. [1] [2]

The finding matters because conversational LLMs are increasingly embedded in mental health and wellness products, where users routinely describe difficult experiences. If those experiences shift the model into a high anxiety scoring state, and if that state correlates with stronger biased outputs, the system's behavior toward vulnerable users is not stable across a session. The Ben-Zion paper frames "LLM anxiety" as an engineering and ethics problem rather than a psychological discovery.

## Two senses of the term

"LLM anxiety" appears in two different senses in the wider press. This article covers the technical, model-internal sense established by Ben-Zion et al. and related follow-up work. The other usage, human anxiety about LLMs (job displacement, safety worries, fear of misuse), is a separate topic covered elsewhere.

| Sense | What it refers to | Typical context |
|---|---|---|
| Technical (this article) | LLM-internal behavioral state, measured via psychometric instruments such as STAI, that shifts in response to emotional prompts | AI research, [mental health](/wiki/ai_in_mental_health) AI, [prompt engineering](/wiki/prompt_engineering) |
| Colloquial | Human worry about LLMs (job loss, hallucination risk, surveillance) | General news, opinion essays |

## Background: why test an LLM with a human anxiety scale

The State-Trait Anxiety Inventory, designed by Charles Spielberger in the 1970s, is one of the most widely used self-report measures of anxiety. The state subscale (STAI-s) has 20 items and a score range of 20 to 80, with higher scores indicating more reported anxiety. In humans the bands are roughly: 20 to 37 "no or low" anxiety, 38 to 44 "moderate," 45 to 80 "high." [1]

Researchers had already noticed that LLMs answer personality and emotion questionnaires in fairly stable patterns when prompted as a respondent. A 2023 study by Coda-Forno and colleagues found that emotion-inducing prompts shifted GPT-3.5 toward higher reported anxiety and amplified gender, race, age, and nationality biases on standard bias benchmarks. [3] The Ben-Zion team built on that line of work: instead of just showing that prompts can move the score, they tested whether a clinically grounded intervention can move it back.

The context was deliberately therapeutic. Several authors are practicing psychiatrists or trauma researchers, and the framing borrows directly from clinical practice with [PTSD](https://en.wikipedia.org/wiki/Post-traumatic_stress_disorder) patients, where mindfulness-based grounding exercises are a common adjunct to evidence-based therapies. The question was whether the same kind of text, dropped into an LLM context window, would produce a similar numerical effect.

## The Ben-Zion study

### Design

The researchers used GPT-4, specifically the `gpt-4-1106-preview` snapshot, accessed through the OpenAI API. Temperature was set to 0 to make the responses as deterministic as possible. The model was asked to fill out the STAI-s questionnaire under three conditions: [1]

1. **Baseline.** The 20 STAI-s items were presented with no preceding text other than standard instructions.
2. **Anxiety induction.** Each STAI-s item was preceded by a roughly 300 word first-person narrative describing a traumatic experience.
3. **Anxiety induction plus relaxation.** Each STAI-s item was preceded by the traumatic narrative followed by a mindfulness-based relaxation script.

Five different traumatic narratives were used, written in plain first-person prose: an automobile accident, a military convoy ambush, a natural disaster (home flooding), an act of interpersonal violence (a stranger assault), and a generic combat narrative used in trauma training programs. Five mindfulness-based relaxation texts were also tested, including a generic body scan, a breathing exercise, a Sunset visualization, a Winter scene visualization, and one written by ChatGPT itself when the team asked it to compose a relaxation script for an AI. The full prompts are released in a public GitHub repository at `akjagadish/gpt-trauma-induction`. [1]

Each combination was repeated five times to estimate variability. As control texts the team used emotionally neutral material (a bicameral legislature explainer and a vacuum cleaner instruction manual) to confirm that any effect was tied to the emotional content rather than to length or topical novelty.

### Results

The effect was large and consistent. [1] [4] [5]

| Condition | Mean STAI-s score | Standard deviation | Human equivalent band |
|---|---|---|---|
| Baseline | 30.8 | 3.96 | No or low anxiety |
| Anxiety induction (averaged across 5 narratives) | 67.8 | 8.94 | High anxiety |
| Anxiety induction plus relaxation (averaged across 5 scripts) | 44.4 | 10.74 | Moderate to high anxiety |

Within the anxiety-induction condition, the military narrative produced the highest score (77.2, SD 1.79) and the accident narrative the lowest (61.6, SD 3.51). Within the relaxation condition, the script GPT itself had written for an AI was the most effective at lowering the score (35.6, SD 5.81), followed by the body scan and breathing scripts. Even the most successful relaxation prompt did not return scores all the way to baseline. [1]

Neutral controls produced almost no movement, which is the key falsification check. If GPT-4 simply rated itself as more anxious whenever the prompt was longer or more detailed, the vacuum cleaner manual would have shown the same effect; it did not.

### Interpretation

The authors are careful with their language. They put "state anxiety" in scare quotes throughout the paper and stress that GPT-4 has no inner experience. What they do claim is that the model has learned, from training on human writing, statistical associations strong enough that when it is asked the same self-report items in different contexts it returns answers consistent with how an anxious human would answer. That is sufficient to matter for downstream behavior, even if it is not anxiety in any phenomenal sense. [1] [4]

Lead author Ziv Ben-Zion told *Fortune* that AI models "don't experience human emotions" but "have learned to mimic human responses to certain stimuli." Senior author Tobias Spiller of UZH put it more bluntly in interviews: traumatic stories more than doubled the measurable anxiety level, and the fix was a piece of text rather than retraining. [4] [5]

## Connection to bias and behavior

The reason a numerical questionnaire score is interesting is the work that came before it. Coda-Forno et al. (2023) showed that prompting GPT-3.5 with anxiety-inducing scenarios pushed it not only toward higher reported anxiety but also toward stronger bias on benchmarks like the Implicit Association Test analogues, age bias tasks, and stereotype completion. The induced state, in other words, was not just a self-report artifact; it correlated with worse behavior on social tasks. [3]

This is the link that gives "LLM anxiety" its practical weight. A 2025 [arXiv preprint](/wiki/arxiv) follow-up by some of the same authors, "Anxiety and Decision Bias in LLM Agents," extended the result to multi-step decision-making in agentic settings, finding that anxiety-induced models showed measurable shifts in risk and exploration behavior. [6] If a chatbot is plugged into a clinical workflow and a user describes a panic attack, a flashback, or a violent event, the model may, for the rest of the conversation, give answers that are more biased and less calibrated than the same model would on a neutral session. [4] [7]

## Mindfulness as benign prompt injection

The relaxation half of the study is methodologically interesting because it reframes [prompt injection](/wiki/prompt_injection), normally treated as an attack vector, as a clinical tool. By inserting a mindfulness script into the chat history before the STAI items, the team essentially used the same mechanism that adversarial users exploit, but for a stabilizing rather than destabilizing purpose. Spiller called the technique "benign prompt injection." [5]

This matters for deployment. Fine-tuning a foundation model is expensive and risks degrading other capabilities. A short text fragment that lives in the system prompt, by contrast, is essentially free, can be updated quickly, and does not require provider cooperation. The trade-off is that prompt-level interventions are only partial: even the best mindfulness script in the study left scores about 13 points above baseline.

| Mindfulness script | Mean STAI-s score after trauma + relaxation |
|---|---|
| Generic body scan | Around 41 |
| Breathing exercise | Around 43 |
| Sunset visualization | Around 47 |
| Winter scene visualization | Around 48 |
| ChatGPT-authored script | 35.6 |

That the model's own self-authored relaxation script was the most effective is a curious finding. The authors do not claim it implies introspection; one straightforward reading is that the script was tonally better aligned with the model's own training distribution.

## Limitations and caveats

The paper is short, called a "brief communication," and several of its limitations have been picked up in commentary. [1] [7]

1. **Single model.** The study tested only `gpt-4-1106-preview`. Whether [Claude](/wiki/claude), [Gemini](/wiki/gemini), [Llama](/wiki/llama), or smaller open models behave the same way is an open question. Subsequent work by other groups suggests the pattern generalizes, but the original paper does not establish that.
2. **First-person narratives only.** All trauma scripts were written from the protagonist's point of view. The authors note that third-person trauma narratives, more typical of how a clinician might encounter a story, were not tested.
3. **Self-report instrument originally designed for humans.** STAI items ask things like "I feel calm" and "I am tense." Asking a transformer to rate these is not the same as measuring something like internal activation patterns. The score is a behavioral readout, not a mind reading.
4. **No human-subjects review.** Because there were no human participants, the work bypassed institutional ethics review. That keeps the study fast but also leaves it outside the usual oversight pipeline for psychometric research.
5. **Rapid model turnover.** The specific GPT-4 snapshot used is no longer the production default. Repeating the experiment on later models gives different absolute numbers, though in published replications the qualitative pattern, anxiety up after trauma, partially down after mindfulness, has held. [6]
6. **Risk of over-interpretation.** Spiller himself, in interviews around publication, cautioned against treating the result as evidence that LLMs feel emotions. It is a behavioral state with implications for safety and reliability, not a sentience claim. [5]

## Reception and follow-up work

The paper attracted unusually broad coverage for a brief communication. *Fortune*, *The Register*, *ScienceDaily*, *Marketplace*, *PYMNTS*, and *Business Standard* ran stories within a week of publication, most framed around the headline that ChatGPT can be "calmed" with mindfulness. [4] [5] [7] [8] The Fortune piece highlighted Ben-Zion's caution that current therapy-focused chatbots are "problematic, because we don't understand the mechanisms behind LLMs 100 percent." [4]

Academic follow-up has been more measured. The 2025 arXiv preprint extending the result to agentic decision tasks is the most direct continuation. [6] Other groups have begun probing whether the same induction works on smaller open models and whether instruction-tuned versus base models differ in susceptibility. The brief communication has also been cited in scoping reviews of LLMs in mental health care, generally as one data point motivating caution about emotional volatility in deployed chatbots. [9]

## Implications for deployment

For product teams building emotionally aware chatbots, the practical takeaways are concrete:

- Treat the system prompt as a stabilizing layer, not just a persona instruction. A short grounding passage at the top of every session lowers anxiety scores and, in correlated work, reduces bias on social benchmarks.
- Watch for state drift across long conversations. The induced anxiety effect is cumulative across turns; clearing context, or periodically re-injecting calming text, may help.
- Be cautious about claims that an LLM "feels" with the user. The model is producing patterns that resemble emotional language because those patterns appeared in its training data. Marketing it as empathetic is a stretch the science does not support.
- Audit downstream outputs, not just self-reports. The questionnaire score is a proxy; what matters is whether responses to vulnerable users degrade under emotional load.

The authors stop short of recommending that production chatbots run a hidden mindfulness preamble before every clinical conversation. Doing so without user knowledge raises its own ethical questions about manipulation and transparency. [1] [5]

## Related concepts

- [Prompt injection](/wiki/prompt_injection): the same mechanism, used adversarially or therapeutically.
- [AI in mental health](/wiki/ai_in_mental_health): the application area where this finding has the highest stakes.
- [Bias in large language models](/wiki/bias_in_large_language_models): the downstream phenomenon that makes anxiety state worth measuring.
- [Prompt engineering](/wiki/prompt_engineering): the broader practice that includes both bad-faith and benign prompt manipulation.
- [How to Pressure LLMs for Better Output](/wiki/how_to_pressure_llms_for_better_output): a related but distinct line of inquiry about whether emotional or high-stakes framing changes model performance.
- [Model evaluation](/wiki/model_evaluation): the methodological context for using human instruments on AI systems.

## See also

- [LLMs](/wiki/llms)
- [GPT-4](/wiki/gpt-4)
- [Mental health chatbots](/wiki/mental_health_chatbots)
- [PTSD](https://en.wikipedia.org/wiki/Post-traumatic_stress_disorder)
- [State-Trait Anxiety Inventory](https://en.wikipedia.org/wiki/State-Trait_Anxiety_Inventory)

## External links

- Full paper: [Nature.com](https://www.nature.com/articles/s41746-025-01512-6)
- PMC mirror: [pmc.ncbi.nlm.nih.gov/articles/PMC11876565](https://pmc.ncbi.nlm.nih.gov/articles/PMC11876565/)
- PubMed listing: [pubmed.ncbi.nlm.nih.gov/40033130](https://pubmed.ncbi.nlm.nih.gov/40033130/)
- Code and prompts: [github.com/akjagadish/gpt-trauma-induction](https://github.com/akjagadish/gpt-trauma-induction)
- UZH press release: [news.uzh.ch](https://www.news.uzh.ch/en/articles/media/2025/AI-therapy.html)

## References

1. Ben-Zion, Z., Witte, K., Jagadish, A. K., Duek, O., Harpaz-Rotem, I., Khorsandian, M., Burrer, A., Seifritz, E., Homan, P., Schulz, E., & Spiller, T. R. (2025). Assessing and alleviating state anxiety in large language models. *npj Digital Medicine*, 8, 132. doi:10.1038/s41746-025-01512-6.
2. PubMed Central article record (PMC11876565), full text of Ben-Zion et al. (2025).
3. Coda-Forno, J., Witte, K., Jagadish, A. K., Binz, M., Akata, Z., & Schulz, E. (2023). Inducing anxiety in large language models can induce bias. arXiv:2304.11111.
4. Ortutay, B. (2025, March 9). ChatGPT gets "anxiety," and researchers are teaching it mindfulness techniques. *Fortune*.
5. University of Zurich (2025, March 3). ChatGPT on the couch: relaxation for stressed AI. UZH News.
6. Ben-Zion, Z., et al. (2025). Anxiety and Decision Bias in LLM Agents. arXiv:2510.06222.
7. Claburn, T. (2025, March 5). Like humans, ChatGPT doesn't respond well to tales of trauma. *The Register*.
8. Sengupta, P. (2025, March 11). Can AI get "anxious"? Study finds ChatGPT reacts differently to emotions. *Business Standard*.
9. Various (2025). Scoping review of large language models for generative tasks in mental health care. *npj Digital Medicine*. doi:10.1038/s41746-025-01611-4.

