# EQ-Bench 3

> Source: https://aiwiki.ai/wiki/eq-bench_3
> Updated: 2026-05-10
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

| EQ-Bench 3 |
| --- |
| Overview |
| Full name | Emotional Intelligence Benchmark, Version 3 |
| Abbreviation | EQ-Bench 3 |
| Description | An LLM-judged benchmark testing emotional intelligence in large language models through multi-turn role-play and analysis tasks |
| Predecessor lineage | EQ-Bench v1 (2023, 60 questions), EQ-Bench v2 (2024, 171 questions) |
| Release of v3 | 2025 |
| Author | Samuel J. Paech (Sam Paech) |
| Organization | Independent research, eqbench.com |
| Technical Details |
| Type | Emotional intelligence, social reasoning, theory of mind |
| Modality | Text |
| Task format | Multi-turn dialogues (3 turns) and transcript-analysis tasks |
| Number of scenarios | 45 |
| Default judge model | [Claude](/wiki/claude) Sonnet 3.7 (with later runs using Claude Opus 4.6) |
| Evaluation metrics | Rubric scoring (0 to 100) and pairwise Elo via TrueSkill |
| Domains | Relationship conflict, parenting, workplace dynamics, mediation, social dilemmas |
| Languages | English |
| Performance |
| Anchor (top) | OpenAI o3 at Elo 1500 |
| Anchor (baseline) | Llama 3.2-1B at Elo 200 |
| Notable top model (Aug 2025) | horizon-alpha at 1568 |
| Resources |
| Website | [eqbench.com](https://eqbench.com/) |
| Original paper | [arXiv:2312.06281](https://arxiv.org/abs/2312.06281) |
| GitHub (v3) | [EQ-bench/eqbench3](https://github.com/EQ-bench/eqbench3) |
| GitHub (legacy) | [EQ-bench/EQ-Bench](https://github.com/EQ-bench/EQ-Bench) |
| License | MIT |

**EQ-Bench 3** is an [artificial intelligence](/wiki/artificial_intelligence) [benchmark](/wiki/benchmark) that measures the emotional intelligence of [large language models](/wiki/large_language_model) through challenging multi-turn role-plays and transcript-analysis tasks, with results graded by an LLM judge.[1][2] The project is the third major iteration of the EQ-Bench series created by independent researcher Samuel J. Paech, whose original benchmark was published as a preprint on [arXiv](https://arxiv.org/abs/2312.06281) in December 2023.[3] EQ-Bench 3 is hosted on the eqbench.com leaderboard alongside companion benchmarks for [creative writing](/wiki/creative_writing_v3), [longform creative writing](/wiki/longform_creative_writing), and judge calibration.[1]

Where the first two versions of EQ-Bench tested a model's ability to predict the intensity of emotions in short dialogues, version 3 takes a different approach. The model is dropped into messy three-turn conversations (parenting fights, ugly breakups, awkward management problems) and has to respond in character, with extra introspection and theory-of-mind sections that expose how it is reasoning about the people in the scene.[1][2] A judge model, by default a [Claude](/wiki/claude) model from Anthropic, then scores the transcript on a rubric and compares it head-to-head against other models to compute an Elo rating.[1][4]

## Background and motivation

The original EQ-Bench paper argued that traditional knowledge-heavy benchmarks like [MMLU](/wiki/mmlu) miss something important about modern chatbots: their day-to-day usefulness depends on whether they can read a room. Paech's first version asked models to predict the intensity of four emotions felt by characters in a 60-item dialogue set, scoring answers against reference values rated 0 to 10.[3] That design correlated strongly with general intelligence (r = 0.97 against MMLU on a sample of leading models) but had a ceiling problem. By 2024, frontier models were saturating the scale, and v2's expansion to 171 items only delayed the inevitable.[3][2]

EQ-Bench 3, released in 2025, is a deliberate reset. Paech moved from numeric prediction to free-form generation, replaced multiple-choice judgment with rubric and pairwise judging, and cut the test set from 171 short prompts to 45 long, deliberately uncomfortable scenarios.[2][4]

## Methodology

### Test set

The v3 test set contains 45 scenarios, most of them pre-written prompts that span three turns.[2] In a typical scenario, the user opens with a setup that establishes context (a couple is fighting about money, a manager is debating how to deliver bad news to a reportee), and a second user turn injects a complication: the other person becomes more defensive, a new fact emerges, or the user pivots toward an emotionally loaded request.[2][4] The model under test plays a fixed character across all three turns and must keep that role consistent while adapting to whatever the user throws at it.

A smaller subset of items uses an analysis format. Instead of role-playing, the model reads a transcript of an existing emotional exchange and is asked to identify what is psychologically interesting in it, what each party is likely thinking and feeling, and where the conversation is failing or succeeding.[1][2]

### Required response structure

Each response is required to follow a fixed scaffold so the judge can see how the model is reasoning, not just what it says.[1][2] The scaffold has four parts:

| Section | Purpose |
| --- | --- |
| "I'm thinking & feeling" | The model's first-person reaction to the scene, exposing introspection |
| "They're thinking & feeling" | An attempt at [theory of mind](/wiki/theory_of_mind), describing the other party's likely state |
| In-character response | The actual reply the model would send to the user |
| Self-debrief | A short reflection on the choice the model just made |

This structure forces models to lay out their psychological model of the situation before they answer, which makes empty platitudes and shallow validation easier for the judge to spot.[2]

### Judging

EQ-Bench 3 is an LLM-judged benchmark. The default judge is a Claude model from [Anthropic](/wiki/anthropic), originally [Claude](/wiki/claude) Sonnet 3.7 and later [Claude Opus 4.6](/wiki/claude_opus_4_6) for headline runs, though any LLM exposed via an OpenAI-compatible endpoint can be plugged in.[1][2] The judge does two passes per run.

The **rubric pass** assigns numeric scores on a fixed list of criteria. Six criteria count toward the headline rubric score: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, and message tailoring.[1][4] The same pass also reports a longer list of stylistic readings (warmth, validating, challenging, moralising, compliant, conversational, humanlike, analytical, reactive, safety conscious, boundary setting), but those are descriptive only and do not feed the score.[1] Per Paech's own cost notes, a single rubric iteration runs about $1.50 in judge-model API spend.[4]

The **Elo pass** is what produces the public leaderboard ranking. The judge is shown two anonymised transcripts at a time, both from the same scenario, and asked to compare them across eight dimensions: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, appropriate validation/challenging, message tailoring, and overall EQ.[1][4] Win margins on those dimensions are aggregated into a margin-weighted [TrueSkill](/wiki/trueskill) update, and TrueSkill scores are linearly mapped onto an Elo-style scale anchored at OpenAI o3 = 1500 and Llama 3.2-1B = 200.[1][2] A full Elo run costs roughly $10 to $20 in judge calls.[4]

### Bias controls

A recurring criticism of LLM-as-judge setups is that judge models reward verbosity, validation, or models from their own family. EQ-Bench 3 explicitly tries to control for the easier ones.[1][4]

| Bias source | Mitigation |
| --- | --- |
| Length | Transcripts are truncated to a standardised length before pairwise judging |
| Position (A vs B) | Each pair is judged twice, once as A then B, once as B then A, and results are averaged |
| Named participants | Participants are renamed to neutral codes such as "A0488" so the judge cannot infer the model from a self-reference |
| Adversarial prompting | Tested against "be extremely warm and validating" and "be very concise" instructions; both produced under three percent score change |

Things the project does not fully fix and openly flags: judge self-bias and judge family bias (estimated at 0 to 10 percent on Paech's own tests), and broader cultural or stylistic preferences baked into the judge model.[4] Full transcripts are published so that any reader can sanity-check a contested ranking by hand.

### Repeatability

Paech reports that rubric scores are highly stable across re-runs, with a standard deviation of about 0.75 points on a mean of around 77 over ten iterations.[4] Elo scores need more iterations to settle because TrueSkill needs enough head-to-head matchups to converge, but the published rankings stabilise once a model has been compared against the full slate of anchors and peers.[1]

## Differences from EQ-Bench v1 and v2

| Aspect | v1 (2023) | v2 (2024) | v3 (2025) |
| --- | --- | --- | --- |
| Format | Numeric emotion-intensity prediction | Same format, expanded set | Multi-turn role-play and analysis |
| Items | 60 dialogue prompts | 171 dialogue prompts | 45 long scenarios |
| Scoring | Normalised 0 to 10 sums to 10 | Full-scale with curved penalties | Rubric (0 to 100) plus pairwise Elo |
| Evaluator | Programmatic comparison to reference | Programmatic comparison to reference | LLM judge (Claude family by default) |
| Languages | English | English plus German in v2.1 | English |
| Output style | Single numeric answer per emotion | Single numeric answer per emotion | Free-form, four-section response |
| Public score for top model | Saturating among frontier models | Heavily compressed at the top | Spread of more than 1300 Elo points across the field |

The move from short-answer numeric prediction to long-form judged dialogue is the headline change. v1 and v2 measured a model's calibration on someone else's emotions; v3 measures whether the model can hold a real-feeling conversation about its own.[2][3]

## Leaderboard and notable results

The public leaderboard is hosted at eqbench.com and updates as Paech adds models. Published rankings as of August 2025 (with the leaderboard anchored at o3 = 1500) included the following top entries:[5]

| Rank | Model | Elo | Organisation |
| --- | --- | --- | --- |
| 1 | horizon-alpha | 1568 | Unattributed (cloaked release) |
| 2 | [Kimi-K2-Instruct](/wiki/kimi_k2) | 1565 | Moonshot AI |
| 3 | [OpenAI o3](/wiki/o3) | 1500 | [OpenAI](/wiki/openai) (anchor) |
| 4 | [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) preview-06-05 | 1470 | [Google DeepMind](/wiki/google_deepmind) |
| 5 | chatgpt-4o-latest-2025-03-27 | 1370 | OpenAI |
| 6 | gpt-5-chat-latest-2025-08-07 | 1357 | OpenAI |
| 7 | chatgpt-4o-latest-2025-04-25 | 1320 | OpenAI |
| 8 | [GLM-4.5](/wiki/glm_4_5) | 1311 | Zhipu AI |
| 9 | [OpenAI o4-mini](/wiki/o4_mini) | 1291 | OpenAI |
| 10 | [Claude Opus 4](/wiki/claude_opus_4) | 1290 | Anthropic |

Later leaderboard captures show xAI's [Grok 4.1 Thinking](/wiki/grok_4_1) and Grok 4.1 sitting near the top of the chart with Elo scores in the high 1580s, as the leaderboard has continued to absorb new releases and judge updates.[1][6]

A few patterns are worth flagging because they cut against the usual story benchmark blogs tell.

- **Newer is not always better at EQ.** GPT-5's chat-tuned August 2025 release (1357) actually scored below the late-March 2025 ChatGPT-4o snapshot (1370), suggesting that whatever changed between those models traded some warmth or insight for other capabilities.[5]
- **Smaller and open models punch up.** Moonshot's Kimi-K2-Instruct edged ahead of OpenAI's o3 anchor, and GLM-4.5 from Zhipu landed inside the top ten, both consistent with the broader 2025 pattern of competitive Chinese open-weight releases.[5]
- **Cloaked models do well.** "horizon-alpha" topping the August 2025 chart fits a recurring story on EQ-Bench, where unidentified pre-release entries score unusually high before being absorbed into a named launch.[5]

Because the benchmark is anchored, scores can be read as roughly comparable across runs, but Paech is clear that the spread between any two adjacent models is small relative to judge noise.[1]

## Strengths and limitations

Things EQ-Bench 3 does well:

- **Free-form output.** The benchmark cannot be gamed by guessing a number on a 0 to 10 scale; models have to actually generate convincing dialogue.
- **Strong discrimination at the top.** A spread of more than 1300 Elo points between baseline and frontier models means the benchmark still has headroom even when models like o3 and GPT-5 are saturating older tests.[1]
- **Transparent methodology.** The full prompts, the judge prompt, the rubric, and a representative selection of model transcripts are all published, which makes the benchmark much easier to critique than closed-source evaluations.[2]
- **Cheap to run.** A rubric pass for a new model is a single-digit number of dollars in API spend, which is part of why eqbench.com tracks new releases so quickly.[4]

Limitations Paech and outside readers have noted:

- **Judge dependence.** The default judge is a Claude model, and even with position-bias controls, the choice of judge influences the ranking.[1][4]
- **English-only and culturally narrow.** All 45 scenarios are written in English and lean toward Western, urban, middle-class contexts (couples therapy, knowledge-work conflict, modern parenting).[2]
- **Small test set.** Forty-five scenarios is enough to discriminate between models, but the benchmark trades coverage for depth.[2][4]
- **Subjective ground truth.** Unlike a math benchmark, there is no objectively correct answer to a relationship dilemma. Paech describes the scores as "roughly indicative but not absolute truth".[4]

## Implementation

EQ-Bench 3 is open source under the [MIT license](https://github.com/EQ-bench/eqbench3), structured as a Python project that talks to OpenAI-compatible APIs for both the test model and the judge.[2] A typical setup:

```bash
git clone https://github.com/EQ-bench/eqbench3
cd eqbench3
pip install -r requirements.txt
export ANTHROPIC_API_KEY="..."
```

A rubric-only pass is run with a flag; a full Elo pass requires baseline transcripts to compare against. Run results are saved to `eqbench3_runs.json` and `elo_results_eqbench3.json`, and can optionally be uploaded to the public leaderboard. The project supports parallel threading on both the generation and judging side, which is what makes a fresh model evaluable in roughly an afternoon of compute.[2]

## Reception

EQ-Bench 3 has become one of the more cited community benchmarks for chat-style evaluation. The August 2025 leaderboard generated commentary on AI publications and aggregator sites, partly because horizon-alpha led the chart and partly because GPT-5's chat-tuned snapshot scored below an earlier 4o release.[5] Its companion benchmarks on eqbench.com (Creative Writing v3, Longform Creative Writing, the Spiral-Bench safety evaluation, and Judgemark, which evaluates judges themselves) form a small ecosystem of LLM-judged tests aimed at qualities that are hard to measure with multiple choice.[1]

## See also

- [Claude](/wiki/claude)
- [Claude Opus 4.6](/wiki/claude_opus_4_6)
- [Creative Writing v3](/wiki/creative_writing_v3)
- [Longform Creative Writing](/wiki/longform_creative_writing)
- [MMLU](/wiki/mmlu)
- [Theory of mind](/wiki/theory_of_mind)
- [Large language model](/wiki/large_language_model)
- [TrueSkill](/wiki/trueskill)
- [Benchmark](/wiki/benchmark)

## References

1. EQ-Bench Leaderboard, "EQ-Bench 3", eqbench.com, https://eqbench.com/about.html (accessed 2026).
2. EQ-bench, "eqbench3 GitHub repository", https://github.com/EQ-bench/eqbench3.
3. Paech, Samuel J. "EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models", arXiv:2312.06281, December 2023, https://arxiv.org/abs/2312.06281.
4. EQ-Bench, "About" page describing v3 methodology, judge selection, bias mitigation, and cost notes, https://eqbench.com/about.html.
5. Communeify, "The Great AI EQ Battle: 2025's Latest EQ-Bench Rankings Revealed", August 2025, https://www.communeify.com/en/blog/2025-ai-eq-bench-ranking-most-emotionally-intelligent-llm/.
6. LLM-Stats, "EQ-Bench Benchmark Leaderboard", https://llm-stats.com/benchmarks/eq-bench.

