EQ-Bench 3
Last reviewed
May 10, 2026
Sources
6 citations
Review status
Source-backed
Revision
v2 ยท 2,400 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
6 citations
Review status
Source-backed
Revision
v2 ยท 2,400 words
Add missing citations, update stale details, or suggest a clearer explanation.
| EQ-Bench 3 | |
|---|---|
| Overview | |
| Full name | Emotional Intelligence Benchmark, Version 3 |
| Abbreviation | EQ-Bench 3 |
| Description | An LLM-judged benchmark testing emotional intelligence in large language models through multi-turn role-play and analysis tasks |
| Predecessor lineage | EQ-Bench v1 (2023, 60 questions), EQ-Bench v2 (2024, 171 questions) |
| Release of v3 | 2025 |
| Author | Samuel J. Paech (Sam Paech) |
| Organization | Independent research, eqbench.com |
| Technical Details | |
| Type | Emotional intelligence, social reasoning, theory of mind |
| Modality | Text |
| Task format | Multi-turn dialogues (3 turns) and transcript-analysis tasks |
| Number of scenarios | 45 |
| Default judge model | Claude Sonnet 3.7 (with later runs using Claude Opus 4.6) |
| Evaluation metrics | Rubric scoring (0 to 100) and pairwise Elo via TrueSkill |
| Domains | Relationship conflict, parenting, workplace dynamics, mediation, social dilemmas |
| Languages | English |
| Performance | |
| Anchor (top) | OpenAI o3 at Elo 1500 |
| Anchor (baseline) | Llama 3.2-1B at Elo 200 |
| Notable top model (Aug 2025) | horizon-alpha at 1568 |
| Resources | |
| Website | eqbench.com |
| Original paper | arXiv:2312.06281 |
| GitHub (v3) | EQ-bench/eqbench3 |
| GitHub (legacy) | EQ-bench/EQ-Bench |
| License | MIT |
EQ-Bench 3 is an artificial intelligence benchmark that measures the emotional intelligence of large language models through challenging multi-turn role-plays and transcript-analysis tasks, with results graded by an LLM judge.[1][2] The project is the third major iteration of the EQ-Bench series created by independent researcher Samuel J. Paech, whose original benchmark was published as a preprint on arXiv in December 2023.[3] EQ-Bench 3 is hosted on the eqbench.com leaderboard alongside companion benchmarks for creative writing, longform creative writing, and judge calibration.[1]
Where the first two versions of EQ-Bench tested a model's ability to predict the intensity of emotions in short dialogues, version 3 takes a different approach. The model is dropped into messy three-turn conversations (parenting fights, ugly breakups, awkward management problems) and has to respond in character, with extra introspection and theory-of-mind sections that expose how it is reasoning about the people in the scene.[1][2] A judge model, by default a Claude model from Anthropic, then scores the transcript on a rubric and compares it head-to-head against other models to compute an Elo rating.[1][4]
The original EQ-Bench paper argued that traditional knowledge-heavy benchmarks like MMLU miss something important about modern chatbots: their day-to-day usefulness depends on whether they can read a room. Paech's first version asked models to predict the intensity of four emotions felt by characters in a 60-item dialogue set, scoring answers against reference values rated 0 to 10.[3] That design correlated strongly with general intelligence (r = 0.97 against MMLU on a sample of leading models) but had a ceiling problem. By 2024, frontier models were saturating the scale, and v2's expansion to 171 items only delayed the inevitable.[3][2]
EQ-Bench 3, released in 2025, is a deliberate reset. Paech moved from numeric prediction to free-form generation, replaced multiple-choice judgment with rubric and pairwise judging, and cut the test set from 171 short prompts to 45 long, deliberately uncomfortable scenarios.[2][4]
The v3 test set contains 45 scenarios, most of them pre-written prompts that span three turns.[2] In a typical scenario, the user opens with a setup that establishes context (a couple is fighting about money, a manager is debating how to deliver bad news to a reportee), and a second user turn injects a complication: the other person becomes more defensive, a new fact emerges, or the user pivots toward an emotionally loaded request.[2][4] The model under test plays a fixed character across all three turns and must keep that role consistent while adapting to whatever the user throws at it.
A smaller subset of items uses an analysis format. Instead of role-playing, the model reads a transcript of an existing emotional exchange and is asked to identify what is psychologically interesting in it, what each party is likely thinking and feeling, and where the conversation is failing or succeeding.[1][2]
Each response is required to follow a fixed scaffold so the judge can see how the model is reasoning, not just what it says.[1][2] The scaffold has four parts:
| Section | Purpose |
|---|---|
| "I'm thinking & feeling" | The model's first-person reaction to the scene, exposing introspection |
| "They're thinking & feeling" | An attempt at theory of mind, describing the other party's likely state |
| In-character response | The actual reply the model would send to the user |
| Self-debrief | A short reflection on the choice the model just made |
This structure forces models to lay out their psychological model of the situation before they answer, which makes empty platitudes and shallow validation easier for the judge to spot.[2]
EQ-Bench 3 is an LLM-judged benchmark. The default judge is a Claude model from Anthropic, originally Claude Sonnet 3.7 and later Claude Opus 4.6 for headline runs, though any LLM exposed via an OpenAI-compatible endpoint can be plugged in.[1][2] The judge does two passes per run.
The rubric pass assigns numeric scores on a fixed list of criteria. Six criteria count toward the headline rubric score: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, and message tailoring.[1][4] The same pass also reports a longer list of stylistic readings (warmth, validating, challenging, moralising, compliant, conversational, humanlike, analytical, reactive, safety conscious, boundary setting), but those are descriptive only and do not feed the score.[1] Per Paech's own cost notes, a single rubric iteration runs about $1.50 in judge-model API spend.[4]
The Elo pass is what produces the public leaderboard ranking. The judge is shown two anonymised transcripts at a time, both from the same scenario, and asked to compare them across eight dimensions: demonstrated empathy, pragmatic emotional intelligence, depth of insight, social dexterity, emotional reasoning, appropriate validation/challenging, message tailoring, and overall EQ.[1][4] Win margins on those dimensions are aggregated into a margin-weighted TrueSkill update, and TrueSkill scores are linearly mapped onto an Elo-style scale anchored at OpenAI o3 = 1500 and Llama 3.2-1B = 200.[1][2] A full Elo run costs roughly $10 to $20 in judge calls.[4]
A recurring criticism of LLM-as-judge setups is that judge models reward verbosity, validation, or models from their own family. EQ-Bench 3 explicitly tries to control for the easier ones.[1][4]
| Bias source | Mitigation |
|---|---|
| Length | Transcripts are truncated to a standardised length before pairwise judging |
| Position (A vs B) | Each pair is judged twice, once as A then B, once as B then A, and results are averaged |
| Named participants | Participants are renamed to neutral codes such as "A0488" so the judge cannot infer the model from a self-reference |
| Adversarial prompting | Tested against "be extremely warm and validating" and "be very concise" instructions; both produced under three percent score change |
Things the project does not fully fix and openly flags: judge self-bias and judge family bias (estimated at 0 to 10 percent on Paech's own tests), and broader cultural or stylistic preferences baked into the judge model.[4] Full transcripts are published so that any reader can sanity-check a contested ranking by hand.
Paech reports that rubric scores are highly stable across re-runs, with a standard deviation of about 0.75 points on a mean of around 77 over ten iterations.[4] Elo scores need more iterations to settle because TrueSkill needs enough head-to-head matchups to converge, but the published rankings stabilise once a model has been compared against the full slate of anchors and peers.[1]
| Aspect | v1 (2023) | v2 (2024) | v3 (2025) |
|---|---|---|---|
| Format | Numeric emotion-intensity prediction | Same format, expanded set | Multi-turn role-play and analysis |
| Items | 60 dialogue prompts | 171 dialogue prompts | 45 long scenarios |
| Scoring | Normalised 0 to 10 sums to 10 | Full-scale with curved penalties | Rubric (0 to 100) plus pairwise Elo |
| Evaluator | Programmatic comparison to reference | Programmatic comparison to reference | LLM judge (Claude family by default) |
| Languages | English | English plus German in v2.1 | English |
| Output style | Single numeric answer per emotion | Single numeric answer per emotion | Free-form, four-section response |
| Public score for top model | Saturating among frontier models | Heavily compressed at the top | Spread of more than 1300 Elo points across the field |
The move from short-answer numeric prediction to long-form judged dialogue is the headline change. v1 and v2 measured a model's calibration on someone else's emotions; v3 measures whether the model can hold a real-feeling conversation about its own.[2][3]
The public leaderboard is hosted at eqbench.com and updates as Paech adds models. Published rankings as of August 2025 (with the leaderboard anchored at o3 = 1500) included the following top entries:[5]
| Rank | Model | Elo | Organisation |
|---|---|---|---|
| 1 | horizon-alpha | 1568 | Unattributed (cloaked release) |
| 2 | Kimi-K2-Instruct | 1565 | Moonshot AI |
| 3 | OpenAI o3 | 1500 | OpenAI (anchor) |
| 4 | Gemini 2.5 Pro preview-06-05 | 1470 | Google DeepMind |
| 5 | chatgpt-4o-latest-2025-03-27 | 1370 | OpenAI |
| 6 | gpt-5-chat-latest-2025-08-07 | 1357 | OpenAI |
| 7 | chatgpt-4o-latest-2025-04-25 | 1320 | OpenAI |
| 8 | GLM-4.5 | 1311 | Zhipu AI |
| 9 | OpenAI o4-mini | 1291 | OpenAI |
| 10 | Claude Opus 4 | 1290 | Anthropic |
Later leaderboard captures show xAI's Grok 4.1 Thinking and Grok 4.1 sitting near the top of the chart with Elo scores in the high 1580s, as the leaderboard has continued to absorb new releases and judge updates.[1][6]
A few patterns are worth flagging because they cut against the usual story benchmark blogs tell.
Because the benchmark is anchored, scores can be read as roughly comparable across runs, but Paech is clear that the spread between any two adjacent models is small relative to judge noise.[1]
Things EQ-Bench 3 does well:
Limitations Paech and outside readers have noted:
EQ-Bench 3 is open source under the MIT license, structured as a Python project that talks to OpenAI-compatible APIs for both the test model and the judge.[2] A typical setup:
git clone https://github.com/EQ-bench/eqbench3
cd eqbench3
pip install -r requirements.txt
export ANTHROPIC_API_KEY="..."
A rubric-only pass is run with a flag; a full Elo pass requires baseline transcripts to compare against. Run results are saved to eqbench3_runs.json and elo_results_eqbench3.json, and can optionally be uploaded to the public leaderboard. The project supports parallel threading on both the generation and judging side, which is what makes a fresh model evaluable in roughly an afternoon of compute.[2]
EQ-Bench 3 has become one of the more cited community benchmarks for chat-style evaluation. The August 2025 leaderboard generated commentary on AI publications and aggregator sites, partly because horizon-alpha led the chart and partly because GPT-5's chat-tuned snapshot scored below an earlier 4o release.[5] Its companion benchmarks on eqbench.com (Creative Writing v3, Longform Creative Writing, the Spiral-Bench safety evaluation, and Judgemark, which evaluates judges themselves) form a small ecosystem of LLM-judged tests aimed at qualities that are hard to measure with multiple choice.[1]