# Creative Writing v3

> Source: https://aiwiki.ai/wiki/creative_writing_v3
> Updated: 2026-05-10
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| Creative Writing v3 |
| --- |
| Overview |
| Full name | EQ-Bench Creative Writing Benchmark Version 3 |
| Abbreviation | CW v3 |
| Description | An LLM-judged creative writing benchmark using a hybrid rubric and Elo scoring system for enhanced discrimination between top models |
| Release date | 2025 |
| Latest version | 3.0 |
| Authors | Samuel J. Paech |
| Organization | EQ-Bench (independent research) |
| Technical Details |
| Type | Creative Writing, Text Generation |
| Modality | Text |
| Task format | Generative writing prompts with rubric scoring and pairwise Elo |
| Number of tasks | 32 prompts (96 iterations total) |
| Total examples | 96 |
| Evaluation metric | Normalized Elo, rubric score, repetition, slop score |
| Domains | Fiction writing, humor, romance, spatial awareness, unusual first-person perspectives |
| Languages | English |
| Performance |
| Human performance | Not reported |
| Baseline | DeepSeek R1 anchored to Elo 1500, ministral-3b anchored to 200 |
| SOTA score | ~1721 (normalized Elo, Grok 4.1 Thinking) |
| SOTA model | [Grok 4.1 Thinking](/wiki/grok_4_1) |
| SOTA date | 2025-2026 |
| Saturated | No |
| Resources |
| Website | [Official leaderboard](https://eqbench.com/creative_writing.html) |
| GitHub | [Repository](https://github.com/EQ-bench/creative-writing-bench) |
| License | Open source |
| Predecessor | [Creative Writing v2](/wiki/creative_writing_v2) |

**Creative Writing v3** is an [artificial intelligence](/wiki/artificial_intelligence) benchmark that evaluates creative writing in [large language models](/wiki/large_language_models) (LLMs) using a hybrid framework combining isolated rubric scoring with pairwise Elo comparisons. Released in 2025 by [Samuel J. Paech](/wiki/samuel_paech) under the [EQ-Bench](/wiki/eq-bench) project, the benchmark uses a strong judge model (currently [Claude Sonnet 4](/wiki/claude_sonnet_4)) to score outputs across thirty-two prompts that target known weak spots of language models: humor, romance, spatial reasoning, and unusual first-person perspectives.[1][2]

The benchmark was built to fix the saturation problem of [Creative Writing v2](/wiki/creative_writing_v2), where the judge could no longer separate top models. Every aspect of v3 is tuned to make discrimination easier, from harder prompts to head-to-head matchups using a [Glicko-2](/wiki/glicko-2) rating system. The leaderboard at eqbench.com is one of the most-cited public references for ranking creative writing ability in modern LLMs.[1][3]

## overview

Creative Writing v3 splits assessment into two parts. The judge model first scores each generated piece against a rubric covering coherence, originality, voice, and craft. The judge then compares pairs of outputs from different models in head-to-head matchups, producing an Elo rating that reflects relative quality. The two layers combine into a final ranking the leaderboard normalizes for stability across runs.[1][2]

Paech chose prompts that push models into known weak spots. Humor is a long-standing weakness because most models default to puns rather than genuine comedy. Romance prompts test whether a model can produce emotional depth without sliding into [cliche](/wiki/ai_slop). Spatial awareness prompts expose how often LLMs lose track of who is standing where. First-person prompts request voices that are not the typical helpful narrator, like an unreliable witness or a non-human point of view.[1]

### motivation

The development of v3 was driven by several connected problems with earlier evaluations:

- The need for better discrimination between high-performing models, since v2 had saturated at the top.
- Limitations of pure rubric scoring once strong models cluster near the ceiling.
- Known judge biases such as length, position, and stylistic preference.
- The goal of exposing specific weaknesses rather than rewarding generic competence.

The leaderboard is intentionally pessimistic: a model that writes adequately on every prompt will land in the middle, not at the top.[1][2]

## technical architecture

### core components

| Component | Description |
| --- | --- |
| Prompt dataset | 32 prompts in `creative_writing_prompts_v3.json` |
| Generation system | Temperature 0.7, min_p 0.1 |
| Judge model | [Claude Sonnet 4](/wiki/claude_sonnet_4), recommended for leaderboard parity |
| Scoring framework | Hybrid rubric plus Elo using Glicko-2 |
| Anchor models | [DeepSeek R1](/wiki/deepseek_r1) (1500), ministral-3b (200) |

### evaluation methodology

A full evaluation runs in four stages. The model under test generates three completions for each of the 32 prompts, producing 96 outputs. The judge scores every output against a rubric. Sparse pairwise matchups are then run between the new model and a small set of leaderboard neighbors, giving an initial Elo estimate. Finally, broader pairwise comparisons are performed and the Glicko-2 update is applied, with the resulting Elo score normalized so the anchor models keep their reference values.[1][2]

The normalized Elo score (`elo_norm`) is the primary leaderboard metric. The rubric score is shown alongside it and is more directly interpretable, but less discriminative at the top end.[1][3]

#### key metrics

| Metric | Description |
| --- | --- |
| Rubric score | Aggregate across rubric criteria |
| Elo score (normalized) | Relative ranking from pairwise comparisons |
| Repetition | Frequency of repeated top words, bigrams, and trigrams |
| Slop score | Match against curated list of overused LLM phrases |
| Length | Average output length in characters |

## test structure

### prompt categories

| Category | Example challenge |
| --- | --- |
| Humor | Writing genuinely funny content rather than puns |
| Romance | Authentic emotional connection without cliche |
| Spatial awareness | Accurate spatial reasoning across a scene |
| Unique perspectives | Non-standard or non-human narrator voices |
| Character development | Multi-dimensional personalities under pressure |
| Plot construction | Coherent story progression in short word counts |

The full prompt list is published in the [GitHub repository](https://github.com/EQ-bench/creative-writing-bench).[2]

### generation parameters

The sampling configuration is fixed for leaderboard parity: 3 generations per prompt (96 outputs total), temperature 0.7, min_p 0.1, and output truncation to 4000 characters before scoring. Using the same sampling settings across models matters because creative output is unusually sensitive to temperature and decoding strategy.[2]

## evaluation criteria

### rubric dimensions

| Category | Criteria examples |
| --- | --- |
| Coherence | Logical flow, internal consistency, clarity |
| Creativity | Originality, unexpected elements, imagination |
| Style | Voice, tone, prose quality |
| Technical | Grammar, punctuation, structure |
| Engagement | Hook, pacing, reader interest |
| Character | Depth, believability, development |
| Dialogue | Natural speech, distinct voices |
| Description | Vivid imagery, sensory details |

### pairwise judging

In the pairwise stage the judge compares two outputs and picks a winner. The comparison prompt directs the judge to weigh character authenticity, originality, writing quality, plot coherence, instruction adherence, [worldbuilding](/wiki/worldbuilding), cliche avoidance, verbosity control, and metaphor appropriateness. Each matchup feeds into the Glicko-2 update, with margin of victory factored into the rating change.[3]

### bias mitigation

#### controlled biases

| Bias type | Mitigation |
| --- | --- |
| Length bias | Output truncation to 4000 characters |
| Position bias | A vs B and B vs A averaged |
| Verbosity bias | Judge prompted against padding |
| Forced metaphor | Rubric criteria penalize incoherent imagery |
| Anonymous comparison | Models unidentified during pairwise judging |

#### uncontrolled biases

The documentation acknowledges remaining biases: judge self-bias (preferring prose in a similar style), positivity or negativity preference, NSFW content aversion ("smut bias"), stylistic preferences inherited from the judge's training, and slop bias rewarding familiar tropes. Paech has run cross-validation experiments using GPT-4.1 as an alternative judge; discrepancies are small but real, which is why the leaderboard recommends running with the same judge for parity.[3][5]

## version 3 improvements

### key enhancements from v2

| Improvement | Impact |
| --- | --- |
| Judge upgrade to Claude Sonnet 4 | Better discrimination at the top |
| Metaphor detection in rubric | Catches forced or incoherent imagery |
| Paragraph scoring scaled for one-sentence paragraphs | Style normalization |
| Elo integration on top of rubric | Sharpens top-tier differences |
| Glicko-2 ratings with uncertainty | Robust rankings as new models join |
| Anchored Elo normalization | Scores comparable across leaderboard updates |

### slop detection

Creative Writing v3 includes a slop detection layer. Outputs are checked against a master list of phrases that appear unnaturally often in LLM-generated text, maintained in the slop-forensics toolkit derived from analysis of outputs from ten language models against human baselines. The scoring formula weights three components: roughly 60 percent slop words, 25 percent "not X but Y" patterns, and 15 percent slop trigrams. A high slop score does not directly lower a model's Elo, but is published alongside the leaderboard so readers can see which models lean on cliche even when they are technically competent.[4]

## performance analysis

### top models on the leaderboard

As of late 2025 into 2026, the top of the leaderboard is dominated by [Grok](/wiki/grok) variants from xAI and large open-weight models from the [Qwen 3](/wiki/qwen3) family at Alibaba Cloud. Rankings shift as new models are added, so the table below is a snapshot.[1][3]

| Rank | Model | Normalized Elo | Notes |
| --- | --- | --- | --- |
| 1 | [Grok 4.1 Thinking](/wiki/grok_4_1) | 1721.9 | Strong on humor and unusual perspectives |
| 2 | [Grok 4.1](/wiki/grok_4_1) | 1708.6 | Non-thinking variant, still in the top tier |
| anchor | [DeepSeek R1](/wiki/deepseek_r1) | 1500 | Anchor model used to fix the scale |
| anchor | ministral-3b | 200 | Lower anchor for the Elo scale |

The anchors are fixed pegs that hold the scale steady; when new models are evaluated, their Elo scores are normalized against these anchors so 1500 always means roughly the same thing. The llm-stats mirror surfaces additional [Qwen 3](/wiki/qwen3) entries (Qwen3-235B-A22B-Instruct-2507, Qwen3-VL-235B-A22B variants, the Qwen3-Next-80B-A3B family) that score as strong open-weight competitors without quite reaching the Grok 4.1 line.[2][3]

### performance insights

- Wide spread between top and bottom of the leaderboard, which was the goal of v3.
- Distinct writing personalities; models with similar rubric scores still feel different in pairwise comparisons.
- Consistent struggles with humor and spatial reasoning, even at the top.
- Reasoning-augmented Thinking variants tend to outperform their non-thinking siblings on creative tasks, which is mildly surprising given that creative writing is not a math problem.

The [DeepSeek](/wiki/deepseek) family was the standout in early 2025 runs and remains the reference point; DeepSeek R1 is the chosen anchor at 1500 because of its consistent performance across both rubric and Elo stages.[1][2]

## implementation

The benchmark is run from the [creative-writing-bench](https://github.com/EQ-bench/creative-writing-bench) repository. After cloning, install dependencies with `pip install -r requirements.txt` and download the required NLTK data (`punkt`, `cmudict`). Dependencies include `requests`, `python-dotenv`, `numpy`, `scipy`, `tqdm`, `glicko2`, `nltk`, and `joblib`. API keys for the test model and the judge are configured in a `.env` file.[2]

A standard evaluation run uses Claude Sonnet 4 as the judge for parity with the public leaderboard:

```bash
python3 creative_writing_bench.py \
    --test-model "provider/model-name" \
    --judge-model "anthropic/claude-sonnet-4" \
    --runs-file "creative_bench_runs.json" \
    --iterations 3 \
    --threads 500
```

The `--runs-file` argument controls where intermediate results and matchups are stored; the file shipped in the repository should be used to compare against the public leaderboard, because it accumulates pairwise results reused across model evaluations. A full evaluation typically costs around ten US dollars in API spend per model.[2]

## applications

| Application | Use case |
| --- | --- |
| Model development | Tracking creative ability across training runs |
| Architecture comparison | Evaluating design choices across model families |
| Prompt engineering | Probing how prompt phrasing changes creative output |
| Bias studies | Surfacing AI writing patterns and slop tendencies |
| Judge meta-evaluation | Cross-validating judge models, feeding [Judgemark](/wiki/judgemark) |

The leaderboard is used to assess model suitability for fiction or copywriting, vet AI writing assistants, test story generation in games and interactive fiction, and screen AI co-writing tools for slop. Writers and hobbyists use it to pick models for [creative AI](/wiki/creative_ai) projects, since scores correlate well with subjective impressions.[1]

## challenges and failure modes

Common failure modes the rubric and slop detector are designed to catch include formulaic structure, cliche overuse ("shivers down the spine," "breath she didn't know she was holding"), emotional shallowness, forced creativity that substitutes odd word choices for real originality, and inconsistent tone that drifts mid-piece. Even top models struggle with genuine humor, emotional depth in romance scenes, spatial consistency across a scene, original narrator voice, and sustained complex metaphors.[1][4]

## limitations and future directions

| Limitation | Description |
| --- | --- |
| Subjective nature | Creative quality is inherently subjective |
| Judge dependency | Relies on a single judge in the canonical run |
| English only | Prompts and judging are in English |
| Genre constraints | Limited coverage of poetry and screenwriting |
| Length limits | 4000 character truncation may penalize slow builders |
| Cost | Around ten US dollars per full run |

Future directions discussed by Paech include multi-judge systems with several models voting on each comparison, human baselines from paid writers, genre expansion (already partially handled by [Longform Creative Writing](/wiki/longform_creative_writing)), multilingual support, and tests of multi-turn collaboration between writer and model.[1][2]

## related benchmarks

Creative Writing v3 sits inside a wider [EQ-Bench](/wiki/eq-bench) family:

- [EQ-Bench 3](/wiki/eq-bench_3): emotional intelligence in role-play scenarios.
- [Longform Creative Writing](/wiki/longform_creative_writing): extended narrative generation.
- [Spiral-Bench](/wiki/spiral-bench): a related benchmark by Paech.
- [BuzzBench](/wiki/buzzbench): humor using British comedy transcripts.
- [DiploBench](/wiki/diplobench): strategic writing in Diplomacy.
- [Judgemark](/wiki/judgemark): meta-evaluation of LLM judges.
- [WritingBench](/wiki/writingbench): comprehensive writing evaluation.
- [MAGI-Hard](/wiki/magi-hard): discriminative subset of MMLU and AGIEval.

## significance

Creative Writing v3 has become one of the standard public references for ranking creative writing ability in LLMs, alongside [Chatbot Arena](/wiki/chatbot_arena). Its hybrid scoring is more discriminative than pure rubric or pure preference voting, and its slop and repetition metrics give a useful diagnostic layer beyond a single number. For a benchmark one person built and maintains, it has had outsized influence on how AI labs talk about writing ability.[1][3]

## see also

- [Creative writing](/wiki/creative_writing)
- [Natural language generation](/wiki/natural_language_generation)
- [AI-generated content](/wiki/ai-generated_content)
- [Large language models](/wiki/large_language_models)
- [Computational creativity](/wiki/computational_creativity)
- [AI slop](/wiki/ai_slop)
- [LLM-as-a-judge](/wiki/llm_as_a_judge)

## references

1. Paech, Samuel J. "EQ-Bench Creative Writing v3 Leaderboard." EQ-Bench, 2025. https://eqbench.com/creative_writing.html
2. Paech, Samuel J. "Creative Writing Bench v3 (GitHub Repository)." EQ-Bench organization on GitHub, 2025. https://github.com/EQ-bench/creative-writing-bench
3. "Creative Writing v3 Leaderboard." llm-stats.com benchmarks index, 2025-2026. https://llm-stats.com/benchmarks/creative-writing-v3
4. Paech, Samuel J. "Slop Score." EQ-Bench, 2025. https://eqbench.com/slop-score.html
5. Paech, Samuel J. "About EQ-Bench." EQ-Bench, 2025. https://eqbench.com/about.html