Creative Writing v3
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,419 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,419 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Creative Writing v3 | |
|---|---|
| Overview | |
| Full name | EQ-Bench Creative Writing Benchmark Version 3 |
| Abbreviation | CW v3 |
| Description | An LLM-judged creative writing benchmark using a hybrid rubric and Elo scoring system for enhanced discrimination between top models |
| Release date | 2025 |
| Latest version | 3.0 |
| Authors | Samuel J. Paech |
| Organization | EQ-Bench (independent research) |
| Technical Details | |
| Type | Creative Writing, Text Generation |
| Modality | Text |
| Task format | Generative writing prompts with rubric scoring and pairwise Elo |
| Number of tasks | 32 prompts (96 iterations total) |
| Total examples | 96 |
| Evaluation metric | Normalized Elo, rubric score, repetition, slop score |
| Domains | Fiction writing, humor, romance, spatial awareness, unusual first-person perspectives |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | DeepSeek R1 anchored to Elo 1500, ministral-3b anchored to 200 |
| SOTA score | ~1721 (normalized Elo, Grok 4.1 Thinking) |
| SOTA model | Grok 4.1 Thinking |
| SOTA date | 2025-2026 |
| Saturated | No |
| Resources | |
| Website | Official leaderboard |
| GitHub | Repository |
| License | Open source |
| Predecessor | Creative Writing v2 |
Creative Writing v3 is an artificial intelligence benchmark that evaluates creative writing in large language models (LLMs) using a hybrid framework combining isolated rubric scoring with pairwise Elo comparisons. Released in 2025 by Samuel J. Paech under the EQ-Bench project, the benchmark uses a strong judge model (currently Claude Sonnet 4) to score outputs across thirty-two prompts that target known weak spots of language models: humor, romance, spatial reasoning, and unusual first-person perspectives.[1][2]
The benchmark was built to fix the saturation problem of Creative Writing v2, where the judge could no longer separate top models. Every aspect of v3 is tuned to make discrimination easier, from harder prompts to head-to-head matchups using a Glicko-2 rating system. The leaderboard at eqbench.com is one of the most-cited public references for ranking creative writing ability in modern LLMs.[1][3]
Creative Writing v3 splits assessment into two parts. The judge model first scores each generated piece against a rubric covering coherence, originality, voice, and craft. The judge then compares pairs of outputs from different models in head-to-head matchups, producing an Elo rating that reflects relative quality. The two layers combine into a final ranking the leaderboard normalizes for stability across runs.[1][2]
Paech chose prompts that push models into known weak spots. Humor is a long-standing weakness because most models default to puns rather than genuine comedy. Romance prompts test whether a model can produce emotional depth without sliding into cliche. Spatial awareness prompts expose how often LLMs lose track of who is standing where. First-person prompts request voices that are not the typical helpful narrator, like an unreliable witness or a non-human point of view.[1]
The development of v3 was driven by several connected problems with earlier evaluations:
The leaderboard is intentionally pessimistic: a model that writes adequately on every prompt will land in the middle, not at the top.[1][2]
| Component | Description |
|---|---|
| Prompt dataset | 32 prompts in creative_writing_prompts_v3.json |
| Generation system | Temperature 0.7, min_p 0.1 |
| Judge model | Claude Sonnet 4, recommended for leaderboard parity |
| Scoring framework | Hybrid rubric plus Elo using Glicko-2 |
| Anchor models | DeepSeek R1 (1500), ministral-3b (200) |
A full evaluation runs in four stages. The model under test generates three completions for each of the 32 prompts, producing 96 outputs. The judge scores every output against a rubric. Sparse pairwise matchups are then run between the new model and a small set of leaderboard neighbors, giving an initial Elo estimate. Finally, broader pairwise comparisons are performed and the Glicko-2 update is applied, with the resulting Elo score normalized so the anchor models keep their reference values.[1][2]
The normalized Elo score (elo_norm) is the primary leaderboard metric. The rubric score is shown alongside it and is more directly interpretable, but less discriminative at the top end.[1][3]
| Metric | Description |
|---|---|
| Rubric score | Aggregate across rubric criteria |
| Elo score (normalized) | Relative ranking from pairwise comparisons |
| Repetition | Frequency of repeated top words, bigrams, and trigrams |
| Slop score | Match against curated list of overused LLM phrases |
| Length | Average output length in characters |
| Category | Example challenge |
|---|---|
| Humor | Writing genuinely funny content rather than puns |
| Romance | Authentic emotional connection without cliche |
| Spatial awareness | Accurate spatial reasoning across a scene |
| Unique perspectives | Non-standard or non-human narrator voices |
| Character development | Multi-dimensional personalities under pressure |
| Plot construction | Coherent story progression in short word counts |
The full prompt list is published in the GitHub repository.[2]
The sampling configuration is fixed for leaderboard parity: 3 generations per prompt (96 outputs total), temperature 0.7, min_p 0.1, and output truncation to 4000 characters before scoring. Using the same sampling settings across models matters because creative output is unusually sensitive to temperature and decoding strategy.[2]
| Category | Criteria examples |
|---|---|
| Coherence | Logical flow, internal consistency, clarity |
| Creativity | Originality, unexpected elements, imagination |
| Style | Voice, tone, prose quality |
| Technical | Grammar, punctuation, structure |
| Engagement | Hook, pacing, reader interest |
| Character | Depth, believability, development |
| Dialogue | Natural speech, distinct voices |
| Description | Vivid imagery, sensory details |
In the pairwise stage the judge compares two outputs and picks a winner. The comparison prompt directs the judge to weigh character authenticity, originality, writing quality, plot coherence, instruction adherence, worldbuilding, cliche avoidance, verbosity control, and metaphor appropriateness. Each matchup feeds into the Glicko-2 update, with margin of victory factored into the rating change.[3]
| Bias type | Mitigation |
|---|---|
| Length bias | Output truncation to 4000 characters |
| Position bias | A vs B and B vs A averaged |
| Verbosity bias | Judge prompted against padding |
| Forced metaphor | Rubric criteria penalize incoherent imagery |
| Anonymous comparison | Models unidentified during pairwise judging |
The documentation acknowledges remaining biases: judge self-bias (preferring prose in a similar style), positivity or negativity preference, NSFW content aversion ("smut bias"), stylistic preferences inherited from the judge's training, and slop bias rewarding familiar tropes. Paech has run cross-validation experiments using GPT-4.1 as an alternative judge; discrepancies are small but real, which is why the leaderboard recommends running with the same judge for parity.[3][5]
| Improvement | Impact |
|---|---|
| Judge upgrade to Claude Sonnet 4 | Better discrimination at the top |
| Metaphor detection in rubric | Catches forced or incoherent imagery |
| Paragraph scoring scaled for one-sentence paragraphs | Style normalization |
| Elo integration on top of rubric | Sharpens top-tier differences |
| Glicko-2 ratings with uncertainty | Robust rankings as new models join |
| Anchored Elo normalization | Scores comparable across leaderboard updates |
Creative Writing v3 includes a slop detection layer. Outputs are checked against a master list of phrases that appear unnaturally often in LLM-generated text, maintained in the slop-forensics toolkit derived from analysis of outputs from ten language models against human baselines. The scoring formula weights three components: roughly 60 percent slop words, 25 percent "not X but Y" patterns, and 15 percent slop trigrams. A high slop score does not directly lower a model's Elo, but is published alongside the leaderboard so readers can see which models lean on cliche even when they are technically competent.[4]
As of late 2025 into 2026, the top of the leaderboard is dominated by Grok variants from xAI and large open-weight models from the Qwen 3 family at Alibaba Cloud. Rankings shift as new models are added, so the table below is a snapshot.[1][3]
| Rank | Model | Normalized Elo | Notes |
|---|---|---|---|
| 1 | Grok 4.1 Thinking | 1721.9 | Strong on humor and unusual perspectives |
| 2 | Grok 4.1 | 1708.6 | Non-thinking variant, still in the top tier |
| anchor | DeepSeek R1 | 1500 | Anchor model used to fix the scale |
| anchor | ministral-3b | 200 | Lower anchor for the Elo scale |
The anchors are fixed pegs that hold the scale steady; when new models are evaluated, their Elo scores are normalized against these anchors so 1500 always means roughly the same thing. The llm-stats mirror surfaces additional Qwen 3 entries (Qwen3-235B-A22B-Instruct-2507, Qwen3-VL-235B-A22B variants, the Qwen3-Next-80B-A3B family) that score as strong open-weight competitors without quite reaching the Grok 4.1 line.[2][3]
The DeepSeek family was the standout in early 2025 runs and remains the reference point; DeepSeek R1 is the chosen anchor at 1500 because of its consistent performance across both rubric and Elo stages.[1][2]
The benchmark is run from the creative-writing-bench repository. After cloning, install dependencies with pip install -r requirements.txt and download the required NLTK data (punkt, cmudict). Dependencies include requests, python-dotenv, numpy, scipy, tqdm, glicko2, nltk, and joblib. API keys for the test model and the judge are configured in a .env file.[2]
A standard evaluation run uses Claude Sonnet 4 as the judge for parity with the public leaderboard:
python3 creative_writing_bench.py \
--test-model "provider/model-name" \
--judge-model "anthropic/claude-sonnet-4" \
--runs-file "creative_bench_runs.json" \
--iterations 3 \
--threads 500
The --runs-file argument controls where intermediate results and matchups are stored; the file shipped in the repository should be used to compare against the public leaderboard, because it accumulates pairwise results reused across model evaluations. A full evaluation typically costs around ten US dollars in API spend per model.[2]
| Application | Use case |
|---|---|
| Model development | Tracking creative ability across training runs |
| Architecture comparison | Evaluating design choices across model families |
| Prompt engineering | Probing how prompt phrasing changes creative output |
| Bias studies | Surfacing AI writing patterns and slop tendencies |
| Judge meta-evaluation | Cross-validating judge models, feeding Judgemark |
The leaderboard is used to assess model suitability for fiction or copywriting, vet AI writing assistants, test story generation in games and interactive fiction, and screen AI co-writing tools for slop. Writers and hobbyists use it to pick models for creative AI projects, since scores correlate well with subjective impressions.[1]
Common failure modes the rubric and slop detector are designed to catch include formulaic structure, cliche overuse ("shivers down the spine," "breath she didn't know she was holding"), emotional shallowness, forced creativity that substitutes odd word choices for real originality, and inconsistent tone that drifts mid-piece. Even top models struggle with genuine humor, emotional depth in romance scenes, spatial consistency across a scene, original narrator voice, and sustained complex metaphors.[1][4]
| Limitation | Description |
|---|---|
| Subjective nature | Creative quality is inherently subjective |
| Judge dependency | Relies on a single judge in the canonical run |
| English only | Prompts and judging are in English |
| Genre constraints | Limited coverage of poetry and screenwriting |
| Length limits | 4000 character truncation may penalize slow builders |
| Cost | Around ten US dollars per full run |
Future directions discussed by Paech include multi-judge systems with several models voting on each comparison, human baselines from paid writers, genre expansion (already partially handled by Longform Creative Writing), multilingual support, and tests of multi-turn collaboration between writer and model.[1][2]
Creative Writing v3 sits inside a wider EQ-Bench family:
Creative Writing v3 has become one of the standard public references for ranking creative writing ability in LLMs, alongside Chatbot Arena. Its hybrid scoring is more discriminative than pure rubric or pure preference voting, and its slop and repetition metrics give a useful diagnostic layer beyond a single number. For a benchmark one person built and maintains, it has had outsized influence on how AI labs talk about writing ability.[1][3]