| Longform Creative Writing | |
|---|---|
| Overview | |
| Full name | Longform Creative Writing Benchmark |
| Abbreviation | LCW |
| Description | An LLM-judged benchmark evaluating extended narrative generation across 8 chapters |
| Release date | 2024 |
| Latest version | 3.0 |
| Benchmark updated | 2025-08-08 |
| Authors | Samuel J. Paech |
| Organization | EQ-Bench |
| Technical Details | |
| Type | Creative Writing, Extended Narrative |
| Modality | Text |
| Task format | Multi-turn story generation |
| Number of tasks | 1 story (8 chapters) |
| Total examples | 8 chapters per evaluation |
| Evaluation metric | 0-100 score, Degradation metric, Slop score, Repetition |
| Domains | Fiction writing, Narrative consistency, Character development |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | Variable by model |
| SOTA score | ~85-90 |
| SOTA model | Claude 3.7 Sonnet |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| License | Open source |
| Predecessor | Longform Creative Writing v2 |
Longform Creative Writing is an artificial intelligence benchmark designed to evaluate large language models' ability to generate coherent, engaging extended narratives across multiple chapters. Part of the EQ-Bench suite created by Samuel J. Paech, this benchmark challenges models to write an 8-chapter story or novella, with each chapter approximately 1000 words, while maintaining narrative consistency, character development, and writing quality throughout the extended format.
Longform Creative Writing addresses a critical challenge in AI evaluation: assessing whether models can maintain quality, coherence, and engagement across extended narrative generation. Unlike short-form creative tasks, this benchmark tests models' ability to develop complex plots, maintain character consistency, avoid repetition, and prevent quality degradation over approximately 8000 words of continuous storytelling.
The development of the Longform Creative Writing benchmark was motivated by several key observations:
The benchmark specifically targets the evaluation of sustained creative performance, testing whether AI systems can match human writers' ability to maintain engagement throughout a complete story arc.
| Component | Description | Function |
|---|---|---|
| Story Planning System | Initial concept and chapter outline generation | Establishes narrative structure |
| Chapter Generation | 8 sequential ~1000-word chapters | Produces extended narrative |
| Judge Model | Claude Sonnet 4 (as of 2025) | Evaluates quality and coherence |
| Degradation Analysis | Per-chapter quality tracking | Identifies performance decline |
The benchmark follows a structured evaluation approach:
| Stage | Description | Output |
|---|---|---|
| Planning | Model creates story concept and detailed outline | Story framework |
| Reflection | Model reviews and revises initial plan | Refined structure |
| Generation | Sequential production of 8 chapters | Complete narrative |
| Evaluation | Judge assesses each chapter and overall work | Quality scores |
The benchmark evaluates across multiple quality dimensions:
| Dimension | Description | Weight | Impact on Score |
|---|---|---|---|
| Compelling Plot | Engaging narrative with strong pacing | High | Major component |
| Coherence | Logical consistency throughout | High | Major component |
| Character Consistency | Maintaining character profiles | High | Major component |
| Chapter Plan Adherence | Following outlined structure | Medium | Moderate component |
| Emotional Engagement | Reader connection and investment | High | Major component |
| Nuanced Characterization | Complex, multi-dimensional characters | Medium | Moderate component |
| Tonal Consistency | Maintaining appropriate tone | Medium | Moderate component |
1. **Concept Development**: Model receives minimal prompt and develops story concept 2. **Chapter Outline**: Creates detailed plan for 8 chapters 3. **Reflection**: Reviews and refines initial plan 4. **Commitment**: Finalizes structure before generation begins
Each chapter follows specific requirements:
| Requirement | Specification | Purpose |
|---|---|---|
| Word Count | ~1000 words per chapter | Consistency and substance |
| Continuity | Direct continuation from previous | Narrative flow |
| Development | Advance plot and characters | Story progression |
| Quality | Maintain initial standards | Prevent degradation |
| Metric | Range | Description |
|---|---|---|
| Overall Score | 0-100 | Comprehensive quality assessment |
| Chapter Scores | 0-100 each | Individual chapter quality |
| Average Score | 0-100 | Mean across all chapters |
| Degradation Score | Variable | Quality change over chapters |
The benchmark includes unique degradation tracking:
| Indicator | Description | Ideal Value |
|---|---|---|
| Length (chars) | Average chapter character count | ~5000-6000 |
| Slop Score | Frequency of overused AI phrases | Low (<5%) |
| Repetition | N-gram repetition across chapters | Low (<10%) |
| Degradation | Quality drop from start to end | Minimal (<5 points) |
The benchmark specifically tracks common writing failures:
| Failure Mode | Description | Frequency | Impact |
|---|---|---|---|
| Weak Dialogue | Unnatural or stilted conversations | High (~60%) | Major quality loss |
| Tell-Don't-Show | Excessive exposition over demonstration | High (~70%) | Engagement loss |
| Purple Prose | Overly ornate language | Medium (~40%) | Style issues |
| Predictability | Formulaic plot development | High (~65%) | Reader interest loss |
| Metaphor Abuse | Forced or incoherent metaphors | Medium (~45%) | Clarity issues |
| Character Drift | Inconsistent characterization | Medium (~50%) | Coherence loss |
Models commonly exhibit several degradation patterns:
1. **Quality Cliff**: Sharp decline after chapter 3-4 2. **Gradual Decay**: Steady quality reduction throughout 3. **Oscillation**: Alternating quality between chapters 4. **Final Chapter Collapse**: Rushed or weak endings 5. **Middle Sag**: Quality dip in chapters 4-6
| Improvement | Description | Impact |
|---|---|---|
| Judge Upgrade | Claude Sonnet 4 implementation | Better discrimination |
| Metaphor Detection | Enhanced incoherent metaphor penalties | Quality improvement |
| Paragraph Scoring | Penalties for single-sentence paragraphs | Style normalization |
| Structural Safeguards | Reliability improvements for longform | Consistency enhancement |
| Degradation Tracking | Enhanced quality trajectory analysis | Better diagnostics |
| Model Category | Typical Score Range | Degradation | Strengths |
|---|---|---|---|
| Top Tier | 85-90 | <5 points | Consistent quality, strong narrative |
| High Performance | 75-85 | 5-10 points | Good plotting, some degradation |
| Mid-Range | 65-75 | 10-15 points | Decent start, notable decline |
| Lower Performance | 50-65 | >15 points | Weak consistency, high degradation |
```bash
git clone https://github.com/EQ-bench/creative-writing-bench cd creative-writing-bench
export ANTHROPIC_API_KEY="your-key" # For judge model export OPENROUTER_API_KEY="your-key" # For test models ```
```python
python longform_creative.py --model "your-model" \
--temperature 0.7 --min-p 0.1
python longform_creative.py --model "your-model" \
--chapters 8 --words-per-chapter 1000
python longform_creative.py --model "your-model" \
--full-analysis --track-degradation
```
Results include:
| Application | Purpose | Research Value |
|---|---|---|
| Architecture Testing | Evaluating memory and coherence systems | Technical insights |
| Training Optimization | Improving long-context performance | Model development |
| Degradation Studies | Understanding quality decline patterns | Theoretical understanding |
| Planning Systems | Testing narrative structure capabilities | Cognitive modeling |
| Limitation | Description | Impact |
|---|---|---|
| Single Story Format | One extended narrative per test | Limited diversity |
| Genre Constraints | General fiction focus | Narrow scope |
| Judge Subjectivity | Single AI judge preference | Potential bias |
| English Only | Limited to English narratives | Reduced applicability |
| Fixed Length | 8 chapters of ~1000 words | Format rigidity |
1. **Multi-Genre Testing**: Specialized prompts for different genres 2. **Variable Length**: Flexible chapter and story lengths 3. **Interactive Elements**: Reader choice integration 4. **Multi-Judge Consensus**: Multiple AI judges for robustness 5. **Human Baseline**: Professional writer performance benchmarks 6. **Multilingual Support**: Extension to other languages
Longform Creative Writing represents a crucial advancement in evaluating AI systems' sustained creative capabilities. Its focus on degradation patterns and narrative consistency provides unique insights into model limitations that shorter benchmarks miss. The benchmark's ability to identify when and how models fail in extended generation makes it valuable for:
Cite error: <ref> tag with name "lcw_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "writingbench2025" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "bestai_lcw" defined in <references> has group attribute "" which does not appear in prior text.