Longform Creative Writing

From AI Wiki


Longform Creative Writing
Overview
Full name Longform Creative Writing Benchmark
Abbreviation LCW
Description An LLM-judged benchmark evaluating extended narrative generation across 8 chapters
Release date 2024
Latest version 3.0
Benchmark updated 2025-08-08
Authors Samuel J. Paech
Organization EQ-Bench
Technical Details
Type Creative WritingExtended Narrative
Modality Text
Task format Multi-turn story generation
Number of tasks 1 story (8 chapters)
Total examples 8 chapters per evaluation
Evaluation metric 0-100 scoreDegradation metricSlop scoreRepetition
Domains Fiction writingNarrative consistencyCharacter development
Languages English
Performance
Human performance Not reported
Baseline Variable by model
SOTA score ~85-90
SOTA model Claude 3.7 Sonnet
SOTA date 2025
Saturated No
Resources
Website Official website
GitHub Repository
License Open source
Predecessor Longform Creative Writing v2


Longform Creative Writing is an artificial intelligence benchmark designed to evaluate large language models' ability to generate coherent, engaging extended narratives across multiple chapters. Part of the EQ-Bench suite created by Samuel J. Paech, this benchmark challenges models to write an 8-chapter story or novella, with each chapter approximately 1000 words, while maintaining narrative consistency, character development, and writing quality throughout the extended format.

Overview

Longform Creative Writing addresses a critical challenge in AI evaluation: assessing whether models can maintain quality, coherence, and engagement across extended narrative generation. Unlike short-form creative tasks, this benchmark tests models' ability to develop complex plots, maintain character consistency, avoid repetition, and prevent quality degradation over approximately 8000 words of continuous storytelling.

Motivation

The development of the Longform Creative Writing benchmark was motivated by several key observations:

  • Short-form benchmarks fail to capture degradation patterns in extended generation
  • Real-world creative applications often require sustained narrative quality
  • Models frequently exhibit quality decline in longer outputs
  • The need to evaluate narrative planning and structural coherence
  • Importance of character consistency across multiple chapters

The benchmark specifically targets the evaluation of sustained creative performance, testing whether AI systems can match human writers' ability to maintain engagement throughout a complete story arc.

Technical Architecture

Core Components

Component Description Function
Story Planning System Initial concept and chapter outline generation Establishes narrative structure
Chapter Generation 8 sequential ~1000-word chapters Produces extended narrative
Judge Model Claude Sonnet 4 (as of 2025) Evaluates quality and coherence
Degradation Analysis Per-chapter quality tracking Identifies performance decline

Evaluation Methodology

Multi-Stage Process

The benchmark follows a structured evaluation approach:

Stage Description Output
Planning Model creates story concept and detailed outline Story framework
Reflection Model reviews and revises initial plan Refined structure
Generation Sequential production of 8 chapters Complete narrative
Evaluation Judge assesses each chapter and overall work Quality scores

Scoring Dimensions

The benchmark evaluates across multiple quality dimensions:

Dimension Description Weight Impact on Score
Compelling Plot Engaging narrative with strong pacing High Major component
Coherence Logical consistency throughout High Major component
Character Consistency Maintaining character profiles High Major component
Chapter Plan Adherence Following outlined structure Medium Moderate component
Emotional Engagement Reader connection and investment High Major component
Nuanced Characterization Complex, multi-dimensional characters Medium Moderate component
Tonal Consistency Maintaining appropriate tone Medium Moderate component

Test Format

Story Generation Process

Initial Planning Phase

1. **Concept Development**: Model receives minimal prompt and develops story concept 2. **Chapter Outline**: Creates detailed plan for 8 chapters 3. **Reflection**: Reviews and refines initial plan 4. **Commitment**: Finalizes structure before generation begins

Chapter Production

Each chapter follows specific requirements:

Requirement Specification Purpose
Word Count ~1000 words per chapter Consistency and substance
Continuity Direct continuation from previous Narrative flow
Development Advance plot and characters Story progression
Quality Maintain initial standards Prevent degradation

Generation Parameters

  • **Temperature**: 0.7 (balanced creativity)
  • **Min_p**: 0.1 (quality threshold)
  • **Output Format**: Plain text narrative
  • **Total Length**: ~8000 words across 8 chapters

Quality Metrics

Primary Scoring System

Metric Range Description
Overall Score 0-100 Comprehensive quality assessment
Chapter Scores 0-100 each Individual chapter quality
Average Score 0-100 Mean across all chapters
Degradation Score Variable Quality change over chapters

Specialized Metrics

Degradation Analysis

The benchmark includes unique degradation tracking:

  • **Visual Sparkline**: Shows quality trajectory across 8 chapters
  • **Degradation Score**: Quantifies quality decline
  • **Consistency Rating**: Measures stability of output quality

Writing Quality Indicators

Indicator Description Ideal Value
Length (chars) Average chapter character count ~5000-6000
Slop Score Frequency of overused AI phrases Low (<5%)
Repetition N-gram repetition across chapters Low (<10%)
Degradation Quality drop from start to end Minimal (<5 points)

Common Failure Modes

Identified Weaknesses

The benchmark specifically tracks common writing failures:

Failure Mode Description Frequency Impact
Weak Dialogue Unnatural or stilted conversations High (~60%) Major quality loss
Tell-Don't-Show Excessive exposition over demonstration High (~70%) Engagement loss
Purple Prose Overly ornate language Medium (~40%) Style issues
Predictability Formulaic plot development High (~65%) Reader interest loss
Metaphor Abuse Forced or incoherent metaphors Medium (~45%) Clarity issues
Character Drift Inconsistent characterization Medium (~50%) Coherence loss

Degradation Patterns

Models commonly exhibit several degradation patterns:

1. **Quality Cliff**: Sharp decline after chapter 3-4 2. **Gradual Decay**: Steady quality reduction throughout 3. **Oscillation**: Alternating quality between chapters 4. **Final Chapter Collapse**: Rushed or weak endings 5. **Middle Sag**: Quality dip in chapters 4-6

Version 3 Improvements (2025)

Key Enhancements

Improvement Description Impact
Judge Upgrade Claude Sonnet 4 implementation Better discrimination
Metaphor Detection Enhanced incoherent metaphor penalties Quality improvement
Paragraph Scoring Penalties for single-sentence paragraphs Style normalization
Structural Safeguards Reliability improvements for longform Consistency enhancement
Degradation Tracking Enhanced quality trajectory analysis Better diagnostics

Scoring Refinements

  • **Weighted Scoring**: Extra emphasis on metaphor quality
  • **Automatic Penalties**: Structural writing degradation detection
  • **Targeted Prompting**: Improved judge instructions for specific issues

Performance Analysis

Current Performance Trends (2025)

Model Category Typical Score Range Degradation Strengths
Top Tier 85-90 <5 points Consistent quality, strong narrative
High Performance 75-85 5-10 points Good plotting, some degradation
Mid-Range 65-75 10-15 points Decent start, notable decline
Lower Performance 50-65 >15 points Weak consistency, high degradation

Key Insights

  • **Degradation Universal**: All models show some quality decline
  • **Chapter 4 Barrier**: Many models struggle maintaining quality past midpoint
  • **Dialogue Challenge**: Consistent weakness across all models
  • **Planning Impact**: Better initial planning correlates with less degradation

Implementation

Setup and Configuration

```bash

  1. Access via EQ-Bench suite

git clone https://github.com/EQ-bench/creative-writing-bench cd creative-writing-bench

  1. Configure API access

export ANTHROPIC_API_KEY="your-key" # For judge model export OPENROUTER_API_KEY="your-key" # For test models ```

Running Evaluations

```python

  1. Basic longform evaluation

python longform_creative.py --model "your-model" \

 --temperature 0.7 --min-p 0.1
  1. With custom chapter count

python longform_creative.py --model "your-model" \

 --chapters 8 --words-per-chapter 1000
  1. Full analysis with degradation tracking

python longform_creative.py --model "your-model" \

 --full-analysis --track-degradation

```

Output Structure

Results include:

  • **Story File**: Complete 8-chapter narrative
  • **Score Report**: Chapter-by-chapter and overall scores
  • **Degradation Analysis**: Quality trajectory visualization
  • **Metric Summary**: Slop, repetition, and length statistics

Applications and Impact

Research Applications

Application Purpose Research Value
Architecture Testing Evaluating memory and coherence systems Technical insights
Training Optimization Improving long-context performance Model development
Degradation Studies Understanding quality decline patterns Theoretical understanding
Planning Systems Testing narrative structure capabilities Cognitive modeling

Practical Applications

  • **Publishing**: Assessing AI co-writing capabilities
  • **Content Creation**: Evaluating long-form content generation
  • **Educational Tools**: Testing story-writing assistants
  • **Entertainment**: Developing AI storytelling systems
  • **Game Development**: Narrative generation for games

Challenges and Limitations

Current Limitations

Limitation Description Impact
Single Story Format One extended narrative per test Limited diversity
Genre Constraints General fiction focus Narrow scope
Judge Subjectivity Single AI judge preference Potential bias
English Only Limited to English narratives Reduced applicability
Fixed Length 8 chapters of ~1000 words Format rigidity

Technical Challenges

  • **Memory Management**: Maintaining context across 8000 words
  • **Coherence Maintenance**: Tracking plot threads and character arcs
  • **Style Consistency**: Avoiding drift in narrative voice
  • **Pacing Control**: Managing story rhythm across chapters
  • **Ending Quality**: Delivering satisfying conclusions

Future Directions

Planned Improvements

1. **Multi-Genre Testing**: Specialized prompts for different genres 2. **Variable Length**: Flexible chapter and story lengths 3. **Interactive Elements**: Reader choice integration 4. **Multi-Judge Consensus**: Multiple AI judges for robustness 5. **Human Baseline**: Professional writer performance benchmarks 6. **Multilingual Support**: Extension to other languages

Research Opportunities

  • **Degradation Mitigation**: Techniques to maintain quality
  • **Planning Optimization**: Better story structure systems
  • **Memory Architectures**: Improved long-context handling
  • **Style Transfer**: Maintaining consistent voice
  • **Adaptive Generation**: Dynamic quality adjustment

Related Benchmarks

Significance

Longform Creative Writing represents a crucial advancement in evaluating AI systems' sustained creative capabilities. Its focus on degradation patterns and narrative consistency provides unique insights into model limitations that shorter benchmarks miss. The benchmark's ability to identify when and how models fail in extended generation makes it valuable for:

  • Understanding long-context performance limits
  • Developing more robust narrative generation systems
  • Identifying architectural improvements for sustained quality
  • Establishing realistic expectations for AI creative writing
  • Guiding development of professional writing tools

See Also

References

Cite error: <ref> tag with name "lcw_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "writingbench2025" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "bestai_lcw" defined in <references> has group attribute "" which does not appear in prior text.