# Paper2Video

> Source: https://aiwiki.ai/wiki/paper2video
> Updated: 2026-05-10
> Categories: AI Benchmarks, AI Research, Multimodal AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Paper2Video** (full title: *Paper2Video: Automatic Video Generation from Scientific Papers*) is a research project from Show Lab at the [National University of Singapore](/wiki/national_university_of_singapore) that formalizes and evaluates automatic generation of academic presentation videos directly from scientific papers. It comprises (1) the **Paper2Video Benchmark**, 101 paper-video pairs with slides and speaker metadata, and (2) **PaperTalker**, a multi-agent framework that turns a paper (plus a reference image and short voice sample) into a narrated presentation video with slides, subtitles, cursor highlights, and an optional talking-head presenter.[1][2] Code and data are open sourced under the [MIT license](/wiki/mit_license) on [GitHub](/wiki/github), and the benchmark is hosted on [Hugging Face](/wiki/hugging_face).[3][4]

The authors are Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou (corresponding author), with Zhu and Lin contributing equally. It was first posted to [arXiv](/wiki/arxiv) on 6 October 2025 (preprint 2510.05096) and accepted to the Scaling Environments for Agents (SEA) Workshop at [NeurIPS](/wiki/neurips) 2025.[1][5] In mid-October 2025 the project briefly trended on Hacker News and the GitHub repository accumulated thousands of stars within days.[3]

## Terminology and scope

The name **Paper2Video** refers to both the benchmark and the overall project; the video-generation agent is called **PaperTalker**. The work targets long-context, multimodal inputs (text, figures, tables) and coordinated outputs across slides, subtitles, speech, cursor motion, and an optional talking head, with evaluation focused on faithfulness, audience comprehension, and author visibility rather than purely natural-video realism.[1][2]

The authors describe academic presentation video generation as a "superproblem" of related document-to-media tasks such as slide and poster generation. Producing a 2 to 10 minute conference talk video manually typically takes several hours of slide design, narration recording, and editing, motivating the case for automation.[1]

## History

| Date | Event |
| --- | --- |
| 28 September 2025 | Project accepted as a poster at the SEA (Scaling Environments for Agents) Workshop at [NeurIPS](/wiki/neurips) 2025.[5] |
| 6 October 2025 | arXiv v1 posted; code and dataset released on [GitHub](/wiki/github) and [Hugging Face](/wiki/hugging_face).[1][3] |
| 9 October 2025 | arXiv v2 released with minor revisions. The four metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory) were already part of v1.[1] |
| 11 October 2025 | Project featured on Hacker News and YC's front page, driving a spike in repository stars.[3] |
| 15 October 2025 | A "fast" variant without the talking head added to the repository for quicker generation.[3] |

## Authors and affiliation

The paper was produced by Show Lab, a research group at the [National University of Singapore](/wiki/national_university_of_singapore) led by Mike Zheng Shou. The same group has produced related work on GUI agents (ShowUI) and multimodal generation, several of which are cited as building blocks of PaperTalker.[1][6]

| Author | Role |
| --- | --- |
| Zeyu Zhu | Co-first author, Show Lab, [NUS](/wiki/national_university_of_singapore) |
| Kevin Qinghong Lin | Co-first author, Show Lab, NUS (also lead author of ShowUI) |
| Mike Zheng Shou | Corresponding author, principal investigator of Show Lab |

## Paper2Video Benchmark

The benchmark pairs recent conference papers with the authors' presentation videos, original slide decks (when available), and presenter identity metadata. Public sources include [YouTube](/wiki/youtube) and SlidesLive, supplemented with portrait images sourced from authors' personal websites. Papers without sufficient metadata were excluded during curation.[1]

### Composition and statistics

| Item | Value (aggregate) |
| --- | --- |
| Number of paper-video pairs | 101 |
| Average words per paper | ~13.3K (~3.3K tokens) |
| Average figures per paper | ~44.7 |
| Average pages per paper | ~28.7 |
| Average slides per presentation | ~16 |
| Average talk duration | ~6 minutes 15 seconds (range: 2 to 14 minutes) |
| Slides per video range | 4 to 28 |
| Original slide PDFs available | for ~40% of entries |

Sources: project page, dataset card, and paper.[1][2][4]

A domain breakdown reported in the paper is:

| Area | Count (papers) | Example venues |
| --- | --- | --- |
| Machine learning | 41 | [NeurIPS](/wiki/neurips), [ICLR](/wiki/iclr), [ICML](/wiki/icml) |
| Computer vision | 40 | [CVPR](/wiki/cvpr), ICCV, ECCV |
| Natural language processing | 20 | ACL, EMNLP, NAACL |

Each instance includes the paper's full [LaTeX](/wiki/latex) project, an author-recorded presentation video (slide and talking-head streams), and speaker identity (portrait and short voice sample). For roughly 40% of entries, original slide PDFs are also collected, enabling reference-based slide evaluation.[1]

### Curation rationale

The authors chose AI conference papers because the field's open-sharing culture provides polished author-recorded presentations on YouTube and SlidesLive, and because such papers offer diverse content with rich text, figures, and tables. They explicitly framed the benchmark as evaluating long-horizon agentic tasks rather than generic video generation, distinguishing it from natural-video benchmarks such as VBench.[1]

## Evaluation metrics

Paper2Video proposes four tailored metrics for academic presentation videos, using vision-language models (VLMs) and VideoLLMs as automated judges where appropriate.[1] The authors argue that conventional metrics from natural video synthesis (such as FVD, IS, or CLIP similarity) miss the central purpose of an academic talk: communicating scholarship to an audience and amplifying author visibility.

| Metric | What it measures | How it is operationalized |
| --- | --- | --- |
| **Meta Similarity** | Alignment of generated assets with human-authored ones (slides, subtitles, speech timbre) | A VLM compares generated slide-subtitle pairs to the human versions on a five-point scale; speech similarity uses embedding cosine similarity on uniformly sampled 10-second clips via SpeechBrain.[1] |
| **PresentArena** | Overall preference and quality in head-to-head comparisons | A VideoLLM performs double-order pairwise comparisons between generated and human-made presentation videos; winning rate is the metric (order flipping reduces position bias).[1] |
| **PresentQuiz** | Information coverage and comprehension | Multiple-choice questions are generated from the source paper covering both fine-grained details and higher-level understanding; a VideoLLM watches the video and answers; overall accuracy is reported.[1] |
| **IP Memory** | Memorability and the audience's ability to associate authors with their work | A recall task asks a VideoLLM to match brief 5-second video clips to a relevant question given a speaker image; accuracy reflects retention and associative memory.[1] |

The authors note that IP Memory is the most novel of the four and was inspired by real-conference interactions where attendees who recall a presentation are more likely to approach the author with relevant questions later.[1]

## System: PaperTalker

PaperTalker is a multi-agent pipeline that converts a paper into a narrated presentation video. The pipeline is designed to scale slide-wise in parallel for efficiency.[1] It comprises four "builders" with clearly decoupled responsibilities:

### 1. Slide builder

The slide builder synthesizes slides with [LaTeX](/wiki/latex) [Beamer](/wiki/beamer) code rather than the PowerPoint XML or PPTX templates favored by prior systems such as PPTAgent. The authors give three reasons: LaTeX arranges content from declarative parameters, Beamer is more compact than XML, and Beamer offers academically appropriate styles. Generation proceeds in three steps:

1. A coder LLM produces draft Beamer code from the paper's LaTeX source.
2. The code is compiled with the Tectonic engine; warnings and errors trigger a focused debugging routine that narrows the relevant lines and asks the model to repair them.
3. Slides flagged for layout overflow are sent through *Tree Search Visual Choice* refinement.[1]

#### Tree Search Visual Choice

Prompting LLMs and VLMs to directly tune numeric layout parameters (font size, figure scale, margins) is unreliable because the models are largely insensitive to small numeric edits. Tree Search Visual Choice instead constructs a neighborhood of candidate parameter values (for example figure scaling factors of 1.25, 0.75, 0.5, 0.25), renders each variant to an image, then asks a VLM to score the candidates on a single composite figure and pick the best. This decouples discrete layout search from semantic reasoning and resolves overflow issues with minimal token cost.[1]

### 2. Subtitle builder

The subtitle builder rasterizes each slide and feeds it to a VLM, which produces sentence-level subtitles paired with a *visual-focus prompt* describing where on the slide attention should be directed during that sentence. These prompts bridge speech and cursor motion.[1]

### 3. Cursor builder

The cursor builder grounds the visual-focus prompts into screen coordinates. Spatial alignment uses ShowUI or UI-TARS to predict an (x, y) location per sentence from the slide screenshot. Temporal alignment uses [WhisperX](/wiki/whisperx) to extract word-level timestamps, yielding (t_start, t_end) per sentence. The authors simplify by assuming the cursor stays still within a sentence and only moves between sentences.[1][6][7]

### 4. Talker builder

The talker builder produces a personalized presenter clip per slide. Speech synthesis uses F5-TTS, a flow-matching TTS model, conditioned on a short voice sample. Talking-head rendering uses Hallo2 by default; FantasyTalking is supported for upper-body articulation. The authors generate one clip per slide in parallel, relying on the natural scene cut between slides to mask the lack of temporal continuity, and report more than a 6 times speedup over sequential generation.[1][8]

### Implementation notes

The repository uses the Tectonic LaTeX engine and supports both commercial and local VLM back ends. Recommended choices are [GPT-4.1](/wiki/gpt-4.1) for slide and subtitle generation and [Gemini](/wiki/gemini) 2.5 Flash or Pro for the VideoLLM judge, with local [Qwen](/wiki/qwen) variants supported. The minimum recommended GPU is an [NVIDIA](/wiki/nvidia) A6000 with 48 GB; the authors ran inference on eight A6000s. A "light" mode without the talking head is provided for fast generation.[1][3]

## Results and findings

On automated metrics, PaperTalker reports the strongest performance among automatic baselines on the Paper2Video Benchmark. Notably, on PresentQuiz it surpasses human-recorded presentations by roughly 10 percentage points, which the authors attribute to PaperTalker producing more concentrated information per minute of video and to the cursor providing explicit attentional cues that help the VideoLLM judge.[1]

### Main result snapshot

Approximate values reported in Table 2 of the paper for selected metrics:

| Method | PresentArena win vs. human (%) | PresentQuiz accuracy (%) | IP Memory (%) |
| --- | --- | --- | --- |
| Human-made | 50.0 | ~85 to 90 | reference |
| PaperTalker (full) | ~17 | ~95 | ~50 |
| PaperTalker without talker and cursor | ~15 | lower | lower |
| [PresentAgent](/wiki/presentagent) | ~2 | ~65 | ~12 |
| Veo3 (end-to-end) | ~1 | ~58 | ~31 |

PresentArena values are pairwise win rates against the human reference, so 50% indicates a tie. PaperTalker's lower number reflects the gap on perceived overall quality, despite its higher quiz accuracy.[1]

### Cursor ablation

A localization QA shows a large gain from explicit cursor guidance:

| Method or variant | Localization accuracy |
| --- | --- |
| PaperTalker without cursor | 0.084 |
| PaperTalker with cursor | 0.633 |

Source: paper (Table 4).[1]

### Tree Search Visual Choice ablation

Slides were scored on a 1 to 5 scale across content, design, and coherence, following the PPTAgent rubric. Removing the layout refinement module produced a pronounced drop in design quality (from roughly 2.85 to 2.53), confirming its role in resolving overflow.[1][10]

### Runtime snapshot (per paper, representative setting)

| Method | Time (min) | Notes |
| --- | --- | --- |
| PaperTalker (full) | 48.1 | Includes talking head, parallelized over 8 A6000s |
| PaperTalker without talker | 15.6 | "Fast" variant |
| PaperTalker without parallelization | 287.2 | For comparison |

Slide-wise parallelization yields more than a 6 times speedup versus sequential generation in the agentic pipeline. Token cost per presentation was reported at roughly $0.001, well below the $0.003 of [PresentAgent](/wiki/presentagent) at the time of evaluation.[1]

### Human evaluation

Ten participants ranked videos generated by different methods on a 1 (worse) to 5 (best) scale across ten randomly sampled papers. Human-made videos scored highest, with PaperTalker ranking second and clearly ahead of all automatic baselines. The authors interpret this as evidence that the gap with humans has narrowed but not closed.[1]

## Availability and licensing

| Resource | Location | License |
| --- | --- | --- |
| Code | [github.com/showlab/Paper2Video](/wiki/github) | [MIT](/wiki/mit_license) |
| Dataset | [Hugging Face](/wiki/hugging_face) (ZaynZhu/Paper2Video) | research use, see card |
| Project page | [showlab.github.io/Paper2Video](/wiki/show_lab) | not applicable |
| Paper | [arXiv](/wiki/arxiv) 2510.05096 | arXiv default |

The repository ships configuration for both commercial APIs (OpenAI, Google) and local model serving, plus a separate Hallo2 environment for the talking-head module.[3]

## Relation to prior work

Paper2Video sits within a small but growing cluster of "AI for Research" agents. The closest comparisons are document-to-media systems and slide-generation tools.

| System | Year | Input | Slides | Subtitles | Cursor | Face | Voice |
| --- | --- | --- | --- | --- | --- | --- | --- |
| D2S (Sun et al.) | 2021 | Document | yes | no | no | no | no |
| [PPTAgent](/wiki/pptagent) | 2025 | Doc + template | yes | no | no | no | no |
| [Paper2Poster](/wiki/paper2poster) | 2025 | Paper | poster only | no | no | no | no |
| [PresentAgent](/wiki/presentagent) | 2025 | Doc + template | yes | yes | no | no | no |
| [Paper2Agent](/wiki/paper2agent) | 2025 | Paper | n/a (interactive) | n/a | n/a | n/a | n/a |
| **Paper2Video / PaperTalker** | 2025 | Paper + image + audio | yes | yes | yes | yes | yes |

Key contrasts:

- **[Paper2Poster](/wiki/paper2poster)** generates posters and contributes the PaperQuiz metric that inspired PresentQuiz, but produces no video or audio.[9]
- **[PPTAgent](/wiki/pptagent)** targets document-to-slides workflows and introduces PPTEval (content, design, coherence); PaperTalker reuses the rubric but writes Beamer rather than editing PPTX.[10]
- **[PresentAgent](/wiki/presentagent)** generates presentation videos from documents but lacks personalization and academic-style structure such as opening or outline slides.[11]
- **[Paper2Code](/wiki/paper2code)** automates code generation from ML papers, a sibling system targeting a different output modality.[12]
- **[Veo 3](/wiki/veo_3)** and similar end-to-end video diffusion models can render a high-quality presenter shot but are constrained to roughly 8 second durations with blurred on-screen text, a key limitation as an academic-talk baseline.[13]

More distantly, the project sits within the wider "AI4Research" agenda surveyed by Chen et al. (2025), spanning literature surveying, idea generation, and replication benchmarks such as PaperBench and SciReplicate-Bench.[1]

## Reception and uptake

Within a week of release the GitHub repository accumulated thousands of stars and reached the front page of Hacker News, with discussion focused on whether automated talks could meaningfully replace recorded conference presentations. Commentary by AlphaXiv and Emergent Mind highlighted the IP Memory metric as a novel attempt to quantify the "author visibility" purpose of conference talks.[3][14][15]

The authors framed the work as a practical step rather than a finished product. They acknowledge that talking-head fidelity, gesture realism, and handling of equation-heavy slides remain open, and that the benchmark covers only AI conference papers from the past three years.[1]

## See also

- [PaperTalker](/wiki/papertalker)
- [Paper2Poster](/wiki/paper2poster)
- [PPTAgent](/wiki/pptagent)
- [PresentAgent](/wiki/presentagent)
- [Paper2Agent](/wiki/paper2agent)
- [Paper2Code](/wiki/paper2code)
- [Vision-language model](/wiki/vision_language_model)
- [Video large language model](/wiki/video_large_language_model)
- [LaTeX](/wiki/latex), [Beamer](/wiki/beamer)
- [Hallo2](/wiki/hallo2), [F5-TTS](/wiki/f5_tts), [WhisperX](/wiki/whisperx)
- [UI-TARS](/wiki/ui_tars), [ShowUI](/wiki/showui)
- [NeurIPS](/wiki/neurips), [arXiv](/wiki/arxiv)

## References

1. Zhu, Z., Lin, K. Q., Shou, M. Z. "Paper2Video: Automatic Video Generation from Scientific Papers". arXiv preprint 2510.05096, October 2025. https://arxiv.org/abs/2510.05096
2. Show Lab, NUS. "Paper2Video project page". https://showlab.github.io/Paper2Video/
3. Show Lab, NUS. "showlab/Paper2Video GitHub repository". https://github.com/showlab/Paper2Video
4. Zhu, Z. "ZaynZhu/Paper2Video dataset card on Hugging Face". https://huggingface.co/datasets/ZaynZhu/Paper2Video
5. NeurIPS 2025 Scaling Environments for Agents (SEA) Workshop. https://sea-workshop.github.io/
6. Lin, K. Q. et al. "ShowUI: One Vision-Language-Action Model for GUI Visual Agent". CVPR 2025. https://arxiv.org/abs/2411.17465
7. Bain, M., Huh, J., Han, T., Zisserman, A. "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio". arXiv 2303.00747, 2023.
8. Cui, J. et al. "Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation". arXiv 2410.07718, 2024.
9. Pang, W. et al. "Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers". arXiv 2505.21497, 2025.
10. Zheng, H. et al. "PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides". arXiv 2501.03936, 2025.
11. Shi, J. et al. "PresentAgent: Multimodal Agent for Presentation Video Generation". arXiv 2507.04036, 2025.
12. Seo, M. et al. "Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning". arXiv 2504.17192, 2025.
13. DeepMind. "Veo 3 Technical Report". May 2025. https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
14. Hugging Face Papers. "Paper2Video paper page". https://huggingface.co/papers/2510.05096
15. AlphaXiv. "Paper2Video: Automatic Video Generation from Scientific Papers". https://www.alphaxiv.org/overview/2510.05096v2

