# Paper2Video

> Source: https://aiwiki.ai/wiki/paper2video
> Updated: 2026-07-07
> Categories: AI Benchmarks, AI Research, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Paper2Video** (full title: *Paper2Video: Automatic Video Generation from Scientific Papers*) is a research project from [Show Lab](/wiki/show_lab) at the [National University of Singapore](/wiki/national_university_of_singapore) that formalizes and evaluates automatic generation of academic presentation videos directly from scientific papers. It comprises (1) the **Paper2Video Benchmark**, 101 paper-video pairs with slides and speaker metadata, and (2) **PaperTalker**, a multi-agent framework that turns a paper (plus a reference image and short voice sample) into a narrated presentation video with slides, subtitles, cursor highlights, and an optional talking-head presenter.[1][2] According to the paper, PaperTalker "outperforms human-made presentations by 10% in PresentQuiz accuracy and achieves comparable ratings in user studies," a result the authors read as automated quality approaching that of human-created content.[1] Code and data are open sourced under the [MIT license](/wiki/mit_license) on [GitHub](/wiki/github), and the benchmark is hosted on [Hugging Face](/wiki/hugging_face).[3][4]

The authors are Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou (corresponding author), with Zhu and Lin contributing equally. It was first posted to [arXiv](/wiki/arxiv) on 6 October 2025 (preprint 2510.05096) and accepted to the Scaling Environments for Agents (SEA) Workshop at [NeurIPS](/wiki/neurips) 2025, where it was presented as a poster in San Diego on 7 December 2025.[1][5][16] In mid-October 2025 the project briefly trended on Hacker News and the GitHub repository accumulated thousands of stars within days.[3]

## What is the difference between Paper2Video and PaperTalker?

The name **Paper2Video** refers to both the benchmark and the overall project; the video-generation agent is called **PaperTalker**. Unlike natural video generation, the authors argue, presentation video generation "involves distinctive challenges: long-context inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker."[1] Evaluation is therefore focused on faithfulness, audience comprehension, and author visibility rather than purely natural-video realism.[1][2]

The authors describe academic presentation video generation as "a superproblem" of related document-to-media tasks such as slide and poster generation, calling it "a practical yet more challenging direction."[1] Producing such a talk manually, the paper notes, "remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video," which motivates the case for automation.[1]

## When was Paper2Video released?

| Date | Event |
| --- | --- |
| 28 September 2025 | Project accepted as a poster at the SEA (Scaling Environments for Agents) Workshop at [NeurIPS](/wiki/neurips) 2025.[5] |
| 6 October 2025 | arXiv v1 posted; code and dataset released on [GitHub](/wiki/github) and [Hugging Face](/wiki/hugging_face).[1][3] |
| 9 October 2025 | arXiv v2 released with minor revisions. The four metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory) were already part of v1.[1] |
| 11 October 2025 | Project featured on Hacker News and YC's front page, driving a spike in repository stars.[3] |
| 15 October 2025 | A "fast" variant without the talking head added to the repository for quicker generation.[3] |
| 7 December 2025 | Presented as a poster at the NeurIPS 2025 SEA Workshop in San Diego.[16] |

## Who created Paper2Video?

The paper was produced by [Show Lab](/wiki/show_lab), a research group at the [National University of Singapore](/wiki/national_university_of_singapore) led by Mike Zheng Shou. The same group has produced related work on GUI agents (ShowUI) and multimodal generation, several of which are cited as building blocks of PaperTalker.[1][6]

| Author | Role |
| --- | --- |
| Zeyu Zhu | Co-first author, Show Lab, [NUS](/wiki/national_university_of_singapore) |
| Kevin Qinghong Lin | Co-first author, Show Lab, NUS (also lead author of ShowUI) |
| Mike Zheng Shou | Corresponding author, principal investigator of Show Lab |

## What is the Paper2Video benchmark?

The benchmark pairs recent conference papers with the authors' presentation videos, original slide decks (when available), and presenter identity metadata. Public sources include [YouTube](/wiki/youtube) and SlidesLive, supplemented with portrait images sourced from authors' personal websites. Papers without sufficient metadata were excluded during curation.[1]

### Composition and statistics

| Item | Value (aggregate) |
| --- | --- |
| Number of paper-video pairs | 101 |
| Average words per paper | ~13.3K (~3.3K tokens) |
| Average figures per paper | ~44.7 |
| Average pages per paper | ~28.7 |
| Average slides per presentation | ~16 |
| Average talk duration | ~6 minutes 15 seconds (range: 2 to 14 minutes) |
| Slides per video range | 4 to 28 |
| Original slide PDFs available | for ~40% of entries |

Sources: project page, dataset card, and paper.[1][2][4]

A domain breakdown reported in the paper is:

| Area | Count (papers) | Example venues |
| --- | --- | --- |
| Machine learning | 41 | [NeurIPS](/wiki/neurips), [ICLR](/wiki/iclr), [ICML](/wiki/icml) |
| Computer vision | 40 | [CVPR](/wiki/cvpr), ICCV, ECCV |
| Natural language processing | 20 | ACL, EMNLP, NAACL |

Each instance includes the paper's full [LaTeX](/wiki/latex) project, an author-recorded presentation video (slide and talking-head streams), and speaker identity (portrait and short voice sample). For roughly 40% of entries, original slide PDFs are also collected, enabling reference-based slide evaluation.[1]

### Curation rationale

The authors chose AI conference papers because the field's open-sharing culture provides polished author-recorded presentations on YouTube and SlidesLive, and because such papers offer diverse content with rich text, figures, and tables. They explicitly framed the benchmark as evaluating long-horizon agentic tasks rather than generic video generation, distinguishing it from natural-video benchmarks such as VBench.[1]

## How is Paper2Video evaluated?

Paper2Video proposes four tailored metrics for academic presentation videos, using vision-language models (VLMs) and VideoLLMs as automated judges where appropriate.[1] The authors argue that conventional metrics from natural video synthesis (such as FVD, IS, or CLIP similarity) miss the central purpose of an academic talk: communicating scholarship to an audience and amplifying author visibility.

| Metric | What it measures | How it is operationalized |
| --- | --- | --- |
| **Meta Similarity** | Alignment of generated assets with human-authored ones (slides, subtitles, speech timbre) | A VLM compares generated slide-subtitle pairs to the human versions on a five-point scale; speech similarity uses embedding cosine similarity on uniformly sampled 10-second clips via SpeechBrain.[1] |
| **PresentArena** | Overall preference and quality in head-to-head comparisons | A VideoLLM performs double-order pairwise comparisons between generated and human-made presentation videos; winning rate is the metric (order flipping reduces position bias).[1] |
| **PresentQuiz** | Information coverage and comprehension | Multiple-choice questions are generated from the source paper covering both fine-grained details and higher-level understanding; a VideoLLM watches the video and answers; overall accuracy is reported.[1] |
| **IP Memory** | Memorability and the audience's ability to associate authors with their work | A recall task asks a VideoLLM to match brief 5-second video clips to a relevant question given a speaker image; accuracy reflects retention and associative memory.[1] |

The authors note that IP Memory is the most novel of the four and was inspired by real-conference interactions where attendees who recall a presentation are more likely to approach the author with relevant questions later.[1]

## How does PaperTalker work?

PaperTalker is a multi-agent pipeline that converts a paper into a narrated presentation video. The pipeline is designed to scale slide-wise in parallel for efficiency.[1] It comprises four "builders" with clearly decoupled responsibilities:

### 1. Slide builder

The slide builder synthesizes slides with [LaTeX](/wiki/latex) [Beamer](/wiki/beamer) code rather than the PowerPoint XML or PPTX templates favored by prior systems such as PPTAgent. The authors give three reasons: LaTeX arranges content from declarative parameters, Beamer is more compact than XML, and Beamer offers academically appropriate styles. Generation proceeds in three steps:

1. A coder LLM produces draft Beamer code from the paper's LaTeX source.
2. The code is compiled with the Tectonic engine; warnings and errors trigger a focused debugging routine that narrows the relevant lines and asks the model to repair them.
3. Slides flagged for layout overflow are sent through *Tree Search Visual Choice* refinement.[1]

#### Tree Search Visual Choice

Prompting LLMs and VLMs to directly tune numeric layout parameters (font size, figure scale, margins) is unreliable because the models are largely insensitive to small numeric edits. Tree Search Visual Choice instead constructs a neighborhood of candidate parameter values (for example figure scaling factors of 1.25, 0.75, 0.5, 0.25), renders each variant to an image, then asks a VLM to score the candidates on a single composite figure and pick the best. This decouples discrete layout search from semantic reasoning and resolves overflow issues with minimal token cost.[1]

### 2. Subtitle builder

The subtitle builder rasterizes each slide and feeds it to a VLM, which produces sentence-level subtitles paired with a *visual-focus prompt* describing where on the slide attention should be directed during that sentence. These prompts bridge speech and cursor motion.[1]

### 3. Cursor builder

The cursor builder grounds the visual-focus prompts into screen coordinates. Spatial alignment uses [ShowUI](/wiki/showui) or [UI-TARS](/wiki/ui_tars) to predict an (x, y) location per sentence from the slide screenshot. Temporal alignment uses [WhisperX](/wiki/whisperx) to extract word-level timestamps, yielding (t_start, t_end) per sentence. The authors simplify by assuming the cursor stays still within a sentence and only moves between sentences.[1][6][7]

### 4. Talker builder

The talker builder produces a personalized presenter clip per slide. Speech synthesis uses [F5-TTS](/wiki/f5_tts), a flow-matching TTS model, conditioned on a short voice sample. Talking-head rendering uses [Hallo2](/wiki/hallo2) by default; FantasyTalking is supported for upper-body articulation. The authors generate one clip per slide in parallel, relying on the natural scene cut between slides to mask the lack of temporal continuity, and report more than a 6 times speedup over sequential generation.[1][8]

### Implementation notes

The repository uses the Tectonic LaTeX engine and supports both commercial and local VLM back ends. Recommended choices are [GPT-4.1](/wiki/gpt-4.1) for slide and subtitle generation and [Gemini](/wiki/gemini) 2.5 Flash or Pro for the VideoLLM judge, with local [Qwen](/wiki/qwen) variants supported. The minimum recommended GPU is an [NVIDIA](/wiki/nvidia) A6000 with 48 GB; the authors ran inference on eight A6000s. A "light" mode without the talking head is provided for fast generation.[1][3]

## How well does PaperTalker perform?

On automated metrics, PaperTalker reports the strongest performance among automatic baselines on the Paper2Video Benchmark. The headline finding is that on PresentQuiz, a multiple-choice comprehension test answered by a VideoLLM after watching the talk, PaperTalker surpasses the human-recorded presentations. The paper summarizes this as the system outperforming "human-made presentations by 10% in PresentQuiz accuracy" while achieving "comparable ratings in user studies."[1] The authors attribute the comprehension gain to PaperTalker producing more concentrated information per minute of video and to the cursor providing explicit attentional cues that help the VideoLLM judge.[1]

### PresentQuiz: comprehension versus human talks

| PresentQuiz subset | Human-made (%) | PaperTalker (%) |
| --- | --- | --- |
| Understanding | 90.8 | 95.1 |
| Detail | 73.8 | 84.2 |

The gap is widest on the fine-grained Detail questions (84.2% versus 73.8%, roughly 10 points), which is the source of the paper's headline "10%" claim.[1]

### Main result snapshot

Values reported in the paper for the head-to-head preference and memorability metrics:

| Method | PresentArena win vs. human (%) | IP Memory (%) |
| --- | --- | --- |
| Human-made | 50.0 (tie reference) | reference |
| PaperTalker (full) | 17.0 | 50.0 |
| PaperTalker without talker and cursor | 15.2 | not applicable |
| [PresentAgent](/wiki/presentagent) | 2.0 | 12.5 |
| Veo3 (end-to-end) | 1.2 | 31.3 |

PresentArena values are pairwise win rates against the human reference, so 50% indicates a tie. PaperTalker's 17.0% reflects the remaining gap on perceived overall quality, even though it leads every automatic baseline by a wide margin and beats humans on comprehension. Two end-to-end video diffusion baselines, Veo3 and Wan2.2, won only 1.2% and 1.1% of PresentArena comparisons against human talks. On IP Memory, a recall task that tests whether a viewer can associate a talk with its author, PaperTalker (50.0%) roughly doubles the best end-to-end video model (Veo3, 31.3%) and far exceeds the slide-based [PresentAgent](/wiki/presentagent) (12.5%).[1]

### Cursor ablation

A localization QA shows a large gain from explicit cursor guidance:

| Method or variant | Localization accuracy |
| --- | --- |
| PaperTalker without cursor | 0.084 |
| PaperTalker with cursor | 0.633 |

Source: paper (Table 4).[1]

### Tree Search Visual Choice ablation

Slides were scored on a 1 to 5 scale across content, design, and coherence, following the PPTAgent rubric. Removing the layout refinement module produced a pronounced drop in design quality (from roughly 2.85 to 2.53), confirming its role in resolving overflow.[1][10]

### Runtime snapshot (per paper, representative setting)

| Method | Time (min) | Notes |
| --- | --- | --- |
| PaperTalker (full) | 48.1 | Includes talking head, parallelized over 8 A6000s |
| PaperTalker without talker | 15.6 | "Fast" variant |
| PaperTalker without parallelization | 287.2 | For comparison |

Slide-wise parallelization yields more than a 6 times speedup versus sequential generation in the agentic pipeline. Token usage per presentation was reported at roughly 62K tokens (about $0.001), well below the roughly 241K tokens (about $0.003) of [PresentAgent](/wiki/presentagent) at the time of evaluation, making PaperTalker close to 3 times cheaper by this measure.[1]

### Human evaluation

Ten participants ranked videos generated by different methods on a 1 (worse) to 5 (best) scale across ten randomly sampled papers. Human-made videos scored highest, with PaperTalker ranking second and clearly ahead of all automatic baselines. The authors interpret this as evidence that the gap with humans has narrowed but not closed.[1]

## Is Paper2Video open source?

Yes. Paper2Video is fully open source: the PaperTalker code is released under the [MIT license](/wiki/mit_license) and the 101-pair benchmark dataset is public on [Hugging Face](/wiki/hugging_face).

| Resource | Location | License |
| --- | --- | --- |
| Code | [github.com/showlab/Paper2Video](/wiki/github) | [MIT](/wiki/mit_license) |
| Dataset | [Hugging Face](/wiki/hugging_face) (ZaynZhu/Paper2Video) | research use, see card |
| Project page | [showlab.github.io/Paper2Video](/wiki/show_lab) | not applicable |
| Paper | [arXiv](/wiki/arxiv) 2510.05096 | arXiv default |

The repository ships configuration for both commercial APIs (OpenAI, Google) and local model serving, plus a separate Hallo2 environment for the talking-head module.[3]

## How does Paper2Video compare to prior work?

Paper2Video sits within a small but growing cluster of "AI for Research" agents. The closest comparisons are document-to-media systems and slide-generation tools.

| System | Year | Input | Slides | Subtitles | Cursor | Face | Voice |
| --- | --- | --- | --- | --- | --- | --- | --- |
| D2S (Sun et al.) | 2021 | Document | yes | no | no | no | no |
| [PPTAgent](/wiki/pptagent) | 2025 | Doc + template | yes | no | no | no | no |
| [Paper2Poster](/wiki/paper2poster) | 2025 | Paper | poster only | no | no | no | no |
| [PresentAgent](/wiki/presentagent) | 2025 | Doc + template | yes | yes | no | no | no |
| [Paper2Agent](/wiki/paper2agent) | 2025 | Paper | n/a (interactive) | n/a | n/a | n/a | n/a |
| **Paper2Video / PaperTalker** | 2025 | Paper + image + audio | yes | yes | yes | yes | yes |

Key contrasts:

- **[Paper2Poster](/wiki/paper2poster)** generates posters and contributes the PaperQuiz metric that inspired PresentQuiz, but produces no video or audio.[9]
- **[PPTAgent](/wiki/pptagent)** targets document-to-slides workflows and introduces PPTEval (content, design, coherence); PaperTalker reuses the rubric but writes Beamer rather than editing PPTX.[10]
- **[PresentAgent](/wiki/presentagent)** generates presentation videos from documents but lacks personalization and academic-style structure such as opening or outline slides.[11]
- **[Paper2Code](/wiki/paper2code)** automates code generation from ML papers, a sibling system targeting a different output modality.[12]
- **[Veo 3](/wiki/veo_3)** and similar end-to-end video diffusion models can render a high-quality presenter shot but are constrained to roughly 8 second durations with blurred on-screen text, a key limitation as an academic-talk baseline.[13]

More distantly, the project sits within the wider "AI4Research" agenda surveyed by Chen et al. (2025), spanning literature surveying, idea generation, and replication benchmarks such as PaperBench and SciReplicate-Bench.[1]

## How was Paper2Video received?

Within a week of release the GitHub repository accumulated thousands of stars and reached the front page of Hacker News, with discussion focused on whether automated talks could meaningfully replace recorded conference presentations. Commentary by AlphaXiv and Emergent Mind highlighted the IP Memory metric as a novel attempt to quantify the "author visibility" purpose of conference talks.[3][14][15]

The authors framed the work as a practical step rather than a finished product. They acknowledge that talking-head fidelity, gesture realism, and handling of equation-heavy slides remain open, and that the benchmark covers only AI conference papers from the past three years.[1]

## See also

- [PaperTalker](/wiki/papertalker)
- [Paper2Poster](/wiki/paper2poster)
- [PPTAgent](/wiki/pptagent)
- [PresentAgent](/wiki/presentagent)
- [Paper2Agent](/wiki/paper2agent)
- [Paper2Code](/wiki/paper2code)
- [Vision-language model](/wiki/vision_language_model)
- [Video large language model](/wiki/video_large_language_model)
- [LaTeX](/wiki/latex), [Beamer](/wiki/beamer)
- [Hallo2](/wiki/hallo2), [F5-TTS](/wiki/f5_tts), [WhisperX](/wiki/whisperx)
- [UI-TARS](/wiki/ui_tars), [ShowUI](/wiki/showui)
- [NeurIPS](/wiki/neurips), [arXiv](/wiki/arxiv)

## References

1. Zhu, Z., Lin, K. Q., Shou, M. Z. "Paper2Video: Automatic Video Generation from Scientific Papers". arXiv preprint 2510.05096, October 2025. https://arxiv.org/abs/2510.05096
2. Show Lab, NUS. "Paper2Video project page". https://showlab.github.io/Paper2Video/
3. Show Lab, NUS. "showlab/Paper2Video GitHub repository". https://github.com/showlab/Paper2Video
4. Zhu, Z. "ZaynZhu/Paper2Video dataset card on Hugging Face". https://huggingface.co/datasets/ZaynZhu/Paper2Video
5. NeurIPS 2025 Scaling Environments for Agents (SEA) Workshop. https://sea-workshop.github.io/
6. Lin, K. Q. et al. "ShowUI: One Vision-Language-Action Model for GUI Visual Agent". CVPR 2025. https://arxiv.org/abs/2411.17465
7. Bain, M., Huh, J., Han, T., Zisserman, A. "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio". arXiv 2303.00747, 2023.
8. Cui, J. et al. "Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation". arXiv 2410.07718, 2024.
9. Pang, W. et al. "Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers". arXiv 2505.21497, 2025.
10. Zheng, H. et al. "PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides". arXiv 2501.03936, 2025.
11. Shi, J. et al. "PresentAgent: Multimodal Agent for Presentation Video Generation". arXiv 2507.04036, 2025.
12. Seo, M. et al. "Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning". arXiv 2504.17192, 2025.
13. DeepMind. "Veo 3 Technical Report". May 2025. https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
14. Hugging Face Papers. "Paper2Video paper page". https://huggingface.co/papers/2510.05096
15. AlphaXiv. "Paper2Video: Automatic Video Generation from Scientific Papers". https://www.alphaxiv.org/overview/2510.05096v2
16. NeurIPS 2025. "Paper2Video: Automatic Video Generation from Scientific Papers (SEA Workshop poster)". https://neurips.cc/virtual/2025/loc/san-diego/124558 ; OpenReview: https://openreview.net/forum?id=LvRHonr4gv