Paper2Video
Last reviewed
May 10, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 2,713 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 2,713 words
Add missing citations, update stale details, or suggest a clearer explanation.
Paper2Video (full title: Paper2Video: Automatic Video Generation from Scientific Papers) is a research project from Show Lab at the National University of Singapore that formalizes and evaluates automatic generation of academic presentation videos directly from scientific papers. It comprises (1) the Paper2Video Benchmark, 101 paper-video pairs with slides and speaker metadata, and (2) PaperTalker, a multi-agent framework that turns a paper (plus a reference image and short voice sample) into a narrated presentation video with slides, subtitles, cursor highlights, and an optional talking-head presenter.[1][2] Code and data are open sourced under the MIT license on GitHub, and the benchmark is hosted on Hugging Face.[3][4]
The authors are Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou (corresponding author), with Zhu and Lin contributing equally. It was first posted to arXiv on 6 October 2025 (preprint 2510.05096) and accepted to the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025.[1][5] In mid-October 2025 the project briefly trended on Hacker News and the GitHub repository accumulated thousands of stars within days.[3]
The name Paper2Video refers to both the benchmark and the overall project; the video-generation agent is called PaperTalker. The work targets long-context, multimodal inputs (text, figures, tables) and coordinated outputs across slides, subtitles, speech, cursor motion, and an optional talking head, with evaluation focused on faithfulness, audience comprehension, and author visibility rather than purely natural-video realism.[1][2]
The authors describe academic presentation video generation as a "superproblem" of related document-to-media tasks such as slide and poster generation. Producing a 2 to 10 minute conference talk video manually typically takes several hours of slide design, narration recording, and editing, motivating the case for automation.[1]
| Date | Event |
|---|---|
| 28 September 2025 | Project accepted as a poster at the SEA (Scaling Environments for Agents) Workshop at NeurIPS 2025.[5] |
| 6 October 2025 | arXiv v1 posted; code and dataset released on GitHub and Hugging Face.[1][3] |
| 9 October 2025 | arXiv v2 released with minor revisions. The four metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory) were already part of v1.[1] |
| 11 October 2025 | Project featured on Hacker News and YC's front page, driving a spike in repository stars.[3] |
| 15 October 2025 | A "fast" variant without the talking head added to the repository for quicker generation.[3] |
The paper was produced by Show Lab, a research group at the National University of Singapore led by Mike Zheng Shou. The same group has produced related work on GUI agents (ShowUI) and multimodal generation, several of which are cited as building blocks of PaperTalker.[1][6]
| Author | Role |
|---|---|
| Zeyu Zhu | Co-first author, Show Lab, NUS |
| Kevin Qinghong Lin | Co-first author, Show Lab, NUS (also lead author of ShowUI) |
| Mike Zheng Shou | Corresponding author, principal investigator of Show Lab |
The benchmark pairs recent conference papers with the authors' presentation videos, original slide decks (when available), and presenter identity metadata. Public sources include YouTube and SlidesLive, supplemented with portrait images sourced from authors' personal websites. Papers without sufficient metadata were excluded during curation.[1]
| Item | Value (aggregate) |
|---|---|
| Number of paper-video pairs | 101 |
| Average words per paper | ~13.3K (~3.3K tokens) |
| Average figures per paper | ~44.7 |
| Average pages per paper | ~28.7 |
| Average slides per presentation | ~16 |
| Average talk duration | ~6 minutes 15 seconds (range: 2 to 14 minutes) |
| Slides per video range | 4 to 28 |
| Original slide PDFs available | for ~40% of entries |
Sources: project page, dataset card, and paper.[1][2][4]
A domain breakdown reported in the paper is:
| Area | Count (papers) | Example venues |
|---|---|---|
| Machine learning | 41 | NeurIPS, ICLR, ICML |
| Computer vision | 40 | CVPR, ICCV, ECCV |
| Natural language processing | 20 | ACL, EMNLP, NAACL |
Each instance includes the paper's full LaTeX project, an author-recorded presentation video (slide and talking-head streams), and speaker identity (portrait and short voice sample). For roughly 40% of entries, original slide PDFs are also collected, enabling reference-based slide evaluation.[1]
The authors chose AI conference papers because the field's open-sharing culture provides polished author-recorded presentations on YouTube and SlidesLive, and because such papers offer diverse content with rich text, figures, and tables. They explicitly framed the benchmark as evaluating long-horizon agentic tasks rather than generic video generation, distinguishing it from natural-video benchmarks such as VBench.[1]
Paper2Video proposes four tailored metrics for academic presentation videos, using vision-language models (VLMs) and VideoLLMs as automated judges where appropriate.[1] The authors argue that conventional metrics from natural video synthesis (such as FVD, IS, or CLIP similarity) miss the central purpose of an academic talk: communicating scholarship to an audience and amplifying author visibility.
| Metric | What it measures | How it is operationalized |
|---|---|---|
| Meta Similarity | Alignment of generated assets with human-authored ones (slides, subtitles, speech timbre) | A VLM compares generated slide-subtitle pairs to the human versions on a five-point scale; speech similarity uses embedding cosine similarity on uniformly sampled 10-second clips via SpeechBrain.[1] |
| PresentArena | Overall preference and quality in head-to-head comparisons | A VideoLLM performs double-order pairwise comparisons between generated and human-made presentation videos; winning rate is the metric (order flipping reduces position bias).[1] |
| PresentQuiz | Information coverage and comprehension | Multiple-choice questions are generated from the source paper covering both fine-grained details and higher-level understanding; a VideoLLM watches the video and answers; overall accuracy is reported.[1] |
| IP Memory | Memorability and the audience's ability to associate authors with their work | A recall task asks a VideoLLM to match brief 5-second video clips to a relevant question given a speaker image; accuracy reflects retention and associative memory.[1] |
The authors note that IP Memory is the most novel of the four and was inspired by real-conference interactions where attendees who recall a presentation are more likely to approach the author with relevant questions later.[1]
PaperTalker is a multi-agent pipeline that converts a paper into a narrated presentation video. The pipeline is designed to scale slide-wise in parallel for efficiency.[1] It comprises four "builders" with clearly decoupled responsibilities:
The slide builder synthesizes slides with LaTeX Beamer code rather than the PowerPoint XML or PPTX templates favored by prior systems such as PPTAgent. The authors give three reasons: LaTeX arranges content from declarative parameters, Beamer is more compact than XML, and Beamer offers academically appropriate styles. Generation proceeds in three steps:
Prompting LLMs and VLMs to directly tune numeric layout parameters (font size, figure scale, margins) is unreliable because the models are largely insensitive to small numeric edits. Tree Search Visual Choice instead constructs a neighborhood of candidate parameter values (for example figure scaling factors of 1.25, 0.75, 0.5, 0.25), renders each variant to an image, then asks a VLM to score the candidates on a single composite figure and pick the best. This decouples discrete layout search from semantic reasoning and resolves overflow issues with minimal token cost.[1]
The subtitle builder rasterizes each slide and feeds it to a VLM, which produces sentence-level subtitles paired with a visual-focus prompt describing where on the slide attention should be directed during that sentence. These prompts bridge speech and cursor motion.[1]
The cursor builder grounds the visual-focus prompts into screen coordinates. Spatial alignment uses ShowUI or UI-TARS to predict an (x, y) location per sentence from the slide screenshot. Temporal alignment uses WhisperX to extract word-level timestamps, yielding (t_start, t_end) per sentence. The authors simplify by assuming the cursor stays still within a sentence and only moves between sentences.[1][6][7]
The talker builder produces a personalized presenter clip per slide. Speech synthesis uses F5-TTS, a flow-matching TTS model, conditioned on a short voice sample. Talking-head rendering uses Hallo2 by default; FantasyTalking is supported for upper-body articulation. The authors generate one clip per slide in parallel, relying on the natural scene cut between slides to mask the lack of temporal continuity, and report more than a 6 times speedup over sequential generation.[1][8]
The repository uses the Tectonic LaTeX engine and supports both commercial and local VLM back ends. Recommended choices are GPT-4.1 for slide and subtitle generation and Gemini 2.5 Flash or Pro for the VideoLLM judge, with local Qwen variants supported. The minimum recommended GPU is an NVIDIA A6000 with 48 GB; the authors ran inference on eight A6000s. A "light" mode without the talking head is provided for fast generation.[1][3]
On automated metrics, PaperTalker reports the strongest performance among automatic baselines on the Paper2Video Benchmark. Notably, on PresentQuiz it surpasses human-recorded presentations by roughly 10 percentage points, which the authors attribute to PaperTalker producing more concentrated information per minute of video and to the cursor providing explicit attentional cues that help the VideoLLM judge.[1]
Approximate values reported in Table 2 of the paper for selected metrics:
| Method | PresentArena win vs. human (%) | PresentQuiz accuracy (%) | IP Memory (%) |
|---|---|---|---|
| Human-made | 50.0 | ~85 to 90 | reference |
| PaperTalker (full) | ~17 | ~95 | ~50 |
| PaperTalker without talker and cursor | ~15 | lower | lower |
| PresentAgent | ~2 | ~65 | ~12 |
| Veo3 (end-to-end) | ~1 | ~58 | ~31 |
PresentArena values are pairwise win rates against the human reference, so 50% indicates a tie. PaperTalker's lower number reflects the gap on perceived overall quality, despite its higher quiz accuracy.[1]
A localization QA shows a large gain from explicit cursor guidance:
| Method or variant | Localization accuracy |
|---|---|
| PaperTalker without cursor | 0.084 |
| PaperTalker with cursor | 0.633 |
Source: paper (Table 4).[1]
Slides were scored on a 1 to 5 scale across content, design, and coherence, following the PPTAgent rubric. Removing the layout refinement module produced a pronounced drop in design quality (from roughly 2.85 to 2.53), confirming its role in resolving overflow.[1][10]
| Method | Time (min) | Notes |
|---|---|---|
| PaperTalker (full) | 48.1 | Includes talking head, parallelized over 8 A6000s |
| PaperTalker without talker | 15.6 | "Fast" variant |
| PaperTalker without parallelization | 287.2 | For comparison |
Slide-wise parallelization yields more than a 6 times speedup versus sequential generation in the agentic pipeline. Token cost per presentation was reported at roughly $0.001, well below the $0.003 of PresentAgent at the time of evaluation.[1]
Ten participants ranked videos generated by different methods on a 1 (worse) to 5 (best) scale across ten randomly sampled papers. Human-made videos scored highest, with PaperTalker ranking second and clearly ahead of all automatic baselines. The authors interpret this as evidence that the gap with humans has narrowed but not closed.[1]
| Resource | Location | License |
|---|---|---|
| Code | github.com/showlab/Paper2Video | MIT |
| Dataset | Hugging Face (ZaynZhu/Paper2Video) | research use, see card |
| Project page | showlab.github.io/Paper2Video | not applicable |
| Paper | arXiv 2510.05096 | arXiv default |
The repository ships configuration for both commercial APIs (OpenAI, Google) and local model serving, plus a separate Hallo2 environment for the talking-head module.[3]
Paper2Video sits within a small but growing cluster of "AI for Research" agents. The closest comparisons are document-to-media systems and slide-generation tools.
| System | Year | Input | Slides | Subtitles | Cursor | Face | Voice |
|---|---|---|---|---|---|---|---|
| D2S (Sun et al.) | 2021 | Document | yes | no | no | no | no |
| PPTAgent | 2025 | Doc + template | yes | no | no | no | no |
| Paper2Poster | 2025 | Paper | poster only | no | no | no | no |
| PresentAgent | 2025 | Doc + template | yes | yes | no | no | no |
| Paper2Agent | 2025 | Paper | n/a (interactive) | n/a | n/a | n/a | n/a |
| Paper2Video / PaperTalker | 2025 | Paper + image + audio | yes | yes | yes | yes | yes |
Key contrasts:
More distantly, the project sits within the wider "AI4Research" agenda surveyed by Chen et al. (2025), spanning literature surveying, idea generation, and replication benchmarks such as PaperBench and SciReplicate-Bench.[1]
Within a week of release the GitHub repository accumulated thousands of stars and reached the front page of Hacker News, with discussion focused on whether automated talks could meaningfully replace recorded conference presentations. Commentary by AlphaXiv and Emergent Mind highlighted the IP Memory metric as a novel attempt to quantify the "author visibility" purpose of conference talks.[3][14][15]
The authors framed the work as a practical step rather than a finished product. They acknowledge that talking-head fidelity, gesture realism, and handling of equation-heavy slides remain open, and that the benchmark covers only AI conference papers from the past three years.[1]