Paper2Video

AI Benchmarks AI Research Multimodal AI

15 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v5 · 3,043 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Paper2Video (full title: Paper2Video: Automatic Video Generation from Scientific Papers) is a research project from Show Lab at the National University of Singapore that formalizes and evaluates automatic generation of academic presentation videos directly from scientific papers. It comprises (1) the Paper2Video Benchmark, 101 paper-video pairs with slides and speaker metadata, and (2) PaperTalker, a multi-agent framework that turns a paper (plus a reference image and short voice sample) into a narrated presentation video with slides, subtitles, cursor highlights, and an optional talking-head presenter.^[1]^[2] According to the paper, PaperTalker "outperforms human-made presentations by 10% in PresentQuiz accuracy and achieves comparable ratings in user studies," a result the authors read as automated quality approaching that of human-created content.^[1] Code and data are open sourced under the MIT license on GitHub, and the benchmark is hosted on Hugging Face.^[3]^[4]

The authors are Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou (corresponding author), with Zhu and Lin contributing equally. It was first posted to arXiv on 6 October 2025 (preprint 2510.05096) and accepted to the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025, where it was presented as a poster in San Diego on 7 December 2025.^[1]^[5]^[16] In mid-October 2025 the project briefly trended on Hacker News and the GitHub repository accumulated thousands of stars within days.^[3]

What is the difference between Paper2Video and PaperTalker?

The name Paper2Video refers to both the benchmark and the overall project; the video-generation agent is called PaperTalker. Unlike natural video generation, the authors argue, presentation video generation "involves distinctive challenges: long-context inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker."^[1] Evaluation is therefore focused on faithfulness, audience comprehension, and author visibility rather than purely natural-video realism.^[1]^[2]

The authors describe academic presentation video generation as "a superproblem" of related document-to-media tasks such as slide and poster generation, calling it "a practical yet more challenging direction."^[1] Producing such a talk manually, the paper notes, "remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video," which motivates the case for automation.^[1]

When was Paper2Video released?

Date	Event
28 September 2025	Project accepted as a poster at the SEA (Scaling Environments for Agents) Workshop at NeurIPS 2025.^[5]
6 October 2025	arXiv v1 posted; code and dataset released on GitHub and Hugging Face.^[1]^[3]
9 October 2025	arXiv v2 released with minor revisions. The four metrics (Meta Similarity, PresentArena, PresentQuiz, IP Memory) were already part of v1.^[1]
11 October 2025	Project featured on Hacker News and YC's front page, driving a spike in repository stars.^[3]
15 October 2025	A "fast" variant without the talking head added to the repository for quicker generation.^[3]
7 December 2025	Presented as a poster at the NeurIPS 2025 SEA Workshop in San Diego.^[16]

Who created Paper2Video?

The paper was produced by Show Lab, a research group at the National University of Singapore led by Mike Zheng Shou. The same group has produced related work on GUI agents (ShowUI) and multimodal generation, several of which are cited as building blocks of PaperTalker.^[1]^[6]

Author	Role
Zeyu Zhu	Co-first author, Show Lab, NUS
Kevin Qinghong Lin	Co-first author, Show Lab, NUS (also lead author of ShowUI)
Mike Zheng Shou	Corresponding author, principal investigator of Show Lab

What is the Paper2Video benchmark?

The benchmark pairs recent conference papers with the authors' presentation videos, original slide decks (when available), and presenter identity metadata. Public sources include YouTube and SlidesLive, supplemented with portrait images sourced from authors' personal websites. Papers without sufficient metadata were excluded during curation.^[1]

Composition and statistics

Item	Value (aggregate)
Number of paper-video pairs	101
Average words per paper	~13.3K (~3.3K tokens)
Average figures per paper	~44.7
Average pages per paper	~28.7
Average slides per presentation	~16
Average talk duration	~6 minutes 15 seconds (range: 2 to 14 minutes)
Slides per video range	4 to 28
Original slide PDFs available	for ~40% of entries

Sources: project page, dataset card, and paper.^[1]^[2]^[4]

A domain breakdown reported in the paper is:

Area	Count (papers)	Example venues
Machine learning	41	NeurIPS, ICLR, ICML
Computer vision	40	CVPR, ICCV, ECCV
Natural language processing	20	ACL, EMNLP, NAACL

Each instance includes the paper's full LaTeX project, an author-recorded presentation video (slide and talking-head streams), and speaker identity (portrait and short voice sample). For roughly 40% of entries, original slide PDFs are also collected, enabling reference-based slide evaluation.^[1]

Curation rationale

The authors chose AI conference papers because the field's open-sharing culture provides polished author-recorded presentations on YouTube and SlidesLive, and because such papers offer diverse content with rich text, figures, and tables. They explicitly framed the benchmark as evaluating long-horizon agentic tasks rather than generic video generation, distinguishing it from natural-video benchmarks such as VBench.^[1]

How is Paper2Video evaluated?

Paper2Video proposes four tailored metrics for academic presentation videos, using vision-language models (VLMs) and VideoLLMs as automated judges where appropriate.^[1] The authors argue that conventional metrics from natural video synthesis (such as FVD, IS, or CLIP similarity) miss the central purpose of an academic talk: communicating scholarship to an audience and amplifying author visibility.

Metric	What it measures	How it is operationalized
Meta Similarity	Alignment of generated assets with human-authored ones (slides, subtitles, speech timbre)	A VLM compares generated slide-subtitle pairs to the human versions on a five-point scale; speech similarity uses embedding cosine similarity on uniformly sampled 10-second clips via SpeechBrain.^[1]
PresentArena	Overall preference and quality in head-to-head comparisons	A VideoLLM performs double-order pairwise comparisons between generated and human-made presentation videos; winning rate is the metric (order flipping reduces position bias).^[1]
PresentQuiz	Information coverage and comprehension	Multiple-choice questions are generated from the source paper covering both fine-grained details and higher-level understanding; a VideoLLM watches the video and answers; overall accuracy is reported.^[1]
IP Memory	Memorability and the audience's ability to associate authors with their work	A recall task asks a VideoLLM to match brief 5-second video clips to a relevant question given a speaker image; accuracy reflects retention and associative memory.^[1]

The authors note that IP Memory is the most novel of the four and was inspired by real-conference interactions where attendees who recall a presentation are more likely to approach the author with relevant questions later.^[1]

How does PaperTalker work?

PaperTalker is a multi-agent pipeline that converts a paper into a narrated presentation video. The pipeline is designed to scale slide-wise in parallel for efficiency.^[1] It comprises four "builders" with clearly decoupled responsibilities:

1. Slide builder

The slide builder synthesizes slides with LaTeX Beamer code rather than the PowerPoint XML or PPTX templates favored by prior systems such as PPTAgent. The authors give three reasons: LaTeX arranges content from declarative parameters, Beamer is more compact than XML, and Beamer offers academically appropriate styles. Generation proceeds in three steps:

A coder LLM produces draft Beamer code from the paper's LaTeX source.
The code is compiled with the Tectonic engine; warnings and errors trigger a focused debugging routine that narrows the relevant lines and asks the model to repair them.
Slides flagged for layout overflow are sent through Tree Search Visual Choice refinement.^[1]

Tree Search Visual Choice

Prompting LLMs and VLMs to directly tune numeric layout parameters (font size, figure scale, margins) is unreliable because the models are largely insensitive to small numeric edits. Tree Search Visual Choice instead constructs a neighborhood of candidate parameter values (for example figure scaling factors of 1.25, 0.75, 0.5, 0.25), renders each variant to an image, then asks a VLM to score the candidates on a single composite figure and pick the best. This decouples discrete layout search from semantic reasoning and resolves overflow issues with minimal token cost.^[1]

2. Subtitle builder

The subtitle builder rasterizes each slide and feeds it to a VLM, which produces sentence-level subtitles paired with a visual-focus prompt describing where on the slide attention should be directed during that sentence. These prompts bridge speech and cursor motion.^[1]

3. Cursor builder

The cursor builder grounds the visual-focus prompts into screen coordinates. Spatial alignment uses ShowUI or UI-TARS to predict an (x, y) location per sentence from the slide screenshot. Temporal alignment uses WhisperX to extract word-level timestamps, yielding (t_start, t_end) per sentence. The authors simplify by assuming the cursor stays still within a sentence and only moves between sentences.^[1]^[6]^[7]

4. Talker builder

The talker builder produces a personalized presenter clip per slide. Speech synthesis uses F5-TTS, a flow-matching TTS model, conditioned on a short voice sample. Talking-head rendering uses Hallo2 by default; FantasyTalking is supported for upper-body articulation. The authors generate one clip per slide in parallel, relying on the natural scene cut between slides to mask the lack of temporal continuity, and report more than a 6 times speedup over sequential generation.^[1]^[8]

Implementation notes

The repository uses the Tectonic LaTeX engine and supports both commercial and local VLM back ends. Recommended choices are GPT-4.1 for slide and subtitle generation and Gemini 2.5 Flash or Pro for the VideoLLM judge, with local Qwen variants supported. The minimum recommended GPU is an NVIDIA A6000 with 48 GB; the authors ran inference on eight A6000s. A "light" mode without the talking head is provided for fast generation.^[1]^[3]

How well does PaperTalker perform?

On automated metrics, PaperTalker reports the strongest performance among automatic baselines on the Paper2Video Benchmark. The headline finding is that on PresentQuiz, a multiple-choice comprehension test answered by a VideoLLM after watching the talk, PaperTalker surpasses the human-recorded presentations. The paper summarizes this as the system outperforming "human-made presentations by 10% in PresentQuiz accuracy" while achieving "comparable ratings in user studies."^[1] The authors attribute the comprehension gain to PaperTalker producing more concentrated information per minute of video and to the cursor providing explicit attentional cues that help the VideoLLM judge.^[1]

PresentQuiz: comprehension versus human talks

PresentQuiz subset	Human-made (%)	PaperTalker (%)
Understanding	90.8	95.1
Detail	73.8	84.2

The gap is widest on the fine-grained Detail questions (84.2% versus 73.8%, roughly 10 points), which is the source of the paper's headline "10%" claim.^[1]

Main result snapshot

Values reported in the paper for the head-to-head preference and memorability metrics:

Method	PresentArena win vs. human (%)	IP Memory (%)
Human-made	50.0 (tie reference)	reference
PaperTalker (full)	17.0	50.0
PaperTalker without talker and cursor	15.2	not applicable
PresentAgent	2.0	12.5
Veo3 (end-to-end)	1.2	31.3

PresentArena values are pairwise win rates against the human reference, so 50% indicates a tie. PaperTalker's 17.0% reflects the remaining gap on perceived overall quality, even though it leads every automatic baseline by a wide margin and beats humans on comprehension. Two end-to-end video diffusion baselines, Veo3 and Wan2.2, won only 1.2% and 1.1% of PresentArena comparisons against human talks. On IP Memory, a recall task that tests whether a viewer can associate a talk with its author, PaperTalker (50.0%) roughly doubles the best end-to-end video model (Veo3, 31.3%) and far exceeds the slide-based PresentAgent (12.5%).^[1]

Cursor ablation

A localization QA shows a large gain from explicit cursor guidance:

Method or variant	Localization accuracy
PaperTalker without cursor	0.084
PaperTalker with cursor	0.633

Source: paper (Table 4).^[1]

Tree Search Visual Choice ablation

Slides were scored on a 1 to 5 scale across content, design, and coherence, following the PPTAgent rubric. Removing the layout refinement module produced a pronounced drop in design quality (from roughly 2.85 to 2.53), confirming its role in resolving overflow.^[1]^[10]

Runtime snapshot (per paper, representative setting)

Method	Time (min)	Notes
PaperTalker (full)	48.1	Includes talking head, parallelized over 8 A6000s
PaperTalker without talker	15.6	"Fast" variant
PaperTalker without parallelization	287.2	For comparison

Slide-wise parallelization yields more than a 6 times speedup versus sequential generation in the agentic pipeline. Token usage per presentation was reported at roughly 62K tokens (about $0.001), well below the roughly 241K tokens (about $0.003) of PresentAgent at the time of evaluation, making PaperTalker close to 3 times cheaper by this measure.^[1]

Human evaluation

Ten participants ranked videos generated by different methods on a 1 (worse) to 5 (best) scale across ten randomly sampled papers. Human-made videos scored highest, with PaperTalker ranking second and clearly ahead of all automatic baselines. The authors interpret this as evidence that the gap with humans has narrowed but not closed.^[1]

Is Paper2Video open source?

Yes. Paper2Video is fully open source: the PaperTalker code is released under the MIT license and the 101-pair benchmark dataset is public on Hugging Face.

Resource	Location	License
Code	github.com/showlab/Paper2Video	MIT
Dataset	Hugging Face (ZaynZhu/Paper2Video)	research use, see card
Project page	showlab.github.io/Paper2Video	not applicable
Paper	arXiv 2510.05096	arXiv default

The repository ships configuration for both commercial APIs (OpenAI, Google) and local model serving, plus a separate Hallo2 environment for the talking-head module.^[3]

How does Paper2Video compare to prior work?

Paper2Video sits within a small but growing cluster of "AI for Research" agents. The closest comparisons are document-to-media systems and slide-generation tools.

System	Year	Input	Slides	Subtitles	Cursor	Face	Voice
D2S (Sun et al.)	2021	Document	yes	no	no	no	no
PPTAgent	2025	Doc + template	yes	no	no	no	no
Paper2Poster	2025	Paper	poster only	no	no	no	no
PresentAgent	2025	Doc + template	yes	yes	no	no	no
Paper2Agent	2025	Paper	n/a (interactive)	n/a	n/a	n/a	n/a
Paper2Video / PaperTalker	2025	Paper + image + audio	yes	yes	yes	yes	yes

Key contrasts:

Paper2Poster generates posters and contributes the PaperQuiz metric that inspired PresentQuiz, but produces no video or audio.^[9]
PPTAgent targets document-to-slides workflows and introduces PPTEval (content, design, coherence); PaperTalker reuses the rubric but writes Beamer rather than editing PPTX.^[10]
PresentAgent generates presentation videos from documents but lacks personalization and academic-style structure such as opening or outline slides.^[11]
Paper2Code automates code generation from ML papers, a sibling system targeting a different output modality.^[12]
Veo 3 and similar end-to-end video diffusion models can render a high-quality presenter shot but are constrained to roughly 8 second durations with blurred on-screen text, a key limitation as an academic-talk baseline.^[13]

More distantly, the project sits within the wider "AI4Research" agenda surveyed by Chen et al. (2025), spanning literature surveying, idea generation, and replication benchmarks such as PaperBench and SciReplicate-Bench.^[1]

How was Paper2Video received?

Within a week of release the GitHub repository accumulated thousands of stars and reached the front page of Hacker News, with discussion focused on whether automated talks could meaningfully replace recorded conference presentations. Commentary by AlphaXiv and Emergent Mind highlighted the IP Memory metric as a novel attempt to quantify the "author visibility" purpose of conference talks.^[3]^[14]^[15]

The authors framed the work as a practical step rather than a finished product. They acknowledge that talking-head fidelity, gesture realism, and handling of equation-heavy slides remain open, and that the benchmark covers only AI conference papers from the past three years.^[1]

References

Zhu, Z., Lin, K. Q., Shou, M. Z. "Paper2Video: Automatic Video Generation from Scientific Papers". arXiv preprint 2510.05096, October 2025. https://arxiv.org/abs/2510.05096 ↩
Show Lab, NUS. "Paper2Video project page". https://showlab.github.io/Paper2Video/ ↩
Show Lab, NUS. "showlab/Paper2Video GitHub repository". https://github.com/showlab/Paper2Video ↩
Zhu, Z. "ZaynZhu/Paper2Video dataset card on Hugging Face". https://huggingface.co/datasets/ZaynZhu/Paper2Video ↩
NeurIPS 2025 Scaling Environments for Agents (SEA) Workshop. https://sea-workshop.github.io/ ↩
Lin, K. Q. et al. "ShowUI: One Vision-Language-Action Model for GUI Visual Agent". CVPR 2025. https://arxiv.org/abs/2411.17465 ↩
Bain, M., Huh, J., Han, T., Zisserman, A. "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio". arXiv 2303.00747, 2023. ↩
Cui, J. et al. "Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation". arXiv 2410.07718, 2024. ↩
Pang, W. et al. "Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers". arXiv 2505.21497, 2025. ↩
Zheng, H. et al. "PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides". arXiv 2501.03936, 2025. ↩
Shi, J. et al. "PresentAgent: Multimodal Agent for Presentation Video Generation". arXiv 2507.04036, 2025. ↩
Seo, M. et al. "Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning". arXiv 2504.17192, 2025. ↩
DeepMind. "Veo 3 Technical Report". May 2025. https://storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf ↩
Hugging Face Papers. "Paper2Video paper page". https://huggingface.co/papers/2510.05096 ↩
AlphaXiv. "Paper2Video: Automatic Video Generation from Scientific Papers". https://www.alphaxiv.org/overview/2510.05096v2 ↩
NeurIPS 2025. "Paper2Video: Automatic Video Generation from Scientific Papers (SEA Workshop poster)". https://neurips.cc/virtual/2025/loc/san-diego/124558 ; OpenReview: https://openreview.net/forum?id=LvRHonr4gv ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Artificial intelligence terms Terms

What is the difference between Paper2Video and PaperTalker?

When was Paper2Video released?

Who created Paper2Video?

What is the Paper2Video benchmark?

Composition and statistics

Curation rationale

How is Paper2Video evaluated?

How does PaperTalker work?

1. Slide builder

Tree Search Visual Choice

2. Subtitle builder

3. Cursor builder

4. Talker builder

Implementation notes

How well does PaperTalker perform?

PresentQuiz: comprehension versus human talks

Main result snapshot

Cursor ablation

Tree Search Visual Choice ablation

Runtime snapshot (per paper, representative setting)

Human evaluation

Is Paper2Video open source?

How does Paper2Video compare to prior work?

How was Paper2Video received?

See also

References

Improve this article

Related Articles

GDPval

ERQA

Fox (benchmark)

MMMU

Visual Question Answering Models

MathVista

What links here

Related Articles

GDPval

ERQA

Fox (benchmark)

MMMU

Visual Question Answering Models

MathVista

What links here