Lumiere
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,375 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,375 words
Add missing citations, update stale details, or suggest a clearer explanation.
Lumiere is a text-to-video diffusion model developed by Google Research in collaboration with researchers from the Weizmann Institute of Science, Tel Aviv University, and the Technion, introduced in January 2024 through the preprint "Lumiere: A Space-Time Diffusion Model for Video Generation" (arXiv:2401.12945)[^1]. The system is best known for its Space-Time U-Net (STUNet) architecture, which generates the entire temporal extent of a clip in a single forward pass instead of relying on the cascade-of-keyframes pipeline plus temporal super-resolution that earlier systems such as Imagen Video and Phenaki had used[^1][^2]. A base Lumiere model outputs 80-frame clips at 16 frames per second, or roughly five seconds of motion, at a low spatial resolution of 128×128 pixels that is subsequently upsampled to 1024×1024 by a spatial super-resolution (SSR) network applied with a MultiDiffusion-style overlapping-window scheme[^1][^3]. The paper, demos, and supplementary applications, including image-to-video, stylized generation, video editing, cinemagraphs, and inpainting, were presented at SIGGRAPH Asia 2024 in Tokyo[^4]. Google did not release Lumiere as a public product, API, or open-weights checkpoint; the project remains research-only, and its design ideas have flowed instead into later Google AI video generation efforts such as the Veo family[^5].
| Field | Value |
|---|---|
| Full name | Lumiere: A Space-Time Diffusion Model for Video Generation |
| Type | Text-to-video diffusion model |
| Developer | Google Research; Weizmann Institute of Science; Tel Aviv University; Technion |
| arXiv ID | 2401.12945 (v1: 2024-01-23; v2: 2024-02-05) |
| Initial announcement | January 25, 2024 |
| Conference | SIGGRAPH Asia 2024, Tokyo, December 3 to 6, 2024 |
| DOI | 10.1145/3680528.3687614 |
| Output length | 80 frames at 16 fps (5 seconds) |
| Base resolution | 128 × 128 pixels |
| Final resolution | 1024 × 1024 pixels (after SSR) |
| Base text-to-image model | Imagen (frozen weights, extended with temporal layers) |
| Training data | 30 million video and text caption pairs |
| Release status | Research only; no public weights, API, or demo |
| Project page | lumiere-video.github.io |
Sources for the values above: the arXiv paper[^1], the project page[^3], the SIGGRAPH Asia 2024 program[^4], the deeplearning.ai coverage[^6], and the VentureBeat announcement[^7].
Video generation models that predate Lumiere converged on a cascade design. A first network produces a short, low-frame-rate sequence of keyframes at low spatial resolution, and a chain of temporal and spatial super-resolution modules then fill in intermediate frames and upscale to a target resolution. Google's own Imagen Video, the closest prior reference for the Lumiere team, was built on seven cascaded diffusion networks: a base text-to-video model produced 16 frames at 24×40 resolution and 3 fps, then alternating temporal and spatial super-resolution stages stretched the output to 128 frames at 1280×768 resolution and 24 fps[^2]. Meta's Make-A-Video adopted a similar staged approach, and Google's Phenaki used a different decomposition (token-based prediction plus a video super-resolution decoder) but still produced long sequences by stitching shorter ones together[^1].
Cascaded designs win on compute efficiency: each stage operates on a manageable tensor, and temporal super-resolution networks are reused across windows of frames. Their weakness is consistency. Each temporal super-resolution stage has to interpolate between keyframes it sees in a sliding window, and the interpolation cannot draw on full global context across the whole clip. The Lumiere paper points out that this design "inherently makes global temporal consistency difficult to achieve" because keyframe synthesis happens before any high-frame-rate signal is available, and the temporal interpolation network is non-causal and aliased: it cannot always recover a single plausible motion that holds across multiple windows[^1]. For repetitive or fast motion, such as walking, hand gestures, or rotating objects, this manifests as jitter, jumpy transitions, or content that subtly mutates between consecutive sub-clips[^1][^6].
Lumiere was written by 17 authors. The first author is Omer Bar-Tal, with Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri[^4]. The collaboration draws together Google Research video and graphics groups (which had previously shipped Imagen and Imagen Video) with academic labs in Israel, including Tali Dekel's group at the Weizmann Institute, Tomer Michaeli's group at the Technion, and Hila Chefer's affiliation with Tel Aviv University[^1][^3]. The paper was first posted to arXiv on January 23, 2024, with version 2 appearing on February 5, 2024[^1]. Google separately publicized the project through a research blog and demo videos on January 25, 2024[^5][^7].
Coverage in January 2024 was unusually intense for a research preprint because the released demos visibly outperformed contemporary commercial systems on motion quality. VentureBeat called the results "remarkably realistic" and emphasized the contrast with Runway Gen-2 and Pika clips, which often showed object distortion or warping in motion[^7]. The Decoder summarized the architecture as a single-pass alternative to keyframe interpolation, and noted that Google was not releasing the model for public use[^8]. The deeplearning.ai newsletter ran a technical summary noting the 30-million-video training set, the 1024×1024 final resolution, and a user-study win rate where Lumiere beat Runway Gen-2 on video quality by 61% to 39%[^6]. The CineD report stressed the lack of any demo or download, framing the model as a research preview, not a product[^9].
A few weeks later, on February 15, 2024, Sora from OpenAI was unveiled and rapidly absorbed media attention with longer, higher-resolution clips, but the underlying STUNet idea (process the whole clip jointly rather than in cascade) was, at the time of the Lumiere release, a sharp public articulation of a research direction that other labs were independently pursuing[^7][^10].
Lumiere's design goal is to produce a coherent five-second clip from a text prompt without ever invoking a temporal super-resolution stage. The model treats the clip as a single space-time tensor and denoises it across multiple scales in both space and time. The complete pipeline has two networks:
The base STUNet model is the conceptual novelty; the SSR network reuses techniques familiar from prior image and video diffusion work, adapted to the temporal axis.
The STUNet extends a pre-trained text-to-image U-Net from Imagen into a video model by interleaving temporal modules and adding both spatial and temporal down- and up-sampling stages[^1][^6]. The architecture preserves the basic U-shape of an image U-Net (encoder, bottleneck, decoder with skip connections) but at each spatial resolution it also carries a time axis, which is downsampled in the encoder and upsampled in the decoder by dedicated temporal pooling and unpooling modules. Spatial weights from the pre-trained text-to-image backbone are frozen; only the new temporal modules are trained, which makes the optimization tractable on a moderate compute budget compared to training a video model from scratch[^1][^6].
Within each level of the U-Net, two block types are used:
Spatial down/up-sampling layers come from the pre-trained text-to-image backbone. Adjacent to each of those, the authors insert temporal down/up-sampling modules that pool and unpool the frame axis. To make the model behave like the original image model at initialization (an important inductive bias when reusing frozen spatial weights), the temporal sampling modules are initialized to nearest-neighbor temporal averaging and unpooling, so that the freshly initialized STUNet, before training, treats a video as a stack of independently denoised images[^1][^11]. Ablations in the paper show that nearest-neighbor initialization produces meaningfully better motion than random initialization of the temporal modules[^6][^11].
Text conditioning enters through the same cross-attention layers as in the underlying Imagen model. A T5-style text encoder produces a sequence of embeddings, and each spatial attention block in the U-Net cross-attends to this sequence; the temporal modules do not introduce additional text conditioning paths[^1][^2].
After STUNet produces an 80×128×128 RGB volume, a separate spatial super-resolution diffusion model upsamples it to 80×1024×1024[^3][^6]. The SSR model is also a diffusion U-Net with temporal modules, and like the base it cannot economically process all 80 frames jointly at 1024×1024. Instead, the paper applies SSR to overlapping temporal windows of frames (the windows overlap by two frames in the description on the project page and in third-party summaries[^3][^12]).
Run independently, the per-window SSR outputs would not match at the window boundaries. To eliminate this, the authors adopt MultiDiffusion style aggregation, originally proposed by some of the same Weizmann authors for combining multiple spatial diffusion passes into a single coherent image[^1]. Concretely, at each denoising step of the SSR diffusion, the network's predictions for each window are reconciled by solving a small per-step least-squares problem that minimizes the squared difference between the global per-frame prediction and each window's local prediction in the overlap regions. The resulting consensus prediction is fed back as the next step's denoising target. The effect is that the SSR network behaves as a single global denoiser despite the underlying network only ever seeing a window of frames at a time. Ablations in the paper show that without MultiDiffusion aggregation, visible temporal seams appear at window boundaries[^1][^11].
The STUNet base is trained on a dataset of 30 million video and text caption pairs, the size of which is reported both in the paper and in the deeplearning.ai summary[^1][^6]. The dataset's exact provenance is not described in detail in the public paper; Google's video projects of the era typically combined licensed and web-sourced clips. The base spatial network is taken from Imagen and frozen during training; only the inserted temporal modules (factorized temporal convolutions, temporal attention, and temporal pool/unpool layers) are optimized. The training objective is the standard noise-prediction loss of denoising diffusion models, applied to the joint space-time tensor[^1][^6].
The SSR model is trained similarly but operates on shorter windows of higher-resolution video.
At inference time, the user supplies a text prompt and an optional seed. The pipeline draws a random Gaussian noise tensor of shape 80×128×128×3, runs the STUNet denoising loop conditioned on the text embedding, then applies the MultiDiffusion-aggregated SSR network to produce the 80×1024×1024 final clip[^1][^3]. Classifier-free guidance is used during sampling, as in other Imagen-derived diffusion systems, although the precise guidance scale and sampler schedule are not the paper's focus[^2].
The Lumiere paper and project page demonstrate six concrete applications built on the same base model, each requiring only inference-time modifications or light fine-tuning:
The default mode: given a text prompt, the STUNet generates a five-second clip[^1][^3]. The demos on the project page show animals, vehicles, plants, water, fire, and people in motion, with the kinds of repetitive deformations (walking, swimming, hair, flame turbulence) that prior cascaded models had handled poorly[^3][^6].
To condition the model on a static image as the first frame, the authors treat image-to-video as a temporal inpainting problem: the first frame slot is "observed" (the user-supplied image is repeated and noised at the appropriate scale), the remaining 79 frames are masked, and the STUNet's reverse diffusion inpaints the future. No re-training is required; the model already knows how to fill masked regions because it was implicitly trained on a similar masking objective for video inpainting[^1][^11].
For stylized text-to-video, the authors fine-tune the spatial layers of the underlying text-to-image model on a small set of style reference images, then perform a linear interpolation between the original spatial weights and the fine-tuned spatial weights to control the strength of stylization. The frozen temporal modules continue to provide motion. This is a video adaptation of the LoRA-style weight-interpolation tricks common in static image generation[^1][^11][^13].
By coupling Lumiere with SDEdit, the model can edit an input clip according to a new text prompt: the source video is noised to an intermediate diffusion timestep, then the STUNet denoises it conditioned on the target prompt, producing a clip that retains the input's spatial layout and motion but applies a new appearance. The advantage over editing each frame independently with a static image editor is that the STUNet's temporal layers enforce coherence across the edited frames[^1][^11][^13].
A cinemagraph animates a localized region of an otherwise static image (a classic example is a clip in which a person's hair moves while everything else is frozen). Lumiere realizes this by conditioning the model on the static image as the first frame, then masking only the user-specified region for the remaining frames. The unmasked regions are held fixed across all 80 time slots so the still parts of the image stay still, while STUNet denoises plausible motion inside the masked region[^1][^3].
Lumiere can fill missing video regions: the user supplies a video with a masked spatial or spatiotemporal region and a text prompt describing the desired content, and the STUNet generates content that is both temporally coherent with the surrounding video and consistent with the prompt. Outpainting (extending a video outside its original frame) is a special case where the mask covers the new spatial padding region across all time steps[^1][^3].
The paper reports automated metrics (Fréchet Video Distance, Inception Score) and human-preference results from a two-alternative forced-choice (2AFC) study comparing Lumiere with Imagen Video, Pika, Stable Video Diffusion, Runway Gen-2, AnimateDiff, and ZeroScope[^1][^6]. The exact numbers reported in third-party coverage of the paper include Lumiere achieving a Fréchet Video Distance of 332.49 and an Inception Score of 37.54, versus 367.23 and 33.00 respectively for ImagenVideo[^11]. The human study, summarized by deeplearning.ai's Andrew Ng newsletter, reports that on video quality Lumiere was preferred over its strongest baseline (Runway Gen-2) by 61% to 39%, and on text-video alignment over its strongest baseline (Imagen Video) by 55% to 45%[^6]. The paper notes that 16-frame baselines such as AnimateDiff and 36-frame baselines such as ZeroScope produce shorter clips than Lumiere's 80 frames, which complicates direct FVD comparison[^11].
The authors evaluate text alignment using CLIP-based metrics including CLIP Score. Like all video-generation evaluations of the period, the results are heavily dependent on the prompt set used, and the paper acknowledges that automated metrics correlate imperfectly with human judgments of motion realism[^1][^11].
The table below summarizes Lumiere alongside other publicly described text-to-video systems available around the time of its release. Specifications are taken from each system's own documentation or paper.
| System | Developer | Approach | Output | Public access |
|---|---|---|---|---|
| Lumiere | Google Research, Weizmann, Tel Aviv U., Technion | Single-pass STUNet plus MultiDiffusion SSR | 80 frames at 16 fps (5 s), 1024×1024[^1][^3] | Research only, no public weights[^7][^9] |
| Imagen Video | Google Research | Cascade of 7 diffusion models with temporal super-resolution | 128 frames at 24 fps, 1280×768[^2] | Research only |
| Sora (original) | OpenAI | Spacetime patch transformer over latent video tokens | Up to 60 s clips, up to 1080p[^10] | Initially research preview; later product (Sora 2)[^10] |
| Veo | Google DeepMind | Latent video diffusion with Transformer backbones | "Over a minute" at 1080p (Veo 1)[^14] | API/product (later Veo 2, Veo 3)[^14] |
| Runway Gen-2 / Gen-3 | Runway | Proprietary latent video diffusion | Up to ~16 s; up to 1280×768[^15] | Commercial product[^15] |
| Pika 1.0 | Pika Labs | Proprietary diffusion | Short clips around 3 s | Commercial product |
| Stable Video Diffusion | Stability AI | Image-to-video latent diffusion | 14 or 25 frames at 1024×576 | Open weights |
| AnimateDiff | Open source community | Plug-in temporal module for image diffusion checkpoints | 16 frames typical | Open source |
| ZeroScope | Open source community | Open text-to-video diffusion | 36 frames typical | Open source |
The distinguishing architectural feature of Lumiere relative to Imagen Video, AnimateDiff, ZeroScope, and Stable Video Diffusion is the unified space-time U-Net with explicit temporal down/up-sampling. Sora, announced shortly after Lumiere, similarly avoids a cascaded keyframe-plus-interpolation pipeline by operating on a joint latent representation of patches in space-time; the two systems are spiritual cousins in this respect even though they differ in backbone (U-Net vs. Diffusion Transformer)[^1][^10]. Veo, Google DeepMind's later product line, integrated lessons from both Lumiere and from internal video research, although Google has not publicly stated which specific Lumiere components are reused in Veo[^14].
The paper itself acknowledges several limitations. First, Lumiere is "not designed to generate videos that consist of multiple shots, or that involve transitions between scenes"[^1][^7]. The model is trained on continuous 80-frame clips, so a single forward pass produces what looks like one camera shot. Multi-shot films would require either extending the architecture (longer clips, or scene-graph conditioning) or stitching multiple Lumiere clips together with a separate transition model.
Second, the five-second runtime is fixed by the training distribution. The 80-frame design was chosen to match a typical shot length in media production and to fit within a reasonable computational budget given full temporal attention at the coarsest U-Net level. Generating longer clips is not directly supported by the released architecture, although the SSR's overlapping-window approach gives hints about how a temporal extrapolation pipeline could be built[^1].
Third, the lowest spatial resolution at which STUNet operates (128×128) constrains the level of fine detail it can plan for. The SSR network adds plausible high-frequency detail, but it does not have the global temporal context that the base STUNet has, so very fine textures can flicker frame-to-frame in cases where the base model has not committed to a specific high-frequency pattern[^1][^11].
Fourth, the compute cost of training on 30 million videos and inferring with full temporal attention at the coarsest level remains substantial. Independent reproductions of Lumiere on smaller datasets have not, to date, matched the demo quality of the original, although the architecture's openness in description has fed downstream open-source work in space-time U-Net design[^11][^13].
Fifth, like all large diffusion systems trained on web-scale video data, Lumiere inherits the dataset's biases. The project page and paper acknowledge this in their ethics and impact statements[^1][^3].
In the paper's "Societal Impact" section, the authors write that the main goal of work like Lumiere is to "enable novice users to generate visual content in a creative and flexible way," but they note an attendant "risk of misuse for creating fake or harmful content with our technology," and they say it is "crucial to develop and apply tools for detecting biases and malicious use cases"[^1][^3][^11]. The project page reiterates this language and stops short of providing weights, an API, or a hosted demo[^3].
The decision not to release model weights or a public demo was widely reported at launch. Footage Secrets, CineD, and TechRadar all flagged the absence of any access path[^9][^16]. Google as a company has, since Lumiere, channeled its public-facing video generation through the Veo product line (delivered through Google DeepMind's consumer surfaces), with content provenance controls including SynthID-style invisible AI watermarking applied to generated media[^14]. The Lumiere paper does not itself adopt a particular watermarking method, but the broader Google product strategy of marking AI-generated video for deepfake mitigation is consistent with the ethics statement's call for detection tools[^3].
Lumiere's significance is largely conceptual. The paper publicly articulated, with strong demo evidence, that the dominant cascaded design for text-to-video generation was a workable but suboptimal compromise, and that a model designed from the start to reason jointly over space and time could produce more coherent motion at five-second clip lengths[^1][^7][^6]. This argument prefigures the design philosophy of Sora (and a number of academic follow-ups) which similarly process whole clips as joint space-time tensors rather than in stages[^10].
The architectural innovations of Lumiere, in particular the factorized space-time convolutions in shallow U-Net levels, temporal attention restricted to the bottleneck, nearest-neighbor initialization of temporal pooling modules, and MultiDiffusion-style aggregation of overlapping windows for SSR, have informed open-source video diffusion work, even though the Lumiere weights themselves were never released[^13]. The decision to reuse a frozen pretrained text-to-image U-Net and train only the new temporal layers also reinforces a pattern (already familiar from AnimateDiff and Stable Video Diffusion) of treating video models as temporal adapters on top of image foundation models, which is now standard practice for academic-budget video research[^6][^13].
For Google specifically, Lumiere sits between the earlier cascaded Imagen Video (which never shipped as a product) and the later, more product-focused Veo line (Veo, Veo 2, Veo 3) that powers consumer-facing surfaces[^14]. The company has not publicly broken down the architectural lineage from Lumiere to Veo, but the broader research direction (single-pass joint space-time generation, with separate spatial super-resolution) is one that all three Veo generations have built on while moving to higher resolutions, longer clips, and audio[^14].
Lumiere stands in a lineage of AI video generation systems based on denoising diffusion. Earlier text-conditional video diffusion work that the paper directly compares against, and that influenced its design, includes:
After Lumiere, the field moved to longer clips, higher resolutions, and (for some systems) joint video-and-audio generation. Notable contemporaneous or later systems include: