# Lumiere

> Source: https://aiwiki.ai/wiki/lumiere
> Updated: 2026-07-16
> Categories: Diffusion Models, Google, Video Generation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Lumiere** is a text-to-video [diffusion model](/wiki/diffusion_model) developed by Google Research in collaboration with researchers from the Weizmann Institute of Science, Tel Aviv University, and the Technion, introduced in January 2024 through the preprint "Lumiere: A Space-Time Diffusion Model for Video Generation" (arXiv:2401.12945)[^1]. The system is best known for its **Space-Time U-Net** (STUNet) architecture, which generates the entire temporal extent of a clip in a single forward pass instead of relying on the cascade-of-keyframes pipeline plus temporal super-resolution that earlier systems such as Imagen Video and Phenaki had used[^1][^2]. A base Lumiere model outputs 80-frame clips at 16 frames per second, or roughly five seconds of motion, at a low spatial resolution of 128×128 pixels that is subsequently upsampled to 1024×1024 by a spatial super-resolution (SSR) network applied with a MultiDiffusion-style overlapping-window scheme[^1][^3]. The paper, demos, and supplementary applications, including image-to-video, stylized generation, video editing, cinemagraphs, and inpainting, were presented at SIGGRAPH Asia 2024 in Tokyo[^4]. Google did not release Lumiere as a public product, API, or open-weights checkpoint; the project remains research-only, and its design ideas have flowed instead into later Google [AI video generation](/wiki/ai_video_generation) efforts such as the [Veo](/wiki/veo) family[^5].

## Infobox

| Field | Value |
|---|---|
| Full name | Lumiere: A Space-Time Diffusion Model for Video Generation |
| Type | Text-to-video [diffusion](/wiki/diffusion_model) model |
| Developer | Google Research; Weizmann Institute of Science; Tel Aviv University; Technion |
| arXiv ID | 2401.12945 (v1: 2024-01-23; v2: 2024-02-05) |
| Initial announcement | January 25, 2024 |
| Conference | SIGGRAPH Asia 2024, Tokyo, December 3 to 6, 2024 |
| DOI | 10.1145/3680528.3687614 |
| Output length | 80 frames at 16 fps (5 seconds) |
| Base resolution | 128 × 128 pixels |
| Final resolution | 1024 × 1024 pixels (after SSR) |
| Base text-to-image model | Imagen (frozen weights, extended with temporal layers) |
| Training data | 30 million video and text caption pairs |
| Release status | Research only; no public weights, API, or demo |
| Project page | lumiere-video.github.io |

Sources for the values above: the arXiv paper[^1], the project page[^3], the SIGGRAPH Asia 2024 program[^4], the deeplearning.ai coverage[^6], and the VentureBeat announcement[^7].

## Background

### The video diffusion landscape before Lumiere

Video generation models that predate Lumiere converged on a cascade design. A first network produces a short, low-frame-rate sequence of keyframes at low spatial resolution, and a chain of temporal and spatial super-resolution modules then fill in intermediate frames and upscale to a target resolution. Google's own Imagen Video, the closest prior reference for the Lumiere team, was built on seven cascaded [diffusion](/wiki/diffusion_model) networks: a base text-to-video model produced 16 frames at 24×40 resolution and 3 fps, then alternating temporal and spatial super-resolution stages stretched the output to 128 frames at 1280×768 resolution and 24 fps[^2]. Meta's Make-A-Video adopted a similar staged approach, and Google's Phenaki used a different decomposition (token-based prediction plus a video super-resolution decoder) but still produced long sequences by stitching shorter ones together[^1].

Cascaded designs win on compute efficiency: each stage operates on a manageable tensor, and temporal super-resolution networks are reused across windows of frames. Their weakness is consistency. Each temporal super-resolution stage has to interpolate between keyframes it sees in a sliding window, and the interpolation cannot draw on full global context across the whole clip. The Lumiere paper points out that this design "inherently makes global temporal consistency difficult to achieve" because keyframe synthesis happens before any high-frame-rate signal is available, and the temporal interpolation network is non-causal and aliased: it cannot always recover a single plausible motion that holds across multiple windows[^1]. For repetitive or fast motion, such as walking, hand gestures, or rotating objects, this manifests as jitter, jumpy transitions, or content that subtly mutates between consecutive sub-clips[^1][^6].

### Origins and authorship

Lumiere was written by 17 authors. The first author is Omer Bar-Tal, with Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri[^4]. The collaboration draws together Google Research video and graphics groups (which had previously shipped Imagen and Imagen Video) with academic labs in Israel, including Tali Dekel's group at the Weizmann Institute, Tomer Michaeli's group at the Technion, and Hila Chefer's affiliation with Tel Aviv University[^1][^3]. The paper was first posted to arXiv on January 23, 2024, with version 2 appearing on February 5, 2024[^1]. [Google](/wiki/google) separately publicized the project through a research blog and demo videos on January 25, 2024[^5][^7].

### Reception of the announcement

Coverage in January 2024 was unusually intense for a research preprint because the released demos visibly outperformed contemporary commercial systems on motion quality. VentureBeat called the results "remarkably realistic" and emphasized the contrast with Runway Gen-2 and Pika clips, which often showed object distortion or warping in motion[^7]. The Decoder summarized the architecture as a single-pass alternative to keyframe interpolation, and noted that Google was not releasing the model for public use[^8]. The deeplearning.ai newsletter ran a technical summary noting the 30-million-video training set, the 1024×1024 final resolution, and a user-study win rate where Lumiere beat Runway Gen-2 on video quality by 61% to 39%[^6]. The CineD report stressed the lack of any demo or download, framing the model as a research preview, not a product[^9].

A few weeks later, on February 15, 2024, [Sora](/wiki/sora) from OpenAI was unveiled and rapidly absorbed media attention with longer, higher-resolution clips, but the underlying STUNet idea (process the whole clip jointly rather than in cascade) was, at the time of the Lumiere release, a sharp public articulation of a research direction that other labs were independently pursuing[^7][^10].

## Technical details

### Goal and high-level pipeline

Lumiere's design goal is to produce a coherent five-second clip from a text prompt without ever invoking a temporal super-resolution stage. The model treats the clip as a single space-time tensor and denoises it across multiple scales in both space and time. The complete pipeline has two networks:

1. A **Space-Time U-Net** (STUNet) base model that synthesizes the full 80-frame clip at 128×128 resolution in one pass, conditioned on a text embedding[^1][^3].
2. A **spatial super-resolution** model that upsamples each frame to 1024×1024, applied with a MultiDiffusion-style aggregation over overlapping temporal windows so that boundaries between segments do not produce visible artifacts[^1][^3].

The base STUNet model is the conceptual novelty; the SSR network reuses techniques familiar from prior image and video [diffusion](/wiki/diffusion_model) work, adapted to the temporal axis.

### Space-Time U-Net architecture

The STUNet extends a pre-trained text-to-image [U-Net](/wiki/unet) from [Imagen](/wiki/imagen) into a video model by interleaving temporal modules and adding both spatial and temporal down- and up-sampling stages[^1][^6]. The architecture preserves the basic U-shape of an image [U-Net](/wiki/unet) (encoder, bottleneck, decoder with skip connections) but at each spatial resolution it also carries a time axis, which is downsampled in the encoder and upsampled in the decoder by dedicated temporal pooling and unpooling modules. Spatial weights from the pre-trained text-to-image backbone are frozen; only the new temporal modules are trained, which makes the optimization tractable on a moderate compute budget compared to training a video model from scratch[^1][^6].

Within each level of the U-Net, two block types are used:

- **Convolutional blocks at most levels.** Each pre-trained 2D spatial residual block is followed by a **factorized space-time [convolution](/wiki/convolutional_layer)**. The pre-trained 3×3 spatial filter is reinterpreted as a 1×3×3 space-only 3D filter (so the spatial weights port over unchanged), and the new temporal axis is processed by an additional 3×1×1 1D temporal convolution. This factorization keeps compute close to the original image model while adding learnable temporal dependencies, and the paper notes it provides more non-linearities than a pure 3D convolution while remaining cheaper[^1][^11].
- **Attention-based blocks at the coarsest level only.** The spatial [attention](/wiki/attention) block from the image [U-Net](/wiki/unet) is followed by a **temporal attention block** that attends along the time axis, with the spatial dimensions folded into the batch dimension. Because the cost of full temporal [attention](/wiki/attention) scales quadratically with the number of frames, the authors restrict temporal attention to the deepest, most spatially compressed level of the U-Net, where the time tensor is the only one large enough to warrant joint reasoning across all 80 frames[^1][^11].

Spatial down/up-sampling layers come from the pre-trained text-to-image backbone. Adjacent to each of those, the authors insert **temporal down/up-sampling modules** that pool and unpool the frame axis. To make the model behave like the original image model at initialization (an important inductive bias when reusing frozen spatial weights), the temporal sampling modules are initialized to nearest-neighbor temporal averaging and unpooling, so that the freshly initialized STUNet, before training, treats a video as a stack of independently denoised images[^1][^11]. Ablations in the paper show that nearest-neighbor initialization produces meaningfully better motion than random initialization of the temporal modules[^6][^11].

Text conditioning enters through the same [cross-attention](/wiki/cross_attention) layers as in the underlying Imagen model. A T5-style text encoder produces a sequence of embeddings, and each spatial attention block in the U-Net cross-attends to this sequence; the temporal modules do not introduce additional text conditioning paths[^1][^2].

### MultiDiffusion-based spatial super-resolution

After STUNet produces an 80×128×128 RGB volume, a separate spatial super-resolution diffusion model upsamples it to 80×1024×1024[^3][^6]. The SSR model is also a [diffusion](/wiki/diffusion_model) U-Net with temporal modules, and like the base it cannot economically process all 80 frames jointly at 1024×1024. Instead, the paper applies SSR to overlapping temporal windows of frames (the windows overlap by two frames in the description on the project page and in third-party summaries[^3][^12]).

Run independently, the per-window SSR outputs would not match at the window boundaries. To eliminate this, the authors adopt **MultiDiffusion** style aggregation, originally proposed by some of the same Weizmann authors for combining multiple spatial diffusion passes into a single coherent image[^1]. Concretely, at each [denoising](/wiki/denoising) step of the SSR diffusion, the network's predictions for each window are reconciled by solving a small per-step least-squares problem that minimizes the squared difference between the global per-frame prediction and each window's local prediction in the overlap regions. The resulting consensus prediction is fed back as the next step's denoising target. The effect is that the SSR network behaves as a single global denoiser despite the underlying network only ever seeing a window of frames at a time. Ablations in the paper show that without MultiDiffusion aggregation, visible temporal seams appear at window boundaries[^1][^11].

### Training data and procedure

The STUNet base is trained on a dataset of **30 million video and text caption pairs**, the size of which is reported both in the paper and in the deeplearning.ai summary[^1][^6]. The dataset's exact provenance is not described in detail in the public paper; Google's video projects of the era typically combined licensed and web-sourced clips. The base spatial network is taken from Imagen and frozen during training; only the inserted temporal modules (factorized temporal convolutions, temporal attention, and temporal pool/unpool layers) are optimized. The training objective is the standard noise-prediction loss of denoising [diffusion](/wiki/diffusion_model) models, applied to the joint space-time tensor[^1][^6].

The SSR model is trained similarly but operates on shorter windows of higher-resolution video.

### Inference

At inference time, the user supplies a text prompt and an optional seed. The pipeline draws a random Gaussian noise tensor of shape 80×128×128×3, runs the STUNet denoising loop conditioned on the text embedding, then applies the MultiDiffusion-aggregated SSR network to produce the 80×1024×1024 final clip[^1][^3]. Classifier-free guidance is used during sampling, as in other Imagen-derived [diffusion](/wiki/diffusion_model) systems, although the precise guidance scale and sampler schedule are not the paper's focus[^2].

## Applications

The Lumiere paper and project page demonstrate six concrete applications built on the same base model, each requiring only inference-time modifications or light fine-tuning:

### Text-to-video

The default mode: given a text prompt, the STUNet generates a five-second clip[^1][^3]. The demos on the project page show animals, vehicles, plants, water, fire, and people in motion, with the kinds of repetitive deformations (walking, swimming, hair, flame turbulence) that prior cascaded models had handled poorly[^3][^6].

### Image-to-video

To condition the model on a static image as the first frame, the authors treat image-to-video as a temporal **inpainting** problem: the first frame slot is "observed" (the user-supplied image is repeated and noised at the appropriate scale), the remaining 79 frames are masked, and the STUNet's reverse [diffusion](/wiki/diffusion_model) inpaints the future. No re-training is required; the model already knows how to fill masked regions because it was implicitly trained on a similar masking objective for video inpainting[^1][^11].

### Stylized generation

For stylized text-to-video, the authors fine-tune the **spatial** layers of the underlying text-to-image model on a small set of style reference images, then perform a linear interpolation between the original spatial weights and the fine-tuned spatial weights to control the strength of stylization. The frozen temporal modules continue to provide motion. This is a video adaptation of the LoRA-style weight-interpolation tricks common in static image generation[^1][^11][^13].

### Consistent video editing (video-to-video)

By coupling Lumiere with **SDEdit**, the model can edit an input clip according to a new text prompt: the source video is noised to an intermediate diffusion timestep, then the STUNet denoises it conditioned on the target prompt, producing a clip that retains the input's spatial layout and motion but applies a new appearance. The advantage over editing each frame independently with a static image editor is that the STUNet's temporal layers enforce coherence across the edited frames[^1][^11][^13].

### Cinemagraphs

A cinemagraph animates a localized region of an otherwise static image (a classic example is a clip in which a person's hair moves while everything else is frozen). Lumiere realizes this by conditioning the model on the static image as the first frame, then masking only the user-specified region for the remaining frames. The unmasked regions are held fixed across all 80 time slots so the still parts of the image stay still, while STUNet denoises plausible motion inside the masked region[^1][^3].

### Inpainting and outpainting

Lumiere can fill missing video regions: the user supplies a video with a masked spatial or spatiotemporal region and a text prompt describing the desired content, and the STUNet generates content that is both temporally coherent with the surrounding video and consistent with the prompt. Outpainting (extending a video outside its original frame) is a special case where the mask covers the new spatial padding region across all time steps[^1][^3].

## Quantitative results

The paper reports automated metrics (Fréchet Video Distance, Inception Score) and human-preference results from a two-alternative forced-choice (2AFC) study comparing Lumiere with [Imagen](/wiki/imagen) Video, Pika, Stable Video Diffusion, Runway Gen-2, AnimateDiff, and ZeroScope[^1][^6]. The exact numbers reported in third-party coverage of the paper include Lumiere achieving a [Fréchet Video Distance](/wiki/frechet_inception_distance) of 332.49 and an Inception Score of 37.54, versus 367.23 and 33.00 respectively for ImagenVideo[^11]. The human study, summarized by deeplearning.ai's Andrew Ng newsletter, reports that on video quality Lumiere was preferred over its strongest baseline ([Runway](/wiki/runway_gen_3) Gen-2) by 61% to 39%, and on text-video alignment over its strongest baseline ([Imagen](/wiki/imagen) Video) by 55% to 45%[^6]. The paper notes that 16-frame baselines such as AnimateDiff and 36-frame baselines such as ZeroScope produce shorter clips than Lumiere's 80 frames, which complicates direct [FVD](/wiki/frechet_inception_distance) comparison[^11].

The authors evaluate text alignment using [CLIP](/wiki/clip)-based metrics including [CLIP Score](/wiki/clip_score). Like all video-generation evaluations of the period, the results are heavily dependent on the prompt set used, and the paper acknowledges that automated metrics correlate imperfectly with human judgments of motion realism[^1][^11].

## Comparison with contemporaneous systems

The table below summarizes Lumiere alongside other publicly described text-to-video systems available around the time of its release. Specifications are taken from each system's own documentation or paper.

| System | Developer | Approach | Output | Public access |
|---|---|---|---|---|
| Lumiere | Google Research, Weizmann, Tel Aviv U., Technion | Single-pass STUNet plus MultiDiffusion SSR | 80 frames at 16 fps (5 s), 1024×1024[^1][^3] | Research only, no public weights[^7][^9] |
| Imagen Video | Google Research | Cascade of 7 [diffusion](/wiki/diffusion_model) models with temporal super-resolution | 128 frames at 24 fps, 1280×768[^2] | Research only |
| Sora (original) | OpenAI | Spacetime patch transformer over latent video tokens | Up to 60 s clips, up to 1080p[^10] | Initially research preview; later product (Sora 2)[^10] |
| Veo | [Google DeepMind](/wiki/google_deepmind) | Latent video [diffusion](/wiki/diffusion_model) with [Transformer](/wiki/transformer) backbones | "Over a minute" at 1080p (Veo 1)[^14] | API/product (later [Veo 2](/wiki/veo_2), [Veo 3](/wiki/veo_3))[^14] |
| Runway Gen-2 / Gen-3 | Runway | Proprietary latent video [diffusion](/wiki/diffusion_model) | Up to ~16 s; up to 1280×768[^15] | Commercial product[^15] |
| Pika 1.0 | Pika Labs | Proprietary [diffusion](/wiki/diffusion_model) | Short clips around 3 s | Commercial product |
| Stable Video Diffusion | Stability AI | Image-to-video latent [diffusion](/wiki/diffusion_model) | 14 or 25 frames at 1024×576 | Open weights |
| AnimateDiff | Open source community | Plug-in temporal module for image [diffusion](/wiki/diffusion_model) checkpoints | 16 frames typical | Open source |
| ZeroScope | Open source community | Open text-to-video [diffusion](/wiki/diffusion_model) | 36 frames typical | Open source |

The distinguishing architectural feature of Lumiere relative to Imagen Video, AnimateDiff, ZeroScope, and Stable Video Diffusion is the unified space-time U-Net with explicit temporal down/up-sampling. Sora, announced shortly after Lumiere, similarly avoids a cascaded keyframe-plus-interpolation pipeline by operating on a joint latent representation of patches in space-time; the two systems are spiritual cousins in this respect even though they differ in backbone (U-Net vs. Diffusion Transformer)[^1][^10]. Veo, Google DeepMind's later product line, integrated lessons from both Lumiere and from internal video research, although Google has not publicly stated which specific Lumiere components are reused in Veo[^14].

## Limitations

The paper itself acknowledges several limitations. First, Lumiere is "not designed to generate videos that consist of multiple shots, or that involve transitions between scenes"[^1][^7]. The model is trained on continuous 80-frame clips, so a single forward pass produces what looks like one camera shot. Multi-shot films would require either extending the architecture (longer clips, or scene-graph conditioning) or stitching multiple Lumiere clips together with a separate transition model.

Second, the five-second runtime is fixed by the training distribution. The 80-frame design was chosen to match a typical shot length in media production and to fit within a reasonable computational budget given full temporal attention at the coarsest U-Net level. Generating longer clips is not directly supported by the released architecture, although the SSR's overlapping-window approach gives hints about how a temporal extrapolation pipeline could be built[^1].

Third, the lowest spatial resolution at which STUNet operates (128×128) constrains the level of fine detail it can plan for. The SSR network adds plausible high-frequency detail, but it does not have the global temporal context that the base STUNet has, so very fine textures can flicker frame-to-frame in cases where the base model has not committed to a specific high-frequency pattern[^1][^11].

Fourth, the compute cost of training on 30 million videos and inferring with full temporal attention at the coarsest level remains substantial. Independent reproductions of Lumiere on smaller datasets have not, to date, matched the demo quality of the original, although the architecture's openness in description has fed downstream open-source work in space-time U-Net design[^11][^13].

Fifth, like all large [diffusion](/wiki/diffusion_model) systems trained on web-scale video data, Lumiere inherits the dataset's biases. The project page and paper acknowledge this in their ethics and impact statements[^1][^3].

## Ethics, safety, and release decision

In the paper's "Societal Impact" section, the authors write that the main goal of work like Lumiere is to "enable novice users to generate visual content in a creative and flexible way," but they note an attendant "risk of misuse for creating fake or harmful content with our technology," and they say it is "crucial to develop and apply tools for detecting biases and malicious use cases"[^1][^3][^11]. The project page reiterates this language and stops short of providing weights, an API, or a hosted demo[^3].

The decision not to release model weights or a public demo was widely reported at launch. Footage Secrets, CineD, and TechRadar all flagged the absence of any access path[^9][^16]. Google as a company has, since Lumiere, channeled its public-facing video generation through the [Veo](/wiki/veo) product line (delivered through [Google DeepMind](/wiki/google_deepmind)'s consumer surfaces), with content provenance controls including [SynthID](/wiki/synthid)-style invisible [AI watermarking](/wiki/ai_watermarking) applied to generated media[^14]. The Lumiere paper does not itself adopt a particular watermarking method, but the broader Google product strategy of marking AI-generated video for [deepfake](/wiki/deepfake) mitigation is consistent with the ethics statement's call for detection tools[^3].

## Significance and influence

Lumiere's significance is largely conceptual. The paper publicly articulated, with strong demo evidence, that the dominant cascaded design for text-to-video generation was a workable but suboptimal compromise, and that a model designed from the start to reason jointly over space and time could produce more coherent motion at five-second clip lengths[^1][^7][^6]. This argument prefigures the design philosophy of [Sora](/wiki/sora) (and a number of academic follow-ups) which similarly process whole clips as joint space-time tensors rather than in stages[^10].

The architectural innovations of Lumiere, in particular the factorized space-time convolutions in shallow U-Net levels, temporal attention restricted to the bottleneck, nearest-neighbor initialization of temporal pooling modules, and MultiDiffusion-style aggregation of overlapping windows for SSR, have informed open-source video [diffusion](/wiki/diffusion_model) work, even though the Lumiere weights themselves were never released[^13]. The decision to reuse a frozen pretrained text-to-image [U-Net](/wiki/unet) and train only the new temporal layers also reinforces a pattern (already familiar from AnimateDiff and Stable Video Diffusion) of treating video models as temporal adapters on top of image foundation models, which is now standard practice for academic-budget video research[^6][^13].

For Google specifically, Lumiere sits between the earlier cascaded Imagen Video (which never shipped as a product) and the later, more product-focused Veo line ([Veo](/wiki/veo), [Veo 2](/wiki/veo_2), [Veo 3](/wiki/veo_3)) that powers consumer-facing surfaces[^14]. The company has not publicly broken down the architectural lineage from Lumiere to Veo, but the broader research direction (single-pass joint space-time generation, with separate spatial super-resolution) is one that all three Veo generations have built on while moving to higher resolutions, longer clips, and audio[^14].

## Related work

Lumiere stands in a lineage of [AI video generation](/wiki/ai_video_generation) systems based on denoising [diffusion](/wiki/diffusion_model). Earlier text-conditional video [diffusion](/wiki/diffusion_model) work that the paper directly compares against, and that influenced its design, includes:

- [Imagen](/wiki/imagen) Video, Google Research, 2022: the closest direct ancestor; introduced cascaded video [diffusion](/wiki/diffusion_model) from Imagen and motivated the cascaded-versus-unified comparison[^2].
- Make-A-Video, Meta, 2022: text-to-video using pretrained text-to-image priors and pseudo-3D temporal layers.
- Phenaki, Google, 2022: longer-form video generation via token-level autoregression.
- Stable Video Diffusion, Stability AI, 2023: open weights image-to-video [latent diffusion model](/wiki/latent_diffusion).
- AnimateDiff, 2023: plug-in temporal module that retrofits image [diffusion](/wiki/diffusion_model) checkpoints into short video generators.
- ZeroScope, 2023: open text-to-video [diffusion](/wiki/diffusion_model) used as a baseline in the Lumiere user study.

After Lumiere, the field moved to longer clips, higher resolutions, and (for some systems) joint video-and-audio generation. Notable contemporaneous or later systems include:

- [Sora](/wiki/sora) (OpenAI, February 2024): joint space-time patch transformer on latents; longer clips and higher resolutions, but similar in its rejection of cascaded keyframe interpolation[^10].
- [Veo](/wiki/veo) and its successors [Veo 2](/wiki/veo_2) and [Veo 3](/wiki/veo_3) (Google DeepMind, 2024 onward): production-grade text-to-video including audio[^14].
- [Runway Gen-3 Alpha](/wiki/runway_gen_3) and Runway Gen-4: commercial latent video [diffusion](/wiki/diffusion_model) systems[^15].
- [Pika](/wiki/pika) and Pika 2.5: commercial text-to-video with image-to-video and editing features.
- [Kling](/wiki/kling) and Kling 2.1: Chinese-developed long-form video generation product.
- [CogVideoX](/wiki/cogvideo) (Tsinghua University and Zhipu AI): open-weights latent video [diffusion](/wiki/diffusion_model) with [DiT](/wiki/diffusion_transformer)-style backbones.
- [HunyuanVideo](/wiki/hunyuan_video) (Tencent): open-weights text-to-video model with [Diffusion Transformer](/wiki/diffusion_transformer) backbone.

## See also

- [AI video generation](/wiki/ai_video_generation)
- [Text-to-video generation](/wiki/text_to_video)
- [Diffusion model](/wiki/diffusion_model)
- [Diffusion models](/wiki/diffusion_models)
- [Latent diffusion model](/wiki/latent_diffusion)
- [U-Net](/wiki/unet)
- [DDPM](/wiki/ddpm)
- [DDIM](/wiki/ddim)
- [Imagen](/wiki/imagen)
- [Sora](/wiki/sora)
- [Veo](/wiki/veo)
- [Veo 2](/wiki/veo_2)
- [Veo 3](/wiki/veo_3)
- [Runway Gen-3 Alpha](/wiki/runway_gen_3)
- [Pika](/wiki/pika)
- [Kling](/wiki/kling)
- [CogVideoX](/wiki/cogvideo)
- [HunyuanVideo](/wiki/hunyuan_video)
- [Text-to-Image Models](/wiki/text-to-image_models)
- [Diffusion Transformer (DiT)](/wiki/diffusion_transformer)
- [Google DeepMind](/wiki/google_deepmind)
- [Google Brain](/wiki/google_brain)
- [Denoising](/wiki/denoising)
- [CLIP](/wiki/clip)
- [CLIP Score](/wiki/clip_score)
- [Frechet Inception Distance](/wiki/frechet_inception_distance)
- [Deepfake](/wiki/deepfake)
- [SynthID](/wiki/synthid)
- [AI watermarking](/wiki/ai_watermarking)

## References

[^1]: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri, "Lumiere: A Space-Time Diffusion Model for Video Generation," arXiv preprint 2401.12945, 2024-01-23 (v1) and 2024-02-05 (v2). https://arxiv.org/abs/2401.12945. Accessed 2026-05-20.

[^2]: Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans, "Imagen Video: High Definition Video Generation with Diffusion Models," arXiv preprint 2210.02303, 2022-10-05. https://arxiv.org/abs/2210.02303. Accessed 2026-05-20.

[^3]: Lumiere project page, "Lumiere: A Space-Time Diffusion Model for Video Generation," Google Research / Weizmann Institute / Tel Aviv University / Technion, 2024. https://lumiere-video.github.io/. Accessed 2026-05-20.

[^4]: ACM Digital Library, "Lumiere: A Space-Time Diffusion Model for Video Generation," SIGGRAPH Asia 2024 Conference Papers, DOI 10.1145/3680528.3687614, December 2024. https://dl.acm.org/doi/10.1145/3680528.3687614. Accessed 2026-05-20.

[^5]: Google Research, "Lumiere: Realistic AI video generation with diffusion models," Google Research blog, 2024-01-25. https://research.google/blog/. Accessed 2026-05-20.

[^6]: The Batch, "Lumiere: A System That Achieves Unprecedented Motion Realism in Video," DeepLearning.AI, 2024-01-31. https://www.deeplearning.ai/the-batch/lumiere-a-system-that-achieves-unprecedented-motion-realism-in-video/. Accessed 2026-05-20.

[^7]: Shubham Sharma, "Google shows off Lumiere, a space-time diffusion model for realistic AI videos," VentureBeat, 2024-01-25. https://venturebeat.com/ai/google-shows-off-lumiere-a-space-time-diffusion-model-for-realistic-ai-videos/. Accessed 2026-05-20.

[^8]: Maximilian Schreiner, "Lumiere: Google shows new generative AI for realistic videos," The Decoder, 2024-01-25. https://the-decoder.com/lumiere-google-shows-new-generative-ai-for-realistic-videos/. Accessed 2026-05-20.

[^9]: Mascha Deikova, "Google Lumiere Introduced - Another Attempt at AI Video Generation," CineD, 2024-01-26. https://www.cined.com/google-lumiere-introduced-another-attempt-at-ai-video-generation/. Accessed 2026-05-20.

[^10]: OpenAI, "Video generation models as world simulators," OpenAI Research, 2024-02-15. https://openai.com/research/video-generation-models-as-world-simulators. Accessed 2026-05-20.

[^11]: Bar-Tal et al., "Lumiere: A Space-Time Diffusion Model for Video Generation," arXiv HTML version, sections on Space-Time U-Net, MultiDiffusion SSR, applications, and ablations, 2024-02-05. https://arxiv.org/html/2401.12945v2. Accessed 2026-05-20.

[^12]: ACM Digital Library full HTML, "Lumiere: A Space-Time Diffusion Model for Video Generation," SIGGRAPH Asia 2024. https://dl.acm.org/doi/fullHtml/10.1145/3680528.3687614. Accessed 2026-05-20.

[^13]: S. Sankar (AI Bites), "Lumiere - The most promising Text-to-Video model yet from Google," ai-bites.net, 2024. https://www.ai-bites.net/lumiere-the-most-promising-text-to-video-model-yet-from-google/. Accessed 2026-05-20.

[^14]: Google DeepMind, "Veo: Our most capable generative video model," Google DeepMind blog, 2024-05-14. https://deepmind.google/technologies/veo/. Accessed 2026-05-20.

[^15]: Runway, "Introducing Gen-3 Alpha," Runway research blog, 2024-06-17. https://runwayml.com/research/introducing-gen-3-alpha. Accessed 2026-05-20.

[^16]: Lance Ulanoff, "Google's impressive Lumiere shows us the future of making short-form AI videos," TechRadar, 2024-01-25. https://www.techradar.com/computing/artificial-intelligence/googles-impressive-lumiere-shows-us-the-future-of-making-short-form-ai-videos. Accessed 2026-05-20.