Stable Video Diffusion
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,075 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,075 words
Add missing citations, update stale details, or suggest a clearer explanation.
Stable Video Diffusion (SVD) is a latent video diffusion model released by Stability AI on 21 November 2023, accompanied by a technical report titled "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets" by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, and collaborators (arXiv:2311.15127).[^1][^2] The first iteration shipped two image-to-video checkpoints, SVD (14 frames) and SVD-XT (25 frames), each producing 576x1024 clips at user-selectable frame rates between 3 and 30 frames per second.[^1][^3] Architecturally, SVD reuses the spatial backbone of Stable Diffusion 2.1 and inserts temporal convolution and attention layers after every spatial layer, then trains the resulting video U-Net through a three-stage recipe of image pretraining, large-scale video pretraining on a curated dataset called LVD-F, and a high-quality finetuning stage on a small premium subset.[^2][^3] SVD was Stability AI's first open-weights video model and became the foundation for follow-up systems including SVD-XT 1.1 (February 2024), Stable Video 3D (March 2024), Stable Video 4D (July 2024), and the commercial Stable Video API.[^4][^5][^6][^7]
| Attribute | Value |
|---|---|
| Developer | Stability AI |
| Initial release | 21 November 2023 (research preview, weights on Hugging Face) |
| Paper | Blattmann et al., arXiv:2311.15127, 25 November 2023 |
| Modality | Image-to-video latent diffusion |
| Resolution | 576x1024 (base/XT); 576x576 (multi-view fine-tune) |
| Frame count | 14 (SVD), 25 (SVD-XT, SVD-XT 1.1) |
| Frame-rate range | 3 to 30 fps (configurable via fps_id) |
| Spatial backbone | Stable Diffusion 2.1 U-Net with added temporal layers |
| Training compute | ~200,000 A100 80GB GPU-hours |
| License (research preview) | Stable Video Diffusion Research / Community License |
| Commercial access | Stability AI Membership and Developer Platform API |
| Repository | github.com/Stability-AI/generative-models |
By late 2023 the open-weights image generation ecosystem had matured around Stable Diffusion 1.5 and SDXL, while video generation remained dominated by closed commercial systems. Runway had launched Runway Gen-2 in March 2023, the first publicly accessible text-to-video and image-to-video generator from a major lab, and Pika Labs followed with Pika 1.0 on 28 November 2023.[^8][^9] LoRA-style motion modules from the open community, notably AnimateDiff (Guo et al., arXiv:2307.04725, July 2023), had shown that personalized image models could be coaxed into producing short looping clips, but they did not match the temporal coherence or resolution of closed video systems.[^10] Internal Stability AI work on Video LDM (Blattmann et al., CVPR 2023, "Align your Latents") had already demonstrated that a frozen image diffusion U-Net could be temporally extended, but no large-scale video model with open weights and a published training recipe existed prior to Stable Video Diffusion.[^2]
Stability AI announced Stable Video Diffusion on 21 November 2023, framing the release as a research preview intended to study generative video models rather than as a consumer product.[^1] The accompanying preprint, posted to arXiv on 25 November 2023, identifies three distinct training stages for video latent diffusion models and provides what the authors call a "systematic curation process" for assembling and filtering a large video pretraining corpus, addressing a gap in prior literature that had described training procedures only at a high level.[^2] The release included weights for two image-to-video checkpoints on Hugging Face (stabilityai/stable-video-diffusion-img2vid and stabilityai/stable-video-diffusion-img2vid-xt), reference inference code in the Stability-AI/generative-models GitHub repository, and a Streamlit demo.[^3][^11]
The November 2023 release was governed by a non-commercial research license. Commercial deployment was not permitted at launch, and Stability AI later introduced a Stable Video Diffusion Community License that allows free commercial use under a revenue threshold and requires an enterprise membership above it, alongside a hosted API made available on the Stability AI Developer Platform.[^4][^7][^12]
The Stable Video Diffusion U-Net is constructed by taking the spatial backbone of Stable Diffusion 2.1, a latent image diffusion model trained at 768x768 with an OpenCLIP-ViT/H text encoder, and inserting temporal convolution and temporal attention layers after every existing spatial convolution and spatial attention block.[^2][^3] During the temporal extension the spatial weights are initialized from the pretrained image model, and the new temporal layers are initialized so that the augmented network reproduces the image model exactly at the start of training. The U-Net therefore operates on five-dimensional latents shaped (batch, time, channel, height, width), with spatial layers processing each frame independently and temporal layers mixing information across time.[^2]
The model uses the same Stable Diffusion 2.1 first-stage latent diffusion autoencoder for the spatial latent representation (8x spatial downsampling, four-channel latents), but Stability AI also trained a temporally-aware decoder, sometimes called the f8-decoder, that is finetuned for temporal consistency and reduces flicker during the latent-to-pixel decoding step.[^3] A frame-wise decoder is also released for use cases where temporal consistency in the decoder is not critical.
For conditioning, the image input is encoded with a CLIP image encoder and supplied through the cross-attention layers, replacing the text-conditioning pathway of the original image model. SVD does not accept text prompts; it is a pure image-to-video model.[^3][^11]
A distinctive feature of SVD is its use of two scalar micro-conditioning signals supplied alongside the noisy latent at every denoising step. The first, fps_id, encodes the target frame rate of the output clip and is bucketed during training so that the model learns the relationship between motion magnitude and inter-frame interval.[^2][^11] The second, motion_bucket_id, encodes how much motion to synthesize: at inference time higher values produce more vigorous motion, while lower values produce slower or near-static clips.[^11][^13] Both signals are exposed to users at inference and allow rough control over the dynamics of the generated clip without retraining.
A third inference-time parameter, noise_aug_strength, controls how much noise is added to the conditioning image before it is encoded; higher values reduce the model's tendency to copy the input frame verbatim and tend to increase motion, at the cost of input fidelity.[^11]
Training of the temporal model adopts the EDM framework of Karras et al. (2022), which parameterizes the diffusion process in continuous noise levels and applies preconditioning to the network inputs and outputs.[^2] The paper notes that for high-resolution video the noise schedule must be shifted toward higher noise levels relative to the schedule used for image training, because larger spatial extents lower the effective signal-to-noise ratio at any given noise level. This shifted schedule is applied during the high-resolution finetuning stage and is reported to be important for stable convergence at 576x1024.[^2] See EDM (Elucidating Diffusion Models) for the underlying framework.
Stable Video Diffusion is trained in three sequential stages, each consuming a different data mixture and resolution.[^2]
Stage I, image pretraining. The model is initialized from the released Stable Diffusion 2.1 checkpoint, which provides a strong visual prior for natural images and serves as the spatial backbone. The paper notes that initializing video training from a well-tuned image model is a substantially stronger starting point than training from scratch.[^2]
Stage II, video pretraining. Temporal layers are inserted, and the model is trained on the curated LVD-F corpus at lower spatial resolution (256x384) and 14 frames per clip. The pretraining set is constructed by starting from approximately 580 million video-clip pairs collected with cut detection, captioning, and filtering, then reducing to roughly 152 million examples that pass thresholds on optical flow magnitude (to remove static clips), aesthetic and CLIP-score quality, and OCR text density (to remove clips dominated by overlaid text).[^2][^14] Captions are generated automatically: the central frame is captioned with CoCa, the whole clip is captioned with V-BLIP, and a large language model fuses the two captions into a single summary description.[^2][^14]
Stage III, high-quality video finetuning. The pretrained video model is finetuned on a much smaller, manually curated set of high-resolution and high-fidelity clips at 576x1024 with 14 frames, producing the released SVD checkpoint. SVD-XT is then produced by further finetuning the SVD weights to generate 25 frames at the same resolution.[^2][^3]
Stability AI reports that total training compute for the released checkpoints was approximately 200,000 A100 80GB GPU-hours on clusters typically configured with 48x8 A100 nodes, with energy consumption of roughly 64,000 kWh and estimated CO2 emissions of about 19,000 kg CO2 equivalent.[^3]
Beyond core image-to-video generation, the paper shows that the SVD weights provide a useful initialization for several downstream video and 3D tasks. Stability AI trained a family of camera-motion LoRA adapters on subsets of LVD-F labeled by motion type (horizontal pan, zoom, static, etc.), enabling post-hoc camera-trajectory control with only a few million trainable parameters.[^2] The same backbone, finetuned on Objaverse and MVImgNet multi-view object data, produces SVD-MV, a multi-view diffusion model the paper reports outperforms image-based novel-view-synthesis baselines (Zero123XL, SyncDreamer) at lower compute cost.[^2]
The Stable Video Diffusion family comprises several checkpoints released between November 2023 and mid-2024. The first two (SVD and SVD-XT) appeared with the original research preview; the third (SVD-XT 1.1) was an opinionated finetune released in February 2024.
| Checkpoint | Release date | Resolution | Frames | Notable feature |
|---|---|---|---|---|
| SVD Image-to-Video | 21 Nov 2023 | 576x1024 | 14 | Base latent video diffusion model |
| SVD-XT Image-to-Video | 21 Nov 2023 | 576x1024 | 25 | Finetuned from SVD for longer clips |
| SVD-XT 1.1 | Feb 2024 | 1024x576 | 25 | Fixed conditioning at 6 fps, motion_bucket_id 127 |
| SV3D_u | 18 Mar 2024 | 576x576 | 21 | Orbital novel-view synthesis |
| SV3D_p | 18 Mar 2024 | 576x576 | 21 | Camera-path conditioned multi-view |
| SV4D | 24 Jul 2024 | 576x576 | 5 frames x 8 views | Video-to-multi-view-video |
The two checkpoints in the original release share the same architecture and training pipeline but differ in clip length. The base SVD model generates 14 frames at 576x1024, while SVD-XT is a continuation finetune of SVD that produces 25 frames at the same resolution.[^3][^11] On an NVIDIA A100 80GB GPU, Stability AI quotes inference times of approximately 100 seconds per clip for SVD and 180 seconds for SVD-XT.[^3] Both models were distributed under a non-commercial research-preview license at first release and later moved to the Stable Video Diffusion Community License framework.[^1][^3]
On 1 February 2024, Stability AI CTO Tom Mason announced SVD 1.1 (officially stable-video-diffusion-img2vid-xt-1-1), a finetune of SVD-XT explicitly optimized for consistency rather than flexibility.[^4][^5] The 1.1 checkpoint was finetuned with fixed micro-conditioning at 6 fps and motion_bucket_id 127, which Stability AI states improves output consistency at the cost of being tuned for a narrower operating point. The values remain user-adjustable at inference, but quality is best at the trained defaults.[^5] The same release introduced the Stability AI Membership commercial-licensing scheme, which lets users with annual revenue below a threshold (initially $1 million) deploy SVD 1.1 commercially under the community license while requiring an enterprise agreement for larger organizations.[^4][^5]
The diffusers library exposes the SVD checkpoints under the StableVideoDiffusionPipeline interface, which accepts a PIL image plus the micro-conditioning scalars and returns a list of decoded frames.[^11] Both the temporally consistent f8-decoder and a frame-wise decoder are shipped as separate weights; users can swap them at inference time depending on whether temporal stability of fine texture or per-frame fidelity is preferred.[^3]
On 18 March 2024 Stability AI released Stable Video 3D (SV3D), a model that adapts the SVD image-to-video formulation to multi-view 3D generation by treating the "video" axis as a camera trajectory around a single object.[^6] SV3D ships in two variants: SV3D_u generates orbital videos around an object from a single image without explicit camera input, while SV3D_p additionally accepts a target camera trajectory.[^6] Both variants generate 21 frames at 576x576 from a single context image and use the SVD backbone as initialization, retraining the temporal layers on multi-view object data. SV3D is licensed under the Stability AI Community License for non-commercial use and requires a Stability AI Membership for commercial deployment.[^6]
Stable Video 4D (SV4D), announced on 24 July 2024, extends SV3D to dynamic objects by accepting an input video and producing dynamic novel-view videos at eight viewpoints simultaneously.[^15] The model emits a spatio-temporal grid of five frames across eight novel camera views at 576x576 resolution, for a total of 40 frames per generation, and is described in the SV4D technical report (arXiv:2407.17470).[^15] Stability AI states SV4D combines the strengths of SVD (temporal modeling) and SV3D (multi-view modeling) and is finetuned on a curated dynamic 3D dataset. A successor, SV4D 2.0, was released subsequently with improved robustness to occlusion and large motion.[^16]
In addition to weights, Stability AI made Stable Video Diffusion available as a hosted API on the Stability AI Developer Platform. The API generates 4 seconds of MP4 video at 24 fps comprising the 25 model-generated frames and additional FILM-interpolated frames, supports motion-strength controls and multiple aspect ratios, and includes safety filters and watermarking.[^7][^17] The hosted service uses SVD-XT and its successors and is targeted at commercial use cases in advertising, marketing, film, and gaming.[^7]
The combination of openly distributed weights, a published training recipe, and a permissive (if non-commercial) license made SVD broadly used by the open-source video-generation community in late 2023 and 2024. Within days of release the checkpoints were integrated into the ComfyUI node graph framework and the Hugging Face diffusers library, enabling local image-to-video generation on consumer GPUs with appropriate quantization.[^11][^18] The model became a common starting point for academic work on video diffusion, video super-resolution, and 3D generation, and for downstream finetunes targeting specific domains such as human-image animation, character motion, and product visualization. SVD-XT 1.1 in particular became a workhorse for short looping marketing content because its fixed conditioning settings make output quality more predictable.[^5]
By mid-2024 the open video-model landscape had begun to expand around SVD with models that were either trained from scratch or built on top of SVD weights. Open Sora, Open-Sora-Plan, Mochi 1, HunyuanVideo, Wan 2.1, and NVIDIA Cosmos all entered the open-weights video space in 2024 and 2025, while commercial competitors Runway Gen-3, Runway Gen-4, Kling, Veo 3, and Sora 2 continued to push closed-model quality. Stable Video Diffusion's role within this ecosystem shifted from leading open model to a well-understood baseline and a source of pretrained weights for subsequent finetunes.
The Stable Video Diffusion paper reports user-preference evaluations against the strongest closed image-to-video systems available at the time of release (Runway Gen-2 and Pika Labs Pika 1.0 era), in which raters were shown clips generated from the same conditioning image by each system and asked to pick the more visually appealing result. The paper states that SVD outperformed both closed baselines in those head-to-head comparisons.[^1][^2] The result was widely reported by secondary sources at launch.[^19]
| System | Developer | Release | Image-to-video | Text-to-video | Open weights |
|---|---|---|---|---|---|
| Stable Video Diffusion (SVD/SVD-XT) | Stability AI | Nov 2023 | Yes (576x1024, 14 or 25 frames) | No (no text input) | Yes (research/community license) |
| Runway Gen-2 | Runway | Mar 2023 | Yes | Yes | No |
| Pika 1.0 | Pika Labs | Nov 2023 | Yes | Yes | No |
| AnimateDiff | Guo et al. (academic) | Jul 2023 | Indirect (via personalized SD checkpoints) | Yes (via SD text prompts) | Yes (open code, motion module weights) |
Several differences are worth flagging. SVD is strictly image-to-video, while Runway Gen-2 and Pika 1.0 also accept text prompts; Pika and Runway provide hosted services with no open weights, while SVD's weights are downloadable.[^1][^8][^9] AnimateDiff is conceptually different: it is not a standalone video model but a motion module that attaches to existing personalized Stable Diffusion checkpoints and reuses their spatial weights to animate generated frames, exposing text conditioning through the underlying SD model.[^10] In terms of raw output specifications at the time of SVD's release, SVD's 14- or 25-frame clips at 576x1024 were comparable to Gen-2 and Pika 1.0 (both also produced short clips in the 1024-pixel range), but SVD's per-clip motion control via motion_bucket_id and its open weights gave researchers an extensible substrate that the closed competitors did not provide.
The released SVD checkpoints do not take text as input. Subsequent open video models including HunyuanVideo and Wan 2.1 reintroduced text conditioning at much higher quality, and by 2025 these models, along with closed systems such as Sora 2 and Veo 3, substantially exceeded SVD's quality on direct preference evaluations. Stable Video Diffusion's principal historical role is as the first open-weights latent video diffusion model with a documented data-curation and training recipe rather than as a current state-of-the-art system.
Several elements of the Stable Video Diffusion release have outlasted the model itself.
First, the three-stage training recipe (image pretraining, large-scale low-resolution video pretraining on curated data, high-resolution video finetuning on a small premium set) became a de facto template for subsequent open and closed video diffusion models. The paper's detailed account of cut detection, dual-captioning, optical-flow and aesthetic filtering, and OCR-based text removal gave practitioners a concrete pipeline that they could replicate or adapt to other corpora.[^2]
Second, the practice of feeding scalar micro-conditioning signals (frame rate, motion magnitude) directly into the U-Net at every denoising step provided a simple alternative to text-only conditioning for video. The motion_bucket_id knob in particular was adopted widely in downstream video diffusion systems as a low-cost way to expose motion intensity to users.[^11][^13]
Third, the EDM-framework noise schedule, shifted toward higher noise levels for high-resolution video, surfaced a practical observation that has since been re-derived in several lines of follow-up work on noise scheduling at high spatial resolution.[^2]
Fourth, SVD demonstrated that an image diffusion model's pretrained spatial weights can be used as the basis for high-quality video, multi-view, and (via SV4D) 4D generation. The same backbone supports image-to-video, novel-view synthesis, and dynamic 3D asset generation with comparatively small additional training, suggesting that learned visual priors transfer well across these tasks. This insight underpins subsequent video-3D systems including SV3D, SV4D, and external research on video diffusion as a general visual prior.[^6][^15]
Stability AI's own model cards enumerate substantial limitations of the released SVD checkpoints.[^3][^11] Generated clips are short (no more than approximately four seconds at 25 frames and 6 fps). The model is not perfectly photorealistic and can produce visible artifacts, especially on faces, hands, and small text. SVD has no text conditioning, so it cannot follow textual instructions and cannot render legible text in the output. It sometimes generates videos with very little motion (effectively static images) or only slow camera pans, particularly at low motion_bucket_id settings. The latent autoencoder is lossy, so very fine high-frequency detail in the input image can be smoothed out in the output. Stability AI also notes that the model is not intended to generate factual depictions of real people or events, and the released checkpoints include a default watermarking step via the imWatermark library.[^3][^11]
Researchers and external evaluators have additionally pointed out that the human-preference numbers in the paper compare to closed systems at versions available in mid-to-late 2023; later versions of Runway, Pika, and new commercial entrants substantially outperform the original SVD checkpoints on most subjective measures.[^19] Open follow-up models including HunyuanVideo, Wan, and Mochi 1 have also exceeded SVD on length, resolution, and text-conditionability.
Finally, the non-commercial nature of the research preview, and the revenue-threshold structure of the later Stable Video Diffusion Community License, place constraints on commercial use that differ from fully permissive open-source licenses such as Apache 2.0 or MIT. Commercial deployers above the revenue threshold must obtain a Stability AI Enterprise Agreement or use the hosted API; this is a more restrictive arrangement than that of some subsequent open video models, although less restrictive than the closed-only release model of Runway, Pika, Sora, and Veo.[^4][^7]