LTX-Video
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,210 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,210 words
Add missing citations, update stale details, or suggest a clearer explanation.
LTX-Video is an open-source, transformer-based latent video diffusion model developed by the Israeli company Lightricks and first released to the public in November 2024.[1][2] The system is designed around a single architectural goal: integrating a high-compression 3D causal Video-VAE with a Diffusion Transformer (DiT) backbone so tightly that generation runs faster than real-time playback on a single NVIDIA H100 GPU.[3] The initial release is a 1.9-billion-parameter base model capable of producing five seconds of 24 fps video at 768×512 resolution in roughly two seconds on an H100, a generation speed unmatched by other public DiT-based text-to-video generation systems at the time.[3] Lightricks has continued to expand the family through a sequence of revisions (0.9, 0.9.1, 0.9.5, 0.9.6, 0.9.7, and the LTXV-13B release in May 2025), and the model has become a common backbone in community pipelines distributed through Hugging Face and ComfyUI.[4][5][6]
The model belongs to a small cohort of fully open video foundation models that emerged in late 2024, alongside Genmo's Mochi 1, Tsinghua's CogVideoX, Tencent's HunyuanVideo, and the AI video generation research effort Open-Sora.[7] Within that group LTX-Video is distinguished by its emphasis on inference latency rather than parameter count: the technical report attributes its speed advantage to relocating patchification from the transformer input into the Video-VAE encoder, producing an aggressive 1:192 pixel-to-token compression ratio and reducing the number of latent tokens the transformer must denoise.[3] The codebase is released on GitHub, and weights are distributed through the Lightricks Hugging Face organization under a custom OpenRAIL-style "LTX-Video Open Weights License", with the broader repository carrying an Apache 2.0 marker for the inference and training scaffolding code.[4][5]
| Attribute | Value |
|---|---|
| Developer | Lightricks (Jerusalem, Israel) |
| First public release | November 2024 (version 0.9) |
| Architecture | DiT-based latent video diffusion with 3D causal Video-VAE |
| Base parameters (initial) | approximately 1.9 billion |
| Later scales | 13B model (LTXV-13B), released May 6, 2025 |
| Text encoder | T5-XXL |
| VAE compression | 1:192 pixel-to-token ratio (32×32 spatial, 8 temporal) |
| Reference resolution | 768×512, 24 fps (also supports 1216×704, 30 fps) |
| Reference speed | 5-second clip generated in roughly 2 seconds on a single NVIDIA H100 |
| Distribution | github.com/Lightricks/LTX-Video; huggingface.co/Lightricks/LTX-Video |
| License (weights) | Custom "LTX-Video Open Weights License" (OpenRAIL-derived); free for personal use; commercial use permitted with later updates |
| Code license | Apache 2.0 (inference scaffolding) |
| Integrations | Diffusers, ComfyUI-LTXVideo, fal.ai, Replicate, LTX Studio |
| Paper | "LTX-Video: Realtime Video Latent Diffusion", arXiv:2501.00103, December 30, 2024 |
Lightricks was founded in 2013 in Jerusalem and is best known for consumer creative-software products such as Facetune and Videoleap. The company moved into generative AI tooling in early 2024 with the announcement of LTX Studio, a web-based AI filmmaking environment intended to let users go from text prompts to storyboards, characters, and short scenes through a single interface. LTX Studio was unveiled on February 28, 2024 as an invite-only beta and later opened to general availability on August 20, 2024.[8][9] At launch, LTX Studio relied primarily on third-party generative components for video synthesis; the company described its long-term goal as building its own video foundation model that could be tuned for storyboard-style filmmaking workflows.[9]
On November 21, 2024, Lightricks publicly released the first version of its in-house video model under the name LTX Video (often stylized LTXV).[1][2] The model was published as version 0.9 on Hugging Face under the repository Lightricks/LTX-Video, with the inference code mirrored at github.com/Lightricks/LTX-Video.[4][5] Lightricks characterized the launch as the first real-time DiT-based video model: a 1.9-billion-parameter base capable of producing 24 fps clips at 768×512 resolution "faster than they can be watched", running on consumer-class hardware such as a single NVIDIA RTX 4090 in addition to data center GPUs.[1][2]
The first release supported text-to-video and image-to-video conditioning simultaneously, both trained jointly inside the same backbone rather than as separate fine-tunes.[3] Coverage at the time positioned LTX-Video as a direct open-source counterweight to OpenAI's then-unreleased Sora and to closed-source commercial models such as Runway Gen-3 Alpha, Kling, and Pika.[2][7]
On December 13, 2024, Lightricks and Shutterstock announced a long-term licensing agreement in which Lightricks would train future versions of LTXV on Shutterstock's HD and 4K stock-footage library.[10] The arrangement was the first deployment of what Shutterstock called a "research license", a tier intended to let model developers train on premium licensed footage at lower cost before committing to a full commercial license; Shutterstock's contributor revenue-share program covered the licensed data, and contributors retained an opt-out from AI training use.[10] The deal was specifically described as providing data for "future model iterations", which positioned LTX-Video 0.9 as a release that pre-dated the Shutterstock corpus and subsequent versions as the first generation trained against it.[10]
The accompanying technical paper, "LTX-Video: Realtime Video Latent Diffusion", was submitted to arXiv on December 30, 2024 as preprint 2501.00103.[3] The paper lists sixteen authors from Lightricks: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi.[3] It is the canonical reference for the system's design and benchmark claims, including the 1:192 compression ratio, the relocation of patchification into the VAE, and head-to-head human preference numbers against CogVideoX, HunyuanVideo, and Open-Sora baselines.[3]
After the November 2024 launch, Lightricks adopted a roughly quarterly cadence of point releases on Hugging Face.
| Version | Approximate release | Notable changes |
|---|---|---|
| 0.9 | November 2024 | Initial 2B base model, text-to-video and image-to-video |
| 0.9.1 | Late 2024 | Quality improvements, integration with spatiotemporal guidance and perturbed attention guidance (STG/PAG) workflows |
| 0.9.5 | March 2025 | Multi-keyframe conditioning, longer videos, commercial-use updates to the weights license |
| 0.9.6 | April 2025 | New ltxv-2b-0.9.6-dev checkpoint and ltxv-2b-0.9.6-distilled real-time variant requiring as few as 8 sampling steps; first-class video extension and keyframe animation |
| 0.9.7 | May 2025 onward | 13B parameter ltxv-13b-0.9.7 model and FP8 quantized variant for higher quality; introduces multi-scale rendering pipeline |
The version metadata above is drawn from the Hugging Face model card and the GitHub repository's README.[4][5][6] The 0.9.6 distilled checkpoint in particular is the one most commonly referenced for "real-time on a single 4090" claims because it accepts very short sampler schedules (around 8 steps) while preserving the multi-task capabilities of the base model.[11]
On May 6, 2025, Lightricks announced LTXV-13B, a 13-billion-parameter scaling of the LTX-Video architecture.[12][13] The headline claim was that LTXV-13B generated comparable-quality video roughly 30 times faster than other open video models of similar quality, while still running on consumer GPUs rather than data-center hardware.[12] The release introduced multi-scale rendering, a pipeline that first drafts a video at lower spatial resolution to capture coarse motion and then refines it through dedicated spatial and temporal upsamplers, a technique enabled by the model's tightly compressed latent space.[12][13] Lightricks also announced commercial-use terms for the 13B model: companies with annual revenue under USD 10 million could use it without a license fee, with paid commercial licensing required above that threshold.[12]
The 13B release also formalized the multi-tier weight distribution that has since become standard for LTX-Video: a high-quality ltxv-13b development weight, a ltxv-13b-distilled weight (roughly fifteen times faster than the base 13B for short sampler schedules), a smaller ltxv-2b-distilled weight retained for low-VRAM use, and FP8-quantized variants of each that further reduce memory footprint.[4][12]
LTX-Video is structured as a latent diffusion model in the lineage of Stable Diffusion and the DiT family, adapted to video.[3] The three principal components are:
What sets LTX-Video apart from earlier video diffusion systems is not the existence of these components but the way they are coupled: the system relocates the patchifying operation from the input of the denoising transformer into the encoder of the Video-VAE itself, so that the transformer operates directly on a 1:1 token-to-latent grid rather than re-tokenizing the VAE output.[3] This decision pushes more of the compression work onto the VAE and shrinks the number of tokens the transformer must process.
The Video-VAE is the most heavily engineered component of LTX-Video. It is built from 3D causal convolutions and applies spatial downsampling of 32×32 in the image plane together with temporal downsampling of 8 frames per latent step, with 128 output channels per latent token.[3] The combined effect is a 1:192 ratio of input pixel values to latent values, several times more aggressive than the compression ratios used by comparable open video models such as PyramidFlow (1:96), CogVideoX (1:48), or HunyuanVideo (1:48).[3]
Because such aggressive compression risks losing high-frequency detail, the VAE decoder is trained as more than a simple inverse mapping. In the LTX-Video design the decoder simultaneously performs latent-to-pixel conversion and the final denoising step in pixel space, allowing the model to recover fine textures without paying for a separate pixel-space refinement network.[3] The decoder is supervised by a stack of objectives that includes:
These design choices are described in the technical report and are the principal reason LTX-Video can ship without a separate cascade or super-resolution stage despite its small parameter count.[3]
The denoising network is a standard Diffusion Transformer (DiT) adapted for spatiotemporal latents. In the base release it has approximately 1.9 billion parameters, with a hidden dimension of 2048 distributed across 28 transformer blocks.[3] Each block contains both self-attention and cross-attention layers; self-attention operates on the full 3D latent grid, while cross-attention pulls in features from the T5-XXL text encoder.[3]
Several adjustments distinguish the LTX-Video transformer from earlier image DiTs:
The transformer is trained with the AdamW optimizer on a mixture of video clips and still images, where images are simply treated as a degenerate video-duration configuration within the same training pipeline.[3] The training mixture stochastically drops 0%-20% of tokens to encourage robustness across different aspect ratios and durations.[3]
LTX-Video uses the T5-XXL encoder (about 4.7B parameters in its standard configuration) as its text conditioning module.[3][5] This is the same encoder family used by other Lightricks releases and is shared with several other large open generative systems. The model card recommends elaborate, descriptive English prompts and explicitly notes that other languages and short prompts are not officially supported.[5]
The 0.9 release was trained on a large, diverse video corpus that the technical report describes only in general terms (large-scale internet-sourced video supplemented with images).[3] Subsequent versions add HD and 4K footage licensed from Shutterstock under the research-license arrangement described above; that data became available to Lightricks beginning in late 2024 under the December 13, 2024 agreement and was incorporated into later iterations of the model.[10] Lightricks does not publish exact dataset sizes or composition tables in the public model card.[5]
The model is trained for multi-resolution generation, with both height and width required to be divisible by 32 (the VAE's spatial stride) and frame count required to be of the form 8n+1 (one more than a multiple of the VAE's temporal stride of 8).[5] Reference operating points include:
The 0.9.5 update added explicit support for multi-keyframe conditioning (allowing the user to pin the first frame, end frame, or arbitrary intermediate frames), while 0.9.6 and later improved video extension (continuing a generated clip beyond its original duration).[11]
Out of the box, LTX-Video supports several conditioning modes through a shared LTXVideoCondition interface in its inference code and through dedicated nodes in the ComfyUI integration:
For LTXV-13B, the inference pipeline additionally supports multi-scale rendering, in which the same model first generates at a low spatial resolution and is then re-applied at higher resolution using paired spatial and temporal upsampler models distributed alongside the main weights.[12]
The most widely cited LTX-Video benchmark is the headline number from the technical report: 5 seconds of 24 fps video at 768×512 resolution generated in roughly 2 seconds on a single NVIDIA H100 GPU, which is faster than the resulting clip can be played back in real time.[3] The Hugging Face model card extends this claim to a 30 fps, 1216×704 reference resolution that also generates faster than playback on the same hardware tier.[5]
On consumer GPUs the picture is less extreme but still very favorable. Third-party comparisons running the LTX-Video 0.9.x line on an RTX 4090 typically report short-clip generation in roughly 5-10 seconds for the base 2B model and around 90 seconds for higher-quality settings, versus several minutes for HunyuanVideo, CogVideoX, or Mochi 1 at comparable resolutions on the same hardware.[7][14] These numbers are not part of the official technical report and should be read as community benchmarks rather than vendor claims.
The technical report includes a human-preference study with 1,000 prompts and 20 participants, judging "visual quality, motion fidelity, and prompt adherence" across LTX-Video and several baselines.[3] In that survey, LTX-Video was preferred over each of CogVideoX, Open-Sora Plan, and PyramidFlow in roughly 85% of text-to-video comparisons, and over comparable image-to-video baselines in 91% of pairings.[3] The same paper compares VAE compression ratios across these systems, with LTX-Video occupying the most-compressed corner of the table at 1:192.[3]
Third-party rankings have given a more mixed read. Independent reviewers writing after the release of HunyuanVideo and the LTXV-13B update typically characterize LTX-Video as the fastest open video model and HunyuanVideo or Wan 2.1 as the strongest on motion-fidelity benchmarks such as VBench, while Mochi 1 is often singled out for prompt adherence in pure text-to-video.[7] LTX-Video's per-token compute advantage is generally credited as the main reason its quality-to-speed ratio remains competitive even as larger systems pull ahead on absolute fidelity.[7][14]
The reference code lives at github.com/Lightricks/LTX-Video and consists of PyTorch inference and pipeline-configuration code, alongside YAML configurations for each weight variant (for example configs/ltxv-13b-0.9.8-distilled.yaml).[4] The repository requires CUDA 12.2 or later and PyTorch 2.1.2 or later, with macOS MPS support added against PyTorch 2.3.[4]
LTX-Video is integrated into the Hugging Face Diffusers library as LTXConditionPipeline (and earlier wrappers), enabling use via the standard from_pretrained interface.[5] Diffusers usage is the path recommended in the model card for application developers, and weights for each version can be loaded by name (Lightricks/LTX-Video, Lightricks/LTX-Video-0.9.5, Lightricks/LTX-Video-0.9.7-dev, and so on).[4][5]
The model has a first-party node pack for ComfyUI distributed at github.com/Lightricks/ComfyUI-LTXVideo.[15] The nodes appear under the LTXVideo category in the ComfyUI menu and include a Gemma- or T5-backed text encoding node, an LTXVTextToVideoSampler sampler node, a VAE-decode node, and a video-combine output node.[15] The pack ships example workflows for text-to-video, image-to-video, multi-keyframe animation, and video extension.[15] Community-maintained GGUF quantizations (for example city96/LTX-Video-0.9.5-gguf) further reduce memory usage for desktop and laptop GPUs.[16]
Hosted inference is available through several third-party providers, including fal.ai, Replicate (the lightricks/ltx-video model), and Lightricks' own LTX Studio platform, which directly integrates the LTXV models into a longer-form storyboard workflow.[1][5][17] LTX Studio also exposes other third-party models alongside LTX-Video, including (per Lightricks' own announcements) Veo 3 for generations with synchronized audio.[9]
LTX-Video is positioned primarily as a building block for short-form generative video. The base release explicitly targets two use patterns:
Because the model is fully open-weights and runs locally on consumer GPUs, it has also seen substantial adoption in research and hobbyist pipelines: LoRA-style finetunes (often called "IC-LoRAs" in the LTX-Video community) attach to the 0.9.6 distilled base for stylistic and subject-specific generation,[11] and the model is commonly used as a fast iteration target for prompt engineering before a more expensive run on a larger system.[7][14]
Lightricks markets the model as suitable for commercial use under the LTX-Video Open Weights License with the additional condition introduced at the 13B launch that organizations with more than USD 10 million in annual revenue must obtain a separate commercial license.[12]
The table below summarizes key public design parameters for LTX-Video and the four open video models with which it is most often compared. Numbers and dates are drawn directly from each model's primary source.
| Model | Developer | Initial release | Parameters | Architecture | VAE compression | Notes |
|---|---|---|---|---|---|---|
| LTX-Video 0.9 | Lightricks | November 2024 | approx. 1.9B | DiT with 3D causal VAE, T5-XXL | 1:192 (32×32×8) | "Faster than playback" on H100 at 768×512[3] |
| Mochi 1 | Genmo | October 2024 | 10B | Asymmetric Diffusion Transformer | 8×8 spatial, 6× temporal | Apache 2.0 license, strong text-to-video quality[7] |
| CogVideoX (CogVideoX-5B) | Tsinghua KEG | August 2024 | 5B (and 2B variant) | DiT with 3D VAE, expert transformer | 1:48 | Strong image-to-video; older baseline by late 2024[3][7] |
| HunyuanVideo | Tencent | December 2024 | 13B | DiT-based, MoE-style routing | 1:48 | Largest open video model at release; high motion quality[7] |
| Open-Sora Plan | community / PKU-YGroup | 2024 onward | varied | DiT-based | 1:48 | Open reimplementation of Sora-style designs[3] |
This comparison is meant only to give a sense of the relative position of LTX-Video at its release; differences in evaluation procedures, training data, and release timing make any direct ranking sensitive to the exact benchmark used. LTX-Video's distinguishing claim is that none of the comparable systems generated faster than playback on a single GPU at the time of its release.[3] Later open systems such as Wan 2.1 have closed part of the speed gap on consumer hardware while pushing motion fidelity higher.[7]
Lightricks documents several limitations in the public model card and technical report.
In addition to these vendor-documented constraints, independent reviewers consistently report that LTX-Video at its initial 2B scale produces slightly less prompt-faithful and motion-stable output than larger contemporaries such as HunyuanVideo or Mochi 1, even when it is much faster to sample.[7][14] The LTXV-13B release and the multi-scale rendering pipeline are explicitly responses to that gap.[12]
LTX-Video sits within a broader set of architectures and systems that the wiki covers separately: