LTX-Video

Diffusion Models Open Source AI Video Generation

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 4,208 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LTX-Video is an open-source, transformer-based latent video diffusion model developed by the Israeli company Lightricks and first released to the public in November 2024.^[1]^[2] The system is designed around a single architectural goal: integrating a high-compression 3D causal Video-VAE with a Diffusion Transformer (DiT) backbone so tightly that generation runs faster than real-time playback on a single NVIDIA H100 GPU.^[3] The initial release is a 1.9-billion-parameter base model capable of producing five seconds of 24 fps video at 768×512 resolution in roughly two seconds on an H100, a generation speed unmatched by other public DiT-based text-to-video generation systems at the time.^[3] Lightricks has continued to expand the family through a sequence of revisions (0.9, 0.9.1, 0.9.5, 0.9.6, 0.9.7, and the LTXV-13B release in May 2025), and the model has become a common backbone in community pipelines distributed through Hugging Face and ComfyUI.^[4]^[5]^[6]

The model belongs to a small cohort of fully open video foundation models that emerged in late 2024, alongside Genmo's Mochi 1, Tsinghua's CogVideoX, Tencent's HunyuanVideo, and the AI video generation research effort Open-Sora.^[7] Within that group LTX-Video is distinguished by its emphasis on inference latency rather than parameter count: the technical report attributes its speed advantage to relocating patchification from the transformer input into the Video-VAE encoder, producing an aggressive 1:192 pixel-to-token compression ratio and reducing the number of latent tokens the transformer must denoise.^[3] The codebase is released on GitHub, and weights are distributed through the Lightricks Hugging Face organization under a custom OpenRAIL-style "LTX-Video Open Weights License", with the broader repository carrying an Apache 2.0 marker for the inference and training scaffolding code.^[4]^[5]

Infobox

Attribute	Value
Developer	Lightricks (Jerusalem, Israel)
First public release	November 2024 (version 0.9)
Architecture	DiT-based latent video diffusion with 3D causal Video-VAE
Base parameters (initial)	approximately 1.9 billion
Later scales	13B model (LTXV-13B), released May 6, 2025
Text encoder	T5-XXL
VAE compression	1:192 pixel-to-token ratio (32×32 spatial, 8 temporal)
Reference resolution	768×512, 24 fps (also supports 1216×704, 30 fps)
Reference speed	5-second clip generated in roughly 2 seconds on a single NVIDIA H100
Distribution	github.com/Lightricks/LTX-Video; huggingface.co/Lightricks/LTX-Video
License (weights)	Custom "LTX-Video Open Weights License" (OpenRAIL-derived); free for personal use; commercial use permitted with later updates
Code license	Apache 2.0 (inference scaffolding)
Integrations	Diffusers, ComfyUI-LTXVideo, fal.ai, Replicate, LTX Studio
Paper	"LTX-Video: Realtime Video Latent Diffusion", arXiv:2501.00103, December 30, 2024

History

Lightricks before LTX-Video

Lightricks was founded in 2013 in Jerusalem and is best known for consumer creative-software products such as Facetune and Videoleap. The company moved into generative AI tooling in early 2024 with the announcement of LTX Studio, a web-based AI filmmaking environment intended to let users go from text prompts to storyboards, characters, and short scenes through a single interface. LTX Studio was unveiled on February 28, 2024 as an invite-only beta and later opened to general availability on August 20, 2024.^[8]^[9] At launch, LTX Studio relied primarily on third-party generative components for video synthesis; the company described its long-term goal as building its own video foundation model that could be tuned for storyboard-style filmmaking workflows.^[9]

Open-source release of LTX-Video 0.9

On November 21, 2024, Lightricks publicly released the first version of its in-house video model under the name LTX Video (often stylized LTXV).^[1]^[2] The model was published as version 0.9 on Hugging Face under the repository Lightricks/LTX-Video, with the inference code mirrored at github.com/Lightricks/LTX-Video.^[4]^[5] Lightricks characterized the launch as the first real-time DiT-based video model: a 1.9-billion-parameter base capable of producing 24 fps clips at 768×512 resolution "faster than they can be watched", running on consumer-class hardware such as a single NVIDIA RTX 4090 in addition to data center GPUs.^[1]^[2]

The first release supported text-to-video and image-to-video conditioning simultaneously, both trained jointly inside the same backbone rather than as separate fine-tunes.^[3] Coverage at the time positioned LTX-Video as a direct open-source counterweight to OpenAI's then-unreleased Sora and to closed-source commercial models such as Runway Gen-3 Alpha, Kling, and Pika.^[2]^[7]

Shutterstock training-data partnership

On December 13, 2024, Lightricks and Shutterstock announced a long-term licensing agreement in which Lightricks would train future versions of LTXV on Shutterstock's HD and 4K stock-footage library.^[10] The arrangement was the first deployment of what Shutterstock called a "research license", a tier intended to let model developers train on premium licensed footage at lower cost before committing to a full commercial license; Shutterstock's contributor revenue-share program covered the licensed data, and contributors retained an opt-out from AI training use.^[10] The deal was specifically described as providing data for "future model iterations", which positioned LTX-Video 0.9 as a release that pre-dated the Shutterstock corpus and subsequent versions as the first generation trained against it.^[10]

The arXiv technical report

The accompanying technical paper, "LTX-Video: Realtime Video Latent Diffusion", was submitted to arXiv on December 30, 2024 as preprint 2501.00103.^[3] The paper lists sixteen authors from Lightricks: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi.^[3] It is the canonical reference for the system's design and benchmark claims, including the 1:192 compression ratio, the relocation of patchification into the VAE, and head-to-head human preference numbers against CogVideoX, HunyuanVideo, and Open-Sora baselines.^[3]

Version timeline (0.9 through 0.9.7)

After the November 2024 launch, Lightricks adopted a roughly quarterly cadence of point releases on Hugging Face.

Version	Approximate release	Notable changes
0.9	November 2024	Initial 2B base model, text-to-video and image-to-video
0.9.1	Late 2024	Quality improvements, integration with spatiotemporal guidance and perturbed attention guidance (STG/PAG) workflows
0.9.5	March 2025	Multi-keyframe conditioning, longer videos, commercial-use updates to the weights license
0.9.6	April 2025	New `ltxv-2b-0.9.6-dev` checkpoint and `ltxv-2b-0.9.6-distilled` real-time variant requiring as few as 8 sampling steps; first-class video extension and keyframe animation
0.9.7	May 2025 onward	13B parameter `ltxv-13b-0.9.7` model and FP8 quantized variant for higher quality; introduces multi-scale rendering pipeline

The version metadata above is drawn from the Hugging Face model card and the GitHub repository's README.^[4]^[5]^[6] The 0.9.6 distilled checkpoint in particular is the one most commonly referenced for "real-time on a single 4090" claims because it accepts very short sampler schedules (around 8 steps) while preserving the multi-task capabilities of the base model.^[11]

LTXV-13B and multi-scale rendering

On May 6, 2025, Lightricks announced LTXV-13B, a 13-billion-parameter scaling of the LTX-Video architecture.^[12]^[13] The headline claim was that LTXV-13B generated comparable-quality video roughly 30 times faster than other open video models of similar quality, while still running on consumer GPUs rather than data-center hardware.^[12] The release introduced multi-scale rendering, a pipeline that first drafts a video at lower spatial resolution to capture coarse motion and then refines it through dedicated spatial and temporal upsamplers, a technique enabled by the model's tightly compressed latent space.^[12]^[13] Lightricks also announced commercial-use terms for the 13B model: companies with annual revenue under USD 10 million could use it without a license fee, with paid commercial licensing required above that threshold.^[12]

The 13B release also formalized the multi-tier weight distribution that has since become standard for LTX-Video: a high-quality ltxv-13b development weight, a ltxv-13b-distilled weight (roughly fifteen times faster than the base 13B for short sampler schedules), a smaller ltxv-2b-distilled weight retained for low-VRAM use, and FP8-quantized variants of each that further reduce memory footprint.^[4]^[12]

Technical details

High-level architecture

LTX-Video is structured as a latent diffusion model in the lineage of Stable Diffusion and the DiT family, adapted to video.^[3] The three principal components are:

A 3D causal Video-VAE that encodes raw video into a compact latent grid and decodes denoised latents back to pixels.
A transformer denoiser that performs full spatiotemporal self-attention on the latent grid.
A T5-XXL text encoder (one of the larger T5 (language model) variants) that provides cross-attention conditioning to the denoiser.

What sets LTX-Video apart from earlier video diffusion systems is not the existence of these components but the way they are coupled: the system relocates the patchifying operation from the input of the denoising transformer into the encoder of the Video-VAE itself, so that the transformer operates directly on a 1:1 token-to-latent grid rather than re-tokenizing the VAE output.^[3] This decision pushes more of the compression work onto the VAE and shrinks the number of tokens the transformer must process.

Video-VAE: 1:192 compression

The Video-VAE is the most heavily engineered component of LTX-Video. It is built from 3D causal convolutions and applies spatial downsampling of 32×32 in the image plane together with temporal downsampling of 8 frames per latent step, with 128 output channels per latent token.^[3] The combined effect is a 1:192 ratio of input pixel values to latent values, several times more aggressive than the compression ratios used by comparable open video models such as PyramidFlow (1:96), CogVideoX (1:48), or HunyuanVideo (1:48).^[3]

Because such aggressive compression risks losing high-frequency detail, the VAE decoder is trained as more than a simple inverse mapping. In the LTX-Video design the decoder simultaneously performs latent-to-pixel conversion and the final denoising step in pixel space, allowing the model to recover fine textures without paying for a separate pixel-space refinement network.^[3] The decoder is supervised by a stack of objectives that includes:

A reconstruction-style GAN discriminator that compares real and reconstructed samples (rather than the more typical real-vs-fake setup), borrowed conceptually from the image-domain VAE literature.^[3]
Multi-layer noise injection in the style of StyleGAN, used to produce diverse high-frequency content on top of the deterministic decoder output.^[3]
A uniform log-variance term applied across latent channels, simplifying the prior.
A "video DWT loss" computed via 3D Discrete Wavelet Transforms with an L1 distance, used to supervise spatiotemporal frequency content.^[3]

These design choices are described in the technical report and are the principal reason LTX-Video can ship without a separate cascade or super-resolution stage despite its small parameter count.^[3]

Transformer denoiser

The denoising network is a standard Diffusion Transformer (DiT) adapted for spatiotemporal latents. In the base release it has approximately 1.9 billion parameters, with a hidden dimension of 2048 distributed across 28 transformer blocks.^[3] Each block contains both self-attention and cross-attention layers; self-attention operates on the full 3D latent grid, while cross-attention pulls in features from the T5-XXL text encoder.^[3]

Several adjustments distinguish the LTX-Video transformer from earlier image DiTs:

Rotary positional embeddings: The model uses Rotary position embedding (RoPE) for positions, with an exponential frequency spacing across the time, height, and width axes, and with normalized fractional coordinates so that the same model can be reused at different output resolutions and durations.^[3]
QK normalization with RMSNorm: Query and key tensors are normalized using RMSNorm before the attention dot product, a stabilization technique that has become standard in large transformers.^[3]
Log-normal noise schedule shifted toward higher noise: The training-time noise level is sampled from a log-normal distribution biased toward higher-noise regions of the diffusion process, with the shift scaling with the number of tokens being denoised; the report attributes this to the practical observation that longer videos benefit from spending more capacity on the coarsest noise regime.^[3]

The transformer is trained with the AdamW optimizer on a mixture of video clips and still images, where images are simply treated as a degenerate video-duration configuration within the same training pipeline.^[3] The training mixture stochastically drops 0%-20% of tokens to encourage robustness across different aspect ratios and durations.^[3]

Text encoder

LTX-Video uses the T5-XXL encoder (about 4.7B parameters in its standard configuration) as its text conditioning module.^[3]^[5] This is the same encoder family used by other Lightricks releases and is shared with several other large open generative systems. The model card recommends elaborate, descriptive English prompts and explicitly notes that other languages and short prompts are not officially supported.^[5]

Training data

The 0.9 release was trained on a large, diverse video corpus that the technical report describes only in general terms (large-scale internet-sourced video supplemented with images).^[3] Subsequent versions add HD and 4K footage licensed from Shutterstock under the research-license arrangement described above; that data became available to Lightricks beginning in late 2024 under the December 13, 2024 agreement and was incorporated into later iterations of the model.^[10] Lightricks does not publish exact dataset sizes or composition tables in the public model card.^[5]

Resolution support

The model is trained for multi-resolution generation, with both height and width required to be divisible by 32 (the VAE's spatial stride) and frame count required to be of the form 8n+1 (one more than a multiple of the VAE's temporal stride of 8).^[5] Reference operating points include:

768×512 at 24 fps for 5-second clips, the configuration cited for the "faster than playback" claim.^[3]
1216×704 at 30 fps, advertised on the Hugging Face model card as a primary supported resolution.^[5]
Resolutions below 720×1280 are recommended in the model card for best quality.^[5]

The 0.9.5 update added explicit support for multi-keyframe conditioning (allowing the user to pin the first frame, end frame, or arbitrary intermediate frames), while 0.9.6 and later improved video extension (continuing a generated clip beyond its original duration).^[11]

Inference modes

Out of the box, LTX-Video supports several conditioning modes through a shared LTXVideoCondition interface in its inference code and through dedicated nodes in the ComfyUI integration:

Text-to-video: a pure prompt-conditioned generation.
Image-to-video: an input still image is treated as the first frame (or a designated keyframe), and the model generates the rest of the clip from a text prompt.^[5]
Video-to-video and video extension: a short input clip is used as a conditioning signal, with the model continuing or modifying it.^[11]
Multi-keyframe interpolation: introduced in 0.9.5, multiple input frames can be assigned to specific frame indices, with the model interpolating motion between them.^[11]

For LTXV-13B, the inference pipeline additionally supports multi-scale rendering, in which the same model first generates at a low spatial resolution and is then re-applied at higher resolution using paired spatial and temporal upsampler models distributed alongside the main weights.^[12]

Performance

Inference speed

The most widely cited LTX-Video benchmark is the headline number from the technical report: 5 seconds of 24 fps video at 768×512 resolution generated in roughly 2 seconds on a single NVIDIA H100 GPU, which is faster than the resulting clip can be played back in real time.^[3] The Hugging Face model card extends this claim to a 30 fps, 1216×704 reference resolution that also generates faster than playback on the same hardware tier.^[5]

On consumer GPUs the picture is less extreme but still very favorable. Third-party comparisons running the LTX-Video 0.9.x line on an RTX 4090 typically report short-clip generation in roughly 5-10 seconds for the base 2B model and around 90 seconds for higher-quality settings, versus several minutes for HunyuanVideo, CogVideoX, or Mochi 1 at comparable resolutions on the same hardware.^[7]^[14] These numbers are not part of the official technical report and should be read as community benchmarks rather than vendor claims.

Quality comparisons

The technical report includes a human-preference study with 1,000 prompts and 20 participants, judging "visual quality, motion fidelity, and prompt adherence" across LTX-Video and several baselines.^[3] In that survey, LTX-Video was preferred over each of CogVideoX, Open-Sora Plan, and PyramidFlow in roughly 85% of text-to-video comparisons, and over comparable image-to-video baselines in 91% of pairings.^[3] The same paper compares VAE compression ratios across these systems, with LTX-Video occupying the most-compressed corner of the table at 1:192.^[3]

Third-party rankings have given a more mixed read. Independent reviewers writing after the release of HunyuanVideo and the LTXV-13B update typically characterize LTX-Video as the fastest open video model and HunyuanVideo or Wan 2.1 as the strongest on motion-fidelity benchmarks such as VBench, while Mochi 1 is often singled out for prompt adherence in pure text-to-video.^[7] LTX-Video's per-token compute advantage is generally credited as the main reason its quality-to-speed ratio remains competitive even as larger systems pull ahead on absolute fidelity.^[7]^[14]

Implementations

Reference implementations

The reference code lives at github.com/Lightricks/LTX-Video and consists of PyTorch inference and pipeline-configuration code, alongside YAML configurations for each weight variant (for example configs/ltxv-13b-0.9.8-distilled.yaml).^[4] The repository requires CUDA 12.2 or later and PyTorch 2.1.2 or later, with macOS MPS support added against PyTorch 2.3.^[4]

Diffusers

LTX-Video is integrated into the Hugging Face Diffusers library as LTXConditionPipeline (and earlier wrappers), enabling use via the standard from_pretrained interface.^[5] Diffusers usage is the path recommended in the model card for application developers, and weights for each version can be loaded by name (Lightricks/LTX-Video, Lightricks/LTX-Video-0.9.5, Lightricks/LTX-Video-0.9.7-dev, and so on).^[4]^[5]

ComfyUI-LTXVideo

The model has a first-party node pack for ComfyUI distributed at github.com/Lightricks/ComfyUI-LTXVideo.^[15] The nodes appear under the LTXVideo category in the ComfyUI menu and include a Gemma- or T5-backed text encoding node, an LTXVTextToVideoSampler sampler node, a VAE-decode node, and a video-combine output node.^[15] The pack ships example workflows for text-to-video, image-to-video, multi-keyframe animation, and video extension.^[15] Community-maintained GGUF quantizations (for example city96/LTX-Video-0.9.5-gguf) further reduce memory usage for desktop and laptop GPUs.^[16]

Cloud and managed offerings

Hosted inference is available through several third-party providers, including fal.ai, Replicate (the lightricks/ltx-video model), and Lightricks' own LTX Studio platform, which directly integrates the LTXV models into a longer-form storyboard workflow.^[1]^[5]^[17] LTX Studio also exposes other third-party models alongside LTX-Video, including (per Lightricks' own announcements) Veo 3 for generations with synchronized audio.^[9]

Applications

LTX-Video is positioned primarily as a building block for short-form generative video. The base release explicitly targets two use patterns:

Image animation: a single input image, often produced by a separate text-to-image model, is animated for a few seconds in response to a motion prompt.^[5]
Pre-visualization and storyboard fill: short clips are generated to fill out a sequence in a longer storyboard, which is the core LTX Studio workflow.^[9]

Because the model is fully open-weights and runs locally on consumer GPUs, it has also seen substantial adoption in research and hobbyist pipelines: LoRA-style finetunes (often called "IC-LoRAs" in the LTX-Video community) attach to the 0.9.6 distilled base for stylistic and subject-specific generation,^[11] and the model is commonly used as a fast iteration target for prompt engineering before a more expensive run on a larger system.^[7]^[14]

Lightricks markets the model as suitable for commercial use under the LTX-Video Open Weights License with the additional condition introduced at the 13B launch that organizations with more than USD 10 million in annual revenue must obtain a separate commercial license.^[12]

Comparison with contemporary open video models

The table below summarizes key public design parameters for LTX-Video and the four open video models with which it is most often compared. Numbers and dates are drawn directly from each model's primary source.

Model	Developer	Initial release	Parameters	Architecture	VAE compression	Notes
LTX-Video 0.9	Lightricks	November 2024	approx. 1.9B	DiT with 3D causal VAE, T5-XXL	1:192 (32×32×8)	"Faster than playback" on H100 at 768×512^[3]
Mochi 1	Genmo	October 2024	10B	Asymmetric Diffusion Transformer	8×8 spatial, 6× temporal	Apache 2.0 license, strong text-to-video quality^[7]
CogVideoX (CogVideoX-5B)	Tsinghua KEG	August 2024	5B (and 2B variant)	DiT with 3D VAE, expert transformer	1:48	Strong image-to-video; older baseline by late 2024^[3]^[7]
HunyuanVideo	Tencent	December 2024	13B	DiT-based, MoE-style routing	1:48	Largest open video model at release; high motion quality^[7]
Open-Sora Plan	community / PKU-YGroup	2024 onward	varied	DiT-based	1:48	Open reimplementation of Sora-style designs^[3]

This comparison is meant only to give a sense of the relative position of LTX-Video at its release; differences in evaluation procedures, training data, and release timing make any direct ranking sensitive to the exact benchmark used. LTX-Video's distinguishing claim is that none of the comparable systems generated faster than playback on a single GPU at the time of its release.^[3] Later open systems such as Wan 2.1 have closed part of the speed gap on consumer hardware while pushing motion fidelity higher.^[7]

Limitations

Lightricks documents several limitations in the public model card and technical report.

Short clips: The base release is designed for short clips (typically up to about 10 seconds), with longer durations added only progressively across later releases.^[3]^[5]
Prompt sensitivity: The model card recommends "elaborate, descriptive" English prompts and explicitly notes that prompt-following degrades for short or terse prompts; non-English prompts are out of scope.^[5]
Factual content: The model is not intended to generate accurate depictions of real people, places, or events, and the model card warns against using it for factual reporting.^[5]
Bias: As with any large generative model trained on internet video and stock footage, the model may reproduce and amplify societal biases in its training data; Lightricks acknowledges this in the model card's limitations section.^[5]
Detail at high compression: The 1:192 VAE compression ratio is what makes the model fast, but the technical report also acknowledges that this aggressive compression places real limits on the representation of fine detail, which is part of why the VAE decoder is also tasked with the final denoising step in pixel space.^[3]

In addition to these vendor-documented constraints, independent reviewers consistently report that LTX-Video at its initial 2B scale produces slightly less prompt-faithful and motion-stable output than larger contemporaries such as HunyuanVideo or Mochi 1, even when it is much faster to sample.^[7]^[14] The LTXV-13B release and the multi-scale rendering pipeline are explicitly responses to that gap.^[12]

LTX-Video sits within a broader set of architectures and systems that the wiki covers separately:

The DiT backbone formalism is treated in Diffusion Transformer (DiT) and its multimodal generalization in MMDiT (Multimodal Diffusion Transformer).
The general framework of operating diffusion in a compressed latent space is described in Latent diffusion model and exemplified by the Stable Diffusion family.
The encoder used for text conditioning is documented in T5 (language model); the rotary position scheme used in attention is documented in Rotary position embedding (RoPE).
LTX-Video's VAE is a Variational Autoencoder specialized for video; its decoder regime borrows ideas from StyleGAN and is supervised in part by GAN-style objectives.
Adjacent open video models include Mochi 1, CogVideoX, and HunyuanVideo; closed-source contemporaries include Sora, Sora 2, Veo and Veo 3, Runway Gen-3 Alpha and Runway Gen-4, Kling (including Kling 2.1), and Pika (including Pika 2.5).
LTX-Video is integrated into ComfyUI and hosted on Hugging Face and Replicate.

References

ComfyUI Wiki, "Lightricks Releases Real-Time Video Generation Model LTX-Video", ComfyUI Wiki, 2024-11-23. https://comfyui-wiki.com/en/news/2024-11-23-ltx-video-release. Accessed 2026-05-20. ↩
Sharon Goldman, "Exclusive: Lightricks bets on open-source AI video to challenge Big Tech", VentureBeat, 2024-11-21. https://venturebeat.com/ai/exclusive-lightricks-bets-on-open-source-ai-video-to-challenge-big-tech. Accessed 2026-05-20. ↩
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian and Ofir Bibi, "LTX-Video: Realtime Video Latent Diffusion", arXiv:2501.00103, 2024-12-30. https://arxiv.org/abs/2501.00103. Accessed 2026-05-20. ↩
Lightricks, "Lightricks/LTX-Video (official repository)", GitHub, 2024-11-21. https://github.com/Lightricks/LTX-Video. Accessed 2026-05-20. ↩
Lightricks, "Lightricks/LTX-Video (model card)", Hugging Face, 2024-11-21. https://huggingface.co/Lightricks/LTX-Video. Accessed 2026-05-20. ↩
Lightricks, "Lightricks/LTX-Video-0.9.5 (model card)", Hugging Face, 2025-03. https://huggingface.co/Lightricks/LTX-Video-0.9.5. Accessed 2026-05-20. ↩
ComfyOnline, "Open source video generation models comparison (CogVideoX, Mochi, LTX-Video, HunyuanVideo)", ComfyOnline Blog, 2025. https://www.comfyonline.app/blog/open-source-video-generation-models-comparisons. Accessed 2026-05-20. ↩
Lauren Forristal, "Lightricks announces AI-powered filmmaking studio to help creators visualize stories", TechCrunch, 2024-02-28. https://techcrunch.com/2024/02/28/lightricks-announces-ai-powered-filmmaking-studio-to-help-creators-visualize-stories/. Accessed 2026-05-20. ↩
SiliconANGLE, "Lightricks launches LTX Studio to advance realism in text-to-video generation", SiliconANGLE, 2024-02-28. https://siliconangle.com/2024/02/28/lightricks-launches-ltx-studio-advance-realism-text-video-generation/. Accessed 2026-05-20. ↩
Lightricks and Shutterstock, "Lightricks Partners With Shutterstock for Video Training Data to Advance Open Source LTXV Video AI Generative Video Model", PR Newswire, 2024-12-13. https://www.prnewswire.com/news-releases/lightricks-partners-with-shutterstock-for-video-training-data-to-advance-open-source-ltxv-video-ai-generative-video-model-302331526.html. Accessed 2026-05-20. ↩
The Local Lab AI, "LTX Video v0.9.6 Update: Faster with Better Coherence and Quality", Patreon, 2025-04. https://www.patreon.com/posts/ltx-video-v0-9-6-127046557. Accessed 2026-05-20. ↩
Lightricks, "Lightricks Launches 13B Parameters LTX Video Model, Breakthrough Rendering Approach Generates High-Quality, Efficient AI Video 30X Faster Than Comparable Models", PR Newswire, 2025-05-06. https://www.prnewswire.com/news-releases/lightricks-launches-13b-parameters-ltx-video-model-breakthrough-rendering-approach-generates-high-quality-efficient-ai-video-30x-faster-than-comparable-models-302447660.html. Accessed 2026-05-20. ↩
Michael Nuñez, "Lightricks just made AI video generation 30x faster and you won't need a $10,000 GPU", VentureBeat, 2025-05-06. https://venturebeat.com/ai/lightricks-just-made-ai-video-generation-30x-faster-and-you-wont-need-a-10000-gpu. Accessed 2026-05-20. ↩
Clore.ai, "Video Generation Comparison", Clore.ai Documentation, 2025. https://docs.clore.ai/guides/comparisons/video-gen-comparison. Accessed 2026-05-20. ↩
Lightricks, "Lightricks/ComfyUI-LTXVideo (official ComfyUI integration)", GitHub, 2024-11. https://github.com/Lightricks/ComfyUI-LTXVideo. Accessed 2026-05-20. ↩
city96, "city96/LTX-Video-0.9.5-gguf (community GGUF quantization)", Hugging Face, 2025. https://huggingface.co/city96/LTX-Video-0.9.5-gguf. Accessed 2026-05-20. ↩
Replicate, "LTX-Video by Lightricks (Replicate model page)", Replicate, 2024-11. https://replicate.com/lightricks/ltx-video. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Best AI Video Generators Step-Video Text-to-video generation

Infobox

History

Lightricks before LTX-Video

Open-source release of LTX-Video 0.9

Shutterstock training-data partnership

The arXiv technical report

Version timeline (0.9 through 0.9.7)

LTXV-13B and multi-scale rendering

Technical details

High-level architecture

Video-VAE: 1:192 compression

Transformer denoiser

Text encoder

Training data

Resolution support

Inference modes

Performance

Inference speed

Quality comparisons

Implementations

Reference implementations

Diffusers

ComfyUI-LTXVideo

Cloud and managed offerings

Applications

Comparison with contemporary open video models

Limitations

Related work

See also

References

Improve this article

Related Articles

Mochi 1

Open-Sora

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here

Related Articles

Mochi 1

Open-Sora

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here