Mochi 1
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,627 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,627 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mochi 1 is an open-weights text-to-video diffusion model released by Genmo Inc. on October 22, 2024 under the Apache 2.0 license.[^1][^2] The preview model is a 10-billion-parameter Asymmetric Diffusion Transformer (AsymmDiT) paired with a 362-million-parameter video VAE that compresses input clips by a factor of 8x8 in space and 6 in time into a 12-channel latent representation.[^3][^4] At launch it generated 480p clips at 30 frames per second with a maximum length of roughly 5.4 seconds, and the company positioned it as the largest openly released video generative model at that time.[^1][^2] The weights and inference code were published on Hugging Face under genmo/mochi-1-preview and on GitHub at genmoai/models (later moved to genmoai/mochi), and a hosted playground was made available at genmo.com/play alongside the open release.[^2][^4][^5]
| Field | Value |
|---|---|
| Developer | Genmo Inc. (San Francisco) |
| Initial release | October 22, 2024 (preview) |
| License | Apache 2.0 |
| Model class | Latent text-to-video diffusion |
| Parameter count | 10 billion (denoiser); 362 million (VAE) |
| Architecture | Asymmetric Diffusion Transformer (AsymmDiT) |
| Text encoder | Single T5-XXL |
| Latent compression | 8x8 spatial, 6x temporal, 12 channels |
| Default resolution | 480p (848x480) |
| Frame rate | 30 fps |
| Maximum length | ~5.4 seconds (163 frames) |
| Repository (weights) | huggingface.co/genmo/mochi-1-preview |
| Repository (code) | github.com/genmoai/mochi |
| Hosted playground | genmo.com/play |
Genmo Inc. was founded in 2022 by brothers Paras Jain and Ajay Jain, both of whom completed PhDs at the University of California, Berkeley.[^6] Ajay Jain co-authored the 2020 Denoising Diffusion Probabilistic Models paper that underpins modern image and video diffusion systems, and later co-developed DreamFusion for text-to-3D generation; Paras Jain worked on machine-learning systems and self-driving software at Google and DeepScale before co-founding Genmo.[^6] The company's earlier products focused on hosted generative image and video tools accessed through a web playground.
Mochi 1 was announced and released as an open-weights preview on October 22, 2024.[^1][^2] The release was accompanied by Genmo's announcement of a $28.4 million Series A funding round led by NEA, with participation from The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, and Essence VC.[^2] At launch, Genmo described the preview as the first stage in a two-part release, with a higher-resolution "Mochi 1 HD" 720p variant promised "before the end of the year".[^1][^2]
Within roughly two weeks of the initial preview, downstream tooling began appearing. A community wrapper for ComfyUI, ComfyUI-MochiWrapper by Kijai, exposed Mochi to ComfyUI users; native ComfyUI nodes followed shortly afterwards.[^7][^8] Hugging Face Diffusers added a MochiPipeline and a MochiTransformer3DModel implementation that reproduced Genmo's reference inference at lower memory cost and supported bitsandbytes 8-bit quantization.[^9] The reference repository at github.com/genmoai/mochi (the upstream of genmoai/models) added LoRA fine-tuning support on November 26, 2024, enabling adaptation on a single H100 GPU.[^10][^11]
The promised Mochi 1 HD release was delayed past the original end-of-2024 window. As of early 2026, Genmo's open-weights repositories continued to host only the 480p preview; an open issue in the official repository (#132, "HD 720p Release timeline") tracked community questions about the delay, and Genmo had not published the 720p weights publicly through that point.[^12]
Mochi 1 follows the latent-diffusion recipe established by image systems such as Stable Diffusion: an autoencoder compresses pixel-space data into a smaller latent representation, and a denoising network operates entirely in that latent space, with the autoencoder's decoder mapping the result back to pixels.[^1][^3] Mochi extends this template to video by introducing a 3D VAE that compresses jointly across the spatial and temporal axes, and by training a diffusion transformer that attends across all latent positions in a clip simultaneously, rather than relying on per-frame image generation with a separate temporal model.[^1][^4]
The forward model is a continuous diffusion process trained with a flow-matching objective; the public Diffusers integration uses a FlowMatchEulerDiscreteScheduler as the default sampler, consistent with the flow-matching framing reported by Genmo.[^9] See Flow Matching and Diffusion model for the underlying mathematical background.
The choice of latent diffusion is consequential for video: the alternative of running diffusion in pixel space at 480p with 163 frames would require denoising a tensor of roughly 848 x 480 x 163 x 3 elements, which is on the order of 200 million values per clip and prohibitively expensive both in attention cost and memory. By compressing to a latent volume that is roughly 128 times smaller than RGB, AsymmVAE reduces this to a tractable token count of 44,520 spatiotemporal positions while preserving enough visual signal for the decoder to reconstruct a coherent clip.[^1][^4] The choice of 12 latent channels (versus the 4 channels typical of earlier image VAEs such as the original Stable Diffusion 1.x AE) is consistent with the larger capacity-per-position needed to represent both spatial detail and temporal continuity in a single tensor element.[^4][^9]
The latent space is produced by a custom 3D variational autoencoder that Genmo calls AsymmVAE. The encoder compresses each input clip causally by 8x in each spatial dimension and 6x in time, and emits a 12-channel latent volume; the overall compression ratio is 8 x 8 x 6 = 384x in voxel count and roughly 96x to 128x in storage depending on accounting, with Genmo reporting "128x smaller" relative to RGB pixels.[^1][^4] The encoder uses 64 base channels while the decoder uses 128 base channels, producing an asymmetric VAE in which decoding has roughly twice the per-stage capacity of encoding.[^4]
The Diffusers port exposes the VAE as AutoencoderKLMochi with 12 latent channels, and reports persisted latents_mean and latents_std statistics that the pipeline uses to rescale latents before decoding, in the same style as the Stable Diffusion 3 family.[^9] Total VAE parameter count is 362 million.[^4]
The causal nature of the temporal compression matters for two reasons. First, it means that frames at the beginning of a clip can be encoded without seeing future frames, which preserves a natural autoregressive ordering at the latent level even though the denoiser itself operates on the full latent volume at once. Second, it allows the same VAE encoder to be reused for variable-length clips during training, since the temporal stride is fixed and the per-latent receptive field is bounded.[^4] The decoder's higher base-channel count (128 versus the encoder's 64) reflects an intentional trade-off: the encoder only needs to produce a compact summary suitable for diffusion, while the decoder must reconstruct fine spatiotemporal detail from a heavily compressed signal, and is therefore allocated more capacity.[^4]
The "asymmetric" framing of both the VAE and the denoiser is one of the most distinctive engineering choices of the Mochi 1 release: rather than allocate equal compute to encoder and decoder (VAE) or to text and visual streams (DiT), Genmo concentrates parameters on the side where they are reported to contribute most to perceptual quality, which is decoding and visual reasoning respectively.[^1][^4] See Variational Autoencoder and Autoencoder for general background.
VAE tiling, exposed by Diffusers via pipe.enable_vae_tiling(), partitions the latent volume into overlapping tiles for decoding, which substantially reduces the peak memory required to convert latents back to pixels. The technique trades a small amount of seam-handling complexity for the ability to decode full-length 163-frame clips on hardware that would otherwise run out of VRAM during the final decode step.[^9]
The denoising network is a 10-billion-parameter Asymmetric Diffusion Transformer.[^1][^3] Architecturally, AsymmDiT belongs to the family of multimodal diffusion transformers that jointly attend over text and visual tokens in the same self-attention operation, similar in spirit to the MMDiT design used in Stable Diffusion 3 and discussed in MMDiT (Multimodal Diffusion Transformer) and Diffusion Transformer (DiT).[^1][^9]
Key reported architectural parameters of AsymmDiT are:[^4]
max_sequence_length=256 in the Diffusers pipeline)Visual and text tokens are concatenated for joint attention but processed by separate MLP layers per modality, so the network learns modality-specific feed-forward transformations while sharing a unified attention operator.[^1] Genmo explicitly compares this scheme to the Stable Diffusion 3 architecture and frames AsymmDiT's distinctive feature as the asymmetric allocation of capacity between the two streams, with the visual stream carrying roughly four times the parameters of the text stream via the larger hidden dimension.[^1]
A second consequence of the asymmetric design is that, unlike many earlier video diffusion systems that combined CLIP and T5 prompt encoders, Mochi 1 uses a single T5-XXL encoder (google/t5-v1_1-xxl) for prompts.[^1][^9] This simplification, paired with the larger visual stream, is reported by Genmo as one of the ways the model concentrates compute on visual reasoning rather than on duplicated text processing.[^1] See T5 (language model) for background on the encoder.
A central technical claim of the Mochi 1 release is that the denoiser performs full 3D self-attention across the entire latent volume of a clip, jointly mixing information across all space and time positions in every block.[^1] At 44,520 visual tokens per clip plus 256 text tokens, this is a substantial context window for a video diffusion model, and the joint formulation contrasts with earlier approaches that factored space and time attention into separate operators for efficiency.[^1]
The practical implication of full 3D attention is that every visual token can attend directly to every other visual token in every layer, without first passing information through an intermediate frame summary or a temporally-local attention window. This in principle allows the model to capture long-range temporal dependencies, such as the persistence of a moving object across the entire 5.4-second clip, without the staged information transfer required by space-then-time factorizations. The cost is that the attention matrix grows as the square of the token count, which is why Mochi's reference implementation prefers PyTorch's EFFICIENT_ATTENTION SDPA backend and why community deployments lean heavily on Flash Attention kernels to keep memory and runtime tractable.[^7][^9]
To position the resulting tokens, Mochi 1 uses learnable 3D rotary positional embeddings, extending the 1D and 2D rotary scheme of Rotary position embedding (RoPE) to three axes. Genmo reports that the network "end-to-end learns mixing frequencies for space and time axes" inside the rotary embedding, rather than relying on hand-tuned base frequencies.[^1] See Rotary Position Embedding for background on the underlying mechanism. Learning the rotary frequencies rather than fixing them is reported by Genmo as a way of letting the model discover the appropriate trade-off between spatial and temporal locality, given the very different sampling rates along the height, width, and time axes of a 30 fps 480p latent.[^1]
In addition to the attention design, Genmo's release notes call out several modern stabilization and efficiency choices in the transformer blocks:[^1]
These are individually well-known techniques in modern transformer training, and their combination is consistent with the design patterns of contemporaneous large diffusion transformers.
Genmo's reference pipeline runs the T5 encoder and VAE in torch.float32, while running the DiT in torch.bfloat16 with PyTorch's efficient-attention SDPA kernel.[^9] The Diffusers documentation reproduces this configuration via an explicit sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION) context and notes that decoding 163 latent frames at full precision requires roughly 70 GB of VRAM, while a bfloat16 variant runs at 22 GB.[^9]
Default inference parameters in the official Diffusers MochiPipeline are 64 denoising steps with guidance_scale=4.5 and a default resolution of 848x480 pixels, matching the 480p preview specification.[^9] Flash Attention is supported as an optional backend in the reference repository.[^5]
The reference repository's MochiSingleGPUPipeline exposes a linear_quadratic_schedule helper for constructing custom sampling timesteps, which allows users to bias the denoising trajectory either toward early-step exploration or late-step refinement without rewriting the scheduler.[^4] In practice, most third-party deployments use either the default 64-step schedule or, for previewing, a reduced 28-step schedule with guidance_scale=3.5 that the Diffusers documentation surfaces as an example.[^9] The 256-token prompt limit is enforced by truncation of the T5 encoder input, which means very long prose prompts will lose later tokens; Genmo's published demo prompts tend to be 1 to 3 sentences long and to emphasize visual nouns and adjectives over discursive description.[^1][^4]
For multi-GPU deployments, Diffusers supports splitting the Mochi transformer across devices with the device_map="auto" and max_memory={0: "24GB", 1: "24GB"} arguments on MochiTransformer3DModel.from_pretrained. This enables Mochi inference on two 24 GB consumer GPUs by sharding the transformer parameters, an option that would otherwise be foreclosed by the model's size at bfloat16.[^9] The reference repository documents an analogous multi-GPU split via its own pipeline factory.[^5]
The reference code released alongside the weights is published at github.com/genmoai/mochi (originally genmoai/models) and provides a MochiSingleGPUPipeline and a multi-GPU pipeline that splits the DiT across devices. It exposes factories such as DitModelFactory, DecoderModelFactory, T5ModelFactory, and a linear_quadratic_schedule for sampling timesteps.[^4][^5] Installation uses the uv Python package manager and depends on FFmpeg for video encoding.[^5]
MochiPipeline)Hugging Face Diffusers ships a MochiPipeline for text-to-video generation that loads genmo/mochi-1-preview directly. The pipeline pairs MochiTransformer3DModel, AutoencoderKLMochi, a T5 encoder, and a FlowMatchEulerDiscreteScheduler, and supports both a full-precision run (around 42 GB VRAM) and a variant="bf16" configuration that fits in around 22 GB.[^9] Memory-saving features such as enable_model_cpu_offload() and enable_vae_tiling() are available, as is multi-GPU sharding via the device_map and max_memory arguments on from_pretrained.[^9]
Diffusers also documents a quantized configuration using bitsandbytes 8-bit weights for both the T5 encoder and the Mochi transformer, allowing a single Mochi pipeline to run with significantly reduced VRAM while accepting some quality trade-off.[^9] A from_single_file loader can ingest the repackaged Comfy-Org bf16 checkpoint (Comfy-Org/mochi_preview_repackaged) directly.[^9]
Mochi 1 was added to ComfyUI in two waves. Kijai's community wrapper ComfyUI-MochiWrapper provided the earliest end-to-end workflow, and native Mochi nodes followed in early November 2024 (announced November 4, 2024).[^7][^8] The native integration supports multiple attention backends (including the efficient SDPA kernel and Flash Attention), allowing Mochi to run on a single 24 GB consumer GPU such as the NVIDIA RTX 4090.[^7][^8]
The Comfy-Org organization on Hugging Face hosts repackaged weights for use with ComfyUI in two precision tiers:[^7][^8][^9]
mochi_preview_bf16.safetensors (full precision, higher VRAM)mochi_preview_fp8_scaled.safetensors (FP8-scaled weights for memory-constrained machines)Text encoders are similarly available in t5xxl_fp16 and t5xxl_fp8_e4m3fn_scaled variants. The diffusion model is placed in ComfyUI/models/diffusion_models/, the encoder in ComfyUI/models/clip/, and the VAE in ComfyUI/models/vae/; the workflow then chains a checkpoint loader, prompt encoder, K sampler, empty Mochi latent video, and the VAE decoder.[^7][^8]
On November 26, 2024, Genmo released a LoRA fine-tuning workflow as part of the reference repository, allowing users to adapt Mochi 1 on a single H100 GPU using small custom video datasets.[^10][^11] Documentation in the demos/fine_tuner directory and downstream tutorials describe training LoRA (Low-Rank Adaptation) adapters that target the query, key, value, and output projection matrices of the DiT with learning rates in the 1e-4 to 2e-4 range.[^10][^11] Compute platforms including Modal and Lambda published walkthroughs covering training on H100 and Grace Hopper GH200 hardware respectively.[^11]
Beyond Genmo's own playground at genmo.com/play, third-party inference providers exposed Mochi 1 through hosted APIs. Replicate published the model as genmoai/mochi-1 and a separate genmoai/mochi-1-lora endpoint for fine-tuned variants.[^11] Tensorfuse and other deployment platforms documented end-to-end recipes for hosting Mochi via ComfyUI on cloud GPUs.[^7]
Mochi 1 was one of three roughly contemporaneous large open-weights video diffusion releases in late 2024 and early 2025. The most direct comparators are CogVideoX from Zhipu AI / Tsinghua KEG, HunyuanVideo from Tencent, and the Open-Sora line from HPC-AI Tech. Public summaries and community comparisons drew the following high-level distinctions at the time of Mochi's release; absolute quality rankings are subjective and shift as later versions ship.[^1][^13]
| Model | Approx. parameters | License | Reported resolution / length at first release | Notable points |
|---|---|---|---|---|
| Mochi 1 preview | 10B (DiT) + 362M (VAE) | Apache 2.0 | 480p (848x480), up to ~5.4 s at 30 fps | Single T5-XXL encoder; AsymmDiT with full 3D attention and 3D RoPE; LoRA fine-tuning kit released November 26, 2024 |
| CogVideoX-5B | 5B | Custom open license | 720x480, 6 s at 8 fps | 3D causal VAE; expert-transformer for joint text-video attention; image-to-video variant |
| HunyuanVideo | 13B | Tencent Community License | 720p, ~5 s | Largest contemporary open video DiT; dual-stream then single-stream design; restrictive non-Apache license |
| Open-Sora (early 2025) | scaled to ~11B by Open-Sora 2.0 | Apache-style open | Multi-resolution / multi-length | Open replication line inspired by Sora; staged training across resolutions |
Sources: Genmo blog,[^1] community comparison writeups,[^13] HunyuanVideo and CogVideoX release notes.[^13]
At launch, third-party commentary and crowd-sourced comparisons reported that Mochi 1 ranked near HunyuanVideo on text-to-video preference benchmarks, that CogVideoX was preferred for image-to-video tasks, and that LTX Video offered the fastest generation among the comparable open models.[^13] Mochi 1 was singled out for motion realism in several community evaluations, consistent with Genmo's own emphasis on motion fidelity and prompt adherence.[^1][^13] Among the four, Mochi was the only model released under Apache 2.0 with full training-style code at launch, which made it especially attractive for downstream experimentation and commercial fine-tuning compared to HunyuanVideo's non-permissive Tencent Community License.[^1][^13]
Genmo also positioned Mochi 1 explicitly against closed commercial systems: the launch coverage framed it as a permissively licensed alternative to OpenAI's Sora, to Runway Gen-3 Alpha, and to Kling (video generation), emphasizing open weights and Apache 2.0 use rather than feature parity.[^2]
Genmo's launch communications repeatedly emphasized prompt adherence as a primary objective of the Mochi 1 design, alongside motion quality.[^1][^9] The architectural levers reported in the release notes that bear on prompt following include the use of a single high-capacity T5-XXL text encoder, the joint attention formulation that allows text tokens to interact with visual tokens at every layer, and the relatively large max_sequence_length=256 text context.[^1][^9] The trade-off is also acknowledged: the model is described as optimized for photorealistic content and as performing poorly on animated, illustrated, or stylized prompts in the preview release.[^4]
Genmo did not publish a detailed training-data datasheet, dataset list, or training-compute disclosure for the Mochi 1 preview at release; the publicly available material is limited to the architectural description in the launch blog, the Hugging Face model card, and the GitHub README, and downstream summaries reiterate the same information.[^1][^4][^5] The model is described only as "trained entirely from scratch" on video data, with no further breakdown released.[^1] This stands in contrast to some peer releases that published partial dataset descriptions; the absence of a training datasheet has been noted in third-party reviews and is consistent with the preview status of the release.
Genmo's emphasis on prompt adherence is reflected in the company's choice of internal evaluation. The launch blog reports that Genmo's preliminary evaluation prioritized two axes: motion quality (whether motion looks physically plausible, whether objects retain identity through time, and whether camera motion behaves as expected) and prompt adherence (whether the model produces the requested subject, action, scene, and style).[^1] These two axes correspond to known failure modes of earlier video diffusion systems, in which short clips often look acceptable per frame but exhibit flickering, identity drift, or generic content that ignores the specific prompt.
The architectural rationale for prioritizing prompt adherence is concentrated in the asymmetric capacity allocation: by giving the visual stream roughly four times the parameter count of the text stream while still requiring all attention layers to read text tokens, the model is structurally biased toward grounding its visual output in the textual conditioning rather than producing a generic photorealistic clip independent of the prompt. The single high-capacity T5-XXL encoder, instead of a CLIP-style dual-encoder ensemble, is also reported by Genmo as a deliberate simplification that improves alignment between caption semantics and generated content.[^1]
The intended applications cited in Genmo's launch materials and downstream coverage include:
Because Mochi 1 is released under Apache 2.0, both personal and commercial use of the weights is permitted, including derivative works and commercial inference services, with no royalty or use-case restriction beyond the standard Apache 2.0 terms.[^1][^2][^4]
Mochi 1's release was significant for several intersecting reasons that go beyond the model's raw capability.
First, the combination of permissive Apache 2.0 licensing with a 10-billion-parameter video diffusion model marked a step change in the size of openly distributed video models. At the time of the October 2024 release, the largest openly available video diffusion model carried noticeably fewer parameters, and most other competitive systems were closed.[^1][^2] Genmo's framing of Mochi 1 as "the largest video generative model ever openly released" reflected this gap, and the Apache 2.0 license made the weights immediately usable in commercial products without the conditions attached to Tencent's HunyuanVideo license or to Sora and Gen-3 API access.[^1][^2]
Second, the asymmetric design choices popularized by AsymmDiT, in particular the joint multimodal attention with non-square QKV projections and the asymmetric encoder-decoder VAE, joined a small but growing set of architectural templates for billion-scale multimodal diffusion. The architecture's relationship to the MMDiT (Multimodal Diffusion Transformer) design used by Stable Diffusion 3 illustrates how the video diffusion community converged on joint text-visual attention as the default formulation by late 2024.[^1][^9]
Third, the rapid arrival of the ComfyUI native nodes and the bf16/fp8 repackaged weights demonstrated how quickly an open-weights release could be operationalized for consumer-grade hardware. Within two to three weeks of launch, end users with a single 24 GB RTX 4090 could run Mochi locally through ComfyUI, an outcome that would have been technically impossible at launch without the community's quantization and attention-backend work.[^7][^8] The downstream LoRA fine-tuner published by Genmo on November 26, 2024 closed a second important loop, enabling small-data adaptation of the base model on a single H100 GPU.[^10][^11]
Fourth, Mochi 1's release expanded the empirical evidence base for what is and is not currently achievable by open video diffusion. The model's strengths in motion realism and prompt adherence, paired with its acknowledged weaknesses on animation and on resolutions above 480p, helped frame subsequent open-weights releases such as HunyuanVideo, the Wan family, and Open-Sora 2.0, all of which targeted Mochi as a baseline.[^13]
Genmo and downstream evaluators have documented several limitations of the Mochi 1 preview release:
genmoai/mochi repository, the 720p variant remained unreleased past the original end-of-2024 timeline; the open issue #132 on the repository documents the delay.[^12]Mochi 1 sits within a rapidly expanding open video diffusion ecosystem. The most directly comparable open-weights video diffusion models include HunyuanVideo from Tencent, CogVideoX from Zhipu AI, and the Wan family represented in this wiki by Wan 2.1 and Wan 2.5 from Alibaba. The closed commercial comparators most often cited in launch coverage are Sora and Sora 2 from OpenAI, the Veo family (Veo, Veo 2, Veo 3) from Google DeepMind, Runway Gen-3 Alpha and Runway Gen-4 from Runway, and Kling (video generation) from Kuaishou.[^1][^2][^13]
Architecturally, AsymmDiT inherits from the broader Diffusion Transformer (DiT) line and bears family resemblance to the MMDiT (Multimodal Diffusion Transformer) design used in Stable Diffusion 3 and Stable Diffusion 3.5, with the principal differences being the asymmetric text-vs-visual capacity allocation and the extension to full 3D attention over a video latent volume.[^1][^9] The flow-matching training objective links Mochi to the broader Flow Matching literature, and the use of Rotary position embedding (RoPE) generalized to three dimensions echoes positional-encoding choices in other modern transformers.[^1][^9]
The Mochi 1 release also fits into a wider pattern of open-weights releases in Text-to-video generation and broader AI Video Generation, a space whose ecosystem and history are summarized in those overview entries.