Mochi 1

Diffusion Models Open Source AI Video Generation

25 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 5,002 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mochi 1 is an open-weights text-to-video diffusion model released by Genmo Inc. on October 22, 2024 under the Apache 2.0 license.^[1]^[2] The preview model is a 10-billion-parameter Asymmetric Diffusion Transformer (AsymmDiT) paired with a 362-million-parameter video VAE that compresses input clips by a factor of 8x8 in space and 6 in time into a 12-channel latent representation.^[3]^[4] Genmo described it as "the largest video generative model ever openly released" and as setting "a new best-in-class standard for open-source video generation".^[1] At launch it generated 480p clips at 30 frames per second with a maximum length of roughly 5.4 seconds.^[1]^[2] The weights and inference code were published on Hugging Face under genmo/mochi-1-preview and on GitHub at genmoai/models (later moved to genmoai/mochi), and a hosted playground was made available at genmo.com/play alongside the open release.^[2]^[4]^[5]

Infobox

Field	Value
Developer	Genmo Inc. (San Francisco)
Initial release	October 22, 2024 (preview)
License	Apache 2.0
Model class	Latent text-to-video diffusion
Parameter count	10 billion (denoiser); 362 million (VAE)
Architecture	Asymmetric Diffusion Transformer (AsymmDiT)
Text encoder	Single T5-XXL
Latent compression	8x8 spatial, 6x temporal, 12 channels
Default resolution	480p (848x480)
Frame rate	30 fps
Maximum length	~5.4 seconds (163 frames)
Repository (weights)	huggingface.co/genmo/mochi-1-preview
Repository (code)	github.com/genmoai/mochi
Hosted playground	genmo.com/play

What is Mochi 1?

Mochi 1 is an open-source, text-to-video generation model built and released by the San Francisco company Genmo. Given a written prompt, it produces a short video clip rather than a still image. The Hugging Face model card describes it as "an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation".^[4] Genmo framed the release as a step change for open models, stating that "open-source models drive progress and democratize access to state-of-the-art AI capabilities" and releasing the weights under "a permissive Apache 2.0 license".^[1]^[4] Because of that license, both personal and commercial use of the weights, including derivative works and commercial inference services, is permitted with no royalty or use-case restriction.^[1]^[2]^[4]

Who made Mochi 1, and when was it released?

Genmo Inc. was founded in 2022 by brothers Paras Jain and Ajay Jain, both of whom completed PhDs at the University of California, Berkeley.^[6] Ajay Jain co-authored the 2020 Denoising Diffusion Probabilistic Models paper that underpins modern image and video diffusion systems, and later co-developed DreamFusion for text-to-3D generation; Paras Jain worked on machine-learning systems and self-driving software at Google and DeepScale before co-founding Genmo.^[6] The company's earlier products focused on hosted generative image and video tools accessed through a web playground.

Mochi 1 was announced and released as an open-weights preview on October 22, 2024.^[1]^[2] The release was accompanied by Genmo's announcement of a $28.4 million Series A funding round led by NEA, with participation from The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, and Essence VC.^[2]^[14] The NEA investment was led by partner Rick Yang, and the round also drew angel investors including Abhay Parasnis (CEO of Typespace) and Amjad Masad (CEO of Replit).^[14] At launch, Genmo described the preview as the first stage in a two-part release, with a higher-resolution "Mochi 1 HD" 720p variant promised "before the end of the year".^[1]^[2]

What downstream tools and integrations shipped after launch?

Within roughly two weeks of the initial preview, downstream tooling began appearing. A community wrapper for ComfyUI, ComfyUI-MochiWrapper by Kijai, exposed Mochi to ComfyUI users; native ComfyUI nodes followed shortly afterwards.^[7]^[8] Hugging Face Diffusers added a MochiPipeline and a MochiTransformer3DModel implementation that reproduced Genmo's reference inference at lower memory cost and supported bitsandbytes 8-bit quantization.^[9] The reference repository at github.com/genmoai/mochi (the upstream of genmoai/models) added LoRA fine-tuning support on November 26, 2024, enabling adaptation on a single H100 GPU.^[10]^[11]

The promised Mochi 1 HD release was delayed past the original end-of-2024 window. As of early 2026, Genmo's open-weights repositories continued to host only the 480p preview; an open issue in the official repository (#132, "HD 720p Release timeline") tracked community questions about the delay, and Genmo had not published the 720p weights publicly through that point.^[12]

How does Mochi 1 work?

How does latent diffusion scale to video?

Mochi 1 follows the latent-diffusion recipe established by image systems such as Stable Diffusion: an autoencoder compresses pixel-space data into a smaller latent representation, and a denoising network operates entirely in that latent space, with the autoencoder's decoder mapping the result back to pixels.^[1]^[3] Mochi extends this template to video by introducing a 3D VAE that compresses jointly across the spatial and temporal axes, and by training a diffusion transformer that attends across all latent positions in a clip simultaneously, rather than relying on per-frame image generation with a separate temporal model.^[1]^[4]

The forward model is a continuous diffusion process trained with a flow-matching objective; the public Diffusers integration uses a FlowMatchEulerDiscreteScheduler as the default sampler, consistent with the flow-matching framing reported by Genmo.^[9] See Flow Matching and Diffusion model for the underlying mathematical background.

The choice of latent diffusion is consequential for video: the alternative of running diffusion in pixel space at 480p with 163 frames would require denoising a tensor of roughly 848 x 480 x 163 x 3 elements, which is on the order of 200 million values per clip and prohibitively expensive both in attention cost and memory. By compressing to a latent volume that is roughly 128 times smaller than RGB, AsymmVAE reduces this to a tractable token count of 44,520 spatiotemporal positions while preserving enough visual signal for the decoder to reconstruct a coherent clip.^[1]^[4] The choice of 12 latent channels (versus the 4 channels typical of earlier image VAEs such as the original Stable Diffusion 1.x AE) is consistent with the larger capacity-per-position needed to represent both spatial detail and temporal continuity in a single tensor element.^[4]^[9]

What is the Asymmetric Video VAE (AsymmVAE)?

The latent space is produced by a custom 3D variational autoencoder that Genmo calls AsymmVAE. The encoder compresses each input clip causally by 8x in each spatial dimension and 6x in time, and emits a 12-channel latent volume; the overall compression ratio is 8 x 8 x 6 = 384x in voxel count and roughly 96x to 128x in storage depending on accounting, with Genmo reporting "128x smaller" relative to RGB pixels.^[1]^[4] The encoder uses 64 base channels while the decoder uses 128 base channels, producing an asymmetric VAE in which decoding has roughly twice the per-stage capacity of encoding.^[4]

The Diffusers port exposes the VAE as AutoencoderKLMochi with 12 latent channels, and reports persisted latents_mean and latents_std statistics that the pipeline uses to rescale latents before decoding, in the same style as the Stable Diffusion 3 family.^[9] Total VAE parameter count is 362 million.^[4]

The causal nature of the temporal compression matters for two reasons. First, it means that frames at the beginning of a clip can be encoded without seeing future frames, which preserves a natural autoregressive ordering at the latent level even though the denoiser itself operates on the full latent volume at once. Second, it allows the same VAE encoder to be reused for variable-length clips during training, since the temporal stride is fixed and the per-latent receptive field is bounded.^[4] The decoder's higher base-channel count (128 versus the encoder's 64) reflects an intentional trade-off: the encoder only needs to produce a compact summary suitable for diffusion, while the decoder must reconstruct fine spatiotemporal detail from a heavily compressed signal, and is therefore allocated more capacity.^[4]

The "asymmetric" framing of both the VAE and the denoiser is one of the most distinctive engineering choices of the Mochi 1 release: rather than allocate equal compute to encoder and decoder (VAE) or to text and visual streams (DiT), Genmo concentrates parameters on the side where they are reported to contribute most to perceptual quality, which is decoding and visual reasoning respectively.^[1]^[4] See Variational Autoencoder and Autoencoder for general background.

VAE tiling, exposed by Diffusers via pipe.enable_vae_tiling(), partitions the latent volume into overlapping tiles for decoding, which substantially reduces the peak memory required to convert latents back to pixels. The technique trades a small amount of seam-handling complexity for the ability to decode full-length 163-frame clips on hardware that would otherwise run out of VRAM during the final decode step.^[9]

What is the Asymmetric Diffusion Transformer (AsymmDiT)?

The denoising network is a 10-billion-parameter Asymmetric Diffusion Transformer.^[1]^[3] Architecturally, AsymmDiT belongs to the family of multimodal diffusion transformers that jointly attend over text and visual tokens in the same self-attention operation, similar in spirit to the MMDiT design used in Stable Diffusion 3 and discussed in MMDiT (Multimodal Diffusion Transformer) and Diffusion Transformer (DiT).^[1]^[9]

Key reported architectural parameters of AsymmDiT are:^[4]

48 transformer blocks
24 attention heads
3,072-dimensional visual hidden state
1,536-dimensional text hidden state (half the visual dimension)
44,520 visual tokens per clip
256 text tokens (the maximum T5 sequence length used at inference, exposed as max_sequence_length=256 in the Diffusers pipeline)
Non-square QKV and output projections that bridge the asymmetric visual and text dimensions in a single joint attention layer

Visual and text tokens are concatenated for joint attention but processed by separate MLP layers per modality, so the network learns modality-specific feed-forward transformations while sharing a unified attention operator.^[1] Genmo explicitly compares this scheme to the Stable Diffusion 3 architecture and frames AsymmDiT's distinctive feature as the asymmetric allocation of capacity between the two streams, reporting that the model "streamlines text processing and focuses neural network capacity on reasoning about visual tokens", with the visual stream carrying nearly four times the parameters of the text stream via the larger hidden dimension.^[1]

A second consequence of the asymmetric design is that, unlike many earlier video diffusion systems that combined CLIP and T5 prompt encoders, Mochi 1 uses a single T5-XXL encoder (google/t5-v1_1-xxl) for prompts.^[1]^[9] This simplification, paired with the larger visual stream, is reported by Genmo as one of the ways the model concentrates compute on visual reasoning rather than on duplicated text processing.^[1] See T5 (language model) for background on the encoder.

What is full 3D attention and 3D RoPE?

A central technical claim of the Mochi 1 release is that the denoiser performs full 3D self-attention across the entire latent volume of a clip, jointly mixing information across all space and time positions in every block.^[1] At 44,520 visual tokens per clip plus 256 text tokens, this is a substantial context window for a video diffusion model, and the joint formulation contrasts with earlier approaches that factored space and time attention into separate operators for efficiency.^[1]

The practical implication of full 3D attention is that every visual token can attend directly to every other visual token in every layer, without first passing information through an intermediate frame summary or a temporally-local attention window. This in principle allows the model to capture long-range temporal dependencies, such as the persistence of a moving object across the entire 5.4-second clip, without the staged information transfer required by space-then-time factorizations. The cost is that the attention matrix grows as the square of the token count, which is why Mochi's reference implementation prefers PyTorch's EFFICIENT_ATTENTION SDPA backend and why community deployments lean heavily on Flash Attention kernels to keep memory and runtime tractable.^[7]^[9]

To position the resulting tokens, Mochi 1 uses learnable 3D rotary positional embeddings, extending the 1D and 2D rotary scheme of Rotary position embedding (RoPE) to three axes. Genmo reports that the network "end-to-end learns mixing frequencies for space and time axes" inside the rotary embedding, rather than relying on hand-tuned base frequencies.^[1] See Rotary Position Embedding for background on the underlying mechanism. Learning the rotary frequencies rather than fixing them is reported by Genmo as a way of letting the model discover the appropriate trade-off between spatial and temporal locality, given the very different sampling rates along the height, width, and time axes of a 30 fps 480p latent.^[1]

What training-stack techniques does Mochi 1 use?

In addition to the attention design, Genmo's release notes call out several modern stabilization and efficiency choices in the transformer blocks:^[1]

SwiGLU feed-forward layers
Query-key normalization on attention projections
"Sandwich" normalization (post-attention and post-MLP normalization in addition to pre-norm), used to stabilize training of the large model

These are individually well-known techniques in modern transformer training, and their combination is consistent with the design patterns of contemporaneous large diffusion transformers.

How is Mochi 1 run at inference time?

Genmo's reference pipeline runs the T5 encoder and VAE in torch.float32, while running the DiT in torch.bfloat16 with PyTorch's efficient-attention SDPA kernel.^[9] The Diffusers documentation reproduces this configuration via an explicit sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION) context and notes that decoding 163 latent frames at full precision requires roughly 70 GB of VRAM, while a bfloat16 variant runs at 22 GB.^[9]

Default inference parameters in the official Diffusers MochiPipeline are 64 denoising steps with guidance_scale=4.5 and a default resolution of 848x480 pixels, matching the 480p preview specification.^[9] Flash Attention is supported as an optional backend in the reference repository.^[5]

The reference repository's MochiSingleGPUPipeline exposes a linear_quadratic_schedule helper for constructing custom sampling timesteps, which allows users to bias the denoising trajectory either toward early-step exploration or late-step refinement without rewriting the scheduler.^[4] In practice, most third-party deployments use either the default 64-step schedule or, for previewing, a reduced 28-step schedule with guidance_scale=3.5 that the Diffusers documentation surfaces as an example.^[9] The 256-token prompt limit is enforced by truncation of the T5 encoder input, which means very long prose prompts will lose later tokens; Genmo's published demo prompts tend to be 1 to 3 sentences long and to emphasize visual nouns and adjectives over discursive description.^[1]^[4]

For multi-GPU deployments, Diffusers supports splitting the Mochi transformer across devices with the device_map="auto" and max_memory={0: "24GB", 1: "24GB"} arguments on MochiTransformer3DModel.from_pretrained. This enables Mochi inference on two 24 GB consumer GPUs by sharding the transformer parameters, an option that would otherwise be foreclosed by the model's size at bfloat16.^[9] The reference repository documents an analogous multi-GPU split via its own pipeline factory.^[5]

What variants and integrations are available?

Genmo reference implementation

The reference code released alongside the weights is published at github.com/genmoai/mochi (originally genmoai/models) and provides a MochiSingleGPUPipeline and a multi-GPU pipeline that splits the DiT across devices. It exposes factories such as DitModelFactory, DecoderModelFactory, T5ModelFactory, and a linear_quadratic_schedule for sampling timesteps.^[4]^[5] Installation uses the uv Python package manager and depends on FFmpeg for video encoding.^[5]

Diffusers (`MochiPipeline`)

Hugging Face Diffusers ships a MochiPipeline for text-to-video generation that loads genmo/mochi-1-preview directly. The pipeline pairs MochiTransformer3DModel, AutoencoderKLMochi, a T5 encoder, and a FlowMatchEulerDiscreteScheduler, and supports both a full-precision run (around 42 GB VRAM) and a variant="bf16" configuration that fits in around 22 GB.^[9] Memory-saving features such as enable_model_cpu_offload() and enable_vae_tiling() are available, as is multi-GPU sharding via the device_map and max_memory arguments on from_pretrained.^[9]

Diffusers also documents a quantized configuration using bitsandbytes 8-bit weights for both the T5 encoder and the Mochi transformer, allowing a single Mochi pipeline to run with significantly reduced VRAM while accepting some quality trade-off.^[9] A from_single_file loader can ingest the repackaged Comfy-Org bf16 checkpoint (Comfy-Org/mochi_preview_repackaged) directly.^[9]

ComfyUI (native nodes and Kijai wrapper)

Mochi 1 was added to ComfyUI in two waves. Kijai's community wrapper ComfyUI-MochiWrapper provided the earliest end-to-end workflow, and native Mochi nodes followed in early November 2024 (announced November 4, 2024).^[7]^[8] The native integration supports multiple attention backends (including the efficient SDPA kernel and Flash Attention), allowing Mochi to run on a single 24 GB consumer GPU such as the NVIDIA RTX 4090.^[7]^[8]

The Comfy-Org organization on Hugging Face hosts repackaged weights for use with ComfyUI in two precision tiers:^[7]^[8]^[9]

mochi_preview_bf16.safetensors (full precision, higher VRAM)
mochi_preview_fp8_scaled.safetensors (FP8-scaled weights for memory-constrained machines)

Text encoders are similarly available in t5xxl_fp16 and t5xxl_fp8_e4m3fn_scaled variants. The diffusion model is placed in ComfyUI/models/diffusion_models/, the encoder in ComfyUI/models/clip/, and the VAE in ComfyUI/models/vae/; the workflow then chains a checkpoint loader, prompt encoder, K sampler, empty Mochi latent video, and the VAE decoder.^[7]^[8]

How is Mochi 1 fine-tuned with LoRA?

On November 26, 2024, Genmo released a LoRA fine-tuning workflow as part of the reference repository, allowing users to adapt Mochi 1 on a single H100 GPU using small custom video datasets.^[10]^[11] Documentation in the demos/fine_tuner directory and downstream tutorials describe training LoRA (Low-Rank Adaptation) adapters that target the query, key, value, and output projection matrices of the DiT with learning rates in the 1e-4 to 2e-4 range.^[10]^[11] Compute platforms including Modal and Lambda published walkthroughs covering training on H100 and Grace Hopper GH200 hardware respectively.^[11]

Hosted endpoints

Beyond Genmo's own playground at genmo.com/play, third-party inference providers exposed Mochi 1 through hosted APIs. Replicate published the model as genmoai/mochi-1 and a separate genmoai/mochi-1-lora endpoint for fine-tuned variants.^[11] Tensorfuse and other deployment platforms documented end-to-end recipes for hosting Mochi via ComfyUI on cloud GPUs.^[7]

How does Mochi 1 compare to other video models?

Mochi 1 was one of three roughly contemporaneous large open-weights video diffusion releases in late 2024 and early 2025. The most direct comparators are CogVideoX from Zhipu AI / Tsinghua KEG, HunyuanVideo from Tencent, and the Open-Sora line from HPC-AI Tech. Public summaries and community comparisons drew the following high-level distinctions at the time of Mochi's release; absolute quality rankings are subjective and shift as later versions ship.^[1]^[13]

Model	Approx. parameters	License	Reported resolution / length at first release	Notable points
Mochi 1 preview	10B (DiT) + 362M (VAE)	Apache 2.0	480p (848x480), up to ~5.4 s at 30 fps	Single T5-XXL encoder; AsymmDiT with full 3D attention and 3D RoPE; LoRA fine-tuning kit released November 26, 2024
CogVideoX-5B	5B	Custom open license	720x480, 6 s at 8 fps	3D causal VAE; expert-transformer for joint text-video attention; image-to-video variant
HunyuanVideo	13B	Tencent Community License	720p, ~5 s	Largest contemporary open video DiT; dual-stream then single-stream design; restrictive non-Apache license
Open-Sora (early 2025)	scaled to ~11B by Open-Sora 2.0	Apache-style open	Multi-resolution / multi-length	Open replication line inspired by Sora; staged training across resolutions

Sources: Genmo blog,^[1] community comparison writeups,^[13] HunyuanVideo and CogVideoX release notes.^[13]

At launch, third-party commentary and crowd-sourced comparisons reported that Mochi 1 ranked near HunyuanVideo on text-to-video preference benchmarks, that CogVideoX was preferred for image-to-video tasks, and that LTX Video offered the fastest generation among the comparable open models.^[13] Mochi 1 was singled out for motion realism in several community evaluations, consistent with Genmo's own emphasis on motion fidelity and prompt adherence.^[1]^[13] Among the four, Mochi was the only model released under Apache 2.0 with full training-style code at launch, which made it especially attractive for downstream experimentation and commercial fine-tuning compared to HunyuanVideo's non-permissive Tencent Community License.^[1]^[13]

Genmo also positioned Mochi 1 explicitly against closed commercial systems: the launch coverage framed it as a permissively licensed alternative to OpenAI's Sora, to Runway Gen-3 Alpha, to Luma AI's Dream Machine, to MiniMax's Hailuo, and to Kling (video generation), emphasizing open weights and Apache 2.0 use rather than feature parity.^[2]^[14]

What is Mochi 1 trained for, and what does it emphasize?

Genmo's launch communications repeatedly emphasized prompt adherence as a primary objective of the Mochi 1 design, alongside motion quality.^[1]^[9] In Genmo's words, the model "demonstrates exceptional alignment with textual prompts" and "generates smooth videos at 30 frames per second for durations up to 5.4 seconds".^[1] The architectural levers reported in the release notes that bear on prompt following include the use of a single high-capacity T5-XXL text encoder, the joint attention formulation that allows text tokens to interact with visual tokens at every layer, and the relatively large max_sequence_length=256 text context.^[1]^[9] The trade-off is also acknowledged: Genmo states that "Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content".^[4]

Genmo did not publish a detailed training-data datasheet, dataset list, or training-compute disclosure for the Mochi 1 preview at release; the publicly available material is limited to the architectural description in the launch blog, the Hugging Face model card, and the GitHub README, and downstream summaries reiterate the same information.^[1]^[4]^[5] The model is described only as "trained entirely from scratch" on video data, with no further breakdown released.^[1] This stands in contrast to some peer releases that published partial dataset descriptions; the absence of a training datasheet has been noted in third-party reviews and is consistent with the preview status of the release.

Genmo's emphasis on prompt adherence is reflected in the company's choice of internal evaluation. The launch blog reports that Genmo's preliminary evaluation prioritized two axes: motion quality (whether motion looks physically plausible, whether objects retain identity through time, and whether camera motion behaves as expected) and prompt adherence (whether the model produces the requested subject, action, scene, and style).^[1] These two axes correspond to known failure modes of earlier video diffusion systems, in which short clips often look acceptable per frame but exhibit flickering, identity drift, or generic content that ignores the specific prompt.

The architectural rationale for prioritizing prompt adherence is concentrated in the asymmetric capacity allocation: by giving the visual stream nearly four times the parameter count of the text stream while still requiring all attention layers to read text tokens, the model is structurally biased toward grounding its visual output in the textual conditioning rather than producing a generic photorealistic clip independent of the prompt. The single high-capacity T5-XXL encoder, instead of a CLIP-style dual-encoder ensemble, is also reported by Genmo as a deliberate simplification that improves alignment between caption semantics and generated content.^[1]

What is Mochi 1 used for?

The intended applications cited in Genmo's launch materials and downstream coverage include:

Short-form video generation for creative tools, particularly clips of up to 5.4 seconds at 30 fps with realistic motion and camera dynamics.^[1]^[2]
Research and experimentation in open video diffusion, supported by the permissive Apache 2.0 license, full inference code, and LoRA fine-tuning tooling.^[1]^[5]^[10]
Downstream fine-tuning for character or style adaptation, exposed by the November 2024 LoRA fine-tuner and documented in tutorials from Lambda, Modal, and others.^[10]^[11]
Integration into desktop AI video workflows via the native and wrapper ComfyUI nodes, including pipelines that combine Mochi text-to-video with separate upscaling, interpolation, and editing nodes.^[7]^[8]
Hosted API access through Genmo's own playground and third-party platforms such as Replicate.^[2]^[11]

Is Mochi 1 open source?

Yes. Mochi 1 is released under the Apache 2.0 license, which Genmo describes as "a permissive Apache 2.0 license".^[1]^[4] Both personal and commercial use of the weights is permitted, including derivative works and commercial inference services, with no royalty or use-case restriction beyond the standard Apache 2.0 terms.^[1]^[2]^[4] The weights are distributed on Hugging Face at genmo/mochi-1-preview and the inference code on GitHub at genmoai/mochi.^[4]^[5] This permissive licensing was a central part of the model's positioning: among the major late-2024 open video models, Mochi was the only one released under Apache 2.0 with full inference code at launch, in contrast to HunyuanVideo's non-permissive Tencent Community License and the closed-API access of Sora, Runway Gen-3, and Kling.^[1]^[2]^[13]

Why was the release of Mochi 1 significant?

Mochi 1's release was significant for several intersecting reasons that go beyond the model's raw capability.

First, the combination of permissive Apache 2.0 licensing with a 10-billion-parameter video diffusion model marked a step change in the size of openly distributed video models. At the time of the October 2024 release, the largest openly available video diffusion model carried noticeably fewer parameters, and most other competitive systems were closed.^[1]^[2] Genmo's framing of Mochi 1 as "the largest video generative model ever openly released" reflected this gap, and the Apache 2.0 license made the weights immediately usable in commercial products without the conditions attached to Tencent's HunyuanVideo license or to Sora and Gen-3 API access.^[1]^[2]

Second, the asymmetric design choices popularized by AsymmDiT, in particular the joint multimodal attention with non-square QKV projections and the asymmetric encoder-decoder VAE, joined a small but growing set of architectural templates for billion-scale multimodal diffusion. The architecture's relationship to the MMDiT (Multimodal Diffusion Transformer) design used by Stable Diffusion 3 illustrates how the video diffusion community converged on joint text-visual attention as the default formulation by late 2024.^[1]^[9]

Third, the rapid arrival of the ComfyUI native nodes and the bf16/fp8 repackaged weights demonstrated how quickly an open-weights release could be operationalized for consumer-grade hardware. Within two to three weeks of launch, end users with a single 24 GB RTX 4090 could run Mochi locally through ComfyUI, an outcome that would have been technically impossible at launch without the community's quantization and attention-backend work.^[7]^[8] The downstream LoRA fine-tuner published by Genmo on November 26, 2024 closed a second important loop, enabling small-data adaptation of the base model on a single H100 GPU.^[10]^[11]

Fourth, Mochi 1's release expanded the empirical evidence base for what is and is not currently achievable by open video diffusion. The model's strengths in motion realism and prompt adherence, paired with its acknowledged weaknesses on animation and on resolutions above 480p, helped frame subsequent open-weights releases such as HunyuanVideo, the Wan family, and Open-Sora 2.0, all of which targeted Mochi as a baseline.^[13]

What are the limitations of Mochi 1?

Genmo and downstream evaluators have documented several limitations of the Mochi 1 preview release:

Resolution. The released model generates only 480p (848x480) video; the 720p Mochi 1 HD variant, promised in October 2024, had not been released as open weights by early 2026.^[1]^[4]^[12]
Clip length. Maximum length is bounded by the 163-frame context (about 5.4 seconds at 30 fps), with the Diffusers default at 85 frames and the reference repository typically demonstrating 31 to 163 frames.^[4]^[9]
VRAM footprint. Even with bfloat16 weights, single-GPU inference requires roughly 22 GB of VRAM in Diffusers and roughly 60 GB at full precision in Genmo's reference pipeline; the model is comfortable on H100 / A100 class hardware but requires aggressive offloading, FP8 quantization, or wrapper-level optimizations to run on a 24 GB consumer GPU such as an RTX 4090.^[4]^[7]^[8]^[9]
Style coverage. Genmo explicitly states that the model is optimized for photorealistic content and "does not perform well with animated content".^[4]
Motion artifacts under extreme dynamics. Community reviews and Genmo's own materials note minor warping in scenes with very large or rapid motion, an issue the unreleased HD model is intended to address.^[1]
No native image-to-video or video-to-video. The preview is text-to-video only; image-to-video has been a recurring community request on the Hugging Face discussion board.
Sparse training transparency. No training-data datasheet, total training-compute figure, or detailed evaluation protocol was published for the preview release, limiting independent auditing.^[1]^[4]
Delayed HD release. As tracked in the official genmoai/mochi repository, the 720p variant remained unreleased past the original end-of-2024 timeline; the open issue #132 on the repository documents the delay.^[12]

Mochi 1 sits within a rapidly expanding open video diffusion ecosystem. The most directly comparable open-weights video diffusion models include HunyuanVideo from Tencent, CogVideoX from Zhipu AI, and the Wan family represented in this wiki by Wan 2.1 and Wan 2.5 from Alibaba. The closed commercial comparators most often cited in launch coverage are Sora and Sora 2 from OpenAI, the Veo family (Veo, Veo 2, Veo 3) from Google DeepMind, Runway Gen-3 Alpha and Runway Gen-4 from Runway, and Kling (video generation) from Kuaishou.^[1]^[2]^[13]

Architecturally, AsymmDiT inherits from the broader Diffusion Transformer (DiT) line and bears family resemblance to the MMDiT (Multimodal Diffusion Transformer) design used in Stable Diffusion 3 and Stable Diffusion 3.5, with the principal differences being the asymmetric text-vs-visual capacity allocation and the extension to full 3D attention over a video latent volume.^[1]^[9] The flow-matching training objective links Mochi to the broader Flow Matching literature, and the use of Rotary position embedding (RoPE) generalized to three dimensions echoes positional-encoding choices in other modern transformers.^[1]^[9]

The Mochi 1 release also fits into a wider pattern of open-weights releases in Text-to-video generation and broader AI Video Generation, a space whose ecosystem and history are summarized in those overview entries.

References

Genmo, "Mochi 1: A new SOTA in open text-to-video", Genmo Blog, 2024-10-22. https://www.genmo.ai/blog/mochi-1-a-new-sota-in-open-text-to-video. Accessed 2026-05-20. ↩
Maria Deutscher, "Genmo introduces Mochi 1, an open-source text-to-video generation model", SiliconANGLE, 2024-10-22. https://siliconangle.com/2024/10/22/genmo-introduces-mochi-1-open-source-text-video-generation-model/. Accessed 2026-05-20. ↩
Hugging Face, "Mochi 1 Preview (Diffusers documentation)", Hugging Face Diffusers, 2024. https://huggingface.co/docs/diffusers/api/pipelines/mochi. Accessed 2026-05-20. ↩
Genmo, "genmo/mochi-1-preview model card", Hugging Face, 2024. https://huggingface.co/genmo/mochi-1-preview. Accessed 2026-05-20. ↩
Genmo, "genmoai/mochi (README)", GitHub, 2024. https://github.com/genmoai/mochi. Accessed 2026-05-20. ↩
CanvasBusinessModel, "Brief History of Genmo Company", CanvasBusinessModel, 2024. https://canvasbusinessmodel.com/blogs/brief-history/genmo-brief-history. Accessed 2026-05-20. ↩
Jo Zhang, "Run Mochi in ComfyUI with consumer GPU", ComfyUI Blog, 2024-11-04. https://blog.comfy.org/p/mochi-1. Accessed 2026-05-20. ↩
comfyanonymous, "Mochi Video Model (ComfyUI examples)", ComfyUI_examples, 2024. https://comfyanonymous.github.io/ComfyUI_examples/mochi/. Accessed 2026-05-20. ↩
Hugging Face, "MochiPipeline API reference", Hugging Face Diffusers v0.38.0, 2025. https://huggingface.co/docs/diffusers/api/pipelines/mochi. Accessed 2026-05-20. ↩
Genmo, "Mochi 1 LoRA Fine-tuner (README)", GitHub, 2024-11-26. https://github.com/genmoai/mochi/blob/main/demos/fine_tuner/README.md. Accessed 2026-05-20. ↩
Modal Labs, "Create a custom video generator by fine-tuning a Mochi LoRA on Modal", Modal Blog, 2024. https://modal.com/blog/fine-tuning-mochi-video. Accessed 2026-05-20. ↩
Genmo / community, "HD 720p Release timeline (Issue #132)", GitHub genmoai/mochi, 2025-02-18. https://github.com/genmoai/mochi/issues/132. Accessed 2026-05-20. ↩
ComfyOnline, "Open source video generation models comparisons (CogVideoX, Mochi, LTX-Video, HunyuanVideo)", ComfyOnline Blog, 2025. https://www.comfyonline.app/blog/open-source-video-generation-models-comparisons. Accessed 2026-05-20. ↩
Carl Franzen, "AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others", VentureBeat, 2024-10-22. https://venturebeat.com/ai/video-ai-startup-genmo-launches-mochi-1-an-open-source-model-to-rival-runway-kling-and-others. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Genmo LTX-Video Open-Sora Stable Video Diffusion Step-Video

Infobox

What is Mochi 1?

Who made Mochi 1, and when was it released?

What downstream tools and integrations shipped after launch?

How does Mochi 1 work?

How does latent diffusion scale to video?

What is the Asymmetric Video VAE (AsymmVAE)?

What is the Asymmetric Diffusion Transformer (AsymmDiT)?

What is full 3D attention and 3D RoPE?

What training-stack techniques does Mochi 1 use?

How is Mochi 1 run at inference time?

What variants and integrations are available?

Genmo reference implementation

Diffusers (MochiPipeline)

ComfyUI (native nodes and Kijai wrapper)

How is Mochi 1 fine-tuned with LoRA?

Hosted endpoints

How does Mochi 1 compare to other video models?

What is Mochi 1 trained for, and what does it emphasize?

What is Mochi 1 used for?

Is Mochi 1 open source?

Why was the release of Mochi 1 significant?

What are the limitations of Mochi 1?

Related work

See also

References

Improve this article

Related Articles

LTX-Video

Open-Sora

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here

Related Articles

LTX-Video

Open-Sora

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here

Diffusers (`MochiPipeline`)