Genmo
Last reviewed
Jun 4, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,995 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 4, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,995 words
Add missing citations, update stale details, or suggest a clearer explanation.
Genmo is a San Francisco artificial intelligence company that builds open video generation models, best known for Mochi 1, a 10-billion-parameter open-weights text-to-video diffusion model released under the Apache 2.0 license on October 22, 2024. Founded in 2022 by brothers Paras Jain (CEO) and Ajay Jain (CTO), both of whom completed AI PhDs at the University of California, Berkeley, the company positions itself as a research lab pursuing video models as "world simulators." At launch Mochi 1 was described as the largest video generation model ever released openly, and Genmo presented it as an open-source counterweight to closed systems such as OpenAI's Sora, Runway's Gen-3, and Kuaishou's Kling. Genmo has raised a $28.4 million Series A round led by New Enterprise Associates (NEA), announced alongside the Mochi 1 preview.
Genmo operates at the intersection of generative AI and video generation, focusing on text-to-video synthesis. Unlike most leading video models, which ship as closed, hosted services, Genmo releases its flagship model's weights, architecture, and training-adjacent code under a permissive license, a strategy it borrows from the broader open source AI movement that produced models like Stable Diffusion and Meta's Llama. The company frames its long-term mission as putting "a tiny filmmaker in the pockets of a billion people" and, more abstractly, as "unlocking the right brain of artificial general intelligence" through models that can simulate physical and imagined scenes.
| Attribute | Detail |
|---|---|
| Company | Genmo, Inc. |
| Website | genmo.ai |
| Founded | 2022 |
| Headquarters | San Francisco, California |
| Founders | Paras Jain (CEO), Ajay Jain (CTO) |
| Sector | Generative AI, text-to-video |
| Flagship model | Mochi 1 (open weights, Apache 2.0) |
| Total funding | $28.4 million (Series A) |
| Lead investor | New Enterprise Associates (NEA) |
Genmo was started in 2022 by Paras Jain and Ajay Jain, brothers who had both pursued doctoral research in artificial intelligence at UC Berkeley. According to company accounts, the first line of code was committed around Christmas 2022, and the company shipped an early image-animation product in January 2023. The founders trained under and now count as advisors several well-known Berkeley researchers and entrepreneurs, including Ion Stoica (co-founder of Databricks and Anyscale), Pieter Abbeel (robotics and reinforcement learning), and Joseph Gonzalez. Genmo says its team's research has accumulated more than 50,000 academic citations.
Paras Jain did his PhD at UC Berkeley advised by Ion Stoica and Joseph Gonzalez, working in the RISELab, BAIR, and Berkeley DeepDrive labs on machine learning systems, including work on memory-efficient neural network training ("Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization"). Before his PhD he was one of the founding engineers at DeepScale, an autonomous-driving perception startup acquired by Tesla in October 2019; he is named as a co-inventor on a Tesla machine-learning patent filed shortly after that acquisition.
Ajay Jain also earned his PhD at UC Berkeley and is a co-author of two foundational generative-modeling papers. He co-wrote "Denoising Diffusion Probabilistic Models" (DDPM) with Jonathan Ho and Pieter Abbeel, published at NeurIPS 2020, which became one of the seminal works underpinning modern image, audio, and 3D diffusion systems. He also co-authored "DreamFusion: Text-to-3D using 2D Diffusion" (with Ben Poole, Jonathan T. Barron, and Ben Mildenhall), presented at ICLR 2023, an early and influential text-to-3D method. This diffusion-modeling lineage is central to Genmo's identity and to Mochi 1's design.
Before Mochi 1, Genmo operated a consumer-facing creative platform for generating images, video, and 3D content from natural-language prompts. The company describes launching an image-to-video capability ("Genmo Alpha") in 2023, then a dedicated text-to-video model branded Replay. Genmo introduced Replay v0.1 as a text-to-video generator, added "Replay Plugins" with camera controls in September 2023, and shipped Replay v0.2 on October 26, 2023, advertising image-to-video support, roughly 3x longer clips, and about 2.7x higher resolution than the prior version. By the time of the Mochi 1 launch, Genmo stated that its existing closed image and video products already had more than 2 million users.
On October 22, 2024, Genmo released a research preview of Mochi 1 and simultaneously announced a $28.4 million Series A financing led by NEA. Coverage in VentureBeat and SiliconANGLE described the release as a direct open-source challenge to closed video models from Runway, Luma AI, Kuaishou (Kling), MiniMax (Hailuo), and OpenAI (Sora). The preview shipped the 480p base model, with a higher-resolution "Mochi 1 HD" promised for later. In early November 2024, the model gained optimized ComfyUI support that let it run on consumer GPUs (under roughly 20 GB of VRAM), and it was integrated into Hugging Face's Diffusers library, broadening access well beyond the data-center hardware the original release required.
Genmo's disclosed funding consists of a single Series A round. Public databases (Crunchbase, Tracxn, PitchBook) report a total of $28.4 million raised. NEA's own announcement rounded the figure to roughly $30 million. The round closed concurrently with the Mochi 1 preview on October 22, 2024.
| Round | Date | Amount | Lead | Other investors |
|---|---|---|---|---|
| Series A | October 22, 2024 | $28.4 million | New Enterprise Associates (NEA) | The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, Essence VC |
Note: the lead investor's blog described the round as "$30M," while financial databases and the company's launch materials cite $28.4 million; the two figures refer to the same Series A.
Mochi 1 is Genmo's open-weights text-to-video model and the company's most significant public release. It generates short video clips from text prompts and, at 10 billion parameters, was billed at launch as the largest openly released video generation model. The weights and code are distributed under the Apache 2.0 license for both personal and commercial use, via Hugging Face (genmo/mochi-1-preview), the GitHub repository (genmoai/mochi, originally genmoai/models), and a magnet link. Genmo also runs a free hosted playground at genmo.ai/play.
Mochi 1 is built on a transformer-based diffusion backbone that Genmo calls the Asymmetric Diffusion Transformer (AsymmDiT). The "asymmetric" name refers to how the network divides its capacity between the visual and text modalities: it allocates nearly four times as many parameters to the visual stream as to the text stream, using a larger hidden dimension for video. AsymmDiT jointly attends over text and visual tokens with multi-modal self-attention while learning separate MLP layers for each modality, an approach the team likens to Stable Diffusion 3. Because the two streams use different widths, the model relies on non-square QKV and output projection layers to bring them into a shared attention space.
| AsymmDiT specification | Value |
|---|---|
| Total parameters | 10 billion |
| Layers | 48 |
| Attention heads | 24 |
| Visual hidden dimension | 3,072 |
| Text hidden dimension | 1,536 |
| Visual tokens | 44,520 |
| Text tokens | 256 |
| Text encoder | single T5-XXL |
Prompts are encoded with a single T5-XXL language model rather than a large language model of the kind used for chat. This is a deliberately lean choice on the text side, with the bulk of the model's parameters devoted to modeling pixels and motion.
Mochi 1 ships with an open-source video variational autoencoder, AsymmVAE, which compresses video into a compact latent space that the diffusion transformer operates on. It uses an asymmetric encoder/decoder design (the encoder is lighter than the decoder) and achieves an aggressive compression ratio so that long, high-frame-rate clips remain tractable.
| AsymmVAE specification | Value |
|---|---|
| Parameters | 362 million |
| Encoder base channels | 64 |
| Decoder base channels | 128 |
| Latent dimension | 12 channels |
| Spatial compression | 8x8 |
| Temporal compression | 6x |
| Overall compression | 128x |
The Mochi 1 preview generates 480p video at 30 frames per second for durations up to 5.4 seconds, with Genmo emphasizing smooth, high-frame-rate motion and strong prompt adherence. The company reported that the model performs well on photorealistic content but is not tuned for animated or stylized output, and it documented that extreme motion can occasionally produce warping or distortion. Genmo evaluated Mochi 1 on two axes: prompt adherence, scored by a vision-language-model judge following an evaluation protocol similar to the one used for DALL-E 3 (using Gemini 1.5 Pro as the judge), and motion quality, scored with human-preference Elo ratings in the style of LMSYS arena comparisons.
The original reference implementation is demanding: running Mochi 1 on a single GPU requires roughly 60 GB of VRAM, and Genmo recommended at least one NVIDIA H100. The repository also supports multi-GPU inference. Community and ecosystem work quickly lowered the barrier: the ComfyUI integration runs the model in under 20 GB of VRAM (fitting cards like the RTX 4090), and the Diffusers integration offers a bfloat16 variant that fits in roughly 22 GB, versus about 42 GB at full precision.
At launch Genmo described Mochi 1 HD as a forthcoming higher-fidelity release targeting 720p output with improved motion smoothness, to follow the 480p preview. The company indicated it intended to complete the full model after the initial preview, and it has continued to treat Mochi 1 as a "living checkpoint" expected to benefit from community fine-tuning, quantization, and adapters.
Mochi 1 arrived at a moment when the strongest text-to-video systems, OpenAI's Sora, Runway Gen-3 Alpha, Luma AI's Dream Machine, Kuaishou's Kling, and MiniMax's Hailuo, were all closed and access-gated. By open-sourcing a 10-billion-parameter model under Apache 2.0, Genmo gave researchers and developers direct access to weights they could inspect, fine-tune, and run locally, which is uncommon at the frontier of video generation. Press coverage framed Mochi 1 as the open-source community's most credible answer to those closed models, comparable to the role Stable Diffusion played for image generation. The model was rapidly adopted into open tooling such as ComfyUI and Hugging Face Diffusers, spawning fine-tunes, quantizations, and adapters within weeks of release. As a research artifact, AsymmVAE and AsymmDiT also contributed reusable design ideas, particularly the asymmetric allocation of capacity between modalities and the aggressively compressed video latent space, to the wider field of open video machine learning work.