Genmo

AI Companies Generative AI Open Source AI Video Generation

10 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 1,995 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Genmo is a San Francisco artificial intelligence company that builds open video generation models, best known for Mochi 1, a 10-billion-parameter open-weights text-to-video diffusion model released under the Apache 2.0 license on October 22, 2024.^[1] Founded in 2022 by brothers Paras Jain (CEO) and Ajay Jain (CTO), both of whom completed AI PhDs at the University of California, Berkeley, the company positions itself as a research lab pursuing video models as "world simulators."^[7] At launch Mochi 1 was described as the largest video generation model ever released openly, and Genmo presented it as an open-source counterweight to closed systems such as OpenAI's Sora, Runway's Gen-3, and Kuaishou's Kling.^[2] Genmo has raised a $28.4 million Series A round led by New Enterprise Associates (NEA), announced alongside the Mochi 1 preview.^[8]

Overview

Genmo operates at the intersection of generative AI and video generation, focusing on text-to-video synthesis. Unlike most leading video models, which ship as closed, hosted services, Genmo releases its flagship model's weights, architecture, and training-adjacent code under a permissive license, a strategy it borrows from the broader open source AI movement that produced models like Stable Diffusion and Meta's Llama. The company frames its long-term mission as putting "a tiny filmmaker in the pockets of a billion people" and, more abstractly, as "unlocking the right brain of artificial general intelligence" through models that can simulate physical and imagined scenes.^[7]

Attribute	Detail
Company	Genmo, Inc.
Website	genmo.ai
Founded	2022
Headquarters	San Francisco, California
Founders	Paras Jain (CEO), Ajay Jain (CTO)
Sector	Generative AI, text-to-video
Flagship model	Mochi 1 (open weights, Apache 2.0)^[1]
Total funding	$28.4 million (Series A)^[8]
Lead investor	New Enterprise Associates (NEA)^[8]

History

Founding

Genmo was started in 2022 by Paras Jain and Ajay Jain, brothers who had both pursued doctoral research in artificial intelligence at UC Berkeley.^[10]^[11] According to company accounts, the first line of code was committed around Christmas 2022, and the company shipped an early image-animation product in January 2023.^[7] The founders trained under and now count as advisors several well-known Berkeley researchers and entrepreneurs, including Ion Stoica (co-founder of Databricks and Anyscale), Pieter Abbeel (robotics and reinforcement learning), and Joseph Gonzalez.^[7] Genmo says its team's research has accumulated more than 50,000 academic citations.^[7]

Paras Jain did his PhD at UC Berkeley advised by Ion Stoica and Joseph Gonzalez, working in the RISELab, BAIR, and Berkeley DeepDrive labs on machine learning systems, including work on memory-efficient neural network training ("Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization").^[10] Before his PhD he was one of the founding engineers at DeepScale, an autonomous-driving perception startup acquired by Tesla in October 2019;^[14] he is named as a co-inventor on a Tesla machine-learning patent filed shortly after that acquisition.^[10]

Ajay Jain also earned his PhD at UC Berkeley and is a co-author of two foundational generative-modeling papers.^[11] He co-wrote "Denoising Diffusion Probabilistic Models" (DDPM) with Jonathan Ho and Pieter Abbeel, published at NeurIPS 2020, which became one of the seminal works underpinning modern image, audio, and 3D diffusion systems.^[12] He also co-authored "DreamFusion: Text-to-3D using 2D Diffusion" (with Ben Poole, Jonathan T. Barron, and Ben Mildenhall), presented at ICLR 2023, an early and influential text-to-3D method.^[13] This diffusion-modeling lineage is central to Genmo's identity and to Mochi 1's design.

Early products (2023)

Before Mochi 1, Genmo operated a consumer-facing creative platform for generating images, video, and 3D content from natural-language prompts. The company describes launching an image-to-video capability ("Genmo Alpha") in 2023, then a dedicated text-to-video model branded Replay.^[16] Genmo introduced Replay v0.1 as a text-to-video generator, added "Replay Plugins" with camera controls in September 2023, and shipped Replay v0.2 on October 26, 2023, advertising image-to-video support, roughly 3x longer clips, and about 2.7x higher resolution than the prior version.^[16] By the time of the Mochi 1 launch, Genmo stated that its existing closed image and video products already had more than 2 million users.^[2]

Mochi 1 launch and Series A (2024)

On October 22, 2024, Genmo released a research preview of Mochi 1 and simultaneously announced a $28.4 million Series A financing led by NEA.^[1]^[8] Coverage in VentureBeat and SiliconANGLE described the release as a direct open-source challenge to closed video models from Runway, Luma AI, Kuaishou (Kling), MiniMax (Hailuo), and OpenAI (Sora).^[2]^[3] The preview shipped the 480p base model, with a higher-resolution "Mochi 1 HD" promised for later.^[1] In early November 2024, the model gained optimized ComfyUI support that let it run on consumer GPUs (under roughly 20 GB of VRAM),^[17] and it was integrated into Hugging Face's Diffusers library, broadening access well beyond the data-center hardware the original release required.

Funding

Genmo's disclosed funding consists of a single Series A round. Public databases (Crunchbase, Tracxn, PitchBook) report a total of $28.4 million raised.^[8]^[9] NEA's own announcement rounded the figure to roughly $30 million.^[6] The round closed concurrently with the Mochi 1 preview on October 22, 2024.^[8]

Round	Date	Amount	Lead	Other investors
Series A	October 22, 2024	$28.4 million^[8]	New Enterprise Associates (NEA)	The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, Essence VC

Note: the lead investor's blog described the round as "$30M,"^[6] while financial databases and the company's launch materials cite $28.4 million;^[8] the two figures refer to the same Series A.

Mochi 1

Mochi 1 is Genmo's open-weights text-to-video model and the company's most significant public release. It generates short video clips from text prompts and, at 10 billion parameters, was billed at launch as the largest openly released video generation model.^[2] The weights and code are distributed under the Apache 2.0 license for both personal and commercial use, via Hugging Face (genmo/mochi-1-preview), the GitHub repository (genmoai/mochi, originally genmoai/models), and a magnet link.^[4]^[5] Genmo also runs a free hosted playground at genmo.ai/play.^[1]

Architecture: AsymmDiT

Mochi 1 is built on a transformer-based diffusion backbone that Genmo calls the Asymmetric Diffusion Transformer (AsymmDiT).^[1] The "asymmetric" name refers to how the network divides its capacity between the visual and text modalities: it allocates nearly four times as many parameters to the visual stream as to the text stream, using a larger hidden dimension for video. AsymmDiT jointly attends over text and visual tokens with multi-modal self-attention while learning separate MLP layers for each modality, an approach the team likens to Stable Diffusion 3.^[1] Because the two streams use different widths, the model relies on non-square QKV and output projection layers to bring them into a shared attention space.

AsymmDiT specification	Value
Total parameters	10 billion^[5]
Layers	48
Attention heads	24
Visual hidden dimension	3,072
Text hidden dimension	1,536
Visual tokens	44,520
Text tokens	256
Text encoder	single T5-XXL

Prompts are encoded with a single T5-XXL language model rather than a large language model of the kind used for chat.^[1] This is a deliberately lean choice on the text side, with the bulk of the model's parameters devoted to modeling pixels and motion.

Video VAE (AsymmVAE)

Mochi 1 ships with an open-source video variational autoencoder, AsymmVAE, which compresses video into a compact latent space that the diffusion transformer operates on.^[1] It uses an asymmetric encoder/decoder design (the encoder is lighter than the decoder) and achieves an aggressive compression ratio so that long, high-frame-rate clips remain tractable.

AsymmVAE specification	Value
Parameters	362 million^[5]
Encoder base channels	64
Decoder base channels	128
Latent dimension	12 channels
Spatial compression	8x8
Temporal compression	6x
Overall compression	128x

Output and capabilities

The Mochi 1 preview generates 480p video at 30 frames per second for durations up to 5.4 seconds, with Genmo emphasizing smooth, high-frame-rate motion and strong prompt adherence.^[1] The company reported that the model performs well on photorealistic content but is not tuned for animated or stylized output, and it documented that extreme motion can occasionally produce warping or distortion.^[4] Genmo evaluated Mochi 1 on two axes: prompt adherence, scored by a vision-language-model judge following an evaluation protocol similar to the one used for DALL-E 3 (using Gemini 1.5 Pro as the judge), and motion quality, scored with human-preference Elo ratings in the style of LMSYS arena comparisons.^[1]

Hardware requirements

The original reference implementation is demanding: running Mochi 1 on a single GPU requires roughly 60 GB of VRAM, and Genmo recommended at least one NVIDIA H100.^[5] The repository also supports multi-GPU inference. Community and ecosystem work quickly lowered the barrier: the ComfyUI integration runs the model in under 20 GB of VRAM (fitting cards like the RTX 4090),^[17] and the Diffusers integration offers a bfloat16 variant that fits in roughly 22 GB, versus about 42 GB at full precision.

Mochi 1 HD

At launch Genmo described Mochi 1 HD as a forthcoming higher-fidelity release targeting 720p output with improved motion smoothness, to follow the 480p preview.^[1] The company indicated it intended to complete the full model after the initial preview, and it has continued to treat Mochi 1 as a "living checkpoint" expected to benefit from community fine-tuning, quantization, and adapters.

Significance

Mochi 1 arrived at a moment when the strongest text-to-video systems, OpenAI's Sora, Runway Gen-3 Alpha, Luma AI's Dream Machine, Kuaishou's Kling, and MiniMax's Hailuo, were all closed and access-gated.^[2] By open-sourcing a 10-billion-parameter model under Apache 2.0, Genmo gave researchers and developers direct access to weights they could inspect, fine-tune, and run locally, which is uncommon at the frontier of video generation. Press coverage framed Mochi 1 as the open-source community's most credible answer to those closed models, comparable to the role Stable Diffusion played for image generation.^[18] The model was rapidly adopted into open tooling such as ComfyUI and Hugging Face Diffusers, spawning fine-tunes, quantizations, and adapters within weeks of release.^[17] As a research artifact, AsymmVAE and AsymmDiT also contributed reusable design ideas, particularly the asymmetric allocation of capacity between modalities and the aggressively compressed video latent space, to the wider field of open video machine learning work.

References

"Mochi 1: A new SOTA in open text-to-video." Genmo Blog, October 22, 2024. https://www.genmo.ai/blog/mochi-1-a-new-sota-in-open-text-to-video ↩
"AI video startup Genmo launches Mochi 1, an open source model to rival Runway, Kling, and others." VentureBeat, October 22, 2024. https://venturebeat.com/ai/video-ai-startup-genmo-launches-mochi-1-an-open-source-model-to-rival-runway-kling-and-others ↩
"Genmo introduces Mochi 1, an open-source text-to-video generation model." SiliconANGLE, October 22, 2024. https://siliconangle.com/2024/10/22/genmo-introduces-mochi-1-open-source-text-video-generation-model/ ↩
"genmo/mochi-1-preview." Hugging Face model card, 2024. https://huggingface.co/genmo/mochi-1-preview ↩
"genmoai/mochi: The best OSS video generation models, created by Genmo." GitHub, 2024. https://github.com/genmoai/mochi ↩
"Genmo's Open-Source GenAI Model Aims to Power the Future of Video." NEA Blog, October 2024. https://www.nea.com/blog/genmos-open-source-genai-model-aims-to-power-the-future-of-video ↩
"Genmo. The best open video generation models. (About)." Genmo, 2024. https://www.genmo.ai/about ↩
"Series A - Genmo." Crunchbase Funding Round Profile, October 2024. https://www.crunchbase.com/funding_round/genmo-series-a--957a2a32 ↩
"Genmo - 2026 Company Profile, Team, Funding & Competitors." Tracxn, 2026. https://tracxn.com/d/companies/genmo ↩
"Paras Jain, ML Systems researcher." Personal website, 2024. https://www.parasjain.com/ ↩
"Ajay Jain." Personal website, 2024. https://www.ajayjain.net/ ↩
"Denoising Diffusion Probabilistic Models (DDPM)." Jonathan Ho, Ajay Jain, Pieter Abbeel, NeurIPS 2020. https://arxiv.org/abs/2006.11239 ↩
"DreamFusion: Text-to-3D using 2D Diffusion." Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall, ICLR 2023. https://arxiv.org/abs/2209.14988 ↩
"Tesla acquires computer vision startup DeepScale in push toward autonomy." TechCrunch, October 1, 2019. https://techcrunch.com/2019/10/01/tesla-acquires-computer-vision-startup-deepscale-in-push-towards-autonomy/ ↩
"DeepScale." Wikipedia, accessed 2026. https://en.wikipedia.org/wiki/DeepScale
"Meet Replay, the next generation in AI video." Genmo Blog, October 26, 2023. https://blog.genmo.ai/log/replay-ai-video ↩
"Run Mochi in ComfyUI with consumer GPU." Comfy.org Blog, November 2024. https://blog.comfy.org/p/mochi-1 ↩
"Meet Mochi-1, the latest free and open-source AI video model." Tom's Guide, October 2024. https://www.tomsguide.com/ai/meet-mochi-1-the-latest-free-and-open-source-ai-video-model ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Text-to-video generation