Open-Sora
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,106 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,106 words
Add missing citations, update stale details, or suggest a clearer explanation.
Open-Sora is an open-source text-to-video diffusion project initiated in March 2024 by Singapore-based startup HPC-AI Tech (the team behind the Colossal-AI distributed training framework) as a public attempt to replicate and approximate the capabilities of OpenAI's Sora using openly released code, training recipes, and pre-trained weights.[1][2] The project is distributed under the Apache 2.0 licence at the GitHub repository hpcaitech/Open-Sora and has progressed through six major releases between March 2024 and March 2025, evolving from a 700-million-parameter spatial-temporal diffusion transformer that produced two-second 512x512 clips into an 11-billion-parameter model that the developers report achieves quality close to commercial systems at a single-run training cost of roughly $200,000.[1][3][4] Open-Sora is one of the most widely cited open replication efforts targeting Sora, alongside the unrelated "Open-Sora Plan" maintained by a group at Peking University, and serves both as a research artefact (the technical reports document data curation, architecture, and training schedules in considerable detail) and as a product demo for HPC-AI Tech's Colossal-AI training stack.[1][5][6] The project's distinguishing feature is the deliberate transparency of its training recipe, including dataset composition, GPU-hour budgets, and dollar costs, which is unusual for video generation systems that are otherwise dominated by closed commercial models.[3][4]
| Field | Value |
|---|---|
| Project name | Open-Sora |
| Developer | HPC-AI Tech (Colossal-AI team) |
| Initial release | 17 March 2024 (v1.0) |
| Latest major release | 12 March 2025 (v2.0) |
| License | Apache 2.0 |
| Repository | github.com/hpcaitech/Open-Sora |
| Architecture | Spatial-Temporal Diffusion Transformer (v1.0 to v1.2); MMDiT-style dual/single stream DiT (v2.0) |
| Parameters (latest) | 11 billion (v2.0) |
| Max resolution | 768x768 (v2.0); up to 720p (v1.2) |
| Max duration | About 5 seconds at 768x768; up to 16 seconds at lower resolutions |
| Training hardware | 64 NVIDIA H800 (v1.0 to v1.2); 224 NVIDIA H200 (v2.0) |
| Reported training cost | About $10,000 (v1.0); about $200,000 (v2.0 single run) |
OpenAI publicly announced Sora on 15 February 2024, presenting a Diffusion Transformer (DiT) trained on spacetime patches of video and image latents that could generate up to a minute of high-fidelity video conditioned on text prompts.[7] OpenAI released a technical report and sample videos but did not publish the model, the training data, or the precise architecture, leaving the open-source community without a reproducible baseline. The release prompted several independent replication efforts; the most visible were the HPC-AI Tech Open-Sora repository and the separately maintained "Open-Sora Plan" hosted by the PKU-YuanGroup at Peking University, which despite a similar name has no organisational overlap with Open-Sora and adopts a distinct architectural lineage.[1][5]
HPC-AI Tech was founded in Beijing in 2021 and operates a research and development centre in Singapore.[8] Its founder, Yang You, holds a faculty position at the National University of Singapore and previously completed a PhD at UC Berkeley under James Demmel; the company's reputation rested initially on Colossal-AI, an open-source system for large-model distributed training that supports tensor, pipeline, and sequence parallelism on top of PyTorch.[8] By the end of 2024 HPC-AI Tech had raised about $50 million in Series A funding to scale both its training framework and its video generation efforts.[9] Open-Sora was launched explicitly as a flagship demonstration of Colossal-AI's parallelism on a problem that had previously been the preserve of large-budget commercial labs.[1][2]
The motivation stated by the developers across the technical reports is twofold. First, to provide an end-to-end open replication that researchers can study and extend, including data preprocessing scripts, VAE training code, the diffusion-transformer implementation, and inference utilities.[1] Second, to interrogate whether the enormous resource budgets associated with closed video models (commonly cited at multiple millions of dollars per training run) are necessary, or whether careful data curation and system optimisation can compress costs by an order of magnitude.[3][4][10] The 2.0 technical report frames this directly, claiming that "the cost of training a top-performing video generation model is highly controllable" given sufficient attention to data curation, architecture, and system design.[3]
Open-Sora releases have followed a roughly quarterly cadence, with each version accompanied by a detailed report in the docs/ directory of the repository. The trajectory shows a steady scaling of parameters, data, and hardware budget alongside changes in the underlying diffusion formulation.
Open-Sora 1.0 was published on 17 March 2024, about a month after Sora's announcement.[2] The release exposed a model architecture, training checkpoints, captioning pipeline, and data preparation scripts. The diffusion network is a Spatial-Temporal Diffusion Transformer (STDiT) derived from the DiT family and the PixArt-alpha text-to-image model, augmented with a one-dimensional temporal attention module stacked on a two-dimensional spatial attention module.[2] The variational autoencoder used to compress pixel video into a tractable latent space was the off-the-shelf Stability AI SD-VAE, the same component used by Stable Diffusion.[2] Text conditioning is supplied by T5-XXL embeddings, projected into the transformer through cross-attention blocks.[2]
Training on 64 NVIDIA H800 GPUs proceeded in three stages: a brief image-pretraining warm-up, a video pre-training stage consuming about 2,808 GPU-hours, and a fine-tuning stage of roughly 1,920 GPU-hours. HPC-AI Tech reported the total cost as approximately $10,000 USD, training on about 400,000 video clips, several orders of magnitude fewer than the 152 million samples used by Stable Video Diffusion.[10] The resulting checkpoint generated 2-second clips at 512x512 resolution; the parameter count of the diffusion network was roughly 700 million.[10] The STDiT decoupling of spatial and temporal attention delivered an approximately fivefold inference speedup against a fully spatio-temporal DiT baseline for long sequences, according to the developers' benchmarks.[2]
Open-Sora 1.1, released on 25 April 2024, scaled the dataset to roughly 9.7 million videos plus 2.6 million images and introduced ST-DiT-2, an architectural revision that replaced sinusoidal temporal embeddings with Rotary position embedding (RoPE), added QK-normalisation using RMSNorm for fp16 stability, and extended T5 tokens from 120 to 200 to accept longer captions.[11] A bucketing strategy allowed training on variable durations between zero and fifteen seconds and at resolutions ranging from 144p to 720p without padding, a design choice the report justifies on the grounds that bucketing is operationally simpler than masking schemes while preserving sample efficiency.[11] A random masking scheme was added during training to enable image-to-video and video-to-video conditioning, where conditioned frames are stamped with timestep zero while unconditioned frames retain their sampled timestep; this opens an autoregressive extension path, though the authors note that drift remains a concern.[11] Nine days of training on the same 64 H800 GPU cluster were reported.
Open-Sora 1.2 was released on 17 June 2024 and introduced three substantial changes documented in the third technical report: a custom three-dimensional Variational Autoencoder, adoption of Rectified Flow in place of discrete diffusion, and the addition of an aesthetic and motion score conditioning channel.[6][12] The 3D VAE stacks a two-dimensional spatial encoder (the SD-VAE from previous versions, with 83 million parameters) and a learned 3D temporal encoder of about 300 million parameters; together they apply an 8x8 spatial downsampling followed by a further 4x temporal compression while keeping the original frame rate during decoding.[12] The compression ratio means that a 720p clip can be processed by the transformer at a sequence length closer to that of high-resolution image generation, narrowing the gap in training cost between image and video transformers.[12]
Rectified flow, drawn from the Stable Diffusion 3 lineage, reduces inference sampling steps from approximately 100 (with traditional DDPM-style schedules) to about 30 while improving sample quality; the developers added logit-norm time-step sampling and a resolution-aware time-step adjustment to compensate for the longer effective sequences of high-resolution video.[12] The full system has roughly 1.1 billion diffusion-network parameters and was trained on a curated multi-stage corpus consisting of WebVid-10M (about 40,000 hours at 240-360p), a filtered subset of Panda-70M (about 20 million clips, 41,000 hours), and a final 2 million high-quality clips totalling about 5,000 hours.[12] Training consumed about 35,000 H100 GPU-hours on a 96-GPU cluster over roughly two weeks.[12] On the public VBench video benchmark, 1.2 raised the total score from 75.91 to 79.23 percent versus the previous version, with the largest gains in semantic understanding.[12]
The 1.2 release also packaged a Gradio application that exposed motion-score, aesthetic-score, and camera-parameter controls, integrated GPT-4o for prompt rewriting, and accepted Chinese-language prompts.[6]
Open-Sora 1.3 was released on 20 February 2025 and is presented in the repository as a transitional update that upgraded the VAE and the diffusion transformer architecture, replacing the stacked spatial-plus-temporal VAE with a unified spatio-temporal VAE and switching to shift-window attention.[1] The 1.3 release served primarily as a bridge to the much larger Open-Sora 2.0 published three weeks later, and most of its components were superseded in the next release.[1]
The most significant release to date, Open-Sora 2.0, was published on 12 March 2025; an accompanying paper "Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k" was posted to arXiv as 2503.09642 on the same day with 33 listed authors led by Zangwei Zheng and Xiangyu Peng.[3][4] The release scaled the diffusion network to 11 billion parameters, replaced the STDiT spatial-temporal decomposition with a hybrid dual-stream and single-stream block layout inspired by the MMDiT (Multimodal Diffusion Transformer) architecture of Stability AI's Stable Diffusion 3 and Black Forest Labs' Flux (text-to-image model), and substituted three-dimensional Rotary position embedding (RoPE) for the previous embedding strategy.[3] Text conditioning now uses both T5-XXL and a CLIP-Large encoder.[3]
A new Video DC-AE ("Deep Compression Autoencoder") replaced the 1.2/1.3 VAE family. The DC-AE applies a 4x32x32 spatio-temporal compression with 128 to 256 channels, in contrast to the 4x8x8 compression of HunyuanVideo's VAE. The much higher spatial compression shortens transformer sequence lengths by approximately sixteenfold, which the developers report yielded a 5.2x speedup in training throughput and an order-of-magnitude inference speedup relative to a less-compressed baseline.[3]
Training was conducted on 224 NVIDIA H200 GPUs across three stages, accumulating approximately 4,160 GPU-days at a reported total dollar cost of about $200,000 for a single end-to-end training run.[3] The data schedule consisted of 70 million 256-pixel text-to-video samples for the first stage, 10 million mixed-resolution samples for the second, and 5 million 768-pixel samples for the final fine-tune.[3] The maximum output is 768x768 pixels at five seconds (128 frames), with multiple aspect ratios (16:9, 9:16, 1:1, 2.39:1) supported.[4]
The release also exposed an explicit text-to-image-to-video (T2I2V) pipeline that first uses Flux to generate a still image from the text prompt and then conditions the video model on that still, on the rationale that high-quality T2I priors are easier to obtain than full T2V coverage.[3][4] On VBench, Open-Sora 2.0 narrowed the gap to OpenAI Sora from 4.52 percent (the 1.2 score) to 0.69 percent, and on independent human-preference evaluations the developers report that it matched or exceeded HunyuanVideo and Step-Video on at least two of three axes (visual quality, prompt adherence, motion quality).[3][13]
Through version 1.2, Open-Sora used the STDiT architecture, in which each transformer block consists of a two-dimensional spatial Multi-Head Self-Attention over the patches of a single frame, followed by a one-dimensional temporal attention along the frame axis at each spatial location.[2][11] This decoupling reduces the attention cost from quadratic in the full spatio-temporal token count to the sum of two smaller quadratic terms, which makes long sequences tractable on a modest GPU cluster. The cost is some loss of global spatio-temporal coupling relative to a fully three-dimensional attention, a trade-off that is examined in the technical reports.[11]
Open-Sora 2.0 abandoned this decomposition in favour of an MMDiT-style architecture that interleaves dual-stream blocks (which carry separate but interacting text and image residual streams) with single-stream blocks (which concatenate text and image tokens into a single attention pass).[3] Full spatio-temporal attention is now applied at every layer, with the higher VAE compression and the H200 GPU memory making the increased compute feasible. Three-dimensional RoPE supplies axis-aware relative positional information across spatial-x, spatial-y, and time, replacing the absolute or learned positional embeddings used in earlier versions.[3]
Open-Sora 1.0 and 1.1 used the pretrained Stable Diffusion VAE, applied frame by frame, without any temporal compression. The transformer therefore had to model time at the original frame rate, inflating sequence lengths and training cost. Open-Sora 1.2 introduced a custom 3D VAE consisting of a 2D encoder (the SD VAE) stacked on top of a learned 3D temporal encoder of about 300 million parameters; the combination delivers 8x8 spatial and 4x temporal compression and was trained progressively in three stages.[12] The 3D component was trained on mixed-length clips of up to 34 frames to enable robust handling of variable durations.[12]
In Open-Sora 2.0, the VAE was replaced with a Video Deep Compression Autoencoder. The DC-AE was designed to push spatial compression much further (32x32 in the spatial axes) while retaining sufficient reconstruction fidelity through a larger channel count, making the transformer sequence dramatically shorter for the same input video. The developers report that this single change accounts for the majority of the 5.2x training throughput improvement over a HunyuanVideo-style 4x8x8 VAE baseline.[3]
Open-Sora 1.0 and 1.1 trained against a standard DDPM denoising objective. Open-Sora 1.2 switched to continuous-time Rectified Flow, which formulates training as learning a velocity field that transports a Gaussian prior to the data distribution along straight-line paths.[12] Rectified flow exposes a single hyperparameter (the time-step sampling distribution) that lets practitioners trade off the difficulty of different noise levels; the Open-Sora team adopted a logit-normal sampler and added a resolution-aware shift so that higher-resolution training emphasises lower-noise time-steps.[12] This approach is shared with Stable Diffusion 3, which also uses rectified flow with logit-normal sampling.[12] Open-Sora 2.0 inherits the rectified-flow objective with further engineering refinements documented in the 2.0 paper.[3]
The Open-Sora codebase sits on top of Colossal-AI, HPC-AI Tech's distributed-training framework that supplies Tensor Parallelism, pipeline parallelism, and sequence parallelism implementations.[1][9] The 1.0 report explicitly highlights a 55 percent training-throughput improvement on 64x512x512 video tokens versus a vanilla DiT baseline, attributed to fused attention kernels, an accelerated T5 path, and sequence parallelism on long token sequences.[2] By the 2.0 release the codebase had added multi-resolution bucketing, memory-offloading inference paths that allow single-GPU inference of the 11B model in fp8 quantisation, and torch.compile integration.[4]
The HuggingFace model card for Open-Sora-v2 reports inference times of about 60 seconds for 256x256 video on a single NVIDIA H100 (consuming 52.5 GB of memory), 1,656 seconds for 768x768 on a single GPU, and 276 seconds for 768x768 distributed across eight GPUs with sequence parallelism enabled.[4] These figures support the project's claim that Open-Sora can be served on modest infrastructure, though the longer single-GPU inference times remain a barrier to interactive use.
A noteworthy contribution of the Open-Sora reports is the explicit description of the data preparation pipeline, which includes scene-cut detection, optical-flow scoring for motion, aesthetic scoring with a CLIP-based aesthetic predictor, and dense captioning using PLLaVA in the 1.2 era.[12] In Open-Sora 2.0 the captioning step was upgraded to a multimodal LLM (the paper cites Qwen-VL among other captioners) and the developers apply explicit deduplication and a filter for static or near-static clips that would otherwise weaken motion priors.[3] The full pipeline is included in the repository, which is unusual for video models: most commercial systems disclose only summary statistics about their datasets.[3][4]
The hpcaitech/Open-Sora repository accumulated tens of thousands of GitHub stars within months of the 1.0 release, and HPC-AI Tech has stated that the combined Colossal-AI and Open-Sora repositories crossed 60,000 stars by the Series A announcement in late 2024.[9] HuggingFace hosts the official weights for v1.0, v1.1, v1.2, v1.3, and v2.0 under the hpcai-tech organisation, with the v2.0 model alone receiving hundreds of community fine-tunes within weeks of release.[4]
Open-Sora has been adopted as a baseline or reference implementation in several follow-up works in Text-to-video generation, and the rapid iteration on architecture (STDiT to MMDiT-style) tracks broader changes in the open video community. The decision to release dollar-cost figures and training-data sizes alongside the weights has also influenced how other groups document their projects.[3][4]
Open-Sora exists in a crowded landscape of open and closed video generation systems. The following comparison summarises the headline specifications of contemporaneous systems at the time of Open-Sora 2.0's release in March 2025.
| System | Developer | Parameters | Max resolution / duration | License | Notable architecture choices |
|---|---|---|---|---|---|
| Open-Sora 2.0 | HPC-AI Tech | 11B | 768x768, 5 s (128 frames) | Apache 2.0 | MMDiT-style dual/single stream DiT, Video DC-AE, 3D RoPE[3][4] |
| Sora | OpenAI | Not disclosed | 1080p, up to 60 s | Closed | DiT on spacetime patches[7] |
| HunyuanVideo | Tencent | 13B | 720p, 5 s | Open (Tencent Community licence) | Dual-stream to single-stream DiT, 3D VAE, flow-matching objective[13] |
| Mochi 1 | Genmo | 10B | 480p, about 5.4 s at 30 fps | Apache 2.0 | Asymmetric DiT (AsymmDiT), full 3D attention[14] |
| CogVideoX | Zhipu AI / Tsinghua THUDM | 5B (largest open variant) | 768x1360 up to 10 s at 16 fps | Open (research licence) | Expert transformer with 3D causal VAE[15] |
| Open-Sora Plan v1.5 | PKU-YuanGroup | Not officially disclosed | 720p | MIT | Sparse 3D DiT, WFVAE, Ascend NPU training[5] |
Open-Sora 2.0's parameter count sits between Mochi 1 and HunyuanVideo. The developers' headline claim, that they reach human-preference parity with Tencent's 13B HunyuanVideo and StepFun's 30B Step-Video using a roughly ten-times smaller training budget, has been independently summarised in coverage by The Decoder and MarkTechPost, though no peer-reviewed third-party benchmarks of comparable scope are yet available.[13][16] On VBench, the developers report a 0.69 percentage-point gap to Sora, narrowed from 4.52 percentage points in version 1.2; the absolute VBench numbers should be read cautiously because VBench scores depend strongly on prompt list selection and resolution settings.[3]
The "Open-Sora Plan" maintained by the PKU-YuanGroup is unrelated to Open-Sora despite the similar name. The PKU effort is led by a different organisation, is released under an MIT licence rather than Apache 2.0, and adopted a separate architectural lineage moving from 2+1D in versions 1.0 to 1.1 to dense 3D attention in 1.2 and then to sparse attention in 1.3 and later. Open-Sora Plan version 1.5 (June 2025) was notable for being trained and served on Huawei Ascend 910-series NPUs rather than NVIDIA GPUs.[5] The two projects are sometimes confused in secondary coverage; the HPC-AI Tech README and the PKU GitHub explicitly note the distinction.[1][5]
Open-Sora's significance for the open AI ecosystem lies less in being the highest-quality video model and more in being one of the few projects to disclose a complete recipe: data, code, weights, and explicit budget. Researchers can study the effect of swapping VAEs, schedulers, or text encoders without having to train a base model from scratch. The 1.2 and 2.0 reports have been cited as references for the design of training schedules, multi-stage data curation, and bucketing strategies in academic and industrial video diffusion work.[3][12]
Practical use cases documented by the developers include short-form social-media content generation, story-boarding, advertising mock-ups, and image-to-video animation of still photographs through the I2V pipeline.[1][4] The motion-score and aesthetic-score conditioning channels introduced in version 1.2 make it possible to dial up or down camera motion or visual fidelity at inference, a feature less common in closed commercial systems.[6] The text-to-image-to-video pipeline added in 2.0 allows users to compose a still using Flux or a similar high-quality Stable Diffusion derivative and then drive motion from that still, bypassing some of the prompt-following weaknesses of pure T2V.[3][4]
Because the weights are Apache 2.0 licensed, Open-Sora can be redistributed and embedded in commercial products without royalties, in contrast to several open video models that ship under non-commercial or community-only licences.[1][4]
Despite the strong VBench numbers, Open-Sora has several documented limitations. Maximum output is restricted to 768x768 pixels and five seconds at 128 frames in version 2.0; this is well short of Sora's claimed up-to-one-minute output and of the 1080p modes offered by some commercial competitors.[4][13] Motion coherence over longer clips, particularly for complex object interactions, remains weaker than for OpenAI's Sora and Google's Veo line in independent qualitative comparisons.[13]
The "$200k" figure cited prominently in the 2.0 marketing and paper title refers to a single end-to-end training run priced at publicly listed cloud rates for H200 capacity. Critics have observed that the figure excludes infrastructure setup, failed runs, ablation experiments, and the cumulative cost of building the underlying datasets and tools.[16] The total all-in cost is thus higher than the headline number, although still likely an order of magnitude below the multi-million-dollar figures associated with closed commercial training, as confirmed by HPC-AI Tech's blog post and several secondary outlets.[13][16]
The choice of T5-XXL plus CLIP-Large for text conditioning, rather than a modern multimodal large language model, limits the model's ability to follow long compositional prompts. HunyuanVideo and several PKU Open-Sora Plan releases adopted MLLM-style text encoders for this reason.[13][5] Open-Sora 2.0's text encoder is acknowledged in the paper as an area for future work.[3]
Finally, the training data pipeline relies on publicly scraped video corpora, including derivatives of WebVid-10M and Panda-70M, whose copyright status for AI training is contested in some jurisdictions. The technical reports describe filtering and deduplication but do not attempt a comprehensive provenance analysis.[12]
Open-Sora belongs to a fast-moving cluster of Text-to-video generation systems built around the Diffusion Transformer (DiT) paradigm.
Closed commercial systems that compete on quality include OpenAI's Sora (and its successor Sora 2), Runway's Runway Gen-3 Alpha, Kuaishou's Kling (video generation) family, Google's Veo line, and Pika (video generation). Among open systems, the most direct comparisons are HunyuanVideo, Mochi 1, CogVideoX, and Alibaba's Wan 2.1, all of which use related but distinct DiT variants and 3D VAEs.[13][14][15]
Architecturally, Open-Sora's evolution mirrors the trajectory of the wider community: from PixArt-style spatial DiT plus temporal attention, through Stable Diffusion 3's MMDiT (Multimodal Diffusion Transformer), to Flow Matching and Rectified Flow training objectives. The shift from absolute and learned positional embeddings to three-dimensional Rotary position embedding (RoPE) is also visible across multiple open systems.[3][13]