Open-Sora

Diffusion Models Open Source AI Video Generation

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v3 · 4,343 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Open-Sora is an open-source text-to-video diffusion project initiated in March 2024 by Singapore-based startup HPC-AI Tech (the team behind the Colossal-AI distributed training framework) as a public attempt to replicate and approximate the capabilities of OpenAI's Sora using openly released code, training recipes, and pre-trained weights.^[1]^[2] The project is distributed under the Apache 2.0 licence at the GitHub repository hpcaitech/Open-Sora and has progressed through five major releases (versions 1.0 through 2.0) between March 2024 and March 2025, evolving from a 700-million-parameter spatial-temporal diffusion transformer that produced two-second 512x512 clips into an 11-billion-parameter model that the developers report achieves quality close to commercial systems at a single-run training cost of roughly $200,000.^[1]^[3]^[4] Open-Sora is one of the most widely cited open replication efforts targeting Sora, alongside the unrelated "Open-Sora Plan" maintained by a group at Peking University, and serves both as a research artefact (the technical reports document data curation, architecture, and training schedules in considerable detail) and as a product demo for HPC-AI Tech's Colossal-AI training stack.^[1]^[5]^[6] The project's distinguishing feature is the deliberate transparency of its training recipe, including dataset composition, GPU-hour budgets, and dollar costs, which is unusual for video generation systems that are otherwise dominated by closed commercial models.^[3]^[4] HPC-AI Tech frames the effort under the banner "Democratizing Efficient Video Production for All," and as of mid-2026 version 2.0 (March 2025) remains its latest major release.^[1]^[3]

Infobox

Field	Value
Project name	Open-Sora
Developer	HPC-AI Tech (Colossal-AI team)
Initial release	17 March 2024 (v1.0)
Latest major release	12 March 2025 (v2.0)
License	Apache 2.0
Repository	github.com/hpcaitech/Open-Sora
Architecture	Spatial-Temporal Diffusion Transformer (v1.0 to v1.2); MMDiT-style dual/single stream DiT (v2.0)
Parameters (latest)	11 billion (v2.0)
Max resolution	768x768 (v2.0); up to 720p (v1.2)
Max duration	About 5 seconds at 768x768; up to 16 seconds at lower resolutions
Training hardware	64 NVIDIA H800 (v1.0-v1.1); 96 NVIDIA H100 (v1.2); 224 NVIDIA H200 (v2.0)
Reported training cost	About $10,000 (v1.0); about $200,000 (v2.0 single run)

Who created Open-Sora, and why?

OpenAI publicly announced Sora on 15 February 2024, presenting a Diffusion Transformer (DiT) trained on spacetime patches of video and image latents that could generate up to a minute of high-fidelity video conditioned on text prompts.^[7] OpenAI released a technical report and sample videos but did not publish the model, the training data, or the precise architecture, leaving the open-source community without a reproducible baseline. The release prompted several independent replication efforts; the most visible were the HPC-AI Tech Open-Sora repository and the separately maintained "Open-Sora Plan" hosted by the PKU-YuanGroup at Peking University, which despite a similar name has no organisational overlap with Open-Sora and adopts a distinct architectural lineage.^[1]^[5]

HPC-AI Tech was founded in Beijing in 2021 and operates a research and development centre in Singapore.^[8] Its founder, Yang You, is a Presidential Young Professor at the National University of Singapore and previously completed a PhD at UC Berkeley under James Demmel; the company's reputation rested initially on Colossal-AI, an open-source system for large-model distributed training that supports tensor, pipeline, and sequence parallelism on top of PyTorch.^[8]^[9] The company closed a $6 million seed and angel round led by BlueRun Ventures in September 2022, and by the end of 2024 it had raised about $50 million in Series A funding to scale both its training framework and its video generation efforts.^[8]^[9] Open-Sora was launched explicitly as a flagship demonstration of Colossal-AI's parallelism on a problem that had previously been the preserve of large-budget commercial labs.^[1]^[2]

The motivation stated by the developers across the technical reports is twofold. First, to provide an end-to-end open replication that researchers can study and extend, including data preprocessing scripts, VAE training code, the diffusion-transformer implementation, and inference utilities.^[1] Second, to interrogate whether the enormous resource budgets associated with closed video models (commonly cited at multiple millions of dollars per training run) are necessary, or whether careful data curation and system optimisation can compress costs by an order of magnitude.^[3]^[4]^[10] The 2.0 technical report frames this directly, claiming that "the cost of training a top-performing video generation model is highly controllable" given sufficient attention to data curation, architecture, and system design.^[3]

How has Open-Sora evolved across versions?

Open-Sora releases have followed a roughly quarterly cadence, with each version accompanied by a detailed report in the docs/ directory of the repository. The trajectory shows a steady scaling of parameters, data, and hardware budget alongside changes in the underlying diffusion formulation.

Open-Sora 1.0 (March 2024)

Open-Sora 1.0 was published on 17 March 2024, about a month after Sora's announcement.^[2] The release exposed a model architecture, training checkpoints, captioning pipeline, and data preparation scripts. The diffusion network is a Spatial-Temporal Diffusion Transformer (STDiT) derived from the DiT family and the PixArt-alpha text-to-image model, augmented with a one-dimensional temporal attention module stacked on a two-dimensional spatial attention module.^[2] The variational autoencoder used to compress pixel video into a tractable latent space was the off-the-shelf Stability AI SD-VAE, the same component used by Stable Diffusion.^[2] Text conditioning is supplied by T5-XXL embeddings, projected into the transformer through cross-attention blocks.^[2]

Training on 64 NVIDIA H800 GPUs proceeded in three stages: a brief image-pretraining warm-up, a video pre-training stage consuming about 2,808 GPU-hours, and a fine-tuning stage of roughly 1,920 GPU-hours. HPC-AI Tech reported the total cost as approximately $10,000 USD, training on about 400,000 video clips, several orders of magnitude fewer than the 152 million samples used by Stable Video Diffusion.^[10] The resulting checkpoint generated 2-second clips at 512x512 resolution; the parameter count of the diffusion network was roughly 700 million.^[10] The STDiT decoupling of spatial and temporal attention delivered an approximately fivefold inference speedup against a fully spatio-temporal DiT baseline for long sequences, according to the developers' benchmarks.^[2]

Open-Sora 1.1 (April 2024)

Open-Sora 1.1, released on 25 April 2024, scaled the dataset to roughly 9.7 million videos plus 2.6 million images and introduced ST-DiT-2, an architectural revision that replaced sinusoidal temporal embeddings with Rotary position embedding (RoPE), added QK-normalisation using RMSNorm for fp16 stability, and extended T5 tokens from 120 to 200 to accept longer captions.^[11] A bucketing strategy allowed training on variable durations between zero and fifteen seconds and at resolutions ranging from 144p to 720p without padding, a design choice the report justifies on the grounds that bucketing is operationally simpler than masking schemes while preserving sample efficiency.^[11] A random masking scheme was added during training to enable image-to-video and video-to-video conditioning, where conditioned frames are stamped with timestep zero while unconditioned frames retain their sampled timestep; this opens an autoregressive extension path, though the authors note that drift remains a concern.^[11] Nine days of training on the same 64 H800 GPU cluster were reported.

Open-Sora 1.2 (June 2024)

Open-Sora 1.2 was released on 17 June 2024 and introduced three substantial changes documented in the third technical report: a custom three-dimensional Variational Autoencoder, adoption of Rectified Flow in place of discrete diffusion, and the addition of an aesthetic and motion score conditioning channel.^[6]^[12] The 3D VAE stacks a two-dimensional spatial encoder (the SD-VAE from previous versions, with 83 million parameters) and a learned 3D temporal encoder of about 300 million parameters; together they apply an 8x8 spatial downsampling followed by a further 4x temporal compression while keeping the original frame rate during decoding.^[12] The compression ratio means that a 720p clip can be processed by the transformer at a sequence length closer to that of high-resolution image generation, narrowing the gap in training cost between image and video transformers.^[12]

Rectified flow, drawn from the Stable Diffusion 3 lineage, reduces inference sampling steps from approximately 100 (with traditional DDPM-style schedules) to about 30 while improving sample quality; the developers added logit-norm time-step sampling and a resolution-aware time-step adjustment to compensate for the longer effective sequences of high-resolution video.^[12] The full system has roughly 1.1 billion diffusion-network parameters and was trained on a curated multi-stage corpus consisting of WebVid-10M (about 40,000 hours at 240-360p), a filtered subset of Panda-70M (about 20 million clips, 41,000 hours), and a final 2 million high-quality clips totalling about 5,000 hours.^[12] Training consumed about 35,000 H100 GPU-hours on a 96-GPU cluster over roughly two weeks.^[12] On the public VBench video benchmark, 1.2 raised the total score from 75.91 to 79.23 percent versus the previous version, with the largest gains in semantic understanding.^[12]

The 1.2 release also packaged a Gradio application that exposed motion-score, aesthetic-score, and camera-parameter controls, integrated GPT-4o for prompt rewriting, and accepted Chinese-language prompts.^[6]

Open-Sora 1.3 (February 2025)

Open-Sora 1.3 was released on 20 February 2025 and is presented in the repository as a transitional update that upgraded the VAE and the diffusion transformer architecture, replacing the stacked spatial-plus-temporal VAE with a unified spatio-temporal VAE and switching to shift-window attention.^[1] The 1.3 release served primarily as a bridge to the much larger Open-Sora 2.0 published three weeks later, and most of its components were superseded in the next release.^[1]

Open-Sora 2.0 (March 2025)

The most significant release to date, Open-Sora 2.0, was published on 12 March 2025; an accompanying paper "Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k" was posted to arXiv as 2503.09642 on the same day with 33 listed authors led by Zangwei Zheng and Xiangyu Peng.^[3]^[4] The release scaled the diffusion network to 11 billion parameters, replaced the STDiT spatial-temporal decomposition with a hybrid dual-stream and single-stream block layout inspired by the MMDiT (Multimodal Diffusion Transformer) architecture of Stability AI's Stable Diffusion 3 and Black Forest Labs' Flux (text-to-image model), and substituted three-dimensional Rotary position embedding (RoPE) for the previous embedding strategy.^[3] Text conditioning now uses both T5-XXL and a CLIP-Large encoder.^[3]

A new Video DC-AE ("Deep Compression Autoencoder") replaced the 1.2/1.3 VAE family. The DC-AE applies a 4x32x32 spatio-temporal compression with 128 to 256 channels, in contrast to the 4x8x8 compression of HunyuanVideo's VAE. The much higher spatial compression shortens transformer sequence lengths by approximately sixteenfold, which the developers report yielded a 5.2x speedup in training throughput and an order-of-magnitude inference speedup relative to a less-compressed baseline.^[3]

Training was conducted on 224 NVIDIA H200 GPUs across three stages, accumulating approximately 4,160 GPU-days at a reported total dollar cost of about $200,000 for a single end-to-end training run.^[3] The 2.0 report breaks that budget down by stage: a first stage on 70 million 256-pixel text-to-video samples costing about $107,500, a second stage on 10 million mixed-resolution samples at about $18,400, and a final 768-pixel fine-tune on 5 million samples at about $73,700, with the later two stages running on 192 GPUs.^[3] The maximum output is 768x768 pixels at five seconds (128 frames at 24 frames per second), with multiple aspect ratios (16:9, 9:16, 1:1, 2.39:1) supported.^[4]

The release also exposed an explicit text-to-image-to-video (T2I2V) pipeline that first uses Flux to generate a still image from the text prompt and then conditions the video model on that still, on the rationale that high-quality T2I priors are easier to obtain than full T2V coverage.^[3]^[4] On VBench, Open-Sora 2.0 narrowed the gap to OpenAI Sora from 4.52 percent (the 1.2 score) to 0.69 percent, and on independent human-preference evaluations spanning 100 prompts across visual quality, prompt adherence, and motion quality the developers report that it matched or exceeded HunyuanVideo and Step-Video on at least two of the three axes.^[3]^[13] The 2.0 report summarises the result plainly: according to its abstract, "Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha."^[3]

How does Open-Sora work?

Spatial-temporal diffusion transformers

Through version 1.2, Open-Sora used the STDiT architecture, in which each transformer block consists of a two-dimensional spatial Multi-Head Self-Attention over the patches of a single frame, followed by a one-dimensional temporal attention along the frame axis at each spatial location.^[2]^[11] This decoupling reduces the attention cost from quadratic in the full spatio-temporal token count to the sum of two smaller quadratic terms, which makes long sequences tractable on a modest GPU cluster. The cost is some loss of global spatio-temporal coupling relative to a fully three-dimensional attention, a trade-off that is examined in the technical reports.^[11]

Open-Sora 2.0 abandoned this decomposition in favour of an MMDiT-style architecture that interleaves dual-stream blocks (which carry separate but interacting text and image residual streams) with single-stream blocks (which concatenate text and image tokens into a single attention pass).^[3] Full spatio-temporal attention is now applied at every layer, with the higher VAE compression and the H200 GPU memory making the increased compute feasible. Three-dimensional RoPE supplies axis-aware relative positional information across spatial-x, spatial-y, and time, replacing the absolute or learned positional embeddings used in earlier versions.^[3]

Variational autoencoder design

Open-Sora 1.0 and 1.1 used the pretrained Stable Diffusion VAE, applied frame by frame, without any temporal compression. The transformer therefore had to model time at the original frame rate, inflating sequence lengths and training cost. Open-Sora 1.2 introduced a custom 3D VAE consisting of a 2D encoder (the SD VAE) stacked on top of a learned 3D temporal encoder of about 300 million parameters; the combination delivers 8x8 spatial and 4x temporal compression and was trained progressively in three stages.^[12] The 3D component was trained on mixed-length clips of up to 34 frames to enable robust handling of variable durations.^[12]

In Open-Sora 2.0, the VAE was replaced with a Video Deep Compression Autoencoder. The DC-AE was designed to push spatial compression much further (32x32 in the spatial axes) while retaining sufficient reconstruction fidelity through a larger channel count, making the transformer sequence dramatically shorter for the same input video. The developers report that this single change accounts for the majority of the 5.2x training throughput improvement over a HunyuanVideo-style 4x8x8 VAE baseline.^[3]

Diffusion formulation

Open-Sora 1.0 and 1.1 trained against a standard DDPM denoising objective. Open-Sora 1.2 switched to continuous-time Rectified Flow, which formulates training as learning a velocity field that transports a Gaussian prior to the data distribution along straight-line paths.^[12] Rectified flow exposes a single hyperparameter (the time-step sampling distribution) that lets practitioners trade off the difficulty of different noise levels; the Open-Sora team adopted a logit-normal sampler and added a resolution-aware shift so that higher-resolution training emphasises lower-noise time-steps.^[12] This approach is shared with Stable Diffusion 3, which also uses rectified flow with logit-normal sampling.^[12] Open-Sora 2.0 inherits the rectified-flow objective with further engineering refinements documented in the 2.0 paper.^[3]

Training infrastructure and system optimisation

The Open-Sora codebase sits on top of Colossal-AI, HPC-AI Tech's distributed-training framework that supplies Tensor Parallelism, pipeline parallelism, and sequence parallelism implementations.^[1]^[9] The 1.0 report explicitly highlights a 55 percent training-throughput improvement on 64x512x512 video tokens versus a vanilla DiT baseline, attributed to fused attention kernels, an accelerated T5 path, and sequence parallelism on long token sequences.^[2] By the 2.0 release the codebase had added multi-resolution bucketing, memory-offloading inference paths that allow single-GPU inference of the 11B model in fp8 quantisation, and torch.compile integration.^[4]

The HuggingFace model card for Open-Sora-v2 reports inference times of about 60 seconds for 256x256 video on a single NVIDIA H100 (consuming 52.5 GB of memory), 1,656 seconds for 768x768 on a single GPU, and 276 seconds for 768x768 distributed across eight GPUs with sequence parallelism enabled.^[4] These figures support the project's claim that Open-Sora can be served on modest infrastructure, though the longer single-GPU inference times remain a barrier to interactive use.

Data pipeline

A noteworthy contribution of the Open-Sora reports is the explicit description of the data preparation pipeline, which includes scene-cut detection, optical-flow scoring for motion, aesthetic scoring with a CLIP-based aesthetic predictor, and dense captioning using PLLaVA in the 1.2 era.^[12] In Open-Sora 2.0 the captioning step was upgraded to a multimodal LLM (the paper cites Qwen-VL among other captioners) and the developers apply explicit deduplication and a filter for static or near-static clips that would otherwise weaken motion priors.^[3] The full pipeline is included in the repository, which is unusual for video models: most commercial systems disclose only summary statistics about their datasets.^[3]^[4]

How widely has Open-Sora been adopted?

The hpcaitech/Open-Sora repository accumulated tens of thousands of GitHub stars within months of the 1.0 release, and HPC-AI Tech has stated that the combined Colossal-AI and Open-Sora repositories crossed 60,000 stars by the Series A announcement in late 2024.^[9] The Open-Sora repository on its own held more than 29,000 GitHub stars by mid-2026.^[1] HuggingFace hosts the official weights for v1.0, v1.1, v1.2, v1.3, and v2.0 under the hpcai-tech organisation, with the v2.0 model alone receiving hundreds of community fine-tunes within weeks of release.^[4]

Open-Sora has been adopted as a baseline or reference implementation in several follow-up works in Text-to-video generation, and the rapid iteration on architecture (STDiT to MMDiT-style) tracks broader changes in the open video community. The decision to release dollar-cost figures and training-data sizes alongside the weights has also influenced how other groups document their projects.^[3]^[4]

How does Open-Sora compare with other video models?

Open-Sora exists in a crowded landscape of open and closed video generation systems. The following comparison summarises the headline specifications of contemporaneous systems at the time of Open-Sora 2.0's release in March 2025.

System	Developer	Parameters	Max resolution / duration	License	Notable architecture choices
Open-Sora 2.0	HPC-AI Tech	11B	768x768, 5 s (128 frames)	Apache 2.0	MMDiT-style dual/single stream DiT, Video DC-AE, 3D RoPE^[3]^[4]
Sora	OpenAI	Not disclosed	1080p, up to 60 s	Closed	DiT on spacetime patches^[7]
HunyuanVideo	Tencent	13B	720p, 5 s	Open (Tencent Community licence)	Dual-stream to single-stream DiT, 3D VAE, flow-matching objective^[13]
Mochi 1	Genmo	10B	480p, about 5.4 s at 30 fps	Apache 2.0	Asymmetric DiT (AsymmDiT), full 3D attention^[14]
CogVideoX	Zhipu AI / Tsinghua THUDM	5B (largest open variant)	768x1360 up to 10 s at 16 fps	Open (research licence)	Expert transformer with 3D causal VAE^[15]
Open-Sora Plan v1.5	PKU-YuanGroup	About 8B	720p	MIT	Sparse 3D DiT (SUV), WFVAE, Ascend NPU training^[5]

Open-Sora 2.0's parameter count sits between Mochi 1 and HunyuanVideo. The developers' headline claim, that they reach human-preference parity with Tencent's 13B HunyuanVideo and StepFun's 30B Step-Video using a roughly ten-times smaller training budget, has been independently summarised in coverage by The Decoder and MarkTechPost, though no peer-reviewed third-party benchmarks of comparable scope are yet available.^[13]^[16] On VBench, the developers report a 0.69 percentage-point gap to Sora, narrowed from 4.52 percentage points in version 1.2; the absolute VBench numbers should be read cautiously because VBench scores depend strongly on prompt list selection and resolution settings.^[3]

How is Open-Sora different from Open-Sora Plan?

The "Open-Sora Plan" maintained by the PKU-YuanGroup is unrelated to Open-Sora despite the similar name. The PKU effort is led by a different organisation, is released under an MIT licence rather than Apache 2.0, and adopted a separate architectural lineage moving from 2+1D in versions 1.0 to 1.1 to dense 3D attention in 1.2 and then to sparse attention in 1.3 and later.^[5] Open-Sora Plan version 1.5, released on 5 June 2025, is an 8-billion-parameter model trained on about 40 million video samples that introduced a wavelet-based WFVAE with an 8x8x8 downsampling rate and a sparse diffusion transformer the authors call SUV; the developers report quality comparable to HunyuanVideo.^[5] It was also notable for being trained and served on Huawei Ascend 910-series NPUs rather than NVIDIA GPUs, using the mindspeed-mm framework.^[5] The two projects are sometimes confused in secondary coverage; the HPC-AI Tech README and the PKU GitHub explicitly note the distinction.^[1]^[5]

What is Open-Sora used for?

Open-Sora's significance for the open AI ecosystem lies less in being the highest-quality video model and more in being one of the few projects to disclose a complete recipe: data, code, weights, and explicit budget. Researchers can study the effect of swapping VAEs, schedulers, or text encoders without having to train a base model from scratch. The 1.2 and 2.0 reports have been cited as references for the design of training schedules, multi-stage data curation, and bucketing strategies in academic and industrial video diffusion work.^[3]^[12]

Practical use cases documented by the developers include short-form social-media content generation, story-boarding, advertising mock-ups, and image-to-video animation of still photographs through the I2V pipeline.^[1]^[4] The motion-score and aesthetic-score conditioning channels introduced in version 1.2 make it possible to dial up or down camera motion or visual fidelity at inference, a feature less common in closed commercial systems.^[6] The text-to-image-to-video pipeline added in 2.0 allows users to compose a still using Flux or a similar high-quality Stable Diffusion derivative and then drive motion from that still, bypassing some of the prompt-following weaknesses of pure T2V.^[3]^[4]

Is Open-Sora free and open source?

Because the weights are Apache 2.0 licensed, Open-Sora can be redistributed and embedded in commercial products without royalties, in contrast to several open video models that ship under non-commercial or community-only licences.^[1]^[4] The repository publishes not only the checkpoints but also the training code and the data-preparation pipeline, so the models can be retrained or fine-tuned rather than only run for inference.^[1]^[3]

What are Open-Sora's limitations?

Despite the strong VBench numbers, Open-Sora has several documented limitations. Maximum output is restricted to 768x768 pixels and five seconds at 128 frames in version 2.0; this is well short of Sora's claimed up-to-one-minute output and of the 1080p modes offered by some commercial competitors.^[4]^[13] Motion coherence over longer clips, particularly for complex object interactions, remains weaker than for OpenAI's Sora and Google's Veo line in independent qualitative comparisons.^[13]

The "$200k" figure cited prominently in the 2.0 marketing and paper title refers to a single end-to-end training run priced at publicly listed cloud rates for H200 capacity. Critics have observed that the figure excludes infrastructure setup, failed runs, ablation experiments, and the cumulative cost of building the underlying datasets and tools.^[16] The total all-in cost is thus higher than the headline number, although still likely an order of magnitude below the multi-million-dollar figures associated with closed commercial training, as confirmed by HPC-AI Tech's blog post and several secondary outlets.^[13]^[16]

The choice of T5-XXL plus CLIP-Large for text conditioning, rather than a modern multimodal large language model, limits the model's ability to follow long compositional prompts. HunyuanVideo and several PKU Open-Sora Plan releases adopted MLLM-style text encoders for this reason.^[13]^[5] Open-Sora 2.0's text encoder is acknowledged in the paper as an area for future work.^[3]

Finally, the training data pipeline relies on publicly scraped video corpora, including derivatives of WebVid-10M and Panda-70M, whose copyright status for AI training is contested in some jurisdictions. The technical reports describe filtering and deduplication but do not attempt a comprehensive provenance analysis.^[12]

Open-Sora belongs to a fast-moving cluster of Text-to-video generation systems built around the Diffusion Transformer (DiT) paradigm.

Closed commercial systems that compete on quality include OpenAI's Sora (and its successor Sora 2), Runway's Runway Gen-3 Alpha, Kuaishou's Kling (video generation) family, Google's Veo line, and Pika (video generation). Among open systems, the most direct comparisons are HunyuanVideo, Mochi 1, CogVideoX, and Alibaba's Wan 2.1, all of which use related but distinct DiT variants and 3D VAEs.^[13]^[14]^[15]

Architecturally, Open-Sora's evolution mirrors the trajectory of the wider community: from PixArt-style spatial DiT plus temporal attention, through Stable Diffusion 3's MMDiT (Multimodal Diffusion Transformer), to Flow Matching and Rectified Flow training objectives. The shift from absolute and learned positional embeddings to three-dimensional Rotary position embedding (RoPE) is also visible across multiple open systems.^[3]^[13]

References

HPC-AI Tech, "Open-Sora: Democratizing Efficient Video Production for All (README)", GitHub, 2025-03-12. https://github.com/hpcaitech/Open-Sora. Accessed 2026-05-20. ↩
HPC-AI Tech, "Open-Sora: Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models", HPC-AI Tech Blog, 2024-03-17. https://company.hpc-ai.com/blog/open-sora-v1.0. Accessed 2026-05-20. ↩
Zangwei Zheng, Xiangyu Peng, Chenhui Shen, Tom Young et al., "Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k", arXiv:2503.09642, 2025-03-12. https://arxiv.org/abs/2503.09642. Accessed 2026-05-20. ↩
HPC-AI Tech, "Open-Sora-v2 Model Card", Hugging Face, 2025-03-12. https://huggingface.co/hpcai-tech/Open-Sora-v2. Accessed 2026-05-20. ↩
PKU-YuanGroup, "Open-Sora Plan", GitHub, 2026-03-15. https://github.com/PKU-YuanGroup/Open-Sora-Plan. Accessed 2026-05-20. ↩
HPC-AI Tech, "Open-Sora from HPC-AI Tech Team Continues Open Source: Generate Any 16-Second 720p HD Video with One Click", HPC-AI Tech Blog, 2024-06-17. https://company.hpc-ai.com/blog/open-sora-from-hpc-ai-tech-team-continues-open-source-generate-any-16-second-720p-hd-video-with-one-click-model-weights-ready-to-use. Accessed 2026-05-20. ↩
OpenAI, "Video generation models as world simulators", OpenAI, 2024-02-15. https://openai.com/index/video-generation-models-as-world-simulators/. Accessed 2026-05-20. ↩
HPC-AI Tech, "HPC-AI Tech Completes $6 Million Seed and Angel Round Fundraising", Medium / HPC-AI Tech, 2022-09-15. https://medium.com/@hpcaitech/hpc-ai-tech-completes-6-million-seed-and-angel-round-fundraising-led-by-bluerun-ventures-in-the-892468cc2b02. Accessed 2026-05-20. ↩
HPC-AI Tech, "Singapore Startup HPC-AI Tech Secures 50 Million USD in Series A Funding to Build the Video Generation AI Model and GPU Platform", HPC-AI Tech Blog, 2024-09-30. https://company.hpc-ai.com/blog/singapore-startup-hpc-ai-tech-secures-50-million-usd-in-series-a-funding-to-build-the-video-generation-ai-model-and-gpu-platform. Accessed 2026-05-20. ↩
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen et al., "Open-Sora: Democratizing Efficient Video Production for All", arXiv:2412.20404, 2024-12-29. https://arxiv.org/abs/2412.20404. Accessed 2026-05-20. ↩
HPC-AI Tech, "Open-Sora 1.1 Technical Report (report_02.md)", GitHub, 2024-04-25. https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_02.md. Accessed 2026-05-20. ↩
HPC-AI Tech, "Open-Sora 1.2 Technical Report (report_03.md)", GitHub, 2024-06-17. https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md. Accessed 2026-05-20. ↩
Maximilian Schreiner, "Open-Sora 2.0 matches competitive AI video models at 90 percent lower training costs", The Decoder, 2025-03-19. https://the-decoder.com/open-sora-2-0-achieves-competitive-ai-video-quality-at-one-tenth-the-training-cost-of-commercial-models/. Accessed 2026-05-20. ↩
Genmo, "Mochi 1: A new SOTA in open text-to-video", Genmo Blog, 2024-10-22. https://www.genmo.ai/blog/mochi-1-a-new-sota-in-open-text-to-video. Accessed 2026-05-20. ↩
Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al., "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer", arXiv:2408.06072, 2024-08-12. https://arxiv.org/abs/2408.06072. Accessed 2026-05-20. ↩
Asif Razzaq, "HPC-AI Tech Releases Open-Sora 2.0: An Open-Source SOTA-Level Video Generation Model Trained for Just $200K", MarkTechPost, 2025-03-14. https://www.marktechpost.com/2025/03/14/hpc-ai-tech-releases-open-sora-2-0-an-open-source-sota-level-video-generation-model-trained-for-just-200k/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Doubao Seedance Step-Video Text-to-video generation

Infobox

Who created Open-Sora, and why?

How has Open-Sora evolved across versions?

Open-Sora 1.0 (March 2024)

Open-Sora 1.1 (April 2024)

Open-Sora 1.2 (June 2024)

Open-Sora 1.3 (February 2025)

Open-Sora 2.0 (March 2025)

How does Open-Sora work?

Spatial-temporal diffusion transformers

Variational autoencoder design

Diffusion formulation

Training infrastructure and system optimisation

Data pipeline

How widely has Open-Sora been adopted?

How does Open-Sora compare with other video models?

How is Open-Sora different from Open-Sora Plan?

What is Open-Sora used for?

Is Open-Sora free and open source?

What are Open-Sora's limitations?

Related work

See also

References

Improve this article

Related Articles

Mochi 1

LTX-Video

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here

Related Articles

Mochi 1

LTX-Video

Stable Video Diffusion

Sora

Text-to-video generation

Lumiere

What links here