Open-source AI

Artificial Intelligence Machine Learning Open Source AI

19 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v5 · 3,832 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Open-source AI is artificial intelligence released so that its weights, code, or both can be used, inspected, modified, and redistributed by anyone. The term covers everything from fully open systems that publish training data, training code, and weights, to the far more common "open-weight" models like Meta's Llama and DeepSeek that release only the trained parameters under a license. The first formal standard, the Open Source Initiative's Open Source AI Definition (OSAID) version 1.0, was published on October 28, 2024, and requires that a system grant the freedoms to "use, study, modify and share" while making available its data information, code, and parameters ^[2].

As of 2026, open-source AI has become one of the most consequential forces shaping the AI industry. Models like Meta's LLaMA series, Mistral models, Alibaba's Qwen, and DeepSeek have demonstrated that openly released models can rival proprietary systems in performance, while enabling researchers, startups, and governments to build AI capabilities without depending on a small number of closed providers ^[1]. The shift is measurable on Hugging Face, the field's main distribution hub: by 2025 the platform hosted more than 2 million public models, and in its Spring 2026 review it reported that Chinese open models had overtaken U.S. ones to account for roughly 41% of all platform downloads ^[12].

What is open-source AI, and why is the definition contested?

The meaning of "open source" in the context of AI has been intensely debated since at least 2023. In traditional software, open source has a well-established definition maintained by the Open Source Initiative (OSI): the source code must be freely available, modifiable, and redistributable. Applying this definition to AI systems is more complicated because an AI model consists of multiple components, including training data, preprocessing code, training code, model architecture, and trained weights (parameters).

The Open Source AI Definition (OSAID)

The OSI Board of Directors approved the Open Source AI Definition (OSAID) version 1.0 on October 28, 2024, announcing it at the All Things Open 2024 conference in Raleigh, North Carolina, after a two-year, multi-stakeholder co-design process that drew on developers, data scientists, legal experts, policymakers, and end users worldwide ^[2]^[13]. The OSAID defines the category in a single liftable sentence: "An Open Source AI is an AI system made available under terms and in a way that grant the freedoms to: Use the system for any purpose and without having to ask for permission. Study how the system works and inspect its components. Modify the system for any purpose, including to change its output. Share the system for others to use with or without modifications, for any purpose" ^[13].

Those four freedoms map directly to the requirements an AI system must satisfy:

Use the system for any purpose without requiring permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including changing its behavior.
Share the system with others, with or without modifications.

OSI executive director Stefano Maffulli framed the difficulty of reaching agreement on these terms directly: "Arriving at today's OSAID version 1.0 was a difficult journey, filled with new challenges for the OSI community," he said, describing "a delicate process, filled with differing opinions and uncharted technical frontiers" ^[13].

To enable these freedoms, the OSAID requires that three categories of information be made available in the "preferred form for making modifications":

Component	What must be provided	Examples
Data information	Sufficiently detailed description of training data to allow a skilled person to recreate a substantially equivalent system	Data sources, preprocessing steps, filtering criteria, labeling methods
Code	Complete source code used to train and run the system	Training scripts, inference code, evaluation code, supporting libraries
Parameters	Model weights and configuration	Weights files, optimizer states, hyperparameter settings

Notably, the OSAID does not require releasing the actual training data itself; it requires enough information about the data to enable recreation. This was a pragmatic compromise, as many training datasets contain copyrighted material, personal data, or proprietary content that cannot legally be redistributed ^[2].

How does open source differ from open weights?

Many widely used models, including Meta's LLaMA series and Google's Gemma, release trained weights but not the full training data or complete training code. These are commonly referred to as "open weights" models. While they allow users to run, fine-tune, and deploy the models, they do not meet the OSAID's criteria for open-source AI because users cannot fully reproduce the training process.

Meta has been the most prominent company to use the term "open source" for what are technically open-weights releases. The Llama Community License allows broad use but imposes restrictions on organizations with more than 700 million monthly active users and prohibits using the model outputs to train competing models ^[3]. Critics argue that calling these models "open source" dilutes the term and misleads developers about their actual rights.

As of early 2026, the OSI recognizes only a handful of models as fully compliant with the OSAID: Pythia (EleutherAI), OLMo (AI2), Amber and CrystalCoder (LLM360), and T5 (Google) ^[2]^[14]. The OSI has separately noted that several prominent models, including Llama 2, Grok, Phi-2, and Mixtral, were analyzed and do not pass because they lack required components or their legal terms are incompatible with open-source principles, while BLOOM, StarCoder2, and Falcon could likely qualify with license changes ^[14].

History

The history of open-source AI spans roughly a decade, with key milestones in frameworks, model releases, and community-driven efforts.

Open-source frameworks (2015-2017)

The modern open-source AI movement began with the release of deep learning frameworks. In November 2015, Google open-sourced TensorFlow, a general-purpose machine learning framework that quickly became the dominant tool for training and deploying neural networks. TensorFlow's release democratized access to deep learning techniques that had previously been confined to well-resourced research labs ^[4].

In September 2016, Facebook AI Research (now Meta AI) released PyTorch, an alternative framework that prioritized flexibility and ease of use for researchers. PyTorch's dynamic computation graphs and Pythonic design made it popular in academic settings, and by 2020 it had overtaken TensorFlow as the preferred framework for research. OpenAI officially switched from TensorFlow to PyTorch in early 2020, and most major open-source LLMs developed since then have been built on PyTorch ^[5].

Open models emerge (2018-2019)

The paradigm shifted from open-sourcing tools to open-sourcing trained models with Google's release of BERT (Bidirectional Encoder Representations from Transformers) in October 2018. BERT was a pre-trained transformer model that achieved state-of-the-art results on 11 natural language processing benchmarks. By releasing the model weights and training code, Google enabled thousands of researchers to fine-tune BERT for their own tasks, sparking an explosion of NLP research ^[6].

In February 2019, OpenAI announced GPT-2, a 1.5 billion parameter language model, but controversially chose not to release the full model initially, citing concerns about potential misuse for generating disinformation. OpenAI released increasingly larger versions over the following months, eventually publishing the full model in November 2019. The GPT-2 episode marked the beginning of a recurring tension in the field: the desire for openness versus the fear of enabling harmful applications ^[7].

The BLOOM and early open LLM era (2022)

In July 2022, the BigScience research workshop, coordinated by Hugging Face, released BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176 billion parameter model trained on 46 natural languages and 13 programming languages. BLOOM was notable for its genuinely open development process: over 1,000 researchers from 60 countries contributed, and the training data, code, and model weights were all publicly released under the RAIL (Responsible AI Licenses) framework ^[8].

The LLaMA leak and its aftermath (2023)

Meta released LLaMA (Large Language Model Meta AI) in February 2023 as a collection of models ranging from 7 billion to 65 billion parameters. Access was initially restricted to approved researchers. Within a week, however, the model weights were leaked online via a torrent posted to 4chan ^[9]. The leak proved to be a watershed moment for open-source AI. Developers and researchers worldwide gained access to a high-quality base model, and a flourishing ecosystem of fine-tuned variants emerged almost immediately. Projects like Alpaca (Stanford), Vicuna, and WizardLM demonstrated that fine-tuning LLaMA on relatively small instruction-following datasets could produce capable chatbots at a fraction of the cost of training from scratch.

Meta subsequently embraced open releases more fully, launching LLaMA 2 in July 2023 with an explicit community license that allowed commercial use for most organizations ^[3].

Rapid expansion (2023-2025)

The period from mid-2023 through 2025 saw an explosion of open-weight model releases from companies worldwide. Mistral AI, a French startup founded by former Google DeepMind and Meta researchers, released Mistral 7B in September 2023 under the Apache 2.0 license, demonstrating that a small team could build a model competitive with much larger proprietary ones ^[10]. The Technology Innovation Institute (TII) in Abu Dhabi released the Falcon series, while Chinese companies including Alibaba (Qwen), 01.AI (Yi), and DeepSeek released increasingly powerful open-weight models.

DeepSeek's releases in late 2024 and January 2025 were particularly notable. DeepSeek-V3, a 671 billion parameter mixture-of-experts model (37 billion parameters active per token), was trained on 14.8 trillion tokens over roughly 55 days using a cluster of 2,048 NVIDIA H800 GPUs, at an estimated cost of about $5.6 million, a fraction of the estimated $100 million cost for GPT-4 ^[11]^[15]. DeepSeek noted in its technical report that this figure covers only the official training run and excludes prior research and ablation experiments ^[15]. DeepSeek-R1, released on January 20, 2025, demonstrated strong reasoning capabilities comparable to OpenAI's o1 and was released under the MIT License alongside its weights and a technical report ^[11]^[16].

Key open-source and open-weight models

The following table summarizes the most significant openly released models as of early 2026.

Model	Developer	Release date	Parameters	License	Notes
BLOOM	BigScience / Hugging Face	July 2022	176B	RAIL License	46 languages; first large-scale collaborative open LLM
LLaMA	Meta	February 2023	7B to 65B	Research-only (leaked)	Weights leaked; catalyzed open LLM ecosystem
Falcon	TII (Abu Dhabi)	May 2023	7B, 40B, 180B	Apache 2.0	Top of Hugging Face leaderboard at release
MPT	MosaicML (Databricks)	May 2023	7B, 30B	Apache 2.0	Commercially permissive; ALiBi attention
LLaMA 2	Meta	July 2023	7B, 13B, 70B	Llama Community License	Commercial use allowed; 700M MAU restriction
Mistral 7B	Mistral AI	September 2023	7B	Apache 2.0	Outperformed LLaMA 2 13B on benchmarks
Mixtral 8x7B	Mistral AI	December 2023	46.7B (12.9B active)	Apache 2.0	Mixture of experts; competitive with GPT-3.5
Yi)	01.AI	November 2023	6B, 34B	Yi License	Strong multilingual performance
Qwen 1.5	Alibaba Cloud	February 2024	0.5B to 110B	Various (Apache 2.0 for some)	Strong Chinese and English performance
Gemma	Google DeepMind	February 2024	2B, 7B	Gemma Terms of Use	Lightweight; derived from Gemini research
LLaMA 3	Meta	April 2024	8B, 70B	Llama 3 Community License	Significant performance improvement
Phi-3	Microsoft	April 2024	3.8B, 7B, 14B	MIT License	Small model; strong on benchmarks relative to size
LLaMA 3.1	Meta	July 2024	8B, 70B, 405B	Llama 3.1 Community License	128K context; 405B was largest open-weight model
Mistral Large 2	Mistral AI	July 2024	123B	Mistral Research License	128K context window
DeepSeek-V3	DeepSeek	December 2024	671B (37B active)	MIT License	MoE; trained for ~$5.6M on 14.8T tokens; competitive with GPT-4o
DeepSeek-R1	DeepSeek	January 2025	671B (37B active)	MIT License	Reasoning model; open-weight chain-of-thought; rivals OpenAI o1
Qwen 3	Alibaba Cloud	April 2025	0.6B to 235B	Apache 2.0	MoE variants; 29+ languages; up to 1M context
LLaMA 4	Meta	April 2025	Maverick: 400B (17B active)	Llama 4 Community License	MoE; 1M context window; 128 experts
OLMo 2	AI2	2025	7B, 13B	Apache 2.0	Fully open: data, code, weights, training logs
Gemma 2	Google DeepMind	2024	9B, 27B	Gemma Terms of Use	Improved efficiency; knowledge distillation

Licenses

The licensing landscape for open AI models is more varied and complex than traditional open-source software licensing.

Permissive licenses

Apache 2.0 is the most common license for genuinely permissive open-weight models. It allows free use, modification, and distribution for any purpose, including commercial applications, requiring only attribution and a notice of changes. Models released under Apache 2.0 include Mistral 7B, Mixtral, Falcon, and Qwen 3 ^[10].

MIT License is similarly permissive and is used by DeepSeek for its V3 and R1 models and by Microsoft for the Phi series. The MIT License places minimal restrictions on use and redistribution ^[11].

Custom and community licenses

Llama Community License is Meta's bespoke license for its LLaMA model family. It permits commercial use but includes two notable restrictions: organizations with more than 700 million monthly active users must request a separate license from Meta, and users may not use LLaMA outputs to improve other language models ^[3]. This license has been updated with each LLaMA release but retains these core restrictions.

Gemma Terms of Use allows free use and redistribution of Google's Gemma models but prohibits certain applications and requires users to comply with Google's acceptable use policy.

RAIL (Responsible AI Licenses) were developed by the BigScience project for BLOOM and subsequent models. RAIL licenses are "use-based," meaning they are permissive in general but restrict specific harmful applications such as generating disinformation, surveillance, or discrimination ^[8].

License	Commercial use	Redistribution	Training data required	Notable restrictions
Apache 2.0	Yes	Yes	No	Attribution required
MIT License	Yes	Yes	No	Minimal restrictions
Llama Community License	Yes (with limits)	Yes	No	700M MAU threshold; no competing model training
Gemma Terms of Use	Yes	Yes	No	Must follow acceptable use policy
RAIL License	Yes	Yes	No	Prohibits specific harmful use cases

What are the benefits of open-source AI?

Research accessibility

Open models allow researchers at universities and smaller institutions to study, experiment with, and build upon state-of-the-art AI systems without paying for expensive API access or training their own models from scratch. The availability of LLaMA and its derivatives accelerated academic AI research measurably, with thousands of papers published using these models within months of their release ^[9].

Customization and fine-tuning

Organizations can fine-tune open models on their own domain-specific data, creating specialized systems for healthcare, legal, financial, or other applications. This level of customization is difficult or impossible with closed API-only models, where fine-tuning options are limited and the underlying model cannot be modified.

Cost reduction

Running an open model on one's own hardware or preferred cloud provider can be significantly cheaper than paying per-token API fees, especially at scale. For applications processing millions of tokens per day, self-hosting an open model can reduce costs by an order of magnitude compared to proprietary APIs.

Data sovereignty and privacy

Open models can be deployed on-premises or in private cloud environments, ensuring that sensitive data never leaves an organization's control. This is critical for industries like healthcare, finance, and government, where regulatory requirements often prohibit sending data to third-party APIs ^[1].

Transparency and trust

Open access to model weights (and in some cases training data and code) allows independent security researchers, ethicists, and regulators to audit AI systems for biases, vulnerabilities, and safety issues. This transparency is an important counterbalance to the "trust us" approach of closed model providers.

What are the risks and challenges?

Misuse potential

Once model weights are publicly released, they cannot be taken back. Bad actors can fine-tune open models to remove safety guardrails, generate disinformation, create malware, or produce other harmful content. This concern was central to OpenAI's initial decision to withhold GPT-2 and has remained a major argument against open releases ^[7].

Safety guardrails

Closed model providers implement safety measures at the API level, including content filtering, rate limiting, and usage monitoring. Open models bypass all of these controls. While responsible developers include safety training (such as RLHF) in their releases, determined users can remove these safeguards through fine-tuning.

Liability and governance

The legal landscape around open-weight AI models remains uncertain. Questions about who is liable when an open model is used to cause harm, whether model creators have a duty of care, and how export controls apply to model weights are still being debated by policymakers in the United States, European Union, and elsewhere ^[2].

Resource requirements

While open models are "free" in terms of licensing, running large models still requires substantial computational resources. A 70 billion parameter model requires multiple high-end GPUs for inference, and training or fine-tuning from scratch demands even more. This creates a de facto barrier to entry that limits who can meaningfully use the largest open models.

The Hugging Face ecosystem

Hugging Face has emerged as the central hub for the open-source AI community. Founded in 2016 as a chatbot company, it pivoted to become a platform for sharing models, datasets, and applications. By 2025, Hugging Face had grown to roughly 13 million users, more than 2 million public models, and over 500,000 public datasets ^[12].

The platform's Transformers library provides a unified API for loading and running thousands of pre-trained models, making it trivial for developers to experiment with different architectures. Hugging Face also hosts the Open LLM Leaderboard, which benchmarks open models on standardized evaluation suites and has become the de facto scoreboard for the open-source LLM community.

Over 30% of Fortune 500 companies maintain verified accounts on Hugging Face, signaling that open-source AI has moved from an academic curiosity to a core enterprise technology. Download patterns reveal heavy concentration: the top 200 most downloaded models (0.01% of all models) account for approximately 49.6% of all downloads, though long-tail specialized models serve important niche communities ^[12]. The platform's Spring 2026 review found two further trends: the mean size of downloaded models grew from about 827 million parameters in 2023 to 20.8 billion in 2025 even though the median barely moved (from 326 million to 406 million), and Chinese-developed models had overtaken U.S. models to capture roughly 41% of all downloads ^[12].

Impact on the AI industry

Open-source AI has fundamentally altered the competitive dynamics of the AI industry.

Commoditization pressure

The rapid improvement of open models has put downward pressure on API pricing for proprietary models. When Mistral 7B demonstrated performance comparable to LLaMA 2 13B and competitive with early GPT-3.5, it signaled that small teams with modest budgets could produce commercially viable models. DeepSeek's ability to train frontier-class models for a fraction of typical costs further reinforced this dynamic ^[11].

Geopolitical dimensions

Open-source AI has become a geopolitical consideration. Chinese companies like Alibaba, DeepSeek, and 01.AI have released powerful open-weight models, raising questions in Western policy circles about supply chain dependencies and the implications of broadly available AI capabilities. The trend is now visible in raw distribution data: Hugging Face's Spring 2026 review reported that Chinese models had surpassed U.S. models in monthly and overall downloads, with Alibaba alone accounting for more derivative models than Google and Meta combined ^[12]. Conversely, open releases have been framed as a tool for AI sovereignty, allowing countries to build domestic AI capabilities without reliance on foreign providers.

The "open" advantage for Meta

Meta's decision to release LLaMA openly was partly strategic. By establishing LLaMA as a widely used base model, Meta benefits from community-driven improvements, ecosystem development, and the normalization of its preferred architectures. The open approach also helps Meta recruit talent and compete with Google and OpenAI without having to build a consumer-facing API business ^[3].

Current state (2025-2026)

As of early 2026, the open-source AI ecosystem is thriving but faces ongoing challenges. The gap between the best open-weight models and the best closed models has narrowed substantially, with models like Qwen 3 and DeepSeek-R1 performing competitively on many benchmarks. The OSAID is undergoing its first review cycle, with a planned update by Q4 2026 to address issues identified through monitoring of evolving industry practices ^[2].

The trend toward mixture-of-experts architectures has made large models more practical to deploy, since only a fraction of parameters are active for any given query. LLaMA 4 Maverick, for example, has 400 billion total parameters but only 17 billion active per forward pass, making it feasible to run on more modest hardware than its total parameter count would suggest ^[3].

Meanwhile, the community continues to push the boundaries of what open models can achieve, with active development in areas like multimodal understanding, long-context processing, and reasoning capabilities. The question of what truly counts as "open source" in AI remains unresolved, but the practical impact of openly released models on research, industry, and society is undeniable.

References

Hunton Andrews Kurth. "Open Source AI Models: How Open Are They Really?" 2025. https://www.hunton.com/insights/publications/part-1-open-source-ai-models-how-open-are-they-really ↩
Open Source Initiative. "Open Source AI Definition." https://opensource.org/ai ↩
Meta. "Llama Community License Agreement." https://www.llama.com/ ↩
Google. "TensorFlow: An Open Source Machine Learning Framework for Everyone." 2015. https://www.tensorflow.org/ ↩
Morales Aguilera, Frank. "A Tribute to PyTorch: The Catalyst for Open-Source LLMs." Medium, 2024. https://medium.com/ai-simplified-in-plain-english/a-tribute-to-pytorch-the-catalyst-for-open-source-llms-06c53bc3505c ↩
Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805, 2018. ↩
OpenAI. "Better Language Models and Their Implications." February 2019. https://openai.com/research/better-language-models ↩
BigScience Workshop. "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100, 2022. ↩
Touvron, Hugo, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
Jiang, Albert Q., et al. "Mistral 7B." arXiv:2310.06825, 2023. ↩
DeepSeek AI. "DeepSeek-V3 Technical Report." GitHub, December 2024. https://github.com/deepseek-ai/DeepSeek-V3 ↩
Hugging Face. "State of Open Source on Hugging Face: Spring 2026." https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026 ↩
Open Source Initiative. "The Open Source Initiative Announces the Release of the Industry's First Open Source AI Definition." October 28, 2024. https://opensource.org/blog/the-open-source-initiative-announces-the-release-of-the-industrys-first-open-source-ai-definition ↩
Open Source Initiative. "The Open Source AI Definition 1.0." https://opensource.org/ai/open-source-ai-definition ↩
DeepSeek AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. ↩
DeepSeek AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Open-source AI

What is open-source AI, and why is the definition contested?

The Open Source AI Definition (OSAID)

How does open source differ from open weights?

History

Open-source frameworks (2015-2017)

Open models emerge (2018-2019)

The BLOOM and early open LLM era (2022)

The LLaMA leak and its aftermath (2023)

Rapid expansion (2023-2025)

Key open-source and open-weight models

Licenses

Permissive licenses

Custom and community licenses

What are the benefits of open-source AI?

Research accessibility

Customization and fine-tuning

Cost reduction

Data sovereignty and privacy

Transparency and trust

What are the risks and challenges?

Misuse potential

Safety guardrails

Liability and governance

Resource requirements

The Hugging Face ecosystem

Impact on the AI industry

Commoditization pressure

Geopolitical dimensions

The "open" advantage for Meta

Current state (2025-2026)

See also

References

Improve this article

What links here (24 of 93)

What links here (24 of 93)

What is open-source AI, and why is the definition contested?

The Open Source AI Definition (OSAID)

How does open source differ from open weights?

History

Open-source frameworks (2015-2017)

Open models emerge (2018-2019)

The BLOOM and early open LLM era (2022)

The LLaMA leak and its aftermath (2023)

Rapid expansion (2023-2025)

Key open-source and open-weight models

Licenses

Permissive licenses

Custom and community licenses

What are the benefits of open-source AI?

Research accessibility

Customization and fine-tuning

Cost reduction

Data sovereignty and privacy

Transparency and trust

What are the risks and challenges?

Misuse potential

Safety guardrails

Liability and governance

Resource requirements

The Hugging Face ecosystem

Impact on the AI industry

Commoditization pressure

Geopolitical dimensions

The "open" advantage for Meta

Current state (2025-2026)

See also

References

Improve this article

Related Articles

InclusionAI

DeepSeek

Mistral AI

OpenMule

SmolVLA

OpenClaw

What links here (24 of 93)

Related Articles

InclusionAI

DeepSeek

Mistral AI

OpenMule

SmolVLA

OpenClaw

What links here (24 of 93)