Open-source AI
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v5 · 3,832 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v5 · 3,832 words
Add missing citations, update stale details, or suggest a clearer explanation.
Open-source AI is artificial intelligence released so that its weights, code, or both can be used, inspected, modified, and redistributed by anyone. The term covers everything from fully open systems that publish training data, training code, and weights, to the far more common "open-weight" models like Meta's Llama and DeepSeek that release only the trained parameters under a license. The first formal standard, the Open Source Initiative's Open Source AI Definition (OSAID) version 1.0, was published on October 28, 2024, and requires that a system grant the freedoms to "use, study, modify and share" while making available its data information, code, and parameters [2].
As of 2026, open-source AI has become one of the most consequential forces shaping the AI industry. Models like Meta's LLaMA series, Mistral models, Alibaba's Qwen, and DeepSeek have demonstrated that openly released models can rival proprietary systems in performance, while enabling researchers, startups, and governments to build AI capabilities without depending on a small number of closed providers [1]. The shift is measurable on Hugging Face, the field's main distribution hub: by 2025 the platform hosted more than 2 million public models, and in its Spring 2026 review it reported that Chinese open models had overtaken U.S. ones to account for roughly 41% of all platform downloads [12].
The meaning of "open source" in the context of AI has been intensely debated since at least 2023. In traditional software, open source has a well-established definition maintained by the Open Source Initiative (OSI): the source code must be freely available, modifiable, and redistributable. Applying this definition to AI systems is more complicated because an AI model consists of multiple components, including training data, preprocessing code, training code, model architecture, and trained weights (parameters).
The OSI Board of Directors approved the Open Source AI Definition (OSAID) version 1.0 on October 28, 2024, announcing it at the All Things Open 2024 conference in Raleigh, North Carolina, after a two-year, multi-stakeholder co-design process that drew on developers, data scientists, legal experts, policymakers, and end users worldwide [2][13]. The OSAID defines the category in a single liftable sentence: "An Open Source AI is an AI system made available under terms and in a way that grant the freedoms to: Use the system for any purpose and without having to ask for permission. Study how the system works and inspect its components. Modify the system for any purpose, including to change its output. Share the system for others to use with or without modifications, for any purpose" [13].
Those four freedoms map directly to the requirements an AI system must satisfy:
OSI executive director Stefano Maffulli framed the difficulty of reaching agreement on these terms directly: "Arriving at today's OSAID version 1.0 was a difficult journey, filled with new challenges for the OSI community," he said, describing "a delicate process, filled with differing opinions and uncharted technical frontiers" [13].
To enable these freedoms, the OSAID requires that three categories of information be made available in the "preferred form for making modifications":
| Component | What must be provided | Examples |
|---|---|---|
| Data information | Sufficiently detailed description of training data to allow a skilled person to recreate a substantially equivalent system | Data sources, preprocessing steps, filtering criteria, labeling methods |
| Code | Complete source code used to train and run the system | Training scripts, inference code, evaluation code, supporting libraries |
| Parameters | Model weights and configuration | Weights files, optimizer states, hyperparameter settings |
Notably, the OSAID does not require releasing the actual training data itself; it requires enough information about the data to enable recreation. This was a pragmatic compromise, as many training datasets contain copyrighted material, personal data, or proprietary content that cannot legally be redistributed [2].
Many widely used models, including Meta's LLaMA series and Google's Gemma, release trained weights but not the full training data or complete training code. These are commonly referred to as "open weights" models. While they allow users to run, fine-tune, and deploy the models, they do not meet the OSAID's criteria for open-source AI because users cannot fully reproduce the training process.
Meta has been the most prominent company to use the term "open source" for what are technically open-weights releases. The Llama Community License allows broad use but imposes restrictions on organizations with more than 700 million monthly active users and prohibits using the model outputs to train competing models [3]. Critics argue that calling these models "open source" dilutes the term and misleads developers about their actual rights.
As of early 2026, the OSI recognizes only a handful of models as fully compliant with the OSAID: Pythia (EleutherAI), OLMo (AI2), Amber and CrystalCoder (LLM360), and T5 (Google) [2][14]. The OSI has separately noted that several prominent models, including Llama 2, Grok, Phi-2, and Mixtral, were analyzed and do not pass because they lack required components or their legal terms are incompatible with open-source principles, while BLOOM, StarCoder2, and Falcon could likely qualify with license changes [14].
The history of open-source AI spans roughly a decade, with key milestones in frameworks, model releases, and community-driven efforts.
The modern open-source AI movement began with the release of deep learning frameworks. In November 2015, Google open-sourced TensorFlow, a general-purpose machine learning framework that quickly became the dominant tool for training and deploying neural networks. TensorFlow's release democratized access to deep learning techniques that had previously been confined to well-resourced research labs [4].
In September 2016, Facebook AI Research (now Meta AI) released PyTorch, an alternative framework that prioritized flexibility and ease of use for researchers. PyTorch's dynamic computation graphs and Pythonic design made it popular in academic settings, and by 2020 it had overtaken TensorFlow as the preferred framework for research. OpenAI officially switched from TensorFlow to PyTorch in early 2020, and most major open-source LLMs developed since then have been built on PyTorch [5].
The paradigm shifted from open-sourcing tools to open-sourcing trained models with Google's release of BERT (Bidirectional Encoder Representations from Transformers) in October 2018. BERT was a pre-trained transformer model that achieved state-of-the-art results on 11 natural language processing benchmarks. By releasing the model weights and training code, Google enabled thousands of researchers to fine-tune BERT for their own tasks, sparking an explosion of NLP research [6].
In February 2019, OpenAI announced GPT-2, a 1.5 billion parameter language model, but controversially chose not to release the full model initially, citing concerns about potential misuse for generating disinformation. OpenAI released increasingly larger versions over the following months, eventually publishing the full model in November 2019. The GPT-2 episode marked the beginning of a recurring tension in the field: the desire for openness versus the fear of enabling harmful applications [7].
In July 2022, the BigScience research workshop, coordinated by Hugging Face, released BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176 billion parameter model trained on 46 natural languages and 13 programming languages. BLOOM was notable for its genuinely open development process: over 1,000 researchers from 60 countries contributed, and the training data, code, and model weights were all publicly released under the RAIL (Responsible AI Licenses) framework [8].
Meta released LLaMA (Large Language Model Meta AI) in February 2023 as a collection of models ranging from 7 billion to 65 billion parameters. Access was initially restricted to approved researchers. Within a week, however, the model weights were leaked online via a torrent posted to 4chan [9]. The leak proved to be a watershed moment for open-source AI. Developers and researchers worldwide gained access to a high-quality base model, and a flourishing ecosystem of fine-tuned variants emerged almost immediately. Projects like Alpaca (Stanford), Vicuna, and WizardLM demonstrated that fine-tuning LLaMA on relatively small instruction-following datasets could produce capable chatbots at a fraction of the cost of training from scratch.
Meta subsequently embraced open releases more fully, launching LLaMA 2 in July 2023 with an explicit community license that allowed commercial use for most organizations [3].
The period from mid-2023 through 2025 saw an explosion of open-weight model releases from companies worldwide. Mistral AI, a French startup founded by former Google DeepMind and Meta researchers, released Mistral 7B in September 2023 under the Apache 2.0 license, demonstrating that a small team could build a model competitive with much larger proprietary ones [10]. The Technology Innovation Institute (TII) in Abu Dhabi released the Falcon series, while Chinese companies including Alibaba (Qwen), 01.AI (Yi), and DeepSeek released increasingly powerful open-weight models.
DeepSeek's releases in late 2024 and January 2025 were particularly notable. DeepSeek-V3, a 671 billion parameter mixture-of-experts model (37 billion parameters active per token), was trained on 14.8 trillion tokens over roughly 55 days using a cluster of 2,048 NVIDIA H800 GPUs, at an estimated cost of about $5.6 million, a fraction of the estimated $100 million cost for GPT-4 [11][15]. DeepSeek noted in its technical report that this figure covers only the official training run and excludes prior research and ablation experiments [15]. DeepSeek-R1, released on January 20, 2025, demonstrated strong reasoning capabilities comparable to OpenAI's o1 and was released under the MIT License alongside its weights and a technical report [11][16].
The following table summarizes the most significant openly released models as of early 2026.
| Model | Developer | Release date | Parameters | License | Notes |
|---|---|---|---|---|---|
| BLOOM | BigScience / Hugging Face | July 2022 | 176B | RAIL License | 46 languages; first large-scale collaborative open LLM |
| LLaMA | Meta | February 2023 | 7B to 65B | Research-only (leaked) | Weights leaked; catalyzed open LLM ecosystem |
| Falcon | TII (Abu Dhabi) | May 2023 | 7B, 40B, 180B | Apache 2.0 | Top of Hugging Face leaderboard at release |
| MPT | MosaicML (Databricks) | May 2023 | 7B, 30B | Apache 2.0 | Commercially permissive; ALiBi attention |
| LLaMA 2 | Meta | July 2023 | 7B, 13B, 70B | Llama Community License | Commercial use allowed; 700M MAU restriction |
| Mistral 7B | Mistral AI | September 2023 | 7B | Apache 2.0 | Outperformed LLaMA 2 13B on benchmarks |
| Mixtral 8x7B | Mistral AI | December 2023 | 46.7B (12.9B active) | Apache 2.0 | Mixture of experts; competitive with GPT-3.5 |
| Yi) | 01.AI | November 2023 | 6B, 34B | Yi License | Strong multilingual performance |
| Qwen 1.5 | Alibaba Cloud | February 2024 | 0.5B to 110B | Various (Apache 2.0 for some) | Strong Chinese and English performance |
| Gemma | Google DeepMind | February 2024 | 2B, 7B | Gemma Terms of Use | Lightweight; derived from Gemini research |
| LLaMA 3 | Meta | April 2024 | 8B, 70B | Llama 3 Community License | Significant performance improvement |
| Phi-3 | Microsoft | April 2024 | 3.8B, 7B, 14B | MIT License | Small model; strong on benchmarks relative to size |
| LLaMA 3.1 | Meta | July 2024 | 8B, 70B, 405B | Llama 3.1 Community License | 128K context; 405B was largest open-weight model |
| Mistral Large 2 | Mistral AI | July 2024 | 123B | Mistral Research License | 128K context window |
| DeepSeek-V3 | DeepSeek | December 2024 | 671B (37B active) | MIT License | MoE; trained for ~$5.6M on 14.8T tokens; competitive with GPT-4o |
| DeepSeek-R1 | DeepSeek | January 2025 | 671B (37B active) | MIT License | Reasoning model; open-weight chain-of-thought; rivals OpenAI o1 |
| Qwen 3 | Alibaba Cloud | April 2025 | 0.6B to 235B | Apache 2.0 | MoE variants; 29+ languages; up to 1M context |
| LLaMA 4 | Meta | April 2025 | Maverick: 400B (17B active) | Llama 4 Community License | MoE; 1M context window; 128 experts |
| OLMo 2 | AI2 | 2025 | 7B, 13B | Apache 2.0 | Fully open: data, code, weights, training logs |
| Gemma 2 | Google DeepMind | 2024 | 9B, 27B | Gemma Terms of Use | Improved efficiency; knowledge distillation |
The licensing landscape for open AI models is more varied and complex than traditional open-source software licensing.
Apache 2.0 is the most common license for genuinely permissive open-weight models. It allows free use, modification, and distribution for any purpose, including commercial applications, requiring only attribution and a notice of changes. Models released under Apache 2.0 include Mistral 7B, Mixtral, Falcon, and Qwen 3 [10].
MIT License is similarly permissive and is used by DeepSeek for its V3 and R1 models and by Microsoft for the Phi series. The MIT License places minimal restrictions on use and redistribution [11].
Llama Community License is Meta's bespoke license for its LLaMA model family. It permits commercial use but includes two notable restrictions: organizations with more than 700 million monthly active users must request a separate license from Meta, and users may not use LLaMA outputs to improve other language models [3]. This license has been updated with each LLaMA release but retains these core restrictions.
Gemma Terms of Use allows free use and redistribution of Google's Gemma models but prohibits certain applications and requires users to comply with Google's acceptable use policy.
RAIL (Responsible AI Licenses) were developed by the BigScience project for BLOOM and subsequent models. RAIL licenses are "use-based," meaning they are permissive in general but restrict specific harmful applications such as generating disinformation, surveillance, or discrimination [8].
| License | Commercial use | Redistribution | Training data required | Notable restrictions |
|---|---|---|---|---|
| Apache 2.0 | Yes | Yes | No | Attribution required |
| MIT License | Yes | Yes | No | Minimal restrictions |
| Llama Community License | Yes (with limits) | Yes | No | 700M MAU threshold; no competing model training |
| Gemma Terms of Use | Yes | Yes | No | Must follow acceptable use policy |
| RAIL License | Yes | Yes | No | Prohibits specific harmful use cases |
Open models allow researchers at universities and smaller institutions to study, experiment with, and build upon state-of-the-art AI systems without paying for expensive API access or training their own models from scratch. The availability of LLaMA and its derivatives accelerated academic AI research measurably, with thousands of papers published using these models within months of their release [9].
Organizations can fine-tune open models on their own domain-specific data, creating specialized systems for healthcare, legal, financial, or other applications. This level of customization is difficult or impossible with closed API-only models, where fine-tuning options are limited and the underlying model cannot be modified.
Running an open model on one's own hardware or preferred cloud provider can be significantly cheaper than paying per-token API fees, especially at scale. For applications processing millions of tokens per day, self-hosting an open model can reduce costs by an order of magnitude compared to proprietary APIs.
Open models can be deployed on-premises or in private cloud environments, ensuring that sensitive data never leaves an organization's control. This is critical for industries like healthcare, finance, and government, where regulatory requirements often prohibit sending data to third-party APIs [1].
Open access to model weights (and in some cases training data and code) allows independent security researchers, ethicists, and regulators to audit AI systems for biases, vulnerabilities, and safety issues. This transparency is an important counterbalance to the "trust us" approach of closed model providers.
Once model weights are publicly released, they cannot be taken back. Bad actors can fine-tune open models to remove safety guardrails, generate disinformation, create malware, or produce other harmful content. This concern was central to OpenAI's initial decision to withhold GPT-2 and has remained a major argument against open releases [7].
Closed model providers implement safety measures at the API level, including content filtering, rate limiting, and usage monitoring. Open models bypass all of these controls. While responsible developers include safety training (such as RLHF) in their releases, determined users can remove these safeguards through fine-tuning.
The legal landscape around open-weight AI models remains uncertain. Questions about who is liable when an open model is used to cause harm, whether model creators have a duty of care, and how export controls apply to model weights are still being debated by policymakers in the United States, European Union, and elsewhere [2].
While open models are "free" in terms of licensing, running large models still requires substantial computational resources. A 70 billion parameter model requires multiple high-end GPUs for inference, and training or fine-tuning from scratch demands even more. This creates a de facto barrier to entry that limits who can meaningfully use the largest open models.
Hugging Face has emerged as the central hub for the open-source AI community. Founded in 2016 as a chatbot company, it pivoted to become a platform for sharing models, datasets, and applications. By 2025, Hugging Face had grown to roughly 13 million users, more than 2 million public models, and over 500,000 public datasets [12].
The platform's Transformers library provides a unified API for loading and running thousands of pre-trained models, making it trivial for developers to experiment with different architectures. Hugging Face also hosts the Open LLM Leaderboard, which benchmarks open models on standardized evaluation suites and has become the de facto scoreboard for the open-source LLM community.
Over 30% of Fortune 500 companies maintain verified accounts on Hugging Face, signaling that open-source AI has moved from an academic curiosity to a core enterprise technology. Download patterns reveal heavy concentration: the top 200 most downloaded models (0.01% of all models) account for approximately 49.6% of all downloads, though long-tail specialized models serve important niche communities [12]. The platform's Spring 2026 review found two further trends: the mean size of downloaded models grew from about 827 million parameters in 2023 to 20.8 billion in 2025 even though the median barely moved (from 326 million to 406 million), and Chinese-developed models had overtaken U.S. models to capture roughly 41% of all downloads [12].
Open-source AI has fundamentally altered the competitive dynamics of the AI industry.
The rapid improvement of open models has put downward pressure on API pricing for proprietary models. When Mistral 7B demonstrated performance comparable to LLaMA 2 13B and competitive with early GPT-3.5, it signaled that small teams with modest budgets could produce commercially viable models. DeepSeek's ability to train frontier-class models for a fraction of typical costs further reinforced this dynamic [11].
Open-source AI has become a geopolitical consideration. Chinese companies like Alibaba, DeepSeek, and 01.AI have released powerful open-weight models, raising questions in Western policy circles about supply chain dependencies and the implications of broadly available AI capabilities. The trend is now visible in raw distribution data: Hugging Face's Spring 2026 review reported that Chinese models had surpassed U.S. models in monthly and overall downloads, with Alibaba alone accounting for more derivative models than Google and Meta combined [12]. Conversely, open releases have been framed as a tool for AI sovereignty, allowing countries to build domestic AI capabilities without reliance on foreign providers.
Meta's decision to release LLaMA openly was partly strategic. By establishing LLaMA as a widely used base model, Meta benefits from community-driven improvements, ecosystem development, and the normalization of its preferred architectures. The open approach also helps Meta recruit talent and compete with Google and OpenAI without having to build a consumer-facing API business [3].
As of early 2026, the open-source AI ecosystem is thriving but faces ongoing challenges. The gap between the best open-weight models and the best closed models has narrowed substantially, with models like Qwen 3 and DeepSeek-R1 performing competitively on many benchmarks. The OSAID is undergoing its first review cycle, with a planned update by Q4 2026 to address issues identified through monitoring of evolving industry practices [2].
The trend toward mixture-of-experts architectures has made large models more practical to deploy, since only a fraction of parameters are active for any given query. LLaMA 4 Maverick, for example, has 400 billion total parameters but only 17 billion active per forward pass, making it feasible to run on more modest hardware than its total parameter count would suggest [3].
Meanwhile, the community continues to push the boundaries of what open models can achieve, with active development in areas like multimodal understanding, long-context processing, and reasoning capabilities. The question of what truly counts as "open source" in AI remains unresolved, but the practical impact of openly released models on research, industry, and society is undeniable.