DBRX
Last reviewed
May 8, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 4,734 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 4,734 words
Add missing citations, update stale details, or suggest a clearer explanation.
DBRX is an open-weight mixture of experts large language model developed by Databricks and its Mosaic AI research team, released on March 27, 2024. The model has 132 billion total parameters, of which 36 billion are active for any given input token, and uses a fine-grained MoE design with 16 experts and top-4 routing. At launch, Databricks marketed DBRX as the strongest available open LLM on standard benchmarks, claiming wins over Llama 2 70B, Mixtral 8x7B, and Grok-1 on language understanding, programming, and math, and surpassing OpenAI's GPT-3.5 on the same suites.
DBRX shipped in two checkpoints, DBRX Base and DBRX Instruct, both released on Hugging Face under a custom Databricks Open Model License (DOML). The license is open enough for most commercial use but restricts deployments above 700 million monthly active users and forbids using DBRX outputs to improve other large language models. The model was trained on 12 trillion tokens of curated text and code on 3,072 NVIDIA H100 GPUs at a reported cost of approximately $10 million.
The release was understood less as a play for ChatGPT-style consumer mindshare and more as a marketing exercise for Databricks' Mosaic AI platform, which the company built on top of Mosaic ML, the startup it acquired in 2023. DBRX's reign as the best open model was brief: within weeks Mistral released Mixtral 8x22B, Snowflake Arctic followed in April, and DeepSeek V2 in May. Databricks did not ship a direct DBRX successor and instead pivoted its platform to host third party models from Meta, Mistral, and Anthropic. DBRX Instruct and Mixtral 8x7B Instruct were retired from Databricks Foundation Model APIs pay-per-token endpoints on April 30, 2025.
| Field | Value |
|---|---|
| Developer | Databricks (Mosaic AI research team) |
| Initial release | March 27, 2024 |
| Variants | DBRX Base, DBRX Instruct |
| Architecture | Decoder-only transformer with mixture of experts |
| Total parameters | 132 billion |
| Active parameters | 36 billion per token |
| Experts | 16 (top-4 routing) |
| Context length | 32,768 tokens |
| Training data | 12 trillion tokens of text and code |
| Training hardware | 3,072 NVIDIA H100 GPUs |
| Reported training cost | ~$10 million USD |
| Tokenizer | GPT-4 BPE (via tiktoken) |
| License | Databricks Open Model License |
| Status | Retired from Databricks Foundation Model APIs on April 30, 2025; weights still hosted on Hugging Face |
Databricks was founded in 2013 by the team behind Apache Spark at UC Berkeley, including Ali Ghodsi, Matei Zaharia, and Ion Stoica. For most of its history the company sold a managed analytics and data warehouse stack on top of Spark; the leap into foundation model training came through acquisition. In June 2023 Databricks announced an agreement to buy Mosaic ML, a generative AI startup co-founded by Naveen Rao and Jonathan Frankle, for roughly $1.3 billion. The deal closed on July 19, 2023, and brought in the team that had previously released the open MPT family of models. Mosaic ML's core product was a managed training stack that customers could use to pretrain or fine-tune their own transformer models on private data.
The acquisition reshaped Databricks' product roadmap. Within months the combined organization rebranded the legacy ML offerings under the "Mosaic AI" name, with model training, vector search, model serving, and an evaluation framework as the headline components. DBRX was the first foundation model produced under that brand, and it served as a public proof point that Mosaic AI's training stack could produce a state-of-the-art result rather than only being a commodity GPU rental product.
The broader competitive context mattered too. By early 2024 the open weight model space had become a serialized arms race. Meta's Llama 2 had been the headline release of mid-2023, Mistral AI had introduced sparse MoE to the open source world with Mixtral 8x7B in December 2023, and xAI had open-sourced Grok-1 (a 314 billion parameter MoE) on March 17, 2024. Databricks shipped DBRX ten days after Grok-1 with a smaller, more efficient design and considerably stronger benchmark scores, which set the tone for how the company framed the model in press materials.
DBRX is a decoder-only transformer with a mixture of experts feed-forward layer in place of the standard dense MLP. It has 132 billion total parameters; for any given input token only 36 billion are active because the router selects a subset of experts to run. The model uses rotary position embeddings (RoPE), gated linear units (GLU), and grouped query attention, which are now standard choices for a 2024-era LLM.
The distinguishing architectural decision is the expert configuration. Mixtral 8x7B and Grok-1 use 8 experts with top-2 routing, meaning two experts run per token. DBRX uses 16 experts with top-4 routing. The combinatorial argument the Databricks team made is that the number of possible expert subsets in a top-k MoE is the binomial coefficient C(N, k), so picking 4 of 16 yields 1,820 combinations versus only 28 for picking 2 of 8, a ratio of roughly 65 to 1. With more granular routing the router can specialize experts more aggressively without inflating the active parameter count. In published ablations the team reported that this fine-grained design improved quality at fixed active parameter budget compared to a top-2 of 8 baseline.
The context window is 32,768 tokens, in line with Mixtral and Llama 2's long context variants. The tokenizer is the GPT-4 BPE tokenizer (the same vocabulary used by GPT-4 and exposed through OpenAI's tiktoken library) rather than the GPT-NeoX or LLaMA tokenizers that were common in earlier open releases. Databricks said the choice was driven both by the GPT-4 tokenizer's stronger compression on natural English and code and by the practical convenience of being able to compare per-token pricing directly with closed competitors. A side effect is that DBRX's per-token compute cost is not directly comparable to Llama-tokenizer models on the same string of text, since the tokenizers segment text differently.
The Mosaic team reported that the data quality used for DBRX was approximately 2x better token-for-token than the data used for the earlier MPT models, judged by held-out evaluation, and that the combination of MoE compute scaling and improved data made the end-to-end training pipeline roughly 4x more compute-efficient than MPT-7B. They also used curriculum learning, adjusting the data mix during pretraining rather than holding it fixed, which is a technique that had become more common in 2023 and 2024 papers from Anthropic, DeepSeek, and others.
| Specification | Value |
|---|---|
| Architecture family | Decoder-only transformer |
| MoE configuration | Fine-grained, 16 experts, top-4 routing |
| Total parameters | 132 billion |
| Active parameters per token | 36 billion |
| Number of layers | 40 |
| Hidden size | 6,144 |
| Attention heads | 48 query heads, 8 key/value heads (grouped query attention) |
| Position encoding | Rotary position embeddings (RoPE) |
| Activation | Gated linear units (GLU) |
| Vocabulary size | ~100,000 (GPT-4 tokenizer) |
| Context length | 32,768 tokens |
| Possible expert combinations | 1,820 (vs 28 for Mixtral) |
DBRX was pretrained on 12 trillion tokens of curated text and code. By comparison, Llama 2 was trained on 2 trillion tokens; the DBRX training dataset is six times larger and was the largest disclosed open-weight pretraining run at the time of the announcement. Databricks did not publish a complete data card with source breakdowns. The blog post described the corpus as a mix of public web data, licensed datasets, and code, with the team running its own deduplication, quality filtering, and curriculum scheduling on top.
The hardware was 3,072 NVIDIA H100 GPUs connected with 3.2 Tbps of InfiniBand bandwidth, drawn from NVIDIA DGX Cloud. Databricks reported that the main training run took roughly two and a half to three months of wall-clock time, with sources giving slightly different figures: TechCrunch reported "two months," the Databricks blog reported "three months" for the full development cycle, and Wikipedia and follow-up coverage cite "2.5 months" for the main training. The discrepancy probably reflects whether the time includes only the final pretraining run or also the data preparation, evaluation, and instruct fine-tuning stages.
The headline training cost figure of "approximately $10 million" was reported by Naveen Rao to TechCrunch and has been the most-cited number in coverage. Independent commentary from Nathan Lambert at Interconnects estimated the full cost at $10 to $30 million when factoring in salaries, infrastructure, failed runs, and data acquisition. Databricks has not published a full breakdown, so the lower number should be read as the marginal compute cost for the successful run rather than the all-in research cost. The figure also sits in a roughly comparable range to publicly reported training costs for GPT-3 and Llama 2, and well below estimates for frontier closed models like GPT-4.
Mosaic AI's training stack was the substrate underneath. Databricks made a point of emphasizing that the entire run was conducted using the same infrastructure (Composer, Streaming, MegaBlocks for sparse MoE kernels, Lilac for data quality, MLflow for experiment tracking) that customers could rent through the Mosaic AI platform. In effect, DBRX functioned as the largest possible advertisement for Mosaic AI's enterprise training product.
| Item | Value |
|---|---|
| Pretraining tokens | 12 trillion |
| Training hardware | 3,072 NVIDIA H100 GPUs |
| Interconnect | 3.2 Tbps InfiniBand on NVIDIA DGX Cloud |
| Training duration | ~2.5 to 3 months (sources vary) |
| Reported training cost | ~$10 million USD (marginal compute) |
| Estimated all-in cost | $10 to $30 million (Interconnects estimate) |
| Training stack | Mosaic AI (Composer, Streaming, MegaBlocks, Lilac, MLflow) |
| Data quality vs MPT | ~2x better token-for-token (per Databricks) |
| Compute efficiency vs MPT-7B | ~4x more efficient end-to-end (per Databricks) |
Databricks released two checkpoints simultaneously on March 27, 2024.
DBRX Base is the pretrained foundation model with no instruction tuning or alignment work applied. It is intended as a starting point for further training, fine tuning, or research. Because it has not been preference-tuned, base outputs do not follow chat formatting conventions and can produce unsafe completions; the Hugging Face model card explicitly notes the absence of safety training.
DBRX Instruct is the chat-tuned variant produced by additional fine-tuning. Databricks released few public details about the fine-tuning recipe; the company did not publish whether it used RLHF, DPO, or simply supervised fine-tuning on a curated instruction dataset. The Interconnects writeup at release flagged this opacity as a missing piece compared to contemporary releases from Mistral and Meta, which had described their preference-optimization stages in more detail. DBRX Instruct is the variant that scored on benchmarks in the official launch announcement and was the version most commonly served by hosting providers.
Both checkpoints share the same architecture, parameter count, and tokenizer.
Databricks released DBRX with an extensive benchmark suite, comparing DBRX Instruct against several open weight peers and against the API version of GPT-3.5. The general pattern in the official numbers is that DBRX Instruct beats the open weight peers across the board and beats GPT-3.5 on most academic benchmarks, while remaining behind GPT-4 on most tasks.
The model showed particularly strong programming results, attributed in the official writeup to the higher proportion of curated code in the pretraining corpus. On HumanEval, DBRX Instruct's 70.1% beats CodeLlama 70B Instruct's 67.8%, despite CodeLlama being a code-specialized model. The math reasoning gap over Mixtral and Llama 2 70B is also wide.
Reported by Databricks for DBRX Instruct against published numbers for peers. Higher is better.
| Benchmark | DBRX Instruct | Mixtral Instruct (8x7B) | LLaMA 2 70B Chat | Grok-1 | GPT-3.5 |
|---|---|---|---|---|---|
| MMLU (5-shot) | 73.7% | 71.4% | 69.8% | 73.0% | 70.0% |
| HellaSwag (10-shot) | 89.0% | 86.5% | 85.9% | n/a | 85.5% |
| GSM8k (CoT) | 72.8% | 61.1% | 54.1% | 62.9% | 57.1% |
| HumanEval (0-shot) | 70.1% | 54.8% | 32.2% | 63.2% | 48.1% |
| Open LLM Leaderboard (avg) | 74.5% | 72.7% | n/a | n/a | n/a |
On the Hugging Face Open LLM Leaderboard at release, DBRX Instruct was the highest scoring open-weight model. The Databricks writeup also reported wins on the company's internal Mosaic Eval Gauntlet, a set of more than 30 benchmarks. Long-context retrieval-augmented generation results from the same writeup placed DBRX Instruct as competitive with Mixtral 8x7B and GPT-3.5 Turbo on Natural Questions and HotPotQA when paired with a vector store.
Independent third-party evaluation has been less laudatory. Artificial Analysis, which runs its own multi-benchmark Intelligence Index, scored DBRX Instruct at 8 out of 100 on its 2025 scale, well below the 13-point average for that cohort; that scale was rebuilt to include later reasoning-tuned models, so DBRX's relative position fell as the field moved on rather than because the model itself regressed.
DBRX is distributed under the Databricks Open Model License, a custom license written for this release. The license is similar in spirit to Meta's Llama 2 community license: weights are freely downloadable and usable for most commercial purposes, with exceptions and restrictions that disqualify the license as "open source" under the OSI definition.
The most-discussed clause is the monthly active user threshold. Any licensee whose products or services collectively had more than 700 million monthly active users in the preceding calendar month must request a separate license from Databricks rather than relying on the open license. The 700 million MAU number matches the threshold in Meta's Llama 2 license and is widely understood to be aimed at large competitor cloud providers and the largest consumer internet companies.
Separately, the license forbids use of DBRX or its outputs to train or improve any other large language model. This is a standard anti-distillation clause that has become common in semi-open releases (Llama, Qwen, and others have similar terms in some versions). It restricts a category of downstream use that the OSI Open Source AI Definition would consider essential, which is part of why critics have called DOML "semi-open" rather than open source.
Databricks does require attribution and the inclusion of the license file in redistributions. Derivative works must apply the same use restrictions to downstream users.
| Term | DBRX (DOML) | Llama 2 community license | Apache 2.0 (e.g. Mistral 7B) |
|---|---|---|---|
| Commercial use | Allowed | Allowed | Allowed |
| Redistribution | Allowed with attribution | Allowed with attribution | Allowed |
| MAU threshold | 700M MAU triggers separate license | 700M MAU triggers separate license | None |
| Use of outputs to train other LLMs | Forbidden | Forbidden | Allowed |
| Use for fine-tuning | Allowed | Allowed | Allowed |
| Sharing fine-tuned weights | Allowed under same terms | Allowed under same terms | Allowed under any terms |
| OSI-approved open source | No | No | Yes |
At release, the two DBRX checkpoints were posted on Hugging Face at databricks/dbrx-base and databricks/dbrx-instruct, with model code in the github.com/databricks/dbrx repository. Both checkpoints were available for direct download without registration, in contrast to Llama 2's gated request flow at the time.
Databricks customers could call DBRX Instruct through the Mosaic AI Foundation Model APIs as a hosted endpoint, both as a pay-per-token product and as a provisioned-throughput product on Databricks-managed GPUs. NVIDIA listed DBRX in its NVIDIA API Catalog and made it available as an NVIDIA NIM container, and Microsoft added the model to Azure AI Foundry. Independent inference providers including Together AI, Fireworks AI, and Perplexity also hosted DBRX endpoints in the months after launch.
The model was retired from Databricks Foundation Model APIs pay-per-token endpoints on April 30, 2025, alongside Mixtral 8x7B Instruct. The DBRX, Mistral, and Mixtral families were also retired from Databricks Foundation Model Fine-tuning on the same date, with Databricks pointing customers toward Llama 3 and other replacements. The Hugging Face weights remain available for self-hosting.
The initial press treatment was warm. Coverage at VentureBeat called the release "a new state of the art"; The Verge framed it as evidence that the open model frontier was catching up to closed labs; TechCrunch led with the price tag, calling the model one Databricks "spent $10 million on" and noting that it still could not beat GPT-4. Wired ran a piece on the Mosaic AI strategy and the way DBRX functioned as a sales tool for Databricks' enterprise customers. SiliconANGLE quoted Ali Ghodsi calling DBRX "a new standard for open source LLMs." Hackster.io was one of the few outlets to lead with the license terms, calling the release "semi-open source."
Within the AI research community, the release was treated as well-engineered but not field-changing. Nathan Lambert at Interconnects published a contemporaneous analysis arguing that DBRX represented a significant infrastructure achievement (proving that Databricks could ship a frontier-quality MoE) but a relatively conservative research contribution, given that fine-grained MoE was already in the literature and that Databricks had not published novel architectural ideas. The fine-grained MoE design later became conventional wisdom in 2024 and 2025 releases, with Snowflake Arctic, DeepSeek V2, and Llama 4 all adopting many-experts top-k routing.
Most of the actual deployment of DBRX inside organizations was through Databricks itself rather than as a downloaded weight, since 132 billion parameter models are difficult to serve on a single host. The minimum hardware to run DBRX in fp16 was around 320 GB of GPU memory, or four H100s, which put self-hosted DBRX out of reach for most individual developers and pushed users toward managed inference.
DBRX appeared at the start of a roughly twelve-month wave of open weight MoE releases. The fine-grained, many-expert design with smaller per-expert MLPs and higher top-k that Databricks championed went on to become the dominant pattern.
| Model | Released | Total params | Active params | Experts | Top-k | Context | License |
|---|---|---|---|---|---|---|---|
| Mixtral 8x7B | Dec 2023 | ~46.7B | ~12.9B | 8 | 2 | 32K | Apache 2.0 |
| Grok-1 | Mar 17, 2024 | 314B | 86B | 8 | 2 | 8K | Apache 2.0 |
| DBRX | Mar 27, 2024 | 132B | 36B | 16 | 4 | 32K | Databricks Open Model License |
| Mixtral 8x22B | Apr 2024 | 141B | 39B | 8 | 2 | 64K | Apache 2.0 |
| Snowflake Arctic | Apr 24, 2024 | 480B | 17B | 128 | 2 | 4K | Apache 2.0 |
| DeepSeek V2 | May 2024 | 236B | 21B | 160 routed + 2 shared | 6 | 128K | DeepSeek License |
| DeepSeek V3 | Dec 2024 | 671B | 37B | 256 routed + 1 shared | 8 | 128K | DeepSeek License |
| Llama 4 Scout | Apr 2025 | ~109B | 17B | 16 | 1 | 10M | Llama 4 license |
| Llama 4 Maverick | Apr 2025 | ~400B | 17B | 128 | 1 | 1M | Llama 4 license |
A few patterns are visible in that table. The total parameter count grew across the year while active parameters mostly held in the same band (17 to 39 billion), reflecting the design pressure to keep inference cost flat while pushing total capacity up. The expert count climbed sharply, with later models choosing dozens or hundreds of experts and routing only one or two of them per token. Context length expanded by orders of magnitude, with Llama 4 Scout's 10 million token context being three orders of magnitude longer than DBRX's 32K. License terms also drifted toward more restrictive arrangements, with Llama 4 introducing additional commercial-use clauses on top of the Llama 2 framework.
In that sense DBRX sits at a hinge point. It was one of the first models to push past Mixtral's eight experts and demonstrate at scale that more granular routing was beneficial, but it was quickly outpaced by competitors who took the same idea further.
Databricks has not released a direct DBRX successor. There has been no public DBRX-2, no DBRX-V, and no scaled-up DBRX-Large. The company has instead leaned on the Mosaic AI Foundation Model API as a multi-model serving layer that hosts third-party weights from Meta, Mistral, DeepSeek, and other vendors, plus closed-API access to models from Anthropic and OpenAI through partnership integrations.
The strategic logic for not pushing a successor is straightforward. By 2024 and 2025 the cost of training a frontier-level model had crossed $100 million, and an enterprise data and analytics company like Databricks did not need to bet its product line on continually winning that race. Databricks customers care about training their own private models on private data using Mosaic AI's pipeline, plus serving the best off-the-shelf open and closed models from a single endpoint. Both of those use cases benefit from Databricks running great inference and fine-tuning infrastructure rather than from Databricks owning the best base model.
The company has continued to invest in research that uses Mosaic AI infrastructure, including domain-specific fine tunes and the Mosaic AI Agent Framework launched in 2024. Mosaic AI Pretraining remains a product that allows customers to pretrain their own foundation models, with DBRX cited as the canonical example of what is possible. The Mosaic Eval Gauntlet has been kept up to date and is used by some independent evaluators.
The market position has held: Databricks raised at a $43 billion valuation in late 2023, $62 billion in 2024, and reportedly $134 billion in late 2025. The company's enterprise AI revenue grew sharply over that period, and DBRX is still cited in sales conversations as evidence that the platform can train state-of-the-art models even though the model itself is no longer the focus.
A related strategic question is whether the open release made commercial sense in retrospect. The investment was not large by 2025 standards, the marketing surface was real, and the model never went head-to-head with Databricks customers' own products since most enterprise users were running fine-tunes of smaller open weights or hosted closed APIs anyway. The case against is that the engineering effort, especially the data work, was redirected away from internal customer-facing products for several quarters, and that DBRX's quick obsolescence meant the marketing window was short. Most observers in the post-release period have treated the release as a net positive for Databricks, particularly because it gave the Mosaic AI brand a concrete proof point in a category where most competitors talked about training infrastructure abstractly.
Several specific criticisms of the DBRX release have been documented.
On opacity, the Databricks blog post and Hugging Face model card did not disclose the data sources used for pretraining at the level of detail that would let third parties reproduce or audit the data. Naveen Rao described the data as "a large set of data from a diverse range of sources," mentioning "open data sets that the community knows, loves and uses every day," but did not list specific datasets, their proportions, or any licensed corpora. The TechCrunch piece noted that this matched the industry pattern (no major lab was disclosing detailed pretraining data in 2024) but criticized it as a transparency gap for an explicitly "open" release.
On the instruct recipe, Databricks did not publish whether DBRX Instruct was tuned with RLHF, DPO, supervised fine-tuning only, or some combination. This was unusual at the time of release, since contemporary releases from Meta, Mistral, and Anthropic had described their preference-optimization stacks at least at a high level.
On the license, the DOML's MAU threshold and anti-distillation clause led several commentators to argue that Databricks should not call DBRX "open source" without qualification. The Open Source Initiative's then-draft Open Source AI Definition, finalized later in 2024, would not have admitted DBRX. Hackster.io and Wired both flagged the gap between Databricks' open marketing and the actual license terms.
On benchmark choice, the Mosaic Eval Gauntlet was an internally curated suite, and the comparisons in the launch blog post used Databricks-run evals on competing models rather than the published numbers from those models' own papers. Independent reproductions of some scores produced lower numbers, and later analyses including the Artificial Analysis Intelligence Index put DBRX Instruct toward the lower end of the cohort once reasoning-tuned models entered the field. None of this implies dishonest reporting. It does mean the headline "DBRX beats GPT-3.5" claim was load-bearing on the specific benchmarks chosen.
On the tokenizer, the choice of the GPT-4 BPE was practical but had ergonomic costs for the open ecosystem. Tooling that was built around the LLaMA or GPT-NeoX vocabularies needed adaptation to consume DBRX, and direct token-count comparisons against Llama-tokenizer models on the same input string were misleading. Some observers thought the choice was as much marketing (so that price-per-token comparisons against OpenAI looked clean) as engineering, since the GPT-4 vocabulary is not unambiguously better than alternatives on all data.
On deployability, the 132 billion parameter total weight made self-hosting difficult for typical users. Even with 36 billion active parameters per token, the model's weights had to fit somewhere, and the cheapest home GPU configurations could not run it. This made the open release more useful as a research artifact and a Databricks marketing statement than as a model that downstream developers actually ran themselves. Quantized GGUF versions of DBRX appeared on Hugging Face within a few weeks of release, which brought the memory requirement down enough to run on a high-end workstation, but inference speed in those configurations was slow enough that the model remained more of a curiosity than a practical local option for most users. The dynamic was different for enterprise users on Databricks itself, who could rent a managed endpoint and avoid the hardware question entirely.
A softer critique, voiced in some research community discussion, was that DBRX did not contribute much new technical knowledge. The fine-grained MoE design had been described in earlier papers, the curriculum learning approach had appeared in concurrent work, and the data curation methods were not published in any reproducible form. That meant DBRX functioned as a demonstration that a particular set of choices worked at scale rather than as a paper that taught the field something new. Compared to releases like DeepSeek V2's multi-head latent attention or Llama 4's interleaved dense and sparse layers, both of which were accompanied by detailed technical writeups, DBRX's research footprint was relatively light.