DBRX

AI Companies Large Language Models Mixture of Experts Open Source AI

27 min read

Updated Jun 26, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 26, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 5,443 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DBRX is an open-weight mixture of experts large language model developed by Databricks and its Mosaic AI research team, released on March 27, 2024. It has 132 billion total parameters, of which only 36 billion are active for any given input token, and uses a fine-grained MoE design with 16 experts and top-4 routing (4 of the 16 experts run per token). DBRX was trained on 12 trillion tokens of text and code with a 32,768-token context window, and at launch Databricks said the model "sets a new state-of-the-art for established open LLMs," reporting wins over Llama 2 70B, Mixtral 8x7B, and Grok-1 on language understanding, programming, and math while surpassing OpenAI's GPT-3.5. ^[1]^[2]

DBRX shipped in two checkpoints, DBRX Base and DBRX Instruct, both released on Hugging Face under a custom Databricks Open Model License (DOML). ^[4]^[5] The license is open enough for most commercial use but requires a separate license for any product with more than 700 million monthly active users and forbids using DBRX outputs to improve other large language models. ^[3] The model was trained on 3,072 NVIDIA H100 GPUs at a reported cost of approximately $10 million. ^[1]^[6]

The release was understood less as a play for ChatGPT-style consumer mindshare and more as a marketing exercise for Databricks' Mosaic AI platform, which the company built on top of Mosaic ML, the startup it acquired in 2023. ^[9]^[10] DBRX's reign as the best open model was brief: within weeks Mistral released Mixtral 8x22B, Snowflake Arctic followed in April, and DeepSeek V2 in May. ^[13]^[14]^[15] Databricks did not ship a direct DBRX successor and instead pivoted its platform to host third party models from Meta, Mistral, and Anthropic. DBRX Instruct and Mixtral 8x7B Instruct were retired from Databricks Foundation Model APIs pay-per-token endpoints on April 30, 2025. ^[12]

Infobox

Field	Value
Developer	Databricks (Mosaic AI research team)
Initial release	March 27, 2024
Variants	DBRX Base, DBRX Instruct
Architecture	Decoder-only transformer with mixture of experts
Total parameters	132 billion
Active parameters	36 billion per token
Experts	16 (top-4 routing)
Context length	32,768 tokens
Training data	12 trillion tokens of text and code
Training hardware	3,072 NVIDIA H100 GPUs
Reported training cost	~$10 million USD
Tokenizer	GPT-4 BPE (via tiktoken)
License	Databricks Open Model License
Status	Retired from Databricks Foundation Model APIs on April 30, 2025; weights still hosted on Hugging Face

What is DBRX?

DBRX is a general-purpose open-weight large language model built by Databricks to handle text generation, coding, and reasoning tasks. It is a sparse mixture of experts model: although it stores 132 billion parameters, a learned router activates only 36 billion of them per token, which gives the quality of a large model at the inference cost of a much smaller one. ^[1] In the launch announcement Databricks claimed DBRX Instruct "surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro," while outperforming every open model it tested on composite benchmarks. ^[1]

The name DBRX is an internal Databricks codename rather than an acronym with an official expansion. The model was released as a proof point for the Mosaic AI training stack, and Databricks CEO Ali Ghodsi described it as "a new standard for open source LLMs" in launch coverage. ^[11] DBRX is most directly comparable to other open MoE models of its era such as Mixtral 8x7B and Grok-1, and it was the highest-scoring open-weight model on the Hugging Face Open LLM Leaderboard at the moment of release. ^[1]

Background

Databricks was founded in 2013 by the team behind Apache Spark at UC Berkeley, including Ali Ghodsi, Matei Zaharia, and Ion Stoica. For most of its history the company sold a managed analytics and data warehouse stack on top of Spark; the leap into foundation model training came through acquisition. In June 2023 Databricks announced an agreement to buy Mosaic ML, a generative AI startup co-founded by Naveen Rao and Jonathan Frankle, for roughly $1.3 billion. ^[9] The deal closed on July 19, 2023, and brought in the team that had previously released the open MPT family of models. ^[10] Mosaic ML's core product was a managed training stack that customers could use to pretrain or fine-tune their own transformer models on private data.

The acquisition reshaped Databricks' product roadmap. Within months the combined organization rebranded the legacy ML offerings under the "Mosaic AI" name, with model training, vector search, model serving, and an evaluation framework as the headline components. DBRX was the first foundation model produced under that brand, and it served as a public proof point that Mosaic AI's training stack could produce a state-of-the-art result rather than only being a commodity GPU rental product.

The broader competitive context mattered too. By early 2024 the open weight model space had become a serialized arms race. Meta's Llama 2 had been the headline release of mid-2023, Mistral AI had introduced sparse MoE to the open source world with Mixtral 8x7B in December 2023, and xAI had open-sourced Grok-1 (a 314 billion parameter MoE) on March 17, 2024. Databricks shipped DBRX ten days after Grok-1 with a smaller, more efficient design and considerably stronger benchmark scores, which set the tone for how the company framed the model in press materials. ^[1]

How is DBRX built?

DBRX is a decoder-only transformer with a mixture of experts feed-forward layer in place of the standard dense MLP. It has 132 billion total parameters; for any given input token only 36 billion are active because the router selects a subset of experts to run. ^[1] The model uses rotary position embeddings (RoPE), gated linear units (GLU), and grouped query attention, which are now standard choices for a 2024-era LLM.

The distinguishing architectural decision is the expert configuration. Mixtral 8x7B and Grok-1 use 8 experts with top-2 routing, meaning two experts run per token. DBRX uses 16 experts with top-4 routing. The combinatorial argument the Databricks team made is that the number of possible expert subsets in a top-k MoE is the binomial coefficient C(N, k), so picking 4 of 16 yields 1,820 combinations versus only 28 for picking 2 of 8. Databricks summarized this as "65x more possible combinations of experts" than Mixtral and Grok-1, and argued that the finer granularity let the router specialize experts more aggressively without inflating the active parameter count. ^[1] In published ablations the team reported that this fine-grained design improved quality at a fixed active parameter budget compared to a top-2 of 8 baseline. ^[1]

The context window is 32,768 tokens, in line with Mixtral and Llama 2's long context variants. The tokenizer is the GPT-4 BPE tokenizer (the same vocabulary used by GPT-4 and exposed through OpenAI's tiktoken library) rather than the GPT-NeoX or LLaMA tokenizers that were common in earlier open releases. ^[1] Databricks said the choice was driven both by the GPT-4 tokenizer's stronger compression on natural English and code and by the practical convenience of being able to compare per-token pricing directly with closed competitors. A side effect is that DBRX's per-token compute cost is not directly comparable to Llama-tokenizer models on the same string of text, since the tokenizers segment text differently.

The Mosaic team reported that the data quality used for DBRX was approximately 2x better token-for-token than the data used for the earlier MPT models, judged by held-out evaluation, and that the combination of MoE compute scaling and improved data made the end-to-end training pipeline roughly 4x more compute-efficient than MPT-7B. ^[1] They also used curriculum learning, adjusting the data mix during pretraining rather than holding it fixed, which is a technique that had become more common in 2023 and 2024 papers from Anthropic, DeepSeek, and others.

Architecture specifications

Specification	Value
Architecture family	Decoder-only transformer
MoE configuration	Fine-grained, 16 experts, top-4 routing
Total parameters	132 billion
Active parameters per token	36 billion
Number of layers	40
Hidden size	6,144
Attention heads	48 query heads, 8 key/value heads (grouped query attention)
Position encoding	Rotary position embeddings (RoPE)
Activation	Gated linear units (GLU)
Vocabulary size	~100,000 (GPT-4 tokenizer)
Context length	32,768 tokens
Possible expert combinations	1,820 (vs 28 for Mixtral)

How does DBRX use mixture of experts?

In a dense transformer every token passes through the full feed-forward network, so the active parameter count equals the total parameter count. In a mixture of experts model the single feed-forward block is replaced by many smaller expert networks plus a lightweight router that, for each token, scores the experts and sends the token only to the top few. DBRX has 16 such experts in each MoE layer and routes every token to the 4 highest-scoring experts, so the active compute per token is roughly the size of 4 experts rather than all 16. ^[1]

This is why DBRX has 132 billion total parameters but only 36 billion active parameters: the unused experts still occupy memory but do not contribute floating-point operations for that token. The design choice Databricks emphasized is granularity. Earlier open MoE models such as Mixtral and Grok-1 used 8 larger experts and routed to 2 of them, which yields C(8,2) = 28 possible expert subsets per token. DBRX's 16-expert, top-4 layout yields C(16,4) = 1,820 subsets, the "65x more possible combinations" figure Databricks cited, which the company argued lets each expert specialize more narrowly and improves quality at the same active-parameter cost. ^[1] Fine-grained MoE of this kind, with more and smaller experts and a higher top-k, went on to become the dominant open-model pattern, with Snowflake Arctic, DeepSeek V2, DeepSeek V3, and Llama 4 all adopting many-expert routing. ^[13]^[15]

How was DBRX trained?

DBRX was pretrained on 12 trillion tokens of curated text and code. ^[1] By comparison, Llama 2 was trained on 2 trillion tokens; the DBRX training dataset is six times larger and was the largest disclosed open-weight pretraining run at the time of the announcement. Databricks did not publish a complete data card with source breakdowns. The blog post described the corpus as a mix of public web data, licensed datasets, and code, with the team running its own deduplication, quality filtering, and curriculum scheduling on top. ^[1]

The hardware was 3,072 NVIDIA H100 GPUs connected with 3.2 Tbps of InfiniBand bandwidth, drawn from NVIDIA DGX Cloud. ^[1] Databricks reported that the main training run took roughly two and a half to three months of wall-clock time, with sources giving slightly different figures: TechCrunch reported "two months," the Databricks blog reported "three months" for the full development cycle, and Wikipedia and follow-up coverage cite "2.5 months" for the main training. ^[6]^[8] The discrepancy probably reflects whether the time includes only the final pretraining run or also the data preparation, evaluation, and instruct fine-tuning stages.

The headline training cost figure of "approximately $10 million" was reported by Naveen Rao to TechCrunch and has been the most-cited number in coverage. ^[6] Independent commentary from Nathan Lambert at Interconnects estimated the full cost at $10 to $30 million when factoring in salaries, infrastructure, failed runs, and data acquisition. ^[7] Databricks has not published a full breakdown, so the lower number should be read as the marginal compute cost for the successful run rather than the all-in research cost. The figure also sits in a roughly comparable range to publicly reported training costs for GPT-3 and Llama 2, and well below estimates for frontier closed models like GPT-4.

Mosaic AI's training stack was the substrate underneath. Databricks made a point of emphasizing that the entire run was conducted using the same infrastructure (Composer, Streaming, MegaBlocks for sparse MoE kernels, Lilac for data quality, MLflow for experiment tracking) that customers could rent through the Mosaic AI platform. ^[1] In effect, DBRX functioned as the largest possible advertisement for Mosaic AI's enterprise training product.

Training compute and data

Item	Value
Pretraining tokens	12 trillion
Training hardware	3,072 NVIDIA H100 GPUs
Interconnect	3.2 Tbps InfiniBand on NVIDIA DGX Cloud
Training duration	~2.5 to 3 months (sources vary)
Reported training cost	~$10 million USD (marginal compute)
Estimated all-in cost	$10 to $30 million (Interconnects estimate)
Training stack	Mosaic AI (Composer, Streaming, MegaBlocks, Lilac, MLflow)
Data quality vs MPT	~2x better token-for-token (per Databricks)
Compute efficiency vs MPT-7B	~4x more efficient end-to-end (per Databricks)

What are DBRX Base and DBRX Instruct?

Databricks released two checkpoints simultaneously on March 27, 2024. ^[4]^[5]

DBRX Base is the pretrained foundation model with no instruction tuning or alignment work applied. It is intended as a starting point for further training, fine tuning, or research. Because it has not been preference-tuned, base outputs do not follow chat formatting conventions and can produce unsafe completions; the Hugging Face model card explicitly notes the absence of safety training. ^[4]

DBRX Instruct is the chat-tuned variant produced by additional fine-tuning. Databricks released few public details about the fine-tuning recipe; the company did not publish whether it used RLHF, DPO, or simply supervised fine-tuning on a curated instruction dataset. The Interconnects writeup at release flagged this opacity as a missing piece compared to contemporary releases from Mistral and Meta, which had described their preference-optimization stages in more detail. ^[7] DBRX Instruct is the variant that scored on benchmarks in the official launch announcement and was the version most commonly served by hosting providers. ^[5]

Both checkpoints share the same architecture, parameter count, and tokenizer.

How does DBRX perform?

Databricks released DBRX with an extensive benchmark suite, comparing DBRX Instruct against several open weight peers and against the API version of GPT-3.5. ^[1] The general pattern in the official numbers is that DBRX Instruct beats the open weight peers across the board and beats GPT-3.5 on most academic benchmarks, while remaining behind GPT-4 on most tasks.

The model showed particularly strong programming results, attributed in the official writeup to the higher proportion of curated code in the pretraining corpus. On HumanEval, DBRX Instruct's 70.1% beats CodeLlama 70B Instruct's 67.8%, despite CodeLlama being a code-specialized model. ^[1] The math reasoning gap over Mixtral and Llama 2 70B is also wide.

Benchmark scores at release

Reported by Databricks for DBRX Instruct against published numbers for peers. Higher is better. ^[1]

Benchmark	DBRX Instruct	Mixtral Instruct (8x7B)	LLaMA 2 70B Chat	Grok-1	GPT-3.5
MMLU (5-shot)	73.7%	71.4%	69.8%	73.0%	70.0%
HellaSwag (10-shot)	89.0%	86.5%	85.9%	n/a	85.5%
GSM8k (CoT)	72.8%	61.1%	54.1%	62.9%	57.1%
HumanEval (0-shot)	70.1%	54.8%	32.2%	63.2%	48.1%
Open LLM Leaderboard (avg)	74.5%	72.7%	n/a	n/a	n/a

On the Hugging Face Open LLM Leaderboard at release, DBRX Instruct was the highest scoring open-weight model. ^[1] The Databricks writeup also reported wins on the company's internal Mosaic Eval Gauntlet, a set of more than 30 benchmarks, where DBRX Instruct scored 66.8% against 60.7% for the next-best model, Mixtral Instruct. ^[1] Long-context retrieval-augmented generation results from the same writeup placed DBRX Instruct as competitive with Mixtral 8x7B and GPT-3.5 Turbo on Natural Questions and HotPotQA when paired with a vector store.

Independent third-party evaluation has been less laudatory. Artificial Analysis, which runs its own multi-benchmark Intelligence Index, scored DBRX Instruct at 8 out of 100 on its 2025 scale, well below the 13-point average for that cohort; that scale was rebuilt to include later reasoning-tuned models, so DBRX's relative position fell as the field moved on rather than because the model itself regressed. ^[18]

Is DBRX open source?

DBRX is distributed under the Databricks Open Model License, a custom license written for this release. ^[3] The license is similar in spirit to Meta's Llama 2 community license: weights are freely downloadable and usable for most commercial purposes, with exceptions and restrictions that disqualify the license as "open source" under the OSI definition. Coverage that focused on the license, such as Hackster.io, described the release as "semi-open source" for exactly this reason. ^[11]

The most-discussed clause is the monthly active user threshold. Any licensee whose products or services collectively had more than 700 million monthly active users in the preceding calendar month must request a separate license from Databricks rather than relying on the open license. ^[3] The 700 million MAU number matches the threshold in Meta's Llama 2 license and is widely understood to be aimed at large competitor cloud providers and the largest consumer internet companies.

Separately, the license forbids use of DBRX or its outputs to train or improve any other large language model. ^[3] This is a standard anti-distillation clause that has become common in semi-open releases (Llama, Qwen, and others have similar terms in some versions). It restricts a category of downstream use that the OSI Open Source AI Definition would consider essential, which is part of why critics have called DOML "semi-open" rather than open source.

Databricks does require attribution and the inclusion of the license file in redistributions. Derivative works must apply the same use restrictions to downstream users.

License terms summary

Term	DBRX (DOML)	Llama 2 community license	Apache 2.0 (e.g. Mistral 7B)
Commercial use	Allowed	Allowed	Allowed
Redistribution	Allowed with attribution	Allowed with attribution	Allowed
MAU threshold	700M MAU triggers separate license	700M MAU triggers separate license	None
Use of outputs to train other LLMs	Forbidden	Forbidden	Allowed
Use for fine-tuning	Allowed	Allowed	Allowed
Sharing fine-tuned weights	Allowed under same terms	Allowed under same terms	Allowed under any terms
OSI-approved open source	No	No	Yes

Where can you run DBRX?

At release, the two DBRX checkpoints were posted on Hugging Face at databricks/dbrx-base and databricks/dbrx-instruct, with model code in the github.com/databricks/dbrx repository. ^[4]^[5] Both checkpoints were available for direct download without registration, in contrast to Llama 2's gated request flow at the time.

Databricks customers could call DBRX Instruct through the Mosaic AI Foundation Model APIs as a hosted endpoint, both as a pay-per-token product and as a provisioned-throughput product on Databricks-managed GPUs. NVIDIA listed DBRX in its NVIDIA API Catalog and made it available as an NVIDIA NIM container, and Microsoft added the model to Azure AI Foundry. Independent inference providers including Together AI, Fireworks AI, and Perplexity also hosted DBRX endpoints in the months after launch.

The model was retired from Databricks Foundation Model APIs pay-per-token endpoints on April 30, 2025, alongside Mixtral 8x7B Instruct. ^[12] The DBRX, Mistral, and Mixtral families were also retired from Databricks Foundation Model Fine-tuning on the same date, with Databricks pointing customers toward Llama 3 and other replacements. ^[12] The Hugging Face weights remain available for self-hosting.

How was DBRX received?

The initial press treatment was warm. Coverage at VentureBeat called the release "a new state of the art"; The Verge framed it as evidence that the open model frontier was catching up to closed labs; TechCrunch led with the price tag, calling the model one Databricks "spent $10 million on" and noting that it still could not beat GPT-4. ^[6] Wired ran a piece on the Mosaic AI strategy and the way DBRX functioned as a sales tool for Databricks' enterprise customers. SiliconANGLE quoted Ali Ghodsi calling DBRX "a new standard for open source LLMs." ^[11] Hackster.io was one of the few outlets to lead with the license terms, calling the release "semi-open source." ^[11]

Within the AI research community, the release was treated as well-engineered but not field-changing. Nathan Lambert at Interconnects published a contemporaneous analysis arguing that DBRX represented a significant infrastructure achievement (proving that Databricks could ship a frontier-quality MoE) but a relatively conservative research contribution, given that fine-grained MoE was already in the literature and that Databricks had not published novel architectural ideas. ^[7] The fine-grained MoE design later became conventional wisdom in 2024 and 2025 releases, with Snowflake Arctic, DeepSeek V2, and Llama 4 all adopting many-experts top-k routing. ^[13]^[15]^[16]

Most of the actual deployment of DBRX inside organizations was through Databricks itself rather than as a downloaded weight, since 132 billion parameter models are difficult to serve on a single host. The minimum hardware to run DBRX in fp16 was around 320 GB of GPU memory, or four H100s, which put self-hosted DBRX out of reach for most individual developers and pushed users toward managed inference.

How does DBRX compare to other open MoE models?

DBRX appeared at the start of a roughly twelve-month wave of open weight MoE releases. The fine-grained, many-expert design with smaller per-expert MLPs and higher top-k that Databricks championed went on to become the dominant pattern.

Model	Released	Total params	Active params	Experts	Top-k	Context	License
Mixtral 8x7B	Dec 2023	~46.7B	~12.9B	8	2	32K	Apache 2.0
Grok-1	Mar 17, 2024	314B	86B	8	2	8K	Apache 2.0
DBRX	Mar 27, 2024	132B	36B	16	4	32K	Databricks Open Model License
Mixtral 8x22B	Apr 2024	141B	39B	8	2	64K	Apache 2.0
Snowflake Arctic	Apr 24, 2024	480B	17B	128	2	4K	Apache 2.0
DeepSeek V2	May 2024	236B	21B	160 routed + 2 shared	6	128K	DeepSeek License
DeepSeek V3	Dec 2024	671B	37B	256 routed + 1 shared	8	128K	DeepSeek License
Llama 4 Scout	Apr 2025	~109B	17B	16	1	10M	Llama 4 license
Llama 4 Maverick	Apr 2025	~400B	17B	128	1	1M	Llama 4 license

A few patterns are visible in that table. The total parameter count grew across the year while active parameters mostly held in the same band (17 to 39 billion), reflecting the design pressure to keep inference cost flat while pushing total capacity up. The expert count climbed sharply, with later models choosing dozens or hundreds of experts and routing only one or two of them per token. Context length expanded by orders of magnitude, with Llama 4 Scout's 10 million token context being three orders of magnitude longer than DBRX's 32K. License terms also drifted toward more restrictive arrangements, with Llama 4 introducing additional commercial-use clauses on top of the Llama 2 framework.

In that sense DBRX sits at a hinge point. It was one of the first models to push past Mixtral's eight experts and demonstrate at scale that more granular routing was beneficial, but it was quickly outpaced by competitors who took the same idea further.

Why did Databricks not release a DBRX successor?

Databricks has not released a direct DBRX successor. There has been no public DBRX-2, no DBRX-V, and no scaled-up DBRX-Large. The company has instead leaned on the Mosaic AI Foundation Model API as a multi-model serving layer that hosts third-party weights from Meta, Mistral, DeepSeek, and other vendors, plus closed-API access to models from Anthropic and OpenAI through partnership integrations.

The strategic logic for not pushing a successor is straightforward. By 2024 and 2025 the cost of training a frontier-level model had crossed $100 million, and an enterprise data and analytics company like Databricks did not need to bet its product line on continually winning that race. Databricks customers care about training their own private models on private data using Mosaic AI's pipeline, plus serving the best off-the-shelf open and closed models from a single endpoint. Both of those use cases benefit from Databricks running great inference and fine-tuning infrastructure rather than from Databricks owning the best base model.

The company has continued to invest in research that uses Mosaic AI infrastructure, including domain-specific fine tunes and the Mosaic AI Agent Framework launched in 2024. Mosaic AI Pretraining remains a product that allows customers to pretrain their own foundation models, with DBRX cited as the canonical example of what is possible. The Mosaic Eval Gauntlet has been kept up to date and is used by some independent evaluators.

The market position has held: Databricks raised at a $43 billion valuation in late 2023, $62 billion in 2024, and reportedly $134 billion in late 2025. The company's enterprise AI revenue grew sharply over that period, and DBRX is still cited in sales conversations as evidence that the platform can train state-of-the-art models even though the model itself is no longer the focus.

A related strategic question is whether the open release made commercial sense in retrospect. The investment was not large by 2025 standards, the marketing surface was real, and the model never went head-to-head with Databricks customers' own products since most enterprise users were running fine-tunes of smaller open weights or hosted closed APIs anyway. The case against is that the engineering effort, especially the data work, was redirected away from internal customer-facing products for several quarters, and that DBRX's quick obsolescence meant the marketing window was short. Most observers in the post-release period have treated the release as a net positive for Databricks, particularly because it gave the Mosaic AI brand a concrete proof point in a category where most competitors talked about training infrastructure abstractly.

What are the main criticisms of DBRX?

Several specific criticisms of the DBRX release have been documented.

On opacity, the Databricks blog post and Hugging Face model card did not disclose the data sources used for pretraining at the level of detail that would let third parties reproduce or audit the data. Naveen Rao described the data as "a large set of data from a diverse range of sources," mentioning "open data sets that the community knows, loves and uses every day," but did not list specific datasets, their proportions, or any licensed corpora. ^[6] The TechCrunch piece noted that this matched the industry pattern (no major lab was disclosing detailed pretraining data in 2024) but criticized it as a transparency gap for an explicitly "open" release. ^[6]

On the instruct recipe, Databricks did not publish whether DBRX Instruct was tuned with RLHF, DPO, supervised fine-tuning only, or some combination. This was unusual at the time of release, since contemporary releases from Meta, Mistral, and Anthropic had described their preference-optimization stacks at least at a high level. ^[7]

On the license, the DOML's MAU threshold and anti-distillation clause led several commentators to argue that Databricks should not call DBRX "open source" without qualification. The Open Source Initiative's then-draft Open Source AI Definition, finalized later in 2024, would not have admitted DBRX. Hackster.io and Wired both flagged the gap between Databricks' open marketing and the actual license terms. ^[11]

On benchmark choice, the Mosaic Eval Gauntlet was an internally curated suite, and the comparisons in the launch blog post used Databricks-run evals on competing models rather than the published numbers from those models' own papers. Independent reproductions of some scores produced lower numbers, and later analyses including the Artificial Analysis Intelligence Index put DBRX Instruct toward the lower end of the cohort once reasoning-tuned models entered the field. ^[18] None of this implies dishonest reporting. It does mean the headline "DBRX beats GPT-3.5" claim was load-bearing on the specific benchmarks chosen.

On the tokenizer, the choice of the GPT-4 BPE was practical but had ergonomic costs for the open ecosystem. Tooling that was built around the LLaMA or GPT-NeoX vocabularies needed adaptation to consume DBRX, and direct token-count comparisons against Llama-tokenizer models on the same input string were misleading. Some observers thought the choice was as much marketing (so that price-per-token comparisons against OpenAI looked clean) as engineering, since the GPT-4 vocabulary is not unambiguously better than alternatives on all data.

On deployability, the 132 billion parameter total weight made self-hosting difficult for typical users. Even with 36 billion active parameters per token, the model's weights had to fit somewhere, and the cheapest home GPU configurations could not run it. This made the open release more useful as a research artifact and a Databricks marketing statement than as a model that downstream developers actually ran themselves. Quantized GGUF versions of DBRX appeared on Hugging Face within a few weeks of release, which brought the memory requirement down enough to run on a high-end workstation, but inference speed in those configurations was slow enough that the model remained more of a curiosity than a practical local option for most users. The dynamic was different for enterprise users on Databricks itself, who could rent a managed endpoint and avoid the hardware question entirely.

A softer critique, voiced in some research community discussion, was that DBRX did not contribute much new technical knowledge. The fine-grained MoE design had been described in earlier papers, the curriculum learning approach had appeared in concurrent work, and the data curation methods were not published in any reproducible form. That meant DBRX functioned as a demonstration that a particular set of choices worked at scale rather than as a paper that taught the field something new. Compared to releases like DeepSeek V2's multi-head latent attention or Llama 4's interleaved dense and sparse layers, both of which were accompanied by detailed technical writeups, DBRX's research footprint was relatively light. ^[15]^[16]

ELI5: What is DBRX in simple terms?

Imagine a company that needs an expert for every question. Instead of hiring one giant generalist who reads every question slowly, it hires a panel of 16 specialists and a quick receptionist (the "router"). For each question the receptionist picks the 4 specialists most likely to help and only bothers those 4, leaving the other 12 idle. DBRX works the same way: it knows a lot (132 billion parameters total) but only wakes up a small part of itself (36 billion parameters) for any single word it reads, so it is smart without being slow or expensive to run. Databricks built it in 2024, gave away the "brain" for free on the internet, and used it to show off the training tools it rents to businesses.

References

Databricks, "Introducing DBRX: A New State-of-the-Art Open LLM," March 27, 2024. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm ↩
Databricks press release, "Databricks Launches DBRX, A New Standard for Efficient Open Source Models," March 27, 2024. https://www.databricks.com/company/newsroom/press-releases/databricks-launches-dbrx-new-standard-efficient-open-source-models ↩
Databricks Open Model License, retrieved from https://www.databricks.com/legal/open-model-license ↩
databricks/dbrx-base on Hugging Face. https://huggingface.co/databricks/dbrx-base ↩
databricks/dbrx-instruct on Hugging Face. https://huggingface.co/databricks/dbrx-instruct ↩
Kyle Wiggers, "Databricks spent $10M on new DBRX generative AI model, but it can't beat GPT-4," TechCrunch, March 27, 2024. https://techcrunch.com/2024/03/27/databricks-spent-10m-on-a-generative-ai-model-that-still-cant-beat-gpt-4/ ↩
Nathan Lambert, "DBRX: The new best open model and Databricks' ML strategy," Interconnects, March 27, 2024. https://www.interconnects.ai/p/databricks-dbrx-open-llm ↩
Wikipedia contributors, "DBRX," Wikipedia. https://en.wikipedia.org/wiki/DBRX ↩
TechCrunch, "Databricks picks up MosaicML, an OpenAI competitor, for $1.3B," June 26, 2023. https://techcrunch.com/2023/06/26/databricks-picks-up-mosaicml-an-openai-competitor-for-1-3b/ ↩
Databricks press release, "Databricks Completes Acquisition of MosaicML," July 19, 2023. https://www.databricks.com/company/newsroom/press-releases/databricks-completes-acquisition-mosaicml ↩
Hackster.io, "Databricks Releases DBRX, a State-of-the-Art Generative AI LLM, Under a Semi-Open Source License." https://www.hackster.io/news/databricks-releases-dbrx-a-state-of-the-art-generative-ai-llm-under-a-semi-open-source-license-f56843e8cda3 ↩
Databricks AWS release notes, "April 2025." https://docs.databricks.com/aws/en/release-notes/product/2025/april ↩
Snowflake, "Snowflake Arctic: The Best LLM for Enterprise AI," April 24, 2024. https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/ ↩
Mistral AI, "Cheaper, Better, Faster, Stronger" (Mixtral 8x22B announcement). https://mistral.ai/news/mixtral-8x22b ↩
DeepSeek-AI et al., "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434, May 2024. https://arxiv.org/abs/2405.04434 ↩
Meta, "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation," April 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩
Cameron R. Wolfe, "DBRX, Continual Pretraining, RewardBench, Faster Inference, and More," Substack. https://cameronrwolfe.substack.com/p/dbrx-continual-pretraining-rewardbench
Artificial Analysis, "DBRX Instruct Intelligence, Performance & Price Analysis." https://artificialanalysis.ai/models/dbrx ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Ali Ghodsi Flash Attention Jonathan Frankle Mixture of Experts (MoE)OLMoE Snowflake Arctic

Infobox

What is DBRX?

Background

How is DBRX built?

Architecture specifications

How does DBRX use mixture of experts?

How was DBRX trained?

Training compute and data

What are DBRX Base and DBRX Instruct?

How does DBRX perform?

Benchmark scores at release

Is DBRX open source?

License terms summary

Where can you run DBRX?

How was DBRX received?

How does DBRX compare to other open MoE models?

Why did Databricks not release a DBRX successor?

What are the main criticisms of DBRX?

ELI5: What is DBRX in simple terms?

See also

References

Improve this article

Related Articles

Mixtral

Snowflake Arctic

Mixtral 8x22B

Zyphra

DeepSeek V4

Kimi K2

What links here

Related Articles

Mixtral

Snowflake Arctic

Mixtral 8x22B

Zyphra

DeepSeek V4

Kimi K2

What links here