# BitNet b1.58

> Source: https://aiwiki.ai/wiki/bitnet_b1_58
> Updated: 2026-07-23
> Categories: Large Language Models, Microsoft, Model Architecture
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

BitNet b1.58 is a ternary-weight large [language model](/wiki/language_model) architecture from [Microsoft Research](/wiki/microsoft_research) in which every weight is constrained to one of three values, -1, 0, or +1, so each parameter carries about log2(3) ≈ 1.58 bits of information while activations stay at 8 bits.[1] Introduced in the February 2024 paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits," its headline result is that at model sizes of roughly 3 billion parameters and above, a BitNet b1.58 model matches a same-size FP16 LLaMA baseline in both [perplexity](/wiki/perplexity) and zero-shot accuracy while using far less compute: at 3B parameters it cuts GPU memory about 3.55x (2.22 GB versus 7.89 GB) and per-token decoding latency about 2.71x, and the paper estimates roughly 71.4 times lower arithmetic energy for matrix multiplication on a 7nm process.[1] Written by researchers at Microsoft Research, the University of [Chinese Academy of Sciences](/wiki/chinese_academy_of_sciences), and Tsinghua University, the paper argues that "the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective."[1] BitNet b1.58 treats ternary [quantization](/wiki/quantization) not as a [post-training](/wiki/post-training) compression trick but as a way of training [large language models](/wiki/large_language_model) from scratch, and it has spawned an active line of follow-up work (BitNet a4.8, BitNet b1.58 2B4T) plus an official inference stack, bitnet.cpp, that [Microsoft](/wiki/microsoft) released in October 2024.[2][3][4]

## When was BitNet b1.58 released?

BitNet b1.58 was introduced on 27 February 2024, building on the original BitNet paper from October 2023. The line has since grown into a family of papers, open-weight models, and an official inference runtime:

| Date | Milestone | Source |
|---|---|---|
| 2023-10-17 | BitNet, the original 1-bit binary architecture that introduced BitLinear | arXiv 2310.11453 [6] |
| 2024-02-27 | BitNet b1.58, "The Era of 1-bit LLMs" (ternary weights) | arXiv 2402.17764 [1] |
| 2024-07-13 | BitNet b1.58 Reloaded (extends 1.58-bit training to small networks) | arXiv 2407.09527 [8] |
| 2024-10-17 | bitnet.cpp released as the official inference framework | GitHub [3] |
| 2024-10-21 | "1-bit AI Infra: Part 1.1," the CPU-kernel paper | arXiv 2410.16144 [10] |
| 2024-11-07 | BitNet a4.8 (adds 4-bit activations) | arXiv 2411.04965 [2] |
| 2025-02-17 | "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" | arXiv 2502.11880 [11] |
| 2025-04-16 | BitNet b1.58 2B4T (2B parameters, 4T tokens, open weights) | arXiv 2504.12285 [4][5] |
| 2025-05-20 | bitnet.cpp official GPU inference kernel released | GitHub [3] |
| 2026-01-15 | bitnet.cpp CPU parallelization update (1.15x to 2.1x more speed) | GitHub [3] |

## Why was BitNet b1.58 created?

The history of BitNet b1.58 begins with the more general problem of arithmetic precision in [Transformer](/wiki/transformer) models. Mainstream LLM training and inference rely on 16-bit floating-point formats (FP16 or BF16), and most quantization research in the early 2020s targeted 8-bit (INT8) or 4-bit integer representations applied after training. Post-training quantization tools such as [GPTQ](/wiki/gptq) and [AWQ](/wiki/awq) reduced memory footprint and latency without retraining, but each new bit removed typically introduced perplexity gaps relative to FP16, and going below 4 bits without quality loss had proven difficult on standard transformer architectures.[6]

Microsoft's original BitNet paper, "BitNet: Scaling 1-bit Transformers for Large Language Models," was posted on arXiv in October 2023 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei.[6] That paper proposed a 1-bit (binary) variant in which every weight was quantized to +1 or -1 by sign, using a drop-in replacement for `nn.Linear` called **BitLinear** that performed [quantization-aware training (QAT)](/wiki/quantization_aware_training) from scratch with 16-bit shadow weights and a straight-through estimator. The original BitNet showed scaling behavior similar to full-precision transformers and large reductions in memory and energy versus FP16 and INT8 baselines, but its absolute quality lagged FP16 at small to medium scales.[6]

The February 2024 follow-up, BitNet b1.58, kept the BitLinear framework and the QAT-from-scratch training recipe but generalized weights from the binary set {-1, +1} to the ternary set {-1, 0, +1}.[1] Adding zero gave the model an explicit "feature filtering" capability: a weight that lands at zero deactivates the corresponding input channel for that output, which the authors argue closes most of the quality gap to FP16 without giving up the algorithmic gains of representing weights with two bits or less.[1] The name encodes this: each ternary weight carries log2(3) ≈ 1.58 bits of entropy.

The paper's authors are Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, with affiliations split between Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University.[1] It appeared on arXiv as 2402.17764 on 27 February 2024 and was widely shared on [social media](/wiki/social_media) in the weeks that followed, in part because it framed itself as defining a "new scaling law" for high-quality 1-bit LLMs and called for hardware designs that target low-bit matrix multiplication directly.[1]

## How does BitNet b1.58 work?

### What are ternary weights and absmean quantization?

BitNet b1.58 replaces every linear projection in the transformer block with a BitLinear layer. During the forward pass, the 16-bit master weight matrix W is mapped to a ternary matrix W̃ ∈ {-1, 0, +1} using **absmean** quantization. The procedure is to first divide W by its mean absolute value γ (plus a small ε for numerical stability), then round each entry to the nearest integer in {-1, 0, +1}:[1][7]

```
γ = (1/nm) Σ |W_ij|
W̃ = RoundClip(W / (γ + ε), -1, 1)
RoundClip(x, a, b) = max(a, min(b, round(x)))
```

Activations are quantized to 8 bits using per-token absmax scaling to the range [-Qb, Qb], with Qb = 2^7 - 1 = 127; the zero point is dropped to keep activations symmetric.[1] Master weights and scale factors stay in 16-bit precision throughout training, and gradients are estimated via the standard straight-through estimator that copies upstream gradients through the rounding step. At inference time, only the packed ternary weights and per-tensor scales are kept.[1][7]

The presence of the zero state is the central change from the original 1-bit BitNet. Two analytic observations follow. First, because ternary weights include zero, the linear projection can implement a learned mask in addition to a sign pattern, which the authors argue improves expressivity for an otherwise extremely constrained weight space.[1] Second, multiplying an 8-bit activation by a ternary weight is no longer a true multiplication: the result is either +x, -x, or 0, so a BitLinear matmul reduces to per-channel sign flips and additions accumulated in an integer accumulator. This eliminates the bulk of the floating-point multiplies that dominate transformer FLOPs and is the source of the paper's energy claims.[1]

### What architecture does BitNet b1.58 use?

BitNet b1.58 follows the same "LLaMA-style" recipe used in many open-weight transformers of its generation: pre-norm with [RMSNorm](/wiki/rmsnorm), [SwiGLU](/wiki/swiglu) feed-forward layers, [rotary position embeddings (RoPE)](/wiki/rope), and no bias terms anywhere in the network.[1] The authors justify this design by noting that matching LLaMA conventions means BitNet b1.58 can be "dropped into" downstream tooling such as Hugging Face Transformers without architectural surgery, and that any quality differences with the FP16 baseline can be attributed to quantization rather than to confounding architecture changes.[1]

The 2025 BitNet b1.58 2B4T model, an open-weight version released by Microsoft Research, varies the recipe slightly: it uses SubLN normalization, replaces SwiGLU with squared ReLU (ReLU²) in the feed-forward block, and adopts the [LLaMA 3](/wiki/llama_3) tokenizer with a vocabulary of 128,256 tokens.[4]

### How is BitNet b1.58 trained?

QAT from scratch with extreme quantization changes the optimization landscape, and the paper reports two important deviations from FP16 LLaMA defaults:[1]

- **Higher learning rate.** Because ternary weights change in discrete steps, the optimizer needs larger effective updates than in continuous-precision training. BitNet b1.58 therefore trains with a higher peak learning rate than the comparable FP16 model, and the authors report that 1.58-bit training is more robust to large learning rates.[1]
- **Two-stage weight decay schedule.** Weight decay is applied during the first half of training and then removed for the second half, which the paper reports as helping the model converge in the late stage.[1][7]

Beyond these hyperparameter changes, BitNet b1.58 uses the standard [AdamW](/wiki/adamw) optimizer, cosine learning-rate decay, and the same data mixes as the FP16 baseline. The original paper trained 700M, 1.3B, and 3B parameter models on 100 billion tokens of [RedPajama](/wiki/red_pajama) data for direct apples-to-apples comparison with reproductions of FP16 LLaMA at the same scales, and additionally trained a 3.9B-parameter model and a 2-trillion-token 3B run to study scaling behavior.[1]

### How does matrix multiplication become addition?

In a standard FP16 transformer, a single forward pass of a linear layer with weight matrix W ∈ R^{n×m} and input X ∈ R^{b×n} requires roughly bnm floating-point multiplies and the same number of adds. In BitNet b1.58, W is ternary and X is INT8, so the multiplication step becomes one of three cases per element: +X, -X, or 0. The energy and silicon-area cost of a dedicated multiplier drops to that of an adder with a controllable sign and a zero predicate, and integer accumulators replace floating-point ones.[1]

The paper estimates that, on a 7nm process, matrix multiplication in BitNet b1.58 consumes roughly 71.4 times less arithmetic energy than in an equivalent FP16 model, based on published per-operation energy costs for 7nm logic.[1] The authors note that current commodity GPUs do not natively expose efficient ternary or sub-INT8 matrix engines, so the full theoretical gain only materializes when models are run through specialized kernels (such as those in bitnet.cpp) on CPUs, or on hypothetical hardware tailored to 1-bit operations.[1][3]

## How well does BitNet b1.58 perform?

### Perplexity and zero-shot accuracy

At fixed training data (100B RedPajama tokens), BitNet b1.58 reproductions of LLaMA-style models reach the following perplexity on the same held-out set:[1]

| Model size | BitNet b1.58 PPL | FP16 LLaMA PPL |
|---|---|---|
| 700M | 12.87 | 12.33 |
| 1.3B | 11.29 | 11.25 |
| 3B | 9.91 | 10.04 |
| 3.9B | 9.62 | (not reported in baseline) |

At 700M parameters the ternary model is roughly half a point worse in perplexity than FP16; at 1.3B the gap shrinks to within 0.05; and at 3B BitNet b1.58 is actually slightly better than the FP16 baseline.[1] This crossover near the 3B mark is the empirical basis for the paper's claim that ternary LLMs become a Pareto improvement over FP16 above roughly three billion parameters.

On a suite of seven zero-shot benchmarks (ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, PIQA, BoolQ, OpenBookQA) at 3B parameters, BitNet b1.58 averages 50.2% versus 49.7% for FP16 LLaMA reproduced at the same scale.[1] Individual task scores hover within one to two points of the FP16 baseline in both directions; for example, ARC-Easy is 61.4% versus 62.1%, while ARC-Challenge is 28.3% versus 25.6%.[1]

A separate experiment trained a 3B BitNet b1.58 for 2 trillion tokens, far beyond the 100B-token baseline. That run reached an average benchmark score of 74.34%, slightly above the 73.22% reported for StableLM-3B trained on comparable data, consistent with the claim that ternary LLMs preserve standard scaling behavior under increased training compute.[1]

### Memory, latency, and throughput

The efficiency numbers in the paper are reported against an FP16 LLaMA baseline using the same vLLM-style serving infrastructure on NVIDIA A100 GPUs:[1]

- **3B model.** GPU memory drops from 7.89 GB (FP16) to 2.22 GB, a 3.55x reduction. Per-token decoding latency drops from 5.07 ms to 1.87 ms, a 2.71x speedup.[1]
- **70B model.** Decoding latency is 4.1x faster than FP16 LLaMA, throughput is 8.9x higher (2977 versus 333 tokens/sec aggregated across all serving requests), and the maximum batch size that fits on the same GPU memory grows roughly 11x (176 versus 16).[1]

The throughput gain at 70B is larger than the latency gain because the smaller per-parameter footprint enables much larger batches, which then keep tensor cores saturated.

### Arithmetic energy

The arithmetic energy estimates rely on published cost tables for 7nm digital logic, where a 32-bit floating-point multiply consumes roughly 3.7 pJ while an 8-bit integer add consumes roughly 0.03 pJ. Replacing FP16 multiplies with INT8 add/subtract operations is the dominant source of the 71.4x figure quoted in the paper.[1] These are projections of the matmul subcomponent only; full-system energy gains in practice depend heavily on memory traffic and on whether the deployment hardware can exploit ternary weights, which on commodity GPUs it largely cannot.[1]

## What are the BitNet b1.58 variants?

### BitNet a4.8: 4-bit activations

A follow-up paper from Hongyu Wang, Shuming Ma, and Furu Wei, "BitNet a4.8: 4-bit Activations for 1-bit LLMs," was posted to arXiv on 7 November 2024.[2] The "a4.8" name refers to activations that are mostly 4-bit, with a small fraction of outlier-heavy intermediate states kept at 8 bits, on top of the 1.58-bit ternary weights.[2]

The paper introduces a **hybrid quantization and sparsification** architecture: most inputs to attention and feed-forward layers are quantized to 4 bits, but intermediate states with heavier-tailed distributions (the output of the FFN down-projection and the attention output projection) are sparsified with a top-K mask and quantized to 8 bits.[2] Roughly 84.2% of values in the FFN down-projection input are masked to zero in the 7B model, and the gate-projection outputs are about 67.5% sparse; the abstract summarizes the net effect by noting that BitNet a4.8 "activates only 55% of parameters and supports 3-bit KV cache."[2]

To avoid retraining from scratch, BitNet a4.8 uses a two-stage recipe that the paper describes as training "from W1.58A8 to W1.58A4": stage 1 trains a standard BitNet b1.58 with 8-bit activations for 95 billion tokens, and stage 2 continues training with 4-bit activations for an additional 5 billion tokens.[2] The reported numbers show parity with BitNet b1.58 at matched scales: at 7B parameters, BitNet a4.8 reaches a perplexity of 9.37 against 9.24 for BitNet b1.58, and an average zero-shot accuracy of 54.74% versus 55.09%.[2] BitNet a4.8 also supports a 3-bit key-value cache with negligible accuracy loss, addressing one of the larger memory pain points in long-context inference.[2]

### BitNet b1.58 2B4T

On 16 April 2025, Microsoft Research released **BitNet b1.58 2B4T**, a 2-billion-parameter ternary model trained on 4 trillion tokens; the model card calls it "the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale."[4][5] The accompanying technical report is arXiv 2504.12285 (Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei), which states that the model "achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency."[4]

The model uses the BitLinear W1.58A8 recipe with SubLN, squared ReLU, RoPE, no bias terms, a 128,256-vocabulary LLaMA 3 tokenizer, and a 4,096-token context. After pre-training on a mix of public text/code and synthetic math, it was supervised fine-tuned (see [supervised fine-tuning](/wiki/supervised_fine-tuning)) on instruction-following datasets and aligned via [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo) using UltraFeedback and MagPie.[4] Selected benchmark results from the technical report:[4]

| Benchmark | BitNet b1.58 2B | Qwen2.5 1.5B | LLaMA 3.2 1B |
|---|---|---|---|
| MMLU (5-shot) | 53.17 | 60.25 | 45.58 |
| GSM8K (4-shot) | 58.38 | 56.79 | 38.21 |
| ARC-Challenge | 49.91 | 46.67 | 37.80 |
| HumanEval+ | 38.40 | 50.60 | 31.10 |
| Average | 54.19 | 55.23 | 44.90 |

On the model's own efficiency table, non-embedding memory drops to 0.4 GB versus 2.6 GB for Qwen2.5 1.5B and 2.0 GB for LLaMA 3.2 1B; CPU per-token latency (TPOT) is 29 ms versus 65 and 48 ms respectively; and estimated CPU decoding energy per token is 0.028 J versus 0.347 J and 0.258 J.[4] Three versions of the weights are published on Hugging Face: a packed 1.58-bit format for efficient inference, a BF16 master-weight version for further training, and a GGUF version targeted at the bitnet.cpp runtime.[4]

### BitNet b1.58 Reloaded

Independently, Jacob Nielsen and Peter Schneider-Kamp (University of Southern Denmark) published "BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks" in July 2024.[8] They studied 1.58-bit QAT on small language and vision models in the 100K to 48M parameter range, proposed using the median rather than the mean in the absmean scaling step to make the quantizer more robust to outliers, and found that small ternary networks reach state-of-the-art performance when the hidden size is approximately doubled compared to the FP16 baseline.[8] The paper supports the case that ternary training is useful below the 3B regime that dominated the original BitNet b1.58 evaluation.

### Reproductions and external models

A community reproduction at [Hugging Face under the 1bitLLM organization](https://huggingface.co/1bitLLM) re-trained BitNet b1.58 at 700M, 1.3B, and 3B parameters on 100B tokens of RedPajama using the recipe described in the paper. The 1bitLLM 3B model reports a perplexity of 9.88 and an average zero-shot score of 49.6%, in line with the paper's claims.[9] Several other groups have trained ternary variants of existing architectures using the same recipe, including a Llama3 8B run on 100B tokens listed in the official bitnet.cpp model registry and ternary versions of the Falcon 3 and Falcon-E family.[3]

## What is bitnet.cpp?

Microsoft open-sourced **bitnet.cpp** on 17 October 2024 at the `microsoft/BitNet` GitHub repository, billing it as the "official inference framework for 1-bit LLMs" that offers "a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU."[3] The project is forked from [llama.cpp](/wiki/llama_cpp) and reuses much of its GGUF format and tokenizer handling, but adds three new quantization kernel implementations specialized for ternary weights:[3][10]

- **I2_S** (Int2 with Scale): two-bit signed packed storage with a per-block scale factor, deployed on both x86 and ARM CPUs.[10]
- **TL1** (Ternary Lookup Table, ARM variant): packs every five ternary weights into one byte (since 3^5 = 243 < 256) and replaces the matmul inner loop with table lookups, used on ARM.[10]
- **TL2** (Ternary Lookup Table, x86 variant): a related lookup-table layout tuned to AVX-style vector units.[10]

The lookup-table approach borrows from the T-MAC project, which Microsoft Research also developed for low-bit matrix multiplication on CPUs.[3] The companion paper, "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs" (Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei, arXiv 2410.16144, 21 October 2024), reports x86 speedups of 2.37x to 6.17x over a baseline llama.cpp INT8 path, ARM speedups of 1.37x to 5.07x, and energy reductions of 71.9% to 82.2% on x86 and 55.4% to 70.0% on ARM, all measured across a range of model sizes from 700M to 100B parameters.[10][3] The same paper notes that a hypothetical 100B BitNet b1.58 model can be decoded at 5 to 7 tokens per second on a single CPU socket using these kernels, which is in the same range as human reading speed.[10][3]

A second paper from largely the same authors, "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv 2502.11880, 17 February 2025), generalizes the lookup-table approach to an element-wise lookup table (ELUT) for arbitrary low-bit matrix multiplication and reports up to 6.25x speedup over full-precision baselines and 2.32x over low-bit baselines.[11] The bitnet.cpp repository documents subsequent updates, including an official GPU kernel release in May 2025 and a further round of CPU parallelization in January 2026 that adds another 1.15x to 2.1x on top of earlier numbers.[3]

Inside the runtime, BitNet b1.58 2B4T weights are stored four per int8 byte (since two bits suffice to encode {-1, 0, +1, unused}); the I2_S kernel decodes the packing and multiplies against INT8 activations in tight vectorized loops.[4][3] The repository ships build instructions for Python 3.9+, CMake 3.22+, and Clang 18, and includes pre-converted GGUF weights for BitNet b1.58 2B4T (2.4B), bitnet_b1_58-large (0.7B), bitnet_b1_58-3B, Llama3-8B-1.58-100B-tokens (8.0B), and Falcon 3 / Falcon-E variants from 1B to 10B parameters.[3]

## What is BitNet b1.58 used for?

The motivation behind BitNet b1.58 is that as LLMs grow, the cost bottleneck is increasingly memory bandwidth and arithmetic energy rather than parameter count itself. Three concrete deployment regimes have emerged where ternary LLMs are particularly compelling:

- **CPU-only inference on consumer hardware.** Because BitNet b1.58 turns matmul into add/subtract on INT8 accumulators, conventional CPU vector units (AVX-2, AVX-512, NEON) can run ternary models at speeds comparable to GPU INT8 baselines. Bitnet.cpp benchmarks show meaningful gains on commodity laptops and ARM single-board computers without any specialized accelerator.[3][10] The promise of running large models on a single CPU and on the kinds of devices typically used in [small language model](/wiki/small_language_model) deployments is a major reason the open-source release attracted significant attention in late 2024.[4][3]
- **Edge and on-device deployment.** The 0.4 GB non-embedding memory footprint of BitNet b1.58 2B4T fits comfortably in mobile- and embedded-device memory budgets, while the energy savings translate into longer battery life and lower thermal envelope.[4] The "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" paper explicitly targets this setting.[11]
- **Long-context and high-throughput serving.** The 70B-model results suggest that ternary representations allow much larger serving batch sizes for the same GPU memory, which improves aggregate throughput in shared inference settings.[1]

Beyond the immediate engineering applications, BitNet b1.58 is widely cited as evidence that LLM training at extremely low precision is viable, that QAT-from-scratch can match post-training quantization in quality while sidestepping a separate calibration step, and that future AI accelerators might profitably trade traditional multipliers for arrays of add/subtract units.[1][3] The paper's authors write that the approach "opens the door for designing specific hardware optimized for 1-bit LLMs," which could deliver energy gains beyond what software kernels extract on floating-point-centric GPUs.[1]

## What are the limitations of BitNet b1.58?

Several caveats apply, some acknowledged by the authors and some surfaced in subsequent work:

- **Hardware support is partial.** Commodity GPUs do not expose efficient ternary matmul natively, so the energy and latency savings reported on paper are most fully realized when running through bitnet.cpp on CPUs, or on custom kernels for select GPUs. Running BitNet b1.58 weights through standard Hugging Face Transformers does not produce the headline speedups. The official model card states, "Please do NOT expect performance efficiency gains ... when using this model with the standard transformers library."[5]
- **Quality crossover with FP16 only happens at scale.** At 700M parameters BitNet b1.58 trails FP16 LLaMA in perplexity; the crossover sits near the 3B mark in the original paper.[1] Subsequent work like BitNet b1.58 Reloaded showed that the gap at smaller scales can be reduced by widening hidden sizes (roughly doubling them) and tweaking the quantizer to use a median rather than mean, suggesting that small ternary models can match FP16 but at compute cost increases that partly offset the bitwidth savings.[8]
- **Pre-training cost is not reduced.** All current BitNet b1.58 variants are trained with 16-bit (BF16) master weights, gradients, and optimizer state; only the deployed inference weights are ternary. Training-time memory and FLOP budgets are therefore comparable to FP16 baselines, even though inference is dramatically cheaper.[1][4]
- **Tooling immaturity.** As of mid-2026 the bitnet.cpp framework supports a limited set of architectures (LLaMA-style decoder-only transformers with specific block layouts) and not the full diversity of modern open-weight models. Most popular [transformer](/wiki/transformer) variants would need bespoke kernel work to take advantage of ternary weights.[3]
- **Benchmark coverage.** The original paper's evaluation is heavy on perplexity and zero-shot classification tasks; more comprehensive evaluations on instruction-following, code, and reasoning benchmarks appeared only later with BitNet b1.58 2B4T, which still shows some gaps relative to similarly sized [Mistral](/wiki/mistral_7b)-class FP16 baselines on tasks like MMLU and HumanEval.[4]

## How does BitNet b1.58 compare to other quantization methods?

BitNet b1.58 sits at the intersection of three lines of research:

- **Quantization-aware training for LLMs.** Traditional LLM quantization research (see [quantization](/wiki/quantization), [GPTQ](/wiki/gptq), [AWQ](/wiki/awq), [QLoRA](/wiki/qlora)) reduces weight precision after training; BitNet b1.58 instead trains in ternary from scratch. This is a return to the QAT regime that was dominant in the binary and ternary neural network literature of 2016 to 2018, but applied at LLM scale and with explicit attention to scaling laws.[1][6]
- **Architecture co-design.** Like [Mixture of Experts (MoE)](/wiki/mixture_of_experts) architectures and [long-context](/wiki/long_context) sparse attention designs, BitNet b1.58 is a co-design point that trades algorithmic complexity (here: the discreteness of ternary weights) for very large inference-time savings. The b1.58 line is unusual in that it preserves the standard transformer block structure entirely and changes only the linear layer's numerical regime.[1]
- **Efficient inference runtimes.** Bitnet.cpp builds on the broader ecosystem around [llama.cpp](/wiki/llama_cpp), [GGUF](/wiki/gguf), and [Hugging Face Transformers](/wiki/transformers_library), and on Microsoft's earlier T-MAC research into lookup-table matmul kernels.[3][10] In this sense BitNet b1.58 is a model architecture whose practical adoption depends on a co-released inference stack, in the same spirit as [vLLM](/wiki/vllm) for FP16 serving.

A common comparison is to other "small-model" efforts. At the 1B to 3B scale, BitNet b1.58 2B4T is roughly in the same weight class as [LLaMA 3.2](/wiki/llama_3_2) 1B, [Gemma 3](/wiki/gemma_3) 1B, Qwen2.5 1.5B, [Phi](/wiki/phi) models, and [SmolLM](/wiki/smollm) variants, and the technical report explicitly benchmarks against them.[4] BitNet b1.58 2B4T wins decisively on memory, latency, and energy while landing in the middle of the pack on raw accuracy, making it most attractive in resource-constrained deployments rather than where peak quality is paramount.[4]

## See also

- [BitNet](/wiki/bitnet)
- [Quantization](/wiki/quantization)
- [GPTQ](/wiki/gptq)
- [AWQ](/wiki/awq)
- [QLoRA](/wiki/qlora)
- [Knowledge distillation](/wiki/knowledge_distillation)
- [GGUF](/wiki/gguf)
- [llama.cpp](/wiki/llama_cpp)
- [vLLM](/wiki/vllm)
- [Hugging Face Transformers](/wiki/transformers_library)
- [LLaMA](/wiki/llama)
- [LLaMA 3](/wiki/llama_3)
- [Falcon (language model)](/wiki/falcon)
- [Falcon 3](/wiki/falcon_3)
- [Phi (language model)](/wiki/phi)
- [Gemma](/wiki/gemma)
- [SmolLM](/wiki/smollm)
- [RMSNorm](/wiki/rmsnorm)
- [SwiGLU](/wiki/swiglu)
- [Rotary position embedding (RoPE)](/wiki/rope)
- [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Scaling Laws](/wiki/scaling_laws)
- [Chinchilla scaling laws](/wiki/chinchilla_scaling)
- [Microsoft](/wiki/microsoft)
- [Microsoft Research](/wiki/microsoft_research)
- [Tsinghua University](/wiki/tsinghua_university)
- [Hugging Face](/wiki/hugging_face)
- [RedPajama](/wiki/red_pajama)
- [Small language model](/wiki/small_language_model)
- [Mixture of Experts (MoE)](/wiki/mixture_of_experts)

## References

1. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei, "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits", arXiv, 2024-02-27. https://arxiv.org/abs/2402.17764. Accessed 2026-05-20.
2. Hongyu Wang, Shuming Ma, Furu Wei, "BitNet a4.8: 4-bit Activations for 1-bit LLMs", arXiv, 2024-11-07. https://arxiv.org/abs/2411.04965. Accessed 2026-05-20.
3. Microsoft, "microsoft/BitNet: Official inference framework for 1-bit LLMs", GitHub, 2024-10-17. https://github.com/microsoft/BitNet. Accessed 2026-07-12.
4. Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei, "BitNet b1.58 2B4T Technical Report", arXiv, 2025-04-16. https://arxiv.org/abs/2504.12285. Accessed 2026-05-20.
5. Microsoft Research, "microsoft/bitnet-b1.58-2B-4T model card", Hugging Face, 2025-04-16. https://huggingface.co/microsoft/bitnet-b1.58-2B-4T. Accessed 2026-07-12.
6. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei, "BitNet: Scaling 1-bit Transformers for Large Language Models", arXiv, 2023-10-17. https://arxiv.org/abs/2310.11453. Accessed 2026-05-20.
7. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (HTML version)", arXiv, 2024-02-27. https://arxiv.org/html/2402.17764v1. Accessed 2026-05-20.
8. Jacob Nielsen, Peter Schneider-Kamp, "BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks", arXiv, 2024-07-13. https://arxiv.org/abs/2407.09527. Accessed 2026-05-20.
9. 1bitLLM, "bitnet_b1_58-3B model card", Hugging Face, 2024-02-27. https://huggingface.co/1bitLLM/bitnet_b1_58-3B. Accessed 2026-05-20.
10. Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei, "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs", arXiv, 2024-10-21. https://arxiv.org/abs/2410.16144. Accessed 2026-05-20.
11. Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei, "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs", arXiv, 2025-02-17. https://arxiv.org/abs/2502.11880. Accessed 2026-05-20.