BitNet b1.58
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,172 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,172 words
Add missing citations, update stale details, or suggest a clearer explanation.
BitNet b1.58 is a ternary-weight large language model architecture introduced by researchers at Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University in the February 2024 paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits."[^1] Every weight in a BitNet b1.58 model is constrained to one of three values, -1, 0, or +1, so each parameter carries roughly log2(3) ≈ 1.58 bits of information; activations remain quantized to 8 bits.[^1] The paper's headline claim is that, at model scales of about 3 billion parameters and above, a BitNet b1.58 model matches a same-size FP16 LLaMA baseline in both perplexity and zero-shot downstream accuracy while delivering several-fold reductions in GPU memory, decoding latency, and arithmetic energy.[^1] The work positioned ternary quantization not as a post-training compression trick but as a way of training large language models from scratch, and it has since spawned an active line of follow-up papers (BitNet a4.8, BitNet b1.58 2B4T) and an official inference stack, bitnet.cpp, released by Microsoft in October 2024.[^2][^3][^4]
The history of BitNet b1.58 begins with the more general problem of arithmetic precision in Transformer models. Mainstream LLM training and inference rely on 16-bit floating-point formats (FP16 or BF16), and most quantization research in the early 2020s targeted 8-bit (INT8) or 4-bit integer representations applied after training. Post-training quantization tools such as GPTQ and AWQ reduced memory footprint and latency without retraining, but each new bit removed typically introduced perplexity gaps relative to FP16, and going below 4 bits without quality loss had proven difficult on standard transformer architectures.[^5]
Microsoft's original BitNet paper, "BitNet: Scaling 1-bit Transformers for Large Language Models," was posted on arXiv in October 2023 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei.[^6] That paper proposed a 1-bit (binary) variant in which every weight was quantized to +1 or -1 by sign, using a drop-in replacement for nn.Linear called BitLinear that performed quantization-aware training (QAT) from scratch with 16-bit shadow weights and a straight-through estimator. The original BitNet showed scaling behavior similar to full-precision transformers and large reductions in memory and energy versus FP16 and INT8 baselines, but its absolute quality lagged FP16 at small to medium scales.[^6]
The February 2024 follow-up, BitNet b1.58, kept the BitLinear framework and the QAT-from-scratch training recipe but generalized weights from the binary set {-1, +1} to the ternary set {-1, 0, +1}.[^1] Adding zero gave the model an explicit "feature filtering" capability: a weight that lands at zero deactivates the corresponding input channel for that output, which the authors argue closes most of the quality gap to FP16 without giving up the algorithmic gains of representing weights with two bits or less.[^1] The name encodes this: each ternary weight carries log2(3) ≈ 1.58 bits of entropy.
The paper's authors are Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, with affiliations split between Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University.[^1] It appeared on arXiv as 2402.17764 on 27 February 2024 and was widely shared on social media in the weeks that followed, in part because it framed itself as defining a "new scaling law" for high-quality 1-bit LLMs and called for hardware designs that target low-bit matrix multiplication directly.[^1]
BitNet b1.58 replaces every linear projection in the transformer block with a BitLinear layer. During the forward pass, the 16-bit master weight matrix W is mapped to a ternary matrix W̃ ∈ {-1, 0, +1} using absmean quantization. The procedure is to first divide W by its mean absolute value γ (plus a small ε for numerical stability), then round each entry to the nearest integer in {-1, 0, +1}:[^1][^7]
γ = (1/nm) Σ |W_ij|
W̃ = RoundClip(W / (γ + ε), -1, 1)
RoundClip(x, a, b) = max(a, min(b, round(x)))
Activations are quantized to 8 bits using per-token absmax scaling to the range [-Qb, Qb], with Qb = 2^7 - 1 = 127; the zero point is dropped to keep activations symmetric.[^1] Master weights and scale factors stay in 16-bit precision throughout training, and gradients are estimated via the standard straight-through estimator that copies upstream gradients through the rounding step. At inference time, only the packed ternary weights and per-tensor scales are kept.[^1][^7]
The presence of the zero state is the central change from the original 1-bit BitNet. Two analytic observations follow. First, because ternary weights include zero, the linear projection can implement a learned mask in addition to a sign pattern, which the authors argue improves expressivity for an otherwise extremely constrained weight space.[^1] Second, multiplying an 8-bit activation by a ternary weight is no longer a true multiplication: the result is either +x, -x, or 0, so a BitLinear matmul reduces to per-channel sign flips and additions accumulated in an integer accumulator. This eliminates the bulk of the floating-point multiplies that dominate transformer FLOPs and is the source of the paper's energy claims.[^1]
BitNet b1.58 follows the same "LLaMA-style" recipe used in many open-weight transformers of its generation: pre-norm with RMSNorm, SwiGLU feed-forward layers, rotary position embeddings (RoPE), and no bias terms anywhere in the network.[^1] The authors justify this design by noting that matching LLaMA conventions means BitNet b1.58 can be "dropped into" downstream tooling such as Hugging Face Transformers without architectural surgery, and that any quality differences with the FP16 baseline can be attributed to quantization rather than to confounding architecture changes.[^1]
The 2025 BitNet b1.58 2B4T model, an open-weight version released by Microsoft Research, varies the recipe slightly: it uses SubLN normalization, replaces SwiGLU with squared ReLU (ReLU²) in the feed-forward block, and adopts the LLaMA 3 tokenizer with a vocabulary of 128,256 tokens.[^4]
QAT from scratch with extreme quantization changes the optimization landscape, and the paper reports two important deviations from FP16 LLaMA defaults:[^1]
Beyond these hyperparameter changes, BitNet b1.58 uses the standard AdamW optimizer, cosine learning-rate decay, and the same data mixes as the FP16 baseline. The original paper trained 700M, 1.3B, and 3B parameter models on 100 billion tokens of RedPajama data for direct apples-to-apples comparison with reproductions of FP16 LLaMA at the same scales, and additionally trained a 3.9B-parameter model and a 2-trillion-token 3B run to study scaling behavior.[^1]
In a standard FP16 transformer, a single forward pass of a linear layer with weight matrix W ∈ R^{n×m} and input X ∈ R^{b×n} requires roughly bnm floating-point multiplies and the same number of adds. In BitNet b1.58, W is ternary and X is INT8, so the multiplication step becomes one of three cases per element: +X, -X, or 0. The energy and silicon-area cost of a dedicated multiplier drops to that of an adder with a controllable sign and a zero predicate, and integer accumulators replace floating-point ones.[^1]
The paper estimates that, on a 7nm process, matrix multiplication in BitNet b1.58 consumes roughly 71.4 times less arithmetic energy than in an equivalent FP16 model, based on published per-operation energy costs for 7nm logic.[^1] The authors note that current commodity GPUs do not natively expose efficient ternary or sub-INT8 matrix engines, so the full theoretical gain only materializes when models are run through specialized kernels (such as those in bitnet.cpp) on CPUs, or on hypothetical hardware tailored to 1-bit operations.[^1][^3]
At fixed training data (100B RedPajama tokens), BitNet b1.58 reproductions of LLaMA-style models reach the following perplexity on the same held-out set:[^1]
| Model size | BitNet b1.58 PPL | FP16 LLaMA PPL |
|---|---|---|
| 700M | 12.87 | 12.33 |
| 1.3B | 11.29 | 11.25 |
| 3B | 9.91 | 10.04 |
| 3.9B | 9.62 | (not reported in baseline) |
At 700M parameters the ternary model is roughly half a point worse in perplexity than FP16; at 1.3B the gap shrinks to within 0.05; and at 3B BitNet b1.58 is actually slightly better than the FP16 baseline.[^1] This crossover near the 3B mark is the empirical basis for the paper's claim that ternary LLMs become a Pareto improvement over FP16 above roughly three billion parameters.
On a suite of seven zero-shot benchmarks (ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, PIQA, BoolQ, OpenBookQA) at 3B parameters, BitNet b1.58 averages 50.2% versus 49.7% for FP16 LLaMA reproduced at the same scale.[^1] Individual task scores hover within one to two points of the FP16 baseline in both directions; for example, ARC-Easy is 61.4% versus 62.1%, while ARC-Challenge is 28.3% versus 25.6%.[^1]
A separate experiment trained a 3B BitNet b1.58 for 2 trillion tokens, far beyond the 100B-token baseline. That run reached an average benchmark score of 74.34%, slightly above the 73.22% reported for StableLM-3B trained on comparable data, consistent with the claim that ternary LLMs preserve standard scaling behavior under increased training compute.[^1]
The efficiency numbers in the paper are reported against an FP16 LLaMA baseline using the same vLLM-style serving infrastructure on NVIDIA A100 GPUs:[^1]
The throughput gain at 70B is larger than the latency gain because the smaller per-parameter footprint enables much larger batches, which then keep tensor cores saturated.
The arithmetic energy estimates rely on published cost tables for 7nm digital logic, where a 32-bit floating-point multiply consumes roughly 3.7 pJ while an 8-bit integer add consumes roughly 0.03 pJ. Replacing FP16 multiplies with INT8 add/subtract operations is the dominant source of the 71.4x figure quoted in the paper.[^1] These are projections of the matmul subcomponent only; full-system energy gains in practice depend heavily on memory traffic and on whether the deployment hardware can exploit ternary weights, which on commodity GPUs it largely cannot.[^1]
A follow-up paper from Hongyu Wang, Shuming Ma, and Furu Wei, "BitNet a4.8: 4-bit Activations for 1-bit LLMs," was posted to arXiv on 7 November 2024.[^2] The "a4.8" label indicates that activations now use 4 bits (with a small fraction at 8 bits for outlier-heavy intermediate states) on top of the 1.58-bit weights, hence the joint precision label "W1.58 A4.8."
The paper introduces a hybrid quantization and sparsification architecture: most inputs to attention and feed-forward layers are quantized to 4 bits, but intermediate states with heavier-tailed distributions (the output of the FFN down-projection and the attention output projection) are sparsified with a top-K mask and quantized to 8 bits.[^2] Roughly 84.2% of values in the FFN down-projection input are masked to zero in the 7B model, and the gate-projection outputs are about 67.5% sparse; the overall fraction of active parameters is about 55%.[^2]
To avoid retraining from scratch, BitNet a4.8 uses a two-stage recipe: stage 1 trains a standard BitNet b1.58 with 8-bit activations for 95 billion tokens, and stage 2 continues training with 4-bit activations for an additional 5 billion tokens.[^2] The reported numbers show parity with BitNet b1.58 at matched scales: at 7B parameters, BitNet a4.8 reaches a perplexity of 9.37 against 9.24 for BitNet b1.58, and an average zero-shot accuracy of 54.74% versus 55.09%.[^2] BitNet a4.8 also supports a 3-bit key-value cache with negligible accuracy loss, addressing one of the larger memory pain points in long-context inference.[^2]
On 16 April 2025, Microsoft Research released BitNet b1.58 2B4T, a 2-billion-parameter ternary model trained on 4 trillion tokens; the model card calls it "the first open-source, native 1-bit LLM at the 2-billion parameter scale."[^4] The accompanying technical report is arXiv 2504.12285 (Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei).[^4]
The model uses the BitLinear/W1.58A8 recipe with SubLN, squared ReLU, RoPE, no bias terms, a 128,256-vocabulary LLaMA 3 tokenizer, and a 4,096-token context. After pre-training on a mix of public text/code and synthetic math, it was supervised fine-tuned (see supervised fine-tuning) on instruction-following datasets and aligned via Direct Preference Optimization (DPO) using UltraFeedback and MagPie.[^4] Selected benchmark results from the technical report:[^4]
| Benchmark | BitNet b1.58 2B | Qwen2.5 1.5B | LLaMA 3.2 1B |
|---|---|---|---|
| MMLU (5-shot) | 53.17 | 60.25 | 45.58 |
| GSM8K (4-shot) | 58.38 | 56.79 | 38.21 |
| ARC-Challenge | 49.91 | 46.67 | 37.80 |
| HumanEval+ | 38.40 | 50.60 | 31.10 |
| Average | 54.19 | 55.23 | 44.90 |
On the model's own efficiency table, non-embedding memory drops to 0.4 GB versus 2.6 GB for Qwen2.5 1.5B and 2.0 GB for LLaMA 3.2 1B; CPU per-token latency (TPOT) is 29 ms versus 65 and 48 ms respectively; and estimated CPU decoding energy per token is 0.028 J versus 0.347 J and 0.258 J.[^4] Three versions of the weights are published on Hugging Face: a packed 1.58-bit format for efficient inference, a BF16 master-weight version for further training, and a GGUF version targeted at the bitnet.cpp runtime.[^4]
Independently, Jacob Nielsen and Peter Schneider-Kamp (University of Southern Denmark) published "BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks" in July 2024.[^8] They studied 1.58-bit QAT on small language and vision models in the 100K to 48M parameter range, proposed using the median rather than the mean in the absmean scaling step to make the quantizer more robust to outliers, and found that small ternary networks reach state-of-the-art performance when the hidden size is approximately doubled compared to the FP16 baseline.[^8] The paper supports the case that ternary training is useful below the 3B regime that dominated the original BitNet b1.58 evaluation.
A community reproduction at Hugging Face under the 1bitLLM organization re-trained BitNet b1.58 at 700M, 1.3B, and 3B parameters on 100B tokens of RedPajama using the recipe described in the paper. The 1bitLLM 3B model reports a perplexity of 9.88 and an average zero-shot score of 49.6%, in line with the paper's claims.[^9] Several other groups have trained ternary variants of existing architectures using the same recipe, including a Llama3 8B run on 100B tokens listed in the official bitnet.cpp model registry and ternary versions of the Falcon 3 and Falcon-E family.[^3]
Microsoft open-sourced bitnet.cpp on 17 October 2024 at the microsoft/BitNet GitHub repository, billing it as the "official inference framework for 1-bit LLMs."[^3] The project is forked from llama.cpp and reuses much of its GGUF format and tokenizer handling, but adds three new quantization kernel implementations specialized for ternary weights:[^3][^10]
The lookup-table approach borrows from the T-MAC project, which Microsoft Research also developed for low-bit matrix multiplication on CPUs.[^3] The companion paper, "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs" (Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei, arXiv 2410.16144, 21 October 2024), reports x86 speedups of 2.37x to 6.17x over a baseline llama.cpp INT8 path, ARM speedups of 1.37x to 5.07x, and energy reductions of 71.9% to 82.2% on x86 and 55.4% to 70.0% on ARM, all measured across a range of model sizes from 700M to 100B parameters.[^10][^3] The same paper notes that a hypothetical 100B BitNet b1.58 model can be decoded at 5 to 7 tokens per second on a single CPU socket using these kernels, which is in the same range as human reading speed.[^10][^3]
A second paper from largely the same authors, "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv 2502.11880, 17 February 2025), generalizes the lookup-table approach to an element-wise lookup table (ELUT) for arbitrary low-bit matrix multiplication and reports up to 6.25x speedup over full-precision baselines and 2.32x over low-bit baselines.[^11] The bitnet.cpp repository documents subsequent updates, including an official GPU kernel release in May 2025 and a further round of CPU parallelization in January 2026 that adds another 1.15x to 2.1x on top of earlier numbers.[^3]
Inside the runtime, BitNet b1.58 2B4T weights are stored four per int8 byte (since two bits suffice to encode {-1, 0, +1, unused}); the I2_S kernel decodes the packing and multiplies against INT8 activations in tight vectorized loops.[^4][^3] The repository ships build instructions for Python 3.9+, CMake 3.22+, and Clang 18, and includes pre-converted GGUF weights for BitNet b1.58 2B4T (2.4B), bitnet_b1_58-large (0.7B), bitnet_b1_58-3B, Llama3-8B-1.58-100B-tokens (8.0B), and Falcon 3 / Falcon-E variants from 1B to 10B parameters.[^3]
The motivation behind BitNet b1.58 is that as LLMs grow, the cost bottleneck is increasingly memory bandwidth and arithmetic energy rather than parameter count itself. Three concrete deployment regimes have emerged where ternary LLMs are particularly compelling:
Beyond the immediate engineering applications, BitNet b1.58 is widely cited as evidence that LLM training at extremely low precision is viable, that QAT-from-scratch can match post-training quantization in quality while sidestepping a separate calibration step, and that future AI accelerators might profitably trade traditional multipliers for arrays of add/subtract units.[^1][^3] The paper's authors have argued that "1-bit hardware" optimized for ternary or binary matmul would unlock additional energy gains beyond what software kernels can extract on FP-centric GPUs.[^1]
Several caveats apply, some acknowledged by the authors and some surfaced in subsequent work:
BitNet b1.58 sits at the intersection of three lines of research:
A common comparison is to other "small-model" efforts. At the 1B to 3B scale, BitNet b1.58 2B4T is roughly in the same weight class as LLaMA 3.2 1B, Gemma 3 1B, Qwen2.5 1.5B, Phi models, and SmolLM variants, and the technical report explicitly benchmarks against them.[^4] BitNet b1.58 2B4T wins decisively on memory, latency, and energy while landing in the middle of the pack on raw accuracy, making it most attractive in resource-constrained deployments rather than where peak quality is paramount.[^4]