BitNet b1.58

BitNet b1.58 is a ternary-weight large language model architecture introduced by researchers at Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University in the February 2024 paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits."[^1] Every weight in a BitNet b1.58 model is constrained to one of three values, -1, 0, or +1, so each parameter carries roughly log2(3) ≈ 1.58 bits of information; activations remain quantized to 8 bits.[^1] The paper's headline claim is that, at model scales of about 3 billion parameters and above, a BitNet b1.58 model matches a same-size FP16 LLaMA baseline in both perplexity and zero-shot downstream accuracy while delivering several-fold reductions in GPU memory, decoding latency, and arithmetic energy.[^1] The work positioned ternary quantization not as a post-training compression trick but as a way of training large language models from scratch, and it has since spawned an active line of follow-up papers (BitNet a4.8, BitNet b1.58 2B4T) and an official inference stack, bitnet.cpp, released by Microsoft in October 2024.[^2][^3][^4]

Background and motivation

The history of BitNet b1.58 begins with the more general problem of arithmetic precision in Transformer models. Mainstream LLM training and inference rely on 16-bit floating-point formats (FP16 or BF16), and most quantization research in the early 2020s targeted 8-bit (INT8) or 4-bit integer representations applied after training. Post-training quantization tools such as GPTQ and AWQ reduced memory footprint and latency without retraining, but each new bit removed typically introduced perplexity gaps relative to FP16, and going below 4 bits without quality loss had proven difficult on standard transformer architectures.[^5]

Microsoft's original BitNet paper, "BitNet: Scaling 1-bit Transformers for Large Language Models," was posted on arXiv in October 2023 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei.[^6] That paper proposed a 1-bit (binary) variant in which every weight was quantized to +1 or -1 by sign, using a drop-in replacement for nn.Linear called BitLinear that performed quantization-aware training (QAT) from scratch with 16-bit shadow weights and a straight-through estimator. The original BitNet showed scaling behavior similar to full-precision transformers and large reductions in memory and energy versus FP16 and INT8 baselines, but its absolute quality lagged FP16 at small to medium scales.[^6]

The February 2024 follow-up, BitNet b1.58, kept the BitLinear framework and the QAT-from-scratch training recipe but generalized weights from the binary set {-1, +1} to the ternary set {-1, 0, +1}.[^1] Adding zero gave the model an explicit "feature filtering" capability: a weight that lands at zero deactivates the corresponding input channel for that output, which the authors argue closes most of the quality gap to FP16 without giving up the algorithmic gains of representing weights with two bits or less.[^1] The name encodes this: each ternary weight carries log2(3) ≈ 1.58 bits of entropy.

The paper's authors are Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, with affiliations split between Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University.[^1] It appeared on arXiv as 2402.17764 on 27 February 2024 and was widely shared on social media in the weeks that followed, in part because it framed itself as defining a "new scaling law" for high-quality 1-bit LLMs and called for hardware designs that target low-bit matrix multiplication directly.[^1]

Technical details

Ternary weights and the absmean quantization function

BitNet b1.58 replaces every linear projection in the transformer block with a BitLinear layer. During the forward pass, the 16-bit master weight matrix W is mapped to a ternary matrix W̃ ∈ {-1, 0, +1} using absmean quantization. The procedure is to first divide W by its mean absolute value γ (plus a small ε for numerical stability), then round each entry to the nearest integer in {-1, 0, +1}:[^1][^7]

γ = (1/nm) Σ |W_ij|
W̃ = RoundClip(W / (γ + ε), -1, 1)
RoundClip(x, a, b) = max(a, min(b, round(x)))

Activations are quantized to 8 bits using per-token absmax scaling to the range [-Qb, Qb], with Qb = 2^7 - 1 = 127; the zero point is dropped to keep activations symmetric.[^1] Master weights and scale factors stay in 16-bit precision throughout training, and gradients are estimated via the standard straight-through estimator that copies upstream gradients through the rounding step. At inference time, only the packed ternary weights and per-tensor scales are kept.[^1][^7]

The presence of the zero state is the central change from the original 1-bit BitNet. Two analytic observations follow. First, because ternary weights include zero, the linear projection can implement a learned mask in addition to a sign pattern, which the authors argue improves expressivity for an otherwise extremely constrained weight space.[^1] Second, multiplying an 8-bit activation by a ternary weight is no longer a true multiplication: the result is either +x, -x, or 0, so a BitLinear matmul reduces to per-channel sign flips and additions accumulated in an integer accumulator. This eliminates the bulk of the floating-point multiplies that dominate transformer FLOPs and is the source of the paper's energy claims.[^1]

Architectural choices

BitNet b1.58 follows the same "LLaMA-style" recipe used in many open-weight transformers of its generation: pre-norm with RMSNorm, SwiGLU feed-forward layers, rotary position embeddings (RoPE), and no bias terms anywhere in the network.[^1] The authors justify this design by noting that matching LLaMA conventions means BitNet b1.58 can be "dropped into" downstream tooling such as Hugging Face Transformers without architectural surgery, and that any quality differences with the FP16 baseline can be attributed to quantization rather than to confounding architecture changes.[^1]

The 2025 BitNet b1.58 2B4T model, an open-weight version released by Microsoft Research, varies the recipe slightly: it uses SubLN normalization, replaces SwiGLU with squared ReLU (ReLU²) in the feed-forward block, and adopts the LLaMA 3 tokenizer with a vocabulary of 128,256 tokens.[^4]

Training recipe and quirks

QAT from scratch with extreme quantization changes the optimization landscape, and the paper reports two important deviations from FP16 LLaMA defaults:[^1]

Higher learning rate. Because ternary weights move in discrete steps, the optimizer needs larger effective updates than in continuous-precision training. The authors recommend a learning rate roughly twice that of a comparable FP16 model and show that smaller rates leave the network underfit.[^1]
Two-stage weight decay schedule. Weight decay is applied during the first half of training and then removed entirely for the second half, which the paper reports as accelerating the late-stage loss decrease without changing the learning rate.[^1][^7]

Beyond these hyperparameter changes, BitNet b1.58 uses the standard AdamW optimizer, cosine learning-rate decay, and the same data mixes as the FP16 baseline. The original paper trained 700M, 1.3B, and 3B parameter models on 100 billion tokens of RedPajama data for direct apples-to-apples comparison with reproductions of FP16 LLaMA at the same scales, and additionally trained a 3.9B-parameter model and a 2-trillion-token 3B run to study scaling behavior.[^1]

Matrix multiplication becomes add/subtract

In a standard FP16 transformer, a single forward pass of a linear layer with weight matrix W ∈ R^{n×m} and input X ∈ R^{b×n} requires roughly bnm floating-point multiplies and the same number of adds. In BitNet b1.58, W is ternary and X is INT8, so the multiplication step becomes one of three cases per element: +X, -X, or 0. The energy and silicon-area cost of a dedicated multiplier drops to that of an adder with a controllable sign and a zero predicate, and integer accumulators replace floating-point ones.[^1]

The paper estimates that, on a 7nm process, matrix multiplication in BitNet b1.58 consumes roughly 71.4 times less arithmetic energy than in an equivalent FP16 model, based on published per-operation energy costs for 7nm logic.[^1] The authors note that current commodity GPUs do not natively expose efficient ternary or sub-INT8 matrix engines, so the full theoretical gain only materializes when models are run through specialized kernels (such as those in bitnet.cpp) on CPUs, or on hypothetical hardware tailored to 1-bit operations.[^1][^3]

Reported results

Perplexity and zero-shot accuracy

At fixed training data (100B RedPajama tokens), BitNet b1.58 reproductions of LLaMA-style models reach the following perplexity on the same held-out set:[^1]

Model size	BitNet b1.58 PPL	FP16 LLaMA PPL
700M	12.87	12.33
1.3B	11.29	11.25
3B	9.91	10.04
3.9B	9.62	(not reported in baseline)

At 700M parameters the ternary model is roughly half a point worse in perplexity than FP16; at 1.3B the gap shrinks to within 0.05; and at 3B BitNet b1.58 is actually slightly better than the FP16 baseline.[^1] This crossover near the 3B mark is the empirical basis for the paper's claim that ternary LLMs become a Pareto improvement over FP16 above roughly three billion parameters.

On a suite of seven zero-shot benchmarks (ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, PIQA, BoolQ, OpenBookQA) at 3B parameters, BitNet b1.58 averages 50.2% versus 49.7% for FP16 LLaMA reproduced at the same scale.[^1] Individual task scores hover within one to two points of the FP16 baseline in both directions; for example, ARC-Easy is 61.4% versus 62.1%, while ARC-Challenge is 28.3% versus 25.6%.[^1]

A separate experiment trained a 3B BitNet b1.58 for 2 trillion tokens, far beyond the 100B-token baseline. That run reached an average benchmark score of 74.34%, slightly above the 73.22% reported for StableLM-3B trained on comparable data, consistent with the claim that ternary LLMs preserve standard scaling behavior under increased training compute.[^1]

Memory, latency, throughput

The efficiency numbers in the paper are reported against an FP16 LLaMA baseline using the same vLLM-style serving infrastructure on NVIDIA A100 GPUs:[^1]

3B model. GPU memory drops from 7.89 GB (FP16) to 2.22 GB, a 3.55x reduction. Per-token decoding latency drops from 5.07 ms to 1.87 ms, a 2.71x speedup.[^1]
70B model. Decoding latency is 4.1x faster than FP16 LLaMA, throughput is 8.9x higher (2977 vs 333 tokens/sec aggregated across all serving requests), and the maximum batch size that fits on the same GPU memory grows roughly 11x (176 vs 16).[^1]

The throughput gain at 70B is larger than the latency gain because the smaller per-parameter footprint enables much larger batches, which then keep tensor cores saturated.

Arithmetic energy

The arithmetic energy estimates rely on published cost tables for 7nm digital logic, where a 32-bit floating-point multiply consumes roughly 3.7 pJ while an 8-bit integer add consumes roughly 0.03 pJ. Replacing FP16 multiplies with INT8 add/subtract operations is the dominant source of the 71.4x figure quoted in the paper.[^1] These are projections of the matmul subcomponent only; full-system energy gains in practice depend heavily on memory traffic and on whether the deployment hardware can exploit ternary weights, which on commodity GPUs it largely cannot.[^1]

Variants and follow-up papers

BitNet a4.8: 4-bit activations

A follow-up paper from Hongyu Wang, Shuming Ma, and Furu Wei, "BitNet a4.8: 4-bit Activations for 1-bit LLMs," was posted to arXiv on 7 November 2024.[^2] The "a4.8" label indicates that activations now use 4 bits (with a small fraction at 8 bits for outlier-heavy intermediate states) on top of the 1.58-bit weights, hence the joint precision label "W1.58 A4.8."

The paper introduces a hybrid quantization and sparsification architecture: most inputs to attention and feed-forward layers are quantized to 4 bits, but intermediate states with heavier-tailed distributions (the output of the FFN down-projection and the attention output projection) are sparsified with a top-K mask and quantized to 8 bits.[^2] Roughly 84.2% of values in the FFN down-projection input are masked to zero in the 7B model, and the gate-projection outputs are about 67.5% sparse; the overall fraction of active parameters is about 55%.[^2]

To avoid retraining from scratch, BitNet a4.8 uses a two-stage recipe: stage 1 trains a standard BitNet b1.58 with 8-bit activations for 95 billion tokens, and stage 2 continues training with 4-bit activations for an additional 5 billion tokens.[^2] The reported numbers show parity with BitNet b1.58 at matched scales: at 7B parameters, BitNet a4.8 reaches a perplexity of 9.37 against 9.24 for BitNet b1.58, and an average zero-shot accuracy of 54.74% versus 55.09%.[^2] BitNet a4.8 also supports a 3-bit key-value cache with negligible accuracy loss, addressing one of the larger memory pain points in long-context inference.[^2]

BitNet b1.58 2B4T

On 16 April 2025, Microsoft Research released BitNet b1.58 2B4T, a 2-billion-parameter ternary model trained on 4 trillion tokens; the model card calls it "the first open-source, native 1-bit LLM at the 2-billion parameter scale."[^4] The accompanying technical report is arXiv 2504.12285 (Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei).[^4]

The model uses the BitLinear/W1.58A8 recipe with SubLN, squared ReLU, RoPE, no bias terms, a 128,256-vocabulary LLaMA 3 tokenizer, and a 4,096-token context. After pre-training on a mix of public text/code and synthetic math, it was supervised fine-tuned (see supervised fine-tuning) on instruction-following datasets and aligned via Direct Preference Optimization (DPO) using UltraFeedback and MagPie.[^4] Selected benchmark results from the technical report:[^4]

Benchmark	BitNet b1.58 2B	Qwen2.5 1.5B	LLaMA 3.2 1B
MMLU (5-shot)	53.17	60.25	45.58
GSM8K (4-shot)	58.38	56.79	38.21
ARC-Challenge	49.91	46.67	37.80
HumanEval+	38.40	50.60	31.10
Average	54.19	55.23	44.90

On the model's own efficiency table, non-embedding memory drops to 0.4 GB versus 2.6 GB for Qwen2.5 1.5B and 2.0 GB for LLaMA 3.2 1B; CPU per-token latency (TPOT) is 29 ms versus 65 and 48 ms respectively; and estimated CPU decoding energy per token is 0.028 J versus 0.347 J and 0.258 J.[^4] Three versions of the weights are published on Hugging Face: a packed 1.58-bit format for efficient inference, a BF16 master-weight version for further training, and a GGUF version targeted at the bitnet.cpp runtime.[^4]

BitNet b1.58 Reloaded

Independently, Jacob Nielsen and Peter Schneider-Kamp (University of Southern Denmark) published "BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks" in July 2024.[^8] They studied 1.58-bit QAT on small language and vision models in the 100K to 48M parameter range, proposed using the median rather than the mean in the absmean scaling step to make the quantizer more robust to outliers, and found that small ternary networks reach state-of-the-art performance when the hidden size is approximately doubled compared to the FP16 baseline.[^8] The paper supports the case that ternary training is useful below the 3B regime that dominated the original BitNet b1.58 evaluation.

Reproductions and external models

A community reproduction at Hugging Face under the 1bitLLM organization re-trained BitNet b1.58 at 700M, 1.3B, and 3B parameters on 100B tokens of RedPajama using the recipe described in the paper. The 1bitLLM 3B model reports a perplexity of 9.88 and an average zero-shot score of 49.6%, in line with the paper's claims.[^9] Several other groups have trained ternary variants of existing architectures using the same recipe, including a Llama3 8B run on 100B tokens listed in the official bitnet.cpp model registry and ternary versions of the Falcon 3 and Falcon-E family.[^3]

Implementations: bitnet.cpp

Microsoft open-sourced bitnet.cpp on 17 October 2024 at the microsoft/BitNet GitHub repository, billing it as the "official inference framework for 1-bit LLMs."[^3] The project is forked from llama.cpp and reuses much of its GGUF format and tokenizer handling, but adds three new quantization kernel implementations specialized for ternary weights:[^3][^10]

I2_S (Int2 with Scale): two-bit signed packed storage with a per-block scale factor, deployed on both x86 and ARM CPUs.[^10]
TL1 (Ternary Lookup Table, ARM variant): packs every five ternary weights into one byte (since 3^5 = 243 < 256) and replaces the matmul inner loop with table lookups, used on ARM.[^10]
TL2 (Ternary Lookup Table, x86 variant): a related lookup-table layout tuned to AVX-style vector units.[^10]

The lookup-table approach borrows from the T-MAC project, which Microsoft Research also developed for low-bit matrix multiplication on CPUs.[^3] The companion paper, "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs" (Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei, arXiv 2410.16144, 21 October 2024), reports x86 speedups of 2.37x to 6.17x over a baseline llama.cpp INT8 path, ARM speedups of 1.37x to 5.07x, and energy reductions of 71.9% to 82.2% on x86 and 55.4% to 70.0% on ARM, all measured across a range of model sizes from 700M to 100B parameters.[^10][^3] The same paper notes that a hypothetical 100B BitNet b1.58 model can be decoded at 5 to 7 tokens per second on a single CPU socket using these kernels, which is in the same range as human reading speed.[^10][^3]

A second paper from largely the same authors, "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" (arXiv 2502.11880, 17 February 2025), generalizes the lookup-table approach to an element-wise lookup table (ELUT) for arbitrary low-bit matrix multiplication and reports up to 6.25x speedup over full-precision baselines and 2.32x over low-bit baselines.[^11] The bitnet.cpp repository documents subsequent updates, including an official GPU kernel release in May 2025 and a further round of CPU parallelization in January 2026 that adds another 1.15x to 2.1x on top of earlier numbers.[^3]

Inside the runtime, BitNet b1.58 2B4T weights are stored four per int8 byte (since two bits suffice to encode {-1, 0, +1, unused}); the I2_S kernel decodes the packing and multiplies against INT8 activations in tight vectorized loops.[^4][^3] The repository ships build instructions for Python 3.9+, CMake 3.22+, and Clang 18, and includes pre-converted GGUF weights for BitNet b1.58 2B4T (2.4B), bitnet_b1_58-large (0.7B), bitnet_b1_58-3B, Llama3-8B-1.58-100B-tokens (8.0B), and Falcon 3 / Falcon-E variants from 1B to 10B parameters.[^3]

Applications and significance

The motivation behind BitNet b1.58 is that as LLMs grow, the cost bottleneck is increasingly memory bandwidth and arithmetic energy rather than parameter count itself. Three concrete deployment regimes have emerged where ternary LLMs are particularly compelling:

CPU-only inference on consumer hardware. Because BitNet b1.58 turns matmul into add/subtract on INT8 accumulators, conventional CPU vector units (AVX-2, AVX-512, NEON) can run ternary models at speeds comparable to GPU INT8 baselines. Bitnet.cpp benchmarks show meaningful gains on commodity laptops and ARM single-board computers without any specialized accelerator.[^3][^10] The promise of running large models on a single CPU and on the kinds of devices typically used in small language model deployments is a major reason the open-source release attracted significant attention in late 2024.[^4][^3]
Edge and on-device deployment. The 0.4 GB non-embedding memory footprint of BitNet b1.58 2B4T fits comfortably in mobile- and embedded-device memory budgets, while the energy savings translate into longer battery life and lower thermal envelope.[^4] The "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs" paper explicitly targets this setting.[^11]
Long-context and high-throughput serving. The 70B-model results suggest that ternary representations allow much larger serving batch sizes for the same GPU memory, which improves aggregate throughput in shared inference settings.[^1]

Beyond the immediate engineering applications, BitNet b1.58 is widely cited as evidence that LLM training at extremely low precision is viable, that QAT-from-scratch can match post-training quantization in quality while sidestepping a separate calibration step, and that future AI accelerators might profitably trade traditional multipliers for arrays of add/subtract units.[^1][^3] The paper's authors have argued that "1-bit hardware" optimized for ternary or binary matmul would unlock additional energy gains beyond what software kernels can extract on FP-centric GPUs.[^1]

Limitations and criticisms

Several caveats apply, some acknowledged by the authors and some surfaced in subsequent work:

Hardware support is partial. Commodity GPUs do not expose efficient ternary matmul natively, so the energy and latency savings reported on paper are most fully realized when running through bitnet.cpp on CPUs, or on custom kernels for select GPUs. Running BitNet b1.58 weights through standard Hugging Face Transformers does not produce the headline speedups, a fact the model card states explicitly.[^4]
Quality crossover with FP16 only happens at scale. At 700M parameters BitNet b1.58 trails FP16 LLaMA in perplexity; the crossover sits near the 3B mark in the original paper.[^1] Subsequent work like BitNet b1.58 Reloaded showed that the gap at smaller scales can be reduced by widening hidden sizes (roughly doubling them) and tweaking the quantizer to use a median rather than mean, suggesting that small ternary models can match FP16 but at compute cost increases that partly offset the bitwidth savings.[^8]
Pre-training cost is not reduced. All current BitNet b1.58 variants are trained with 16-bit (BF16) master weights, gradients, and optimizer state; only the deployed inference weights are ternary. Training-time memory and FLOP budgets are therefore comparable to FP16 baselines, even though inference is dramatically cheaper.[^1][^4]
Tooling immaturity. As of mid-2026 the bitnet.cpp framework supports a limited set of architectures (LLaMA-style decoder-only transformers with specific block layouts) and not the full diversity of modern open-weight models. Most popular transformer variants would need bespoke kernel work to take advantage of ternary weights.[^3]
Benchmark coverage. The original paper's evaluation is heavy on perplexity and zero-shot classification tasks; more comprehensive evaluations on instruction-following, code, and reasoning benchmarks appeared only later with BitNet b1.58 2B4T, which still shows some gaps relative to similarly sized Mistral-class FP16 baselines on tasks like MMLU and HumanEval.[^4]

BitNet b1.58 sits at the intersection of three lines of research:

Quantization-aware training for LLMs. Traditional LLM quantization research (see quantization, GPTQ, AWQ, QLoRA) reduces weight precision after training; BitNet b1.58 instead trains in ternary from scratch. This is a return to the QAT regime that was dominant in the binary and ternary neural network literature of 2016 to 2018, but applied at LLM scale and with explicit attention to scaling laws.[^1][^6]
Architecture co-design. Like Mixture of Experts (MoE) architectures and long-context sparse attention designs, BitNet b1.58 is a co-design point that trades algorithmic complexity (here: the discreteness of ternary weights) for very large inference-time savings. The b1.58 line is unusual in that it preserves the standard transformer block structure entirely and changes only the linear layer's numerical regime.[^1]
Efficient inference runtimes. Bitnet.cpp builds on the broader ecosystem around llama.cpp, GGUF, and Hugging Face Transformers, and on Microsoft's earlier T-MAC research into lookup-table matmul kernels.[^3][^10] In this sense BitNet b1.58 is a model architecture whose practical adoption depends on a co-released inference stack, in the same spirit as vLLM for FP16 serving.

A common comparison is to other "small-model" efforts. At the 1B to 3B scale, BitNet b1.58 2B4T is roughly in the same weight class as LLaMA 3.2 1B, Gemma 3 1B, Qwen2.5 1.5B, Phi models, and SmolLM variants, and the technical report explicitly benchmarks against them.[^4] BitNet b1.58 2B4T wins decisively on memory, latency, and energy while landing in the middle of the pack on raw accuracy, making it most attractive in resource-constrained deployments rather than where peak quality is paramount.[^4]

References

BitNet b1.58

Background and motivation

Technical details

Ternary weights and the absmean quantization function

Architectural choices

Training recipe and quirks

Matrix multiplication becomes add/subtract

Reported results

Perplexity and zero-shot accuracy

Memory, latency, throughput

Arithmetic energy

Variants and follow-up papers

BitNet a4.8: 4-bit activations

BitNet b1.58 2B4T

BitNet b1.58 Reloaded

Reproductions and external models

Implementations: bitnet.cpp

Applications and significance

Limitations and criticisms

Related work and comparison

See also

References

Improve this article

BitNet b1.58

Background and motivation

Technical details

Ternary weights and the absmean quantization function

Architectural choices

Training recipe and quirks

Matrix multiplication becomes add/subtract

Reported results

Perplexity and zero-shot accuracy

Memory, latency, throughput

Arithmetic energy

Variants and follow-up papers

BitNet a4.8: 4-bit activations

BitNet b1.58 2B4T

BitNet b1.58 Reloaded

Reproductions and external models

Implementations: bitnet.cpp

Applications and significance

Limitations and criticisms

Related work and comparison

See also

References