BitNet
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,884 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,884 words
Add missing citations, update stale details, or suggest a clearer explanation.
BitNet is a family of large language model architectures developed by Microsoft Research Asia that constrain the weights of a transformer to extremely low bit-widths — initially a single bit ({-1, +1}) and later three values ({-1, 0, +1}, encoded in roughly 1.58 bits per weight). The first BitNet paper, "BitNet: Scaling 1-bit Transformers for Large Language Models," was posted to arXiv on October 17, 2023 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei.[^1] Subsequent papers in the series — BitNet b1.58 (February 27, 2024), BitNet a4.8 (November 7, 2024), the open-weight BitNet b1.58 2B4T (April 16, 2025), and BitNet v2 (April 25, 2025) — extended the idea to ternary weights, 4-bit activations, and Hadamard-smoothed quantization, while a companion inference framework, bitnet.cpp, was open-sourced in October 2024 to make these models practical on commodity CPUs.[^2][^3][^4][^5][^6]
BitNet's central claim is that a sufficiently large transformer can be trained from scratch using ternary weights and still match the perplexity and downstream-task accuracy of an FP16 baseline of the same size and token budget — while consuming a fraction of the memory and energy. At 3 billion parameters, the b1.58 paper reported parity with a full-precision LLaMA-style model on perplexity and zero-shot accuracy while using 3.55x less GPU memory and running 2.71x faster.[^7] In 2025, Microsoft released BitNet b1.58 2B4T, a 2-billion-parameter model trained natively at 1.58 bits on 4 trillion tokens, under the MIT License on Hugging Face — the first openly distributed, natively-trained 1.58-bit LLM at the 2B scale.[^4][^8]
The work has drawn intense interest because it suggests a path to running tens-of-billions-parameter language models on a single CPU, but it has also attracted skepticism: as of mid-2026, only models up to ~2-3B parameters have been released natively-trained, no frontier lab has publicly committed to a 1-bit production model, and academic studies have begun questioning whether the FP16-parity claim survives at high training-token counts.[^9][^10]
Reducing the precision of neural network weights is one of the oldest tools in deep-learning efficiency. By 2023, the dominant approach for large language models was post-training quantization: train a model in FP16 or BF16, then compress the trained weights to INT8 or INT4 for deployment using methods such as GPTQ, AWQ, or bitsandbytes. QLoRA, released in May 2023, showed that 4-bit base weights could even be fine-tuned through with low-rank adapters with minimal quality loss.[^11]
These methods compress an already-trained model. Quantization-aware training (QAT) instead bakes the low-precision constraint into the training loop, so the model learns weights that survive quantization rather than being forced into a quantization grid after the fact. By 2023, QAT was well-established in the computer-vision community at 4 and 8 bits, but pushing transformers to 1-bit weights had a long history of instability — earlier work on binary neural networks (BNNs) such as XNOR-Net (2016) and BinaryConnect (2015) had largely failed to scale to language models, where the loss landscape is more sensitive than in convolutional vision networks.[^1]
Two parallel trends made 2023 a plausible moment to revisit 1-bit LLMs. First, the publication of the original "Attention Is All You Need" architecture in 2017 had been followed by a cluster of architectural refinements — RMSNorm, SwiGLU, rotary position embeddings (RoPE), and reduced-bias formulations such as those used in LLaMA — that produced more numerically stable training dynamics.[^4] Second, the practical pain of FP16/BF16 inference — multi-GPU serving, multi-hundred-GB checkpoint files, and the rapidly rising electricity costs of large-model deployment — made the prospect of a 16-bit-to-1-bit memory cut commercially attractive.
It is into this gap that the BitNet team at Microsoft Research Asia, led by Furu Wei (Distinguished Scientist and Chief Scientist of Microsoft Research Asia), proposed BitNet.[^12]
The first BitNet paper, posted to arXiv on October 17, 2023 as arXiv:2310.11453, introduced two ideas that would persist through all subsequent versions.[^1]
BitLinear. The architectural primitive replaces standard nn.Linear layers in the transformer with a "BitLinear" layer whose weights are stored in 1 bit (signed ±1) but whose activations remain in 8 bits. The forward pass is: (1) RMSNorm-style "SubLN" normalization of activations; (2) absmax quantization of activations to INT8 per token; (3) sign-based binarization of the 16-bit shadow weights to ±1; (4) matrix multiplication of the resulting INT8 × INT1 tensors; and (5) rescaling using stored absolute-mean factors.[^1][^13] The result is that the dominant cost in the forward pass — large-matrix GEMM — becomes a sign-multiply-and-accumulate operation that, in principle, can be implemented as integer additions without any multiplication hardware.
Straight-through quantization-aware training. Like prior QAT work, BitNet keeps a high-precision shadow copy of each weight tensor in optimizer state. The forward pass quantizes the shadow weights to 1 bit; the backward pass uses a "straight-through estimator" to pass gradients through the non-differentiable sign operation, so the shadow weights continue to receive meaningful updates. This is what is meant when the BitNet papers say they are "trained from scratch" at 1-bit precision — the deployed model is 1-bit, but training itself happens in mixed precision with a 16-bit master copy.[^1][^4]
The October 2023 paper trained models from 125M up to 3.9B parameters and reported that BitNet "exhibits a scaling law akin to full-precision Transformers," with the BitNet loss curve approaching the FP16 baseline as model size grew. Memory savings vs FP16 were reported as substantial but the paper explicitly framed the work as "in progress" and stopped short of claiming full perplexity parity.[^1] The paper noted authors affiliated with both Microsoft Research Asia and academic partners in China (including footnote markers indicating collaborators at the University of the Chinese Academy of Sciences and Tsinghua).[^14]
Reception of the original paper was muted: it was widely circulated in deep-learning Twitter and on Hacker News but did not lead to immediate adoption, in part because the strongest results required training models from scratch — there was no way to convert an existing FP16 LLaMA checkpoint — and because no production-scale 1-bit checkpoint was released.
The follow-up paper, "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764, February 27, 2024), is the work that turned BitNet into a phenomenon.[^2] Authored by Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, it made two consequential changes.
Ternary weights. Rather than restricting weights to ±1, b1.58 allows three values: {-1, 0, +1}. Information-theoretically, three states require log₂(3) ≈ 1.58 bits of storage — giving the variant its name. The "0" state is critical: it gives the model an implicit sparsity mechanism (any weight that should be small enough to ignore becomes literally zero) without changing the simplicity of the matmul kernel, since multiplication by 0, +1, or -1 still reduces to addition and subtraction. Quantization uses an absmean scheme: weights are scaled by their absolute-mean and then rounded to the nearest of {-1, 0, +1}.[^2][^13]
Performance parity claim. The headline result is that BitNet b1.58 matches the performance of an FP16 Transformer LLM "with the same model size and training tokens in terms of both perplexity and end-task performance," with the parity emerging at the 3B-parameter mark.[^2] The paper reports the following numbers against a 3B FP16 LLaMA-style baseline:
| Model size | BitNet b1.58 PPL | FP16 LLaMA PPL | Memory reduction | Latency speedup |
|---|---|---|---|---|
| 700M | 12.87 | 12.33 | 2.60x | 1.23x |
| 1.3B | 11.29 | 11.25 | 2.93x | 1.67x |
| 3B | 9.91 | 10.04 | 3.55x | 2.71x |
At 3B the b1.58 model outperforms the FP16 baseline on perplexity, and the gap continues to widen at 3.9B, where the paper reports 3.32x memory savings and 2.40x latency speedup.[^7] At a hypothetical 70B scale, the paper projected an 8.9x throughput increase and the ability to handle batches 11x larger than a 70B FP16 LLaMA on the same hardware.[^7] The authors also estimated that on a 7nm process node, the elimination of FP16 multiplies in the matmul kernel saves 71.4x in arithmetic energy.[^7]
These numbers reframed extreme quantization as a viable training strategy rather than a deployment-time compromise: if you could train a model in 1.58 bits from scratch and get equal or better quality than its FP16 sibling, then the only reason not to do so was the lack of fast inference kernels — a problem the team would address with bitnet.cpp eight months later.
To translate the theoretical advantages into measurable wall-clock speedups, Microsoft released bitnet.cpp, an open-source C++ inference framework, on October 17, 2024 — the one-year anniversary of the original BitNet paper.[^3][^15] The project is hosted at github.com/microsoft/BitNet under the MIT License, forked from the architecture of llama.cpp but with custom kernels specialized for ternary operations.
Kernels. bitnet.cpp ships three quantization kernels selected by hardware target: I2_S (a 2-bit signed packed representation, supported on both x86 and ARM), TL1 (a 2-bit packed representation with lookup tables, ARM-only), and TL2 (an x86-only lookup-table kernel).[^3] The lookup-table approach is borrowed from Microsoft's T-MAC project and exploits the fact that with only three weight values there are very few possible matmul outcomes per small tile — these can be precomputed once and reused, replacing arithmetic with memory lookups.
Reported speedups. On ARM CPUs, bitnet.cpp reports 1.37x to 5.07x speedup with 55.4% to 70.0% energy reduction relative to a FP16 baseline; on x86 CPUs, 2.37x to 6.17x speedup with 71.9% to 82.2% energy reduction.[^3] The most dramatic claim is that a 100-billion-parameter BitNet b1.58 model can run on a single CPU at 5–7 tokens per second — slow by frontier-cloud standards but roughly the speed of human reading and a regime in which no FP16 LLM of that size is practical at all on consumer hardware.[^3]
Caveats. The 100B figure is for a synthetic model — Microsoft has not actually trained or released a 100B BitNet model as of mid-2026 — and the speedup numbers are kernel-level, not end-to-end including tokenization, prompt processing, and memory access patterns. GPU support was officially introduced on May 20, 2025, after substantial community pressure, and NPU support remains roadmapped rather than shipped.[^3] A subsequent January 2026 update to bitnet.cpp added parallel kernel implementations and embedded quantization, claiming an additional 1.15x–2.1x speedup.[^3][^16]
By late 2024 the team had a 1.58-bit weight scheme that closed the FP16 gap and a CPU inference framework that materialized the speedups. The remaining inefficiency was that activations were still stored in 8 bits. Reducing them to 4 bits would, in principle, halve the memory traffic at inference time again — but activations contain "outlier channels" with much larger dynamic range than weights, and naive 4-bit quantization destroys these outliers.
BitNet a4.8 (arXiv:2411.04965, November 7, 2024) by Hongyu Wang, Shuming Ma, and Furu Wei addresses this with a hybrid quantization-and-sparsification strategy.[^5][^17] Activations entering the attention and feed-forward network projections are quantized to INT4 (or FP4), where outlier impact is small. The intermediate activations — gated FFN states and attention scores — are instead sparsified (most channels are zeroed out) and the surviving non-zero channels kept at 8 bits. The result is that ~55% of FFN parameters become inactive per token, and the active 45% benefits from 4-bit activation kernels.[^5][^17]
a4.8 also introduced 3-bit KV cache compression, attacking the other major memory bottleneck of LLM inference. Combined, the paper reports approximately 2x speedup over BitNet b1.58 and a roughly 10x reduction in memory and 4x speedup vs full-precision LLaMA models, while maintaining "negligible differences" in downstream-task accuracy.[^17]
a4.8 was developed jointly by Microsoft Research and the University of the Chinese Academy of Sciences (UCAS), a collaboration that has continued through later versions of the BitNet series.[^17]
For 18 months after the b1.58 paper, the headline benchmark numbers came from internal Microsoft training runs whose weights were not public. This changed on April 16, 2025, with the release of the BitNet b1.58 2B4T Technical Report (arXiv:2504.12285) and accompanying open-weight model on Hugging Face at microsoft/bitnet-b1.58-2B-4T.[^4][^8]
Scale. 2 billion parameters, 4 trillion training tokens — equivalent to chinchilla-style scaling at this size and substantially above the 100B-token regime of the original b1.58 experiments. The model was trained from scratch in 1.58-bit weights, not quantized from a full-precision checkpoint.[^4]
Architecture. Transformer-based with BitLinear layers, RoPE positional embeddings, squared ReLU (ReLU²) activation in the FFN, SubLN normalization, and no bias terms in linear or normalization layers. Tokenizer is borrowed from LLaMA 3 with a 128,256-token vocabulary. Context length is 4,096 tokens, modest by 2025 standards.[^8]
Training recipe. Three stages: large-scale pre-training on 4T tokens (a mixture of DCLM, FineWeb-EDU, and synthetic mathematical data), supervised fine-tuning (SFT) on instruction datasets, and Direct Preference Optimization (DPO) using UltraFeedback and MagPie. Pre-training used a two-stage learning-rate schedule with a "cooldown" abrupt decay roughly midway through, and a similar two-stage weight-decay strategy (peaking at 0.1, then disabled).[^18]
Benchmark results. The release made head-to-head comparisons against Llama 3.2 1B, Gemma-3 1B, Qwen2.5 1.5B, and MiniCPM 2B:
| Benchmark | BitNet b1.58 2B | Llama 3.2 1B | Qwen2.5 1.5B | Gemma-3 1B |
|---|---|---|---|---|
| ARC-Challenge | 49.91 | 37.80 | 46.67 | — |
| GSM8K | 58.38 | 38.21 | 56.79 | — |
| MMLU | 53.17 | 45.58 | 60.25 | — |
| HumanEval+ | 38.40 | 31.10 | 50.60 | — |
| Average | 54.19 | 44.90 | 55.23 | — |
BitNet wins on knowledge/reasoning benchmarks (ARC, GSM8K) and trails on coding (HumanEval+) and broad knowledge (MMLU), with an aggregate average of 54.19 vs Qwen2.5 1.5B's 55.23 — slightly behind, but at roughly one-fifth the memory footprint.[^8]
Efficiency. Reported numbers from the model card:
| Model | Non-embedding memory | CPU decode latency | Estimated energy/token |
|---|---|---|---|
| BitNet b1.58 2B4T | 0.4 GB | 29 ms | 0.028 J |
| Llama 3.2 1B | 2.0 GB | 48 ms | 0.258 J |
| Gemma-3 1B | 1.4 GB | 41 ms | 0.186 J |
| Qwen2.5 1.5B | 2.6 GB | 65 ms | 0.347 J |
| MiniCPM 2B | 4.8 GB | 124 ms | 0.649 J |
The model fits in 0.4 GB and runs at 29 ms/token on CPU — roughly 5x lower memory than the next-cheapest competitor and an order-of-magnitude lower estimated energy than MiniCPM 2B.[^8] Microsoft published three variants: the packed 1.58-bit deployment weights, a BF16 master-weights variant for further fine-tuning, and a GGUF-format variant for use with bitnet.cpp.[^8]
The model card carries an explicit warning: the efficiency advantages only materialize when running through bitnet.cpp. Running the same checkpoint through stock Hugging Face transformers is slower than running an FP16 model of equivalent quality, because the transformers library lacks the specialized ternary kernels and falls back on unpacking the weights to higher precision.[^8][^19]
Nine days after the 2B4T release, on April 25, 2025, Microsoft Research and UCAS posted BitNet v2 (arXiv:2504.18415, "Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs"), by Hongyu Wang, Shuming Ma, and Furu Wei.[^6][^20]
v2 returns to the activation-outlier problem that motivated a4.8 but takes a different approach. Rather than sparsifying intermediate activations to dodge outliers, v2 smooths the activation distribution before quantization using an online Hadamard transform — a fast, structured orthogonal transform that mixes the entries of an activation tensor so that any outlier channel is averaged across many channels, converting a sharp, heavy-tailed distribution into a more Gaussian-like one that is friendly to uniform 4-bit quantization.[^6]
The architectural primitive, H-BitLinear, is a drop-in replacement for the output projection in attention and the down-projection in FFN layers (the two locations where activation outliers are concentrated). The model is trained first with INT8 activations matching b1.58, then continue-trained with INT4 activations for all linear layers except input/output embeddings. The reported result is that BitNet v2(a4) "maintains comparable performance to the 8-bit version while significantly boosting efficiency in batched inference scenarios" — closing most of the gap between a4.8 and b1.58 without needing the sparsification machinery.[^6]
A v2 revision was posted on June 13, 2025.[^6]
The BitLinear layer is the load-bearing piece of the BitNet family. In the b1.58 formulation, the forward pass for an input activation tensor x and shadow weight matrix W is:
x_q = round(x / absmax(x) * 127). This gives a per-token scale factor that floats with the dynamic range of each token.[^13]W_q = round(W / mean(|W|)) clipped to {-1, 0, +1}, with a scalar scale α = mean(|W|). The absmean choice is critical — using max or absmax instead introduces too much rounding error at the bulk of the weight distribution.[^13]α and the per-token activation scale to recover the FP16-equivalent output.The model carries no bias terms in either the linear layers or the normalization, reducing parameter count and simplifying the kernel design. RoPE provides positional information without learned position embeddings, and squared-ReLU activations in the FFN (used by the 2B4T release) replace SwiGLU — empirically, squared-ReLU is more amenable to low-bit activation than the gated SwiGLU variant.[^4][^8]
All BitNet variants are trained with quantization-aware training (QAT) rather than post-training quantization. During training:
This is the source of one of BitNet's most-discussed limitations: training is not cheaper than training an FP16 model of equivalent size, because the optimizer state is still at BF16 precision. The cost savings are exclusively at inference time. BitNet 2B4T's training run on 4T tokens consumed compute roughly equivalent to training a BF16 model of the same size and token budget.[^9]
Hugging Face researchers in 2024 demonstrated an alternative pathway: converting an existing FP16 model to 1.58-bit via fine-tuning with a gradually-ramped quantization "lambda" schedule. They fine-tuned Llama 3 8B to 1.58 bits using a linear scheduler that warmed up over 1000–2000 steps and a 1e-4 learning rate. After 10 billion tokens of fine-tuning, the resulting 1.58-bit Llama 3 8B variant exceeded the perplexity of Microsoft's own 7B BitNet trained on 100B tokens.[^21] However, the converted model still lagged the original FP16 Llama 3 8B on aggregate metrics, suggesting QAT-from-scratch and conversion-via-fine-tuning produce qualitatively different trade-offs.[^21]
The strongest claim from the BitNet series — perplexity parity with FP16 at matched parameter count — survives at the 3 billion parameter and below scale that Microsoft has published. The b1.58 paper showed parity emerging at 3B (BitNet 9.91 vs LLaMA-3B FP16 10.04 perplexity), and the 2B4T release shows the 2B model trailing Qwen2.5 1.5B by less than 1.1 points on aggregate benchmarks while using one-fifth the memory.[^2][^7][^8]
The claim becomes more contested at larger scales and higher token budgets:
The community verdict as of mid-2026 is that BitNet has produced a real, replicable efficiency improvement at the 2B scale for tasks that don't require precise long-tail knowledge, but the FP16-parity claim should not yet be extrapolated to frontier-scale models.
The BitNet thesis has a structural implication: if the dominant cost of LLM inference is FP16 matrix multiplication and that cost can be replaced with ternary integer addition, then the GPU is no longer the natural inference substrate. Several lines of follow-on hardware work have emerged:
Specialized FPGA accelerators. Academic groups have produced two notable designs: TerEffic (arXiv:2502.16473, February 2025), an FPGA architecture with custom datapaths for ternary arithmetic, and TeLLMe v2 (arXiv:2510.15926, late 2025), an end-to-end ternary LLM prefill-and-decode accelerator using table-lookup matrix multiplication on edge FPGAs.[^22][^23] Both demonstrate that for ternary models the cost-per-token of FPGA inference can be lower than GPU inference at small batch sizes.
Lookup-table kernels on commodity CPUs. bitnet.cpp's I2_S / TL1 / TL2 kernels demonstrate that the throughput advantage is already accessible without new silicon — ARM and x86 cores with SIMD instructions can run lookup-based ternary GEMM kernels at the reported 2x–6x speedups.[^3][^16]
Native NPU and GPU support. Microsoft has acknowledged that an ASIC or NPU specifically designed for ternary inference could deliver another order-of-magnitude improvement, but as of the May 2025 update, bitnet.cpp's GPU path is a CUDA-based implementation that does not yet exploit GPU tensor-core hardware specifically tuned for ternary arithmetic, and NPU support is roadmapped but unshipped.[^3][^15]
No major commercial chipmaker — TSMC customer or otherwise — has announced a production-grade ternary inference accelerator as of mid-2026, despite the apparent commercial pull. This is part of what has fueled the skeptical Hacker News observation that if 1-bit inference were really 10x cheaper, an NVIDIA competitor would have shipped silicon by now.[^24]
BitNet's reception has been bifurcated. Within the research and open-source community, the b1.58 paper and the 2B4T release have been treated as a significant proof-of-concept and have inspired the Hugging Face Llama 3 1.58-bit conversion experiment, multiple FPGA implementations, and a wave of academic follow-ups extending the technique to specific domains.[^21][^22] BitNet b1.58 2B4T was reported by InfoQ, TechRepublic, VentureBeat, MarkTechPost, and others as a meaningful step toward democratized on-device AI.[^25][^17]
At the same time, frontier-lab adoption has not materialized. Hacker News discussions of the bitnet.cpp release noted that "if ternary weights scaled gracefully to 100B, [you would expect Microsoft] to have proven it by now rather than listing it as a research direction," and that all major commercial LLM serving stacks remained on FP16 / BF16 / FP8 in 2025–2026.[^24] Several plausible explanations have been offered:
Notable additional research from the same team has continued to expand the BitNet program. Sparse-BitNet (2025) explored semi-structured sparsity in 1.58-bit weights; BitDistill explored distilling FP16 teacher models into BitNet students; and the "1-bit AI Infra" series (e.g., arXiv:2410.16144) has detailed the bitnet.cpp inference stack in depth.[^15][^26]
As of mid-2026, the publicly stated BitNet roadmap includes:
The BitNet team — Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Furu Wei, and a rotating set of collaborators at the University of the Chinese Academy of Sciences — remains active at Microsoft Research Asia, with Furu Wei continuing in his role as Distinguished Scientist and Chief Scientist of MSRA. Despite Microsoft layoffs affecting other parts of the research organization, the BitNet group has continued to publish through 2025 and into 2026.[^12]