BitNet

Large Language Models Microsoft Model Architecture

24 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v6 · 4,830 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BitNet is a family of large language model architectures developed by Microsoft Research Asia that constrain the weights of a transformer to extremely low bit-widths: initially a single bit ({-1, +1}) and later three values ({-1, 0, +1}, encoded in roughly 1.58 bits per weight). The first BitNet paper, "BitNet: Scaling 1-bit Transformers for Large Language Models," was posted to arXiv on October 17, 2023 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei.^[1] Subsequent papers in the series (BitNet b1.58, February 27, 2024; BitNet a4.8, November 7, 2024; the open-weight BitNet b1.58 2B4T, April 16, 2025; and BitNet v2, April 25, 2025) extended the idea to ternary weights, 4-bit activations, and Hadamard-smoothed quantization, while a companion inference framework, bitnet.cpp, was open-sourced in October 2024 to make these models practical on commodity CPUs.^[2]^[3]^[4]^[5]^[6]

BitNet's central claim is that a sufficiently large transformer can be trained from scratch using ternary weights and still match the perplexity and downstream-task accuracy of an FP16 baseline of the same size and token budget, while consuming a fraction of the memory and energy. At 3 billion parameters, the b1.58 paper reported parity with a full-precision LLaMA-style model on perplexity and zero-shot accuracy while using 3.55x less GPU memory and running 2.71x faster.^[7] In 2025, Microsoft released BitNet b1.58 2B4T, a 2-billion-parameter model trained natively at 1.58 bits on 4 trillion tokens, under the MIT License on Hugging Face, the first openly distributed, natively-trained 1.58-bit LLM at the 2B scale.^[4]^[8]

The work has drawn intense interest because it suggests a path to running tens-of-billions-parameter language models on a single CPU, but it has also attracted skepticism: as of mid-2026, only models up to ~2-3B parameters have been released natively-trained, no frontier lab has publicly committed to a 1-bit production model, and academic studies have begun questioning whether the FP16-parity claim survives at high training-token counts.^[9]^[10]

Background: the quantization landscape before BitNet

Reducing the precision of neural network weights is one of the oldest tools in deep-learning efficiency. By 2023, the dominant approach for large language models was post-training quantization: train a model in FP16 or BF16, then compress the trained weights to INT8 or INT4 for deployment using methods such as GPTQ, AWQ, or bitsandbytes. QLoRA, released in May 2023, showed that 4-bit base weights could even be fine-tuned through with low-rank adapters with minimal quality loss.^[11]

These methods compress an already-trained model. Quantization-aware training (QAT) instead bakes the low-precision constraint into the training loop, so the model learns weights that survive quantization rather than being forced into a quantization grid after the fact. By 2023, QAT was well-established in the computer-vision community at 4 and 8 bits, but pushing transformers to 1-bit weights had a long history of instability; earlier work on binary neural networks (BNNs) such as XNOR-Net (2016) and BinaryConnect (2015) had largely failed to scale to language models, where the loss landscape is more sensitive than in convolutional vision networks.^[1]

Two parallel trends made 2023 a plausible moment to revisit 1-bit LLMs. First, the publication of the original "Attention Is All You Need" architecture in 2017 had been followed by a cluster of architectural refinements (RMSNorm, SwiGLU, rotary position embeddings (RoPE), and reduced-bias formulations such as those used in LLaMA) that produced more numerically stable training dynamics.^[4] Second, the practical pain of FP16/BF16 inference (multi-GPU serving, multi-hundred-GB checkpoint files, and the rapidly rising electricity costs of large-model deployment) made the prospect of a 16-bit-to-1-bit memory cut commercially attractive.

It is into this gap that the BitNet team at Microsoft Research Asia, led by Furu Wei (Distinguished Scientist and Chief Scientist of Microsoft Research Asia), proposed BitNet.^[12]

Original BitNet (October 2023)

The first BitNet paper, posted to arXiv on October 17, 2023 as arXiv:2310.11453, introduced two ideas that would persist through all subsequent versions.^[1]

BitLinear. The architectural primitive replaces standard nn.Linear layers in the transformer with a "BitLinear" layer whose weights are stored in 1 bit (signed ±1) but whose activations remain in 8 bits. The forward pass is: (1) RMSNorm-style "SubLN" normalization of activations; (2) absmax quantization of activations to INT8 per token; (3) sign-based binarization of the 16-bit shadow weights to ±1; (4) matrix multiplication of the resulting INT8 × INT1 tensors; and (5) rescaling using stored absolute-mean factors.^[1]^[13] The result is that the dominant cost in the forward pass (large-matrix GEMM) becomes a sign-multiply-and-accumulate operation that, in principle, can be implemented as integer additions without any multiplication hardware.

Straight-through quantization-aware training. Like prior QAT work, BitNet keeps a high-precision shadow copy of each weight tensor in optimizer state. The forward pass quantizes the shadow weights to 1 bit; the backward pass uses a "straight-through estimator" to pass gradients through the non-differentiable sign operation, so the shadow weights continue to receive meaningful updates. This is what is meant when the BitNet papers say they are "trained from scratch" at 1-bit precision: the deployed model is 1-bit, but training itself happens in mixed precision with a 16-bit master copy.^[1]^[4]

The October 2023 paper trained models from 125M up to 3.9B parameters and reported that BitNet "exhibits a scaling law akin to full-precision Transformers," with the BitNet loss curve approaching the FP16 baseline as model size grew. Memory savings vs FP16 were reported as substantial but the paper explicitly framed the work as "in progress" and stopped short of claiming full perplexity parity.^[1] The paper noted authors affiliated with both Microsoft Research Asia and academic partners in China (including footnote markers indicating collaborators at the University of the Chinese Academy of Sciences and Tsinghua).^[14]

Reception of the original paper was muted: it was widely circulated in deep-learning Twitter and on Hacker News but did not lead to immediate adoption, in part because the strongest results required training models from scratch (there was no way to convert an existing FP16 LLaMA checkpoint), and because no production-scale 1-bit checkpoint was released.

BitNet b1.58 (February 2024): the ternary breakthrough

The follow-up paper, "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764, February 27, 2024), is the work that turned BitNet into a phenomenon.^[2] Authored by Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, it made two consequential changes.

Ternary weights. Rather than restricting weights to ±1, b1.58 allows three values: {-1, 0, +1}. Information-theoretically, three states require log₂(3) ≈ 1.58 bits of storage, giving the variant its name. The "0" state is critical: it gives the model an implicit sparsity mechanism (any weight that should be small enough to ignore becomes literally zero) without changing the simplicity of the matmul kernel, since multiplication by 0, +1, or -1 still reduces to addition and subtraction. Quantization uses an absmean scheme: weights are scaled by their absolute-mean and then rounded to the nearest of {-1, 0, +1}.^[2]^[13]

Performance parity claim. The headline result is that BitNet b1.58 matches the performance of an FP16 Transformer LLM "with the same model size and training tokens in terms of both perplexity and end-task performance," with the parity emerging at the 3B-parameter mark.^[2] The paper reports the following numbers against a 3B FP16 LLaMA-style baseline:

Model size	BitNet b1.58 PPL	FP16 LLaMA PPL	Memory reduction	Latency speedup
700M	12.87	12.33	2.60x	1.23x
1.3B	11.29	11.25	2.93x	1.67x
3B	9.91	10.04	3.55x	2.71x

At 3B the b1.58 model outperforms the FP16 baseline on perplexity, and the gap continues to widen at 3.9B, where the paper reports 3.32x memory savings and 2.40x latency speedup.^[7] At a hypothetical 70B scale, the paper projected an 8.9x throughput increase and the ability to handle batches 11x larger than a 70B FP16 LLaMA on the same hardware.^[7] The authors also estimated that on a 7nm process node, the elimination of FP16 multiplies in the matmul kernel saves 71.4x in arithmetic energy.^[7]

These numbers reframed extreme quantization as a viable training strategy rather than a deployment-time compromise: if you could train a model in 1.58 bits from scratch and get equal or better quality than its FP16 sibling, then the only reason not to do so was the lack of fast inference kernels, a problem the team would address with bitnet.cpp eight months later.

bitnet.cpp (October 2024): the inference engine

To translate the theoretical advantages into measurable wall-clock speedups, Microsoft released bitnet.cpp, an open-source C++ inference framework, on October 17, 2024, the one-year anniversary of the original BitNet paper.^[3]^[15] The project is hosted at github.com/microsoft/BitNet under the MIT License, forked from the architecture of llama.cpp but with custom kernels specialized for ternary operations.

Kernels. bitnet.cpp ships three quantization kernels selected by hardware target: I2_S (a 2-bit signed packed representation, supported on both x86 and ARM), TL1 (a 2-bit packed representation with lookup tables, ARM-only), and TL2 (an x86-only lookup-table kernel).^[3] The lookup-table approach is borrowed from Microsoft's T-MAC project and exploits the fact that with only three weight values there are very few possible matmul outcomes per small tile; these can be precomputed once and reused, replacing arithmetic with memory lookups.

Reported speedups. On ARM CPUs, bitnet.cpp reports 1.37x to 5.07x speedup with 55.4% to 70.0% energy reduction relative to a FP16 baseline; on x86 CPUs, 2.37x to 6.17x speedup with 71.9% to 82.2% energy reduction.^[3] The most dramatic claim is that a 100-billion-parameter BitNet b1.58 model can run on a single CPU at 5-7 tokens per second, slow by frontier-cloud standards but roughly the speed of human reading and a regime in which no FP16 LLM of that size is practical at all on consumer hardware.^[3]

Caveats. The 100B figure is for a synthetic model (Microsoft has not actually trained or released a 100B BitNet model as of mid-2026), and the speedup numbers are kernel-level, not end-to-end including tokenization, prompt processing, and memory access patterns. GPU support was officially introduced on May 20, 2025, after substantial community pressure, and NPU support remains roadmapped rather than shipped.^[3] A subsequent January 2026 update to bitnet.cpp added parallel kernel implementations and embedded quantization, claiming an additional 1.15x-2.1x speedup.^[3]^[16]

BitNet a4.8 (November 2024): sparsifying activations

By late 2024 the team had a 1.58-bit weight scheme that closed the FP16 gap and a CPU inference framework that materialized the speedups. The remaining inefficiency was that activations were still stored in 8 bits. Reducing them to 4 bits would, in principle, halve the memory traffic at inference time again, but activations contain "outlier channels" with much larger dynamic range than weights, and naive 4-bit quantization destroys these outliers.

BitNet a4.8 (arXiv:2411.04965, November 7, 2024) by Hongyu Wang, Shuming Ma, and Furu Wei addresses this with a hybrid quantization-and-sparsification strategy.^[5]^[17] Activations entering the attention and feed-forward network projections are quantized to INT4 (or FP4), where outlier impact is small. The intermediate activations (gated FFN states and attention scores) are instead sparsified (most channels are zeroed out) and the surviving non-zero channels kept at 8 bits. The result is that ~55% of FFN parameters become inactive per token, and the active 45% benefits from 4-bit activation kernels.^[5]^[17]

a4.8 also introduced 3-bit KV cache compression, attacking the other major memory bottleneck of LLM inference. Combined, the paper reports approximately 2x speedup over BitNet b1.58 and a roughly 10x reduction in memory and 4x speedup vs full-precision LLaMA models, while maintaining "negligible differences" in downstream-task accuracy.^[17]

a4.8 was developed jointly by Microsoft Research and the University of the Chinese Academy of Sciences (UCAS), a collaboration that has continued through later versions of the BitNet series.^[17]

BitNet b1.58 2B4T (April 2025): the open-weight release

For 18 months after the b1.58 paper, the headline benchmark numbers came from internal Microsoft training runs whose weights were not public. This changed on April 16, 2025, with the release of the BitNet b1.58 2B4T Technical Report (arXiv:2504.12285) and accompanying open-weight model on Hugging Face at microsoft/bitnet-b1.58-2B-4T.^[4]^[8]

Scale. 2 billion parameters, 4 trillion training tokens, equivalent to chinchilla-style scaling at this size and substantially above the 100B-token regime of the original b1.58 experiments. The model was trained from scratch in 1.58-bit weights, not quantized from a full-precision checkpoint.^[4]

Architecture. Transformer-based with BitLinear layers, RoPE positional embeddings, squared ReLU (ReLU²) activation in the FFN, SubLN normalization, and no bias terms in linear or normalization layers. Tokenizer is borrowed from LLaMA 3 with a 128,256-token vocabulary. Context length is 4,096 tokens, modest by 2025 standards.^[8]

Training recipe. Three stages: large-scale pre-training on 4T tokens (a mixture of DCLM, FineWeb-EDU, and synthetic mathematical data), supervised fine-tuning (SFT) on instruction datasets, and Direct Preference Optimization (DPO) using UltraFeedback and MagPie. Pre-training used a two-stage learning-rate schedule with a "cooldown" abrupt decay roughly midway through, and a similar two-stage weight-decay strategy (peaking at 0.1, then disabled).^[18]

Benchmark results. The release made head-to-head comparisons against Llama 3.2 1B, Gemma-3 1B, Qwen2.5 1.5B, and MiniCPM 2B:

Benchmark	BitNet b1.58 2B	Llama 3.2 1B	Qwen2.5 1.5B	Gemma-3 1B
ARC-Challenge	49.91	37.80	46.67	N/A
GSM8K	58.38	38.21	56.79	N/A
MMLU	53.17	45.58	60.25	N/A
HumanEval+	38.40	31.10	50.60	N/A
Average	54.19	44.90	55.23	N/A

BitNet wins on knowledge/reasoning benchmarks (ARC, GSM8K) and trails on coding (HumanEval+) and broad knowledge (MMLU), with an aggregate average of 54.19 vs Qwen2.5 1.5B's 55.23, slightly behind, but at roughly one-fifth the memory footprint.^[8]

Efficiency. Reported numbers from the model card:

Model	Non-embedding memory	CPU decode latency	Estimated energy/token
BitNet b1.58 2B4T	0.4 GB	29 ms	0.028 J
Llama 3.2 1B	2.0 GB	48 ms	0.258 J
Gemma-3 1B	1.4 GB	41 ms	0.186 J
Qwen2.5 1.5B	2.6 GB	65 ms	0.347 J
MiniCPM 2B	4.8 GB	124 ms	0.649 J

The model fits in 0.4 GB and runs at 29 ms/token on CPU, roughly 5x lower memory than the next-cheapest competitor and an order-of-magnitude lower estimated energy than MiniCPM 2B.^[8] Microsoft published three variants: the packed 1.58-bit deployment weights, a BF16 master-weights variant for further fine-tuning, and a GGUF-format variant for use with bitnet.cpp.^[8]

The model card carries an explicit warning: the efficiency advantages only materialize when running through bitnet.cpp. Running the same checkpoint through stock Hugging Face transformers is slower than running an FP16 model of equivalent quality, because the transformers library lacks the specialized ternary kernels and falls back on unpacking the weights to higher precision.^[8]^[19]

BitNet v2 (April 2025): Hadamard-smoothed 4-bit activations

Nine days after the 2B4T release, on April 25, 2025, Microsoft Research and UCAS posted BitNet v2 (arXiv:2504.18415, "Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs"), by Hongyu Wang, Shuming Ma, and Furu Wei.^[6]^[20]

v2 returns to the activation-outlier problem that motivated a4.8 but takes a different approach. Rather than sparsifying intermediate activations to dodge outliers, v2 smooths the activation distribution before quantization using an online Hadamard transform, a fast, structured orthogonal transform that mixes the entries of an activation tensor so that any outlier channel is averaged across many channels, converting a sharp, heavy-tailed distribution into a more Gaussian-like one that is friendly to uniform 4-bit quantization.^[6]

The architectural primitive, H-BitLinear, is a drop-in replacement for the output projection in attention and the down-projection in FFN layers (the two locations where activation outliers are concentrated). The model is trained first with INT8 activations matching b1.58, then continue-trained with INT4 activations for all linear layers except input/output embeddings. The reported result is that BitNet v2(a4) "maintains comparable performance to the 8-bit version while significantly boosting efficiency in batched inference scenarios," closing most of the gap between a4.8 and b1.58 without needing the sparsification machinery.^[6]

A v2 revision was posted on June 13, 2025.^[6]

Architecture and BitLinear in detail

The BitLinear layer is the load-bearing piece of the BitNet family. In the b1.58 formulation, the forward pass for an input activation tensor x and shadow weight matrix W is:

Normalize activations via SubLN: a variant of LayerNorm placed before the linear transform, which empirically stabilizes low-bit training.^[13]
Absmax-quantize activations to INT8 per token: x_q = round(x / absmax(x) * 127). This gives a per-token scale factor that floats with the dynamic range of each token.^[13]
Absmean-quantize shadow weights to ternary: W_q = round(W / mean(|W|)) clipped to {-1, 0, +1}, with a scalar scale α = mean(|W|). The absmean choice is critical: using max or absmax instead introduces too much rounding error at the bulk of the weight distribution.^[13]
GEMM the INT8 activations against the ternary weights. Because every weight is in {-1, 0, +1}, this reduces to addition (for +1), subtraction (for -1), and skipping (for 0).
Rescale the integer result by α and the per-token activation scale to recover the FP16-equivalent output.

The model carries no bias terms in either the linear layers or the normalization, reducing parameter count and simplifying the kernel design. RoPE provides positional information without learned position embeddings, and squared-ReLU activations in the FFN (used by the 2B4T release) replace SwiGLU; empirically, squared-ReLU is more amenable to low-bit activation than the gated SwiGLU variant.^[4]^[8]

Quantization-aware training

All BitNet variants are trained with quantization-aware training (QAT) rather than post-training quantization. During training:

A high-precision shadow copy of each weight tensor (typically BF16) lives in optimizer state.
The forward pass quantizes the shadow weights to ternary on every step.
The backward pass uses a straight-through estimator (STE) to propagate gradients through the non-differentiable quantization function, treating the quantization as the identity during gradient flow.
The optimizer updates the BF16 shadow weights based on these gradients.

This is the source of one of BitNet's most-discussed limitations: training is not cheaper than training an FP16 model of equivalent size, because the optimizer state is still at BF16 precision. The cost savings are exclusively at inference time. BitNet 2B4T's training run on 4T tokens consumed compute roughly equivalent to training a BF16 model of the same size and token budget.^[9]

Hugging Face researchers in 2024 demonstrated an alternative pathway: converting an existing FP16 model to 1.58-bit via fine-tuning with a gradually-ramped quantization "lambda" schedule. They fine-tuned Llama 3 8B to 1.58 bits using a linear scheduler that warmed up over 1000-2000 steps and a 1e-4 learning rate. After 10 billion tokens of fine-tuning, the resulting 1.58-bit Llama 3 8B variant exceeded the perplexity of Microsoft's own 7B BitNet trained on 100B tokens.^[21] However, the converted model still lagged the original FP16 Llama 3 8B on aggregate metrics, suggesting QAT-from-scratch and conversion-via-fine-tuning produce qualitatively different trade-offs.^[21]

Performance benchmarks: holds up or doesn't?

The strongest claim from the BitNet series, perplexity parity with FP16 at matched parameter count, survives at the 3 billion parameter and below scale that Microsoft has published. The b1.58 paper showed parity emerging at 3B (BitNet 9.91 vs LLaMA-3B FP16 10.04 perplexity), and the 2B4T release shows the 2B model trailing Qwen2.5 1.5B by less than 1.1 points on aggregate benchmarks while using one-fifth the memory.^[2]^[7]^[8]

The claim becomes more contested at larger scales and higher token budgets:

Scaling beyond 3B has not been openly demonstrated. No native-trained BitNet model larger than 2B has been released by Microsoft. The 70B numbers in the original b1.58 paper are projections from smaller-model scaling rather than measurements of a trained 70B BitNet.^[2]^[9]
The undertraining hypothesis. Ouyang et al. (2024) and Kumar et al. (2024) have argued that "low-bit weights only favor undertrained models": the perceived parity of BitNet at 100B tokens disappears once a comparable FP16 model is given a chinchilla-optimal or token-saturated training budget. Under this view, the BitNet scaling curve plateaus earlier than the FP16 curve.^[10]
bitnet.cpp is required to realize the speedups. The 0.4 GB / 29 ms latency figures depend entirely on the specialized C++ kernels. Standard PyTorch / Hugging Face inference unpacks the ternary weights back into a higher-precision representation and is not faster than running an FP16 model of equivalent quality.^[8]^[19]
Coding and broad-knowledge benchmarks remain weaker. BitNet 2B4T trails Qwen2.5 1.5B by ~12 points on HumanEval+ and ~7 points on MMLU, suggesting the weight constraint still costs the model accuracy on tasks with limited training signal or high-precision symbolic structure.^[8]

The community verdict as of mid-2026 is that BitNet has produced a real, replicable efficiency improvement at the 2B scale for tasks that don't require precise long-tail knowledge, but the FP16-parity claim should not yet be extrapolated to frontier-scale models.

Hardware implications

The BitNet thesis has a structural implication: if the dominant cost of LLM inference is FP16 matrix multiplication and that cost can be replaced with ternary integer addition, then the GPU is no longer the natural inference substrate. Several lines of follow-on hardware work have emerged:

Specialized FPGA accelerators. Academic groups have produced two notable designs: TerEffic (arXiv:2502.16473, February 2025), an FPGA architecture with custom datapaths for ternary arithmetic, and TeLLMe v2 (arXiv:2510.15926, late 2025), an end-to-end ternary LLM prefill-and-decode accelerator using table-lookup matrix multiplication on edge FPGAs.^[22]^[23] Both demonstrate that for ternary models the cost-per-token of FPGA inference can be lower than GPU inference at small batch sizes.

Lookup-table kernels on commodity CPUs. bitnet.cpp's I2_S / TL1 / TL2 kernels demonstrate that the throughput advantage is already accessible without new silicon; ARM and x86 cores with SIMD instructions can run lookup-based ternary GEMM kernels at the reported 2x-6x speedups.^[3]^[16]

Native NPU and GPU support. Microsoft has acknowledged that an ASIC or NPU specifically designed for ternary inference could deliver another order-of-magnitude improvement, but as of the May 2025 update, bitnet.cpp's GPU path is a CUDA-based implementation that does not yet exploit GPU tensor-core hardware specifically tuned for ternary arithmetic, and NPU support is roadmapped but unshipped.^[3]^[15]

No major commercial chipmaker (TSMC customer or otherwise) has announced a production-grade ternary inference accelerator as of mid-2026, despite the apparent commercial pull. This is part of what has fueled the skeptical Hacker News observation that if 1-bit inference were really 10x cheaper, an NVIDIA competitor would have shipped silicon by now.^[24]

Reception and limitations

BitNet's reception has been bifurcated. Within the research and open-source community, the b1.58 paper and the 2B4T release have been treated as a significant proof-of-concept and have inspired the Hugging Face Llama 3 1.58-bit conversion experiment, multiple FPGA implementations, and a wave of academic follow-ups extending the technique to specific domains.^[21]^[22] BitNet b1.58 2B4T was reported by InfoQ, TechRepublic, VentureBeat, MarkTechPost, and others as a meaningful step toward democratized on-device AI.^[25]^[17]

At the same time, frontier-lab adoption has not materialized. Hacker News discussions of the bitnet.cpp release noted that "if ternary weights scaled gracefully to 100B, [you would expect Microsoft] to have proven it by now rather than listing it as a research direction," and that all major commercial LLM serving stacks remained on FP16 / BF16 / FP8 in 2025-2026.^[24] Several plausible explanations have been offered:

Training cost is unchanged. BitNet's efficiency gains accrue at inference time, but pre-training a 70B+ BitNet costs roughly as much as a 70B+ FP16 model. For labs whose biggest cost is training-side, the incentive is weaker.^[24]
Hidden brittleness. No native-trained BitNet model has been published above 3B parameters, leaving open the possibility that ternary training diverges or degrades at the 30B-70B scale relevant to commercial systems.^[9]
No conversion path. Existing FP16 checkpoints cannot be converted to high-quality BitNet weights without expensive re-training, removing the lowest-friction adoption path.^[21]^[24]
Application brittleness. Microsoft's own model card warns that BitNet b1.58 2B4T is intended for research, not production use, that it has "elevated defect rate on election-critical queries," that multilingual performance is limited, and that performance gains require the bitnet.cpp inference path.^[8]

Notable additional research from the same team has continued to expand the BitNet program. Sparse-BitNet (2025) explored semi-structured sparsity in 1.58-bit weights; BitDistill explored distilling FP16 teacher models into BitNet students; and the "1-bit AI Infra" series (e.g., arXiv:2410.16144) has detailed the bitnet.cpp inference stack in depth.^[15]^[26]

Roadmap

As of mid-2026, the publicly stated BitNet roadmap includes:

NPU support in bitnet.cpp: announced as upcoming but not yet shipped.^[3]
Scaling native-trained BitNet beyond 2B parameters: the 2B4T technical report explicitly lists "investigating the scaling properties of native 1-bit LLMs" as future work.^[4]^[9]
Improved mathematical and code-generation performance: Microsoft has acknowledged the persistent gap on HumanEval and reasoning-heavy benchmarks.^[4]^[25]
Multilingual capability: current models are predominantly English-trained.^[8]
Specialized inference hardware: the BitNet program has been explicit since 2024 that custom silicon for ternary inference is a natural extension, though no Microsoft chip announcement has followed.^[3]^[7]

The BitNet team, including Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Furu Wei, and a rotating set of collaborators at the University of the Chinese Academy of Sciences, remains active at Microsoft Research Asia, with Furu Wei continuing in his role as Distinguished Scientist and Chief Scientist of MSRA. Despite Microsoft layoffs affecting other parts of the research organization, the BitNet group has continued to publish through 2025 and into 2026.^[12]

References

Wang, H.; Ma, S.; Dong, L.; Huang, S.; Wang, H.; Ma, L.; Yang, F.; Wang, R.; Wu, Y.; Wei, F. (October 17, 2023). "BitNet: Scaling 1-bit Transformers for Large Language Models." arXiv:2310.11453. https://arxiv.org/abs/2310.11453 ↩
Ma, S.; Wang, H.; Ma, L.; Wang, L.; Wang, W.; Huang, S.; Dong, L.; Wang, R.; Xue, J.; Wei, F. (February 27, 2024). "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv:2402.17764. https://arxiv.org/abs/2402.17764 ↩
Microsoft. "BitNet: Official inference framework for 1-bit LLMs." GitHub repository. https://github.com/microsoft/BitNet ↩
Ma, S.; Wang, H.; Huang, S.; Zhang, X.; Hu, Y.; Song, T.; Xia, Y.; Wei, F. (April 16, 2025). "BitNet b1.58 2B4T Technical Report." arXiv:2504.12285. https://arxiv.org/abs/2504.12285 ↩
Wang, H.; Ma, S.; Wei, F. (November 7, 2024). "BitNet a4.8: 4-bit Activations for 1-bit LLMs." arXiv:2411.04965. https://arxiv.org/abs/2411.04965 ↩
Wang, H.; Ma, S.; Wei, F. (April 25, 2025; revised June 13, 2025). "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs." arXiv:2504.18415. https://arxiv.org/abs/2504.18415 ↩
BitNet b1.58 paper, Table 1: memory, latency, perplexity numbers at 700M / 1.3B / 3B / 3.9B parameters; 71.4x arithmetic energy savings claim at 7nm; 70B throughput projection. https://arxiv.org/html/2402.17764v1 ↩
Microsoft. "microsoft/bitnet-b1.58-2B-4T." Hugging Face model card. https://huggingface.co/microsoft/bitnet-b1.58-2B-4T ↩
TechRepublic (April 2025). "Microsoft Releases Largest 1-Bit LLM, Letting Powerful AI Run on Some Older Hardware." https://www.techrepublic.com/article/news-microsoft-bitnet-small-ai-model/ ; analysis discussed scale limitations and 2B as current ceiling for native-trained models ↩
Wikipedia. "1.58-bit large language model," citing Ouyang et al. (2024) and Kumar et al. (2024) findings on undertrained-model bias of low-bit weights. https://en.wikipedia.org/wiki/1.58-bit_large_language_model ↩
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. (May 2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. https://arxiv.org/abs/2305.14314 ↩
Wei, F. "Furu Wei's Homepage." https://thegenerality.com/ ↩
Hugging Face Transformers documentation, BitNet entry: BitLinear architecture, absmean weight quantization, absmax activation quantization, SubLN normalization. https://huggingface.co/docs/transformers/model_doc/bitnet ↩
Microsoft Research publication page for original BitNet paper (lab designation: Microsoft Research Lab - Asia). https://www.microsoft.com/en-us/research/publication/bitnet-scaling-1-bit-transformers-for-large-language-models/ ↩
Wang, J.; et al. (October 2024). "1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs." arXiv:2410.16144. https://www.microsoft.com/en-us/research/publication/1-bit-ai-infra-part-1-1-fast-and-lossless-bitnet-b1-58-inference-on-cpus/ ↩
bitnet.cpp January 2026 optimization update: additional 1.15x-2.1x speedup via parallel kernels and embedded quantization (per GitHub README and ai-minor.com coverage). https://github.com/microsoft/BitNet ↩
MarkTechPost (November 10, 2024). "This AI Paper Introduces BitNet a4.8: A Highly Efficient and Accurate 4-bit LLM." https://www.marktechpost.com/2024/11/10/this-ai-paper-introduces-bitnet-a4-8-a-highly-efficient-and-accurate-4-bit-llm/ ↩
BitNet b1.58 2B4T Technical Report: two-stage learning rate schedule, weight decay schedule, dataset mixture (DCLM, FineWeb-EDU, synthetic math), SFT and DPO post-training using UltraFeedback and MagPie. https://arxiv.org/html/2504.12285v1 ↩
Hugging Face Transformers documentation: confirms standard transformers library lacks specialized BitNet kernels and runs slower than dedicated bitnet.cpp implementation. https://huggingface.co/docs/transformers/model_doc/bitnet ↩
BitNet v2 paper homepage. https://ustcwhy.github.io/publications/bitnet_v2/ ↩
Hugging Face Blog (2024). "1.58 Bit LLM: Fine-tuning Llama3 to 1.58 bits via gradual quantization with lambda scheduler." https://huggingface.co/blog/1_58_llm_extreme_quantization ↩
TerEffic FPGA accelerator for ternary LLM inference. arXiv:2502.16473. https://arxiv.org/html/2502.16473v2 ↩
TeLLMe v2: End-to-end ternary LLM accelerator with table-lookup matmul on edge FPGAs. arXiv:2510.15926. https://arxiv.org/pdf/2510.15926 ↩
Hacker News discussion of Microsoft BitNet inference framework release (October 2024). https://news.ycombinator.com/item?id=41877609 ↩
InfoQ (April 23, 2025). "Microsoft Native 1-Bit LLM Could Bring Efficient genAI to Everyday CPUs." https://www.infoq.com/news/2025/04/microsoft-bitnet-1bit-llm/ ↩
Microsoft Research. "Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity." https://www.microsoft.com/en-us/research/publication/sparse-bitnet-1-58-bit-llms-are-naturally-friendly-to-semi-structured-sparsity/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

BitNet b1.58 Microsoft Research Quantization-Aware Training (QAT)

Background: the quantization landscape before BitNet

Original BitNet (October 2023)

BitNet b1.58 (February 2024): the ternary breakthrough

bitnet.cpp (October 2024): the inference engine

BitNet a4.8 (November 2024): sparsifying activations

BitNet b1.58 2B4T (April 2025): the open-weight release

BitNet v2 (April 2025): Hadamard-smoothed 4-bit activations

Architecture and BitLinear in detail

Quantization-aware training

Performance benchmarks: holds up or doesn't?

Hardware implications

Reception and limitations

Roadmap

See also

References

Improve this article

Related Articles

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)

Differential Transformer

LongNet

Microsoft 365 Copilot

What links here

Related Articles

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)

Differential Transformer

LongNet

Microsoft 365 Copilot

What links here