# Hugging Face Transformers

> Source: https://aiwiki.ai/wiki/transformers_library
> Updated: 2026-06-21
> Categories: Developer Tools, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

> **Note:** This article is about the open-source Python library by Hugging Face. For the neural network architecture introduced in the 2017 paper "Attention Is All You Need," see [Transformer (architecture)](/wiki/transformers).

**Hugging Face Transformers** is an open-source Python library that provides general-purpose architectures, a unified API, and pretrained weights for state-of-the-art machine learning models across text, vision, audio, and multimodal tasks, installable with a single `pip install transformers` command.[3] In the words of its own EMNLP 2020 paper, "Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community."[1] It started life as a [PyTorch](/wiki/pytorch) port of Google's [BERT](/wiki/bert) reference code, and has since become the de facto standard interface for natural language understanding, text generation, computer vision, audio processing, and multimodal modeling.[16] The library covers more than 200 model families, including BERT, GPT, T5, BART, ViT, CLIP, Whisper, [LLaMA](/wiki/llama), Mistral, Mixtral, Gemma, Phi, Qwen, and DeepSeek, and (through version 4.x) worked with PyTorch, [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax).[3] It is developed and maintained by [Hugging Face](/wiki/hugging_face), an open-source AI company headquartered in New York and Paris.[15]

Transformers is licensed under Apache 2.0 and tightly integrated with the Hugging Face Hub, which hosts more than 2 million public model checkpoints and over 500,000 public datasets as of 2026.[6][11] The repository carries roughly 162,000 GitHub stars and about 33,600 forks as of June 2026, making it one of the most starred machine learning projects on GitHub.[7] The associated EMNLP demo paper, Wolf et al. 2020, has been cited tens of thousands of times and is one of the most cited software papers in modern NLP.[1] The official GitHub tagline now describes it as "the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training."[7]

## What is the Hugging Face Transformers library?

In practical terms, `transformers` is a single Python package (`pip install transformers`) that exposes:

- A unified Python API for loading pretrained models, tokenizers, image processors, and feature extractors from the Hub.
- Standardized model classes that mirror the original research implementations as closely as possible.
- A high-level `pipeline` API for one-line inference on common tasks.
- A `Trainer` class for fine-tuning with mixed precision, gradient accumulation, and distributed training.
- A `generate` API for text generation, supporting greedy decoding, beam search, sampling, contrastive search, and speculative decoding.
- Hooks for quantization, parameter-efficient fine-tuning, hardware acceleration, and serving.

The library is, in the words of its own documentation, the "model-definition framework" for the broader ecosystem: if a model is supported in `transformers`, it tends to be compatible with downstream training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, [PyTorch Lightning](/wiki/pytorch_lightning)) and inference engines (vLLM, SGLang, TGI, [llama.cpp](/wiki/llama_cpp), MLX) that build on top of those definitions.[3] Hugging Face frames the v5 release the same way: "Transformers, at the core, remains a model architecture toolkit," and "the backbone of hundreds of thousands of projects."[18]

## When was the Transformers library released? History and timeline

The library predates the Hugging Face Hub and even predates Hugging Face's pivot from a chatbot company to an ML platform. It began in late 2018 when Thomas Wolf and a small team ported Google's TensorFlow BERT code to PyTorch.[15] The package went through two renames before settling on `transformers` in late 2019.[15]

| Year | Milestone |
|------|-----------|
| 2016 | Hugging Face founded by [Clément Delangue](/wiki/clement_delangue), Julien Chaumond, and Thomas Wolf in New York City as a chatbot startup for teenagers. |
| Nov 2018 | Hugging Face releases `pytorch-pretrained-bert`, a PyTorch port of Google's BERT, on PyPI. |
| Feb 2019 | Library expands to OpenAI [GPT](/wiki/gpt), [GPT-2](/wiki/gpt-2), and Transformer-XL after PyTorch reimplementations of GPT-2 small. |
| Jul 2019 | Library renamed to `pytorch-transformers` (v1.0) to reflect its broader scope; adds XLNet, XLM, RoBERTa, DistilBERT. |
| Sep 2019 | Renamed again to `transformers` (v2.0); TensorFlow 2.0 support added so models can be loaded interchangeably between frameworks. |
| Oct 2019 | Wolf et al. publish the technical report "HuggingFace's Transformers: State-of-the-art Natural Language Processing" on arXiv (1910.03771).[2] |
| 2020 | Wolf et al. paper accepted to EMNLP 2020 (System Demonstrations), pp. 38-45, formally introducing the library to the research community. Pipelines API gains traction.[1] |
| Sep 2020 | v3.0 release; ONNX export, Trainer class, and improved tokenizers integration. |
| Nov 2020 | v4.0 release; deeper Hub integration, model sharing, and a stable API. |
| 2021 | Acquires Gradio (December 2021) to provide easy demo hosting through Spaces. |
| 2022 | Adds vision models (ViT, DETR, Swin) and audio models (Wav2Vec2, Whisper); BigScience releases BLOOM through the Hub. |
| Feb 2023 | Native support for [LLaMA](/wiki/llama) lands shortly after Meta's release; Mistral, Falcon, MPT, and other open LLMs follow throughout the year. |
| Aug 2023 | Hugging Face raises $235M Series D at a $4.5B valuation from investors including Google, Amazon, NVIDIA, Intel, Salesforce, and AMD.[14] |
| 2024 | Multimodal pipelines, agentic features (smolagents), assisted decoding, and integration with Inference Providers. Acquires Argilla in June 2024 ($10M) and XetHub later in 2024. |
| Apr 2025 | Acquires Pollen Robotics, the maker of the open-source Reachy 2 humanoid robot.[12] |
| Dec 2025 | Transformers v5.0.0rc-0 released (December 1, 2025): PyTorch-only backend, quantization promoted to a first-class feature, and the start of the sunset of TensorFlow and Flax support.[18] |
| Jun 2026 | The v5 line continues at a fast cadence, reaching v5.12.1 on June 15, 2026; the Hub has crossed 2 million public models, 500,000 public datasets, and 13 million users.[11][17] |

The November 2018 release of `pytorch-pretrained-bert` matters because it made BERT, which had only just been published by Google, usable in PyTorch with a few lines of Python.[16][17] That single design choice (mirror the official architecture, ship pretrained weights, and let people fine-tune in their own training loop) is the pattern the library has followed ever since.

## Architecture and core concepts

The library is organized around a small set of abstractions that get reused across every model.

### Three-class pattern

Every model in `transformers` is implemented with three main classes:

1. A **configuration** class (e.g., `BertConfig`) that holds all hyperparameters as plain Python attributes.
2. A **model** class (e.g., `BertModel`, `BertForSequenceClassification`) implementing the forward pass.
3. A **preprocessor** class such as a tokenizer, image processor, or feature extractor.

The philosophy is intentionally light on abstraction. Each model file is meant to be readable on its own without chasing class hierarchies. Hugging Face calls this the "single model file" policy.[3]

### AutoClasses

Most users do not instantiate model classes directly. Instead they use the Auto family, which inspects a checkpoint's config and picks the right class:

- `AutoConfig` loads the configuration.
- `AutoTokenizer` returns the correct tokenizer (fast Rust-backed when available, slow Python otherwise).
- `AutoModel` returns the base model.
- `AutoModelForCausalLM`, `AutoModelForSequenceClassification`, `AutoModelForQuestionAnswering`, `AutoModelForSeq2SeqLM`, etc., return models with the right task head.
- `AutoImageProcessor` and `AutoFeatureExtractor` handle vision and audio inputs.

This is what lets the same five lines of code load BERT, [RoBERTa](/wiki/roberta), DeBERTa, [DistilBERT](/wiki/distilbert), or any compatible checkpoint by changing only the model name.[3]

### Pipelines

The `pipeline` factory wraps preprocessing, the model, and postprocessing into one callable. It is the easiest entry point for prototyping and accounts for many of the library's tutorials.[4] Tasks supported include:

| Modality | Pipeline tasks |
|----------|---------------|
| Text | `text-classification` (alias `sentiment-analysis`), `token-classification` (alias `ner`), `question-answering`, `table-question-answering`, `fill-mask`, `text-generation`, `text2text-generation`, `summarization`, `translation`, `zero-shot-classification`, `feature-extraction` |
| Vision | `image-classification`, `image-segmentation`, `object-detection`, `depth-estimation`, `mask-generation`, `keypoint-matching`, `image-feature-extraction`, `zero-shot-image-classification`, `zero-shot-object-detection`, `video-classification` |
| Audio | `automatic-speech-recognition`, `audio-classification`, `text-to-audio` (alias `text-to-speech`), `zero-shot-audio-classification` |
| Multimodal | `image-text-to-text`, `document-question-answering`, `visual-question-answering` |

A one-liner like `pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")` will download Whisper, set up its feature extractor, and return a callable that transcribes audio files.[4]

### Trainer

`Trainer` is the library's training loop. It handles:

- Mixed precision (fp16, bf16, fp8 where supported).
- Gradient accumulation and gradient checkpointing.
- Distributed training via `accelerate`, including DDP, FSDP, and DeepSpeed ZeRO.
- `torch.compile` integration.
- Hyperparameter search backends (Optuna, Ray Tune, SigOpt, Weights & Biases sweeps).
- Logging to TensorBoard, W&B, MLflow, Comet, Neptune, and Hugging Face Hub.
- Automatic model card generation and Hub upload on save.

For reinforcement learning from human feedback or preference optimization, users typically reach for `trl` (Transformer Reinforcement Learning) on top of `Trainer`.[10]

### generate API

The `model.generate()` method is the main interface for text generation in causal LMs and sequence-to-sequence models. It supports greedy decoding, beam search, sampling with temperature, top-k and top-p (nucleus), contrastive search, diverse beam search, group beam search, and speculative or assisted decoding (where a smaller draft model proposes tokens that a larger model accepts or rejects).[3] Streaming output is supported through `TextStreamer` and `TextIteratorStreamer`.

Generation also includes a configurable KV cache, RoPE scaling for extended context windows, repetition penalties, logit processors, and stopping criteria. Tool calling and structured output, including JSON-grammar constrained decoding, are supported on compatible models.[3]

### Model coverage

As of v5.x the library ships definitions for more than 200 architectures.[3] The roster spans encoder-only models (BERT, RoBERTa, DeBERTa, DistilBERT, [ALBERT](/wiki/albert), ELECTRA), decoder-only models ([GPT-2](/wiki/gpt2), [GPT-J](/wiki/gpt_j), GPT-NeoX, OPT, BLOOM, LLaMA 1-4, Mistral, Mixtral, Gemma, Phi, Qwen, DeepSeek, Falcon, MPT), encoder-decoder models (T5, mT5, BART, mBART, Pegasus, MarianMT, NLLB), vision models (ViT, DeiT, Swin, ConvNeXt, DINOv2, BEiT, MAE, DETR, Mask2Former), audio models (Wav2Vec2, HuBERT, Whisper, MMS, SeamlessM4T, MusicGen), and multimodal models (CLIP, BLIP, BLIP-2, LLaVA, IDEFICS, Pix2Struct, Donut, Kosmos-2, PaliGemma, Qwen-VL, Llama 3.2 Vision).

## The Hugging Face ecosystem

Transformers does not stand alone. The library ships with hooks into roughly a dozen sister packages, most of them maintained by Hugging Face itself:

| Library | Purpose |
|---------|---------|
| `transformers` | Model architectures, tokenizers, training, generation, pipelines. |
| `tokenizers` | Fast Rust-backed tokenizers (BPE, WordPiece, Unigram, byte-level). |
| [`datasets`](/wiki/datasets) | Streaming-capable dataset loading; over 500,000 datasets on the Hub. |
| `accelerate` | Hardware-agnostic distributed training and inference; abstracts CUDA, ROCm, TPU, and Apple Silicon. |
| [`peft`](/wiki/peft) | Parameter-efficient fine-tuning (LoRA, QLoRA, IA3, prefix tuning, prompt tuning, adapters). |
| `trl` | RLHF and preference optimization (SFT, PPO, DPO, KTO, ORPO, GRPO). |
| `diffusers` | Diffusion models (Stable Diffusion, SDXL, SD 3, Flux, video diffusion). |
| `optimum` | Hardware-specific optimization backends: ONNX Runtime, TensorRT, OpenVINO, Habana Gaudi, AWS Neuron, Apple Neural Engine. |
| `evaluate` | Standardized evaluation metrics. |
| `safetensors` | Safe, fast tensor serialization format that has largely replaced `pickle`-based PyTorch checkpoints. |
| `huggingface_hub` | Programmatic Hub access, file downloads, repo management, Inference API client. |
| `gradio` | UI library for building demo apps; powers most Hugging Face Spaces. |
| `smolagents` | Lightweight agent framework for tool-using LLMs. |

The ecosystem is designed so that you can mix and match. For example, fine-tuning Llama 3 with QLoRA in 4-bit precision typically uses `transformers` for the model, `bitsandbytes` for the 4-bit quantization, `peft` for the LoRA adapters, `datasets` for the training data, `accelerate` for distributed training, and `trl` if the recipe involves DPO or PPO.[8][9][10] Every step touches `huggingface_hub` for downloads and uploads.

## Which deep learning frameworks does Transformers support?

Transformers is unusual in that it long tried to keep a single Python interface across multiple deep learning frameworks. In practice the support became uneven, and with the v5 release of December 2025 Hugging Face consolidated on PyTorch: "Finally, we're sunsetting our Flax/TensorFlow support in favor of focusing on PyTorch as the sole backend."[18]

| Framework | Support level | Notes |
|-----------|--------------|-------|
| PyTorch | First-class for essentially every model | The reference implementations live in `modeling_*.py` files. As of v5.x, PyTorch is the sole supported backend. |
| TensorFlow / Keras | Deprecated in v5 | Historically implemented in `modeling_tf_*.py`; the v5 line sunsets TF support in favor of PyTorch. |
| JAX / Flax | Deprecated in v5 | Historically implemented in `modeling_flax_*.py`; Hugging Face is instead working with Jax-ecosystem partners on compatibility rather than maintaining in-tree Flax models. |
| ONNX | Via `optimum` | Most common production export path for CPU and GPU inference. |
| TensorRT-LLM, vLLM, SGLang, TGI | Via separate inference servers | These projects re-implement the hot path for serving but reuse `transformers` model definitions and tokenizers. |

## Hugging Face Hub integration

The Hub is the network effect that has made the library hard to dislodge. Loading a checkpoint with `from_pretrained("meta-llama/Llama-3.1-8B-Instruct")` resolves to a Hub repository, downloads the relevant files (config, tokenizer, weights, often in `safetensors`), caches them locally, and instantiates the model.[6] Uploads are symmetrical: `model.push_to_hub("my-org/my-model")` creates a Git LFS-backed repo with a model card.[6]

Hugging Face's State of Open Source on Hugging Face: Spring 2026 report gives a precise snapshot of the Hub circa early 2026:[11]

- More than 2 million public model repositories and over 500,000 public datasets.
- 13 million users in 2025, with model, dataset, and user counts all close to doubling year over year.
- Heavy concentration: the top 200 most-downloaded models, about 0.01 percent of all models, account for 49.6 percent of downloads, while roughly half of all models have fewer than 200 total downloads.
- A geographic shift: Chinese models reached a plurality at about 41 percent of downloads in 2025, following the viral release of DeepSeek R1.
- A move from corporate to community: industry's share of releases fell from around 70 percent before 2022 to roughly 37 percent in 2025, while independent or unaffiliated developers rose from 17 percent to 39 percent.
- A widening size distribution: the mean uploaded model grew from 827 million parameters in 2023 to 20.8 billion in 2025, while the median rose only marginally from 326 million to 406 million parameters.
- A robotics surge: robotics datasets grew from 1,145 in 2024 to 26,991 in 2025, climbing from the 44th-largest dataset category to the single largest.

The Hub is backed by the Xet storage system (acquired through XetHub in 2024), which deduplicates large files at the chunk level and significantly speeds up uploads and downloads of multi-gigabyte model weights.[11]

## How widely is the Transformers library used?

A few numbers give a sense of the scale.

- **GitHub stars**: about 162,000 on `huggingface/transformers` as of June 2026, with roughly 33,600 forks and tens of thousands of contributors and pull requests, ranking it among the most starred ML projects on GitHub.[7]
- **Releases**: development moves fast, with the v5 line reaching v5.12.1 on June 15, 2026 after several point releases earlier the same month.[17]
- **Downloads**: tens of millions of installs per month from PyPI and conda-forge combined, making it consistently one of the most-downloaded ML packages.[17]
- **Citations**: Wolf et al. 2020 has been cited tens of thousands of times in academic literature, putting it among the most cited NLP software papers in history.[1]
- **Industry use**: Microsoft, Meta, Google DeepMind, Amazon, NVIDIA, Anthropic, Cohere, IBM, Salesforce, Apple, and many startups use the library directly for research or as a baseline. Most public open-weight models from these organizations ship with `transformers`-compatible code.
- **Education**: It is the standard teaching library in graduate NLP courses and bootcamps; the official Hugging Face NLP and LLM courses are widely used.

It is fair to say that releasing a new pretrained model without a `transformers`-compatible implementation is now the exception rather than the rule.

## Major features

A partial inventory of features that have shaped how people use the library.

- **Quantization**: integrations with `bitsandbytes` (8-bit and 4-bit, QLoRA-friendly), AWQ, GPTQ / GPTQModel, AQLM, HQQ, EETQ, FBGEMM FP8, Quanto, torchao, compressed-tensors, GGUF interoperability with `llama.cpp`, and built-in fine-grained FP8. With v5, Hugging Face made quantization a first-class feature: "we move to quantization being a first-class citizen."[5][18]
- **Long context**: rotary position embedding (RoPE) scaling, position interpolation, YARN, NTK-aware scaling; KV cache management for streaming and long generations.
- **Speculative decoding**: assisted generation with a draft model, EAGLE-style and Medusa-style heads where supported.
- **Streaming**: `TextStreamer` and async iterators for token-by-token output.
- **Tool calling and structured output**: chat templates, function-calling formats, JSON-schema constrained decoding via integration with libraries such as Outlines and XGrammar.
- **FlashAttention 2 and 3**: drop-in attention implementations selectable through `attn_implementation="flash_attention_2"` or `"sdpa"`.
- **Mixed-precision and `torch.compile`**: bf16 default on modern hardware; `torch.compile` supported across many architectures.
- **Chat templates**: per-model Jinja templates that turn structured chat messages into the exact prompt format a given instruction-tuned model expects.
- **Multimodal pipelines**: image-text-to-text, document QA, video classification.

## What is the Transformers library used for?

The library shows up in roughly four kinds of work.

- **Research**: it is the standard way to release a paper's code. New architectures usually land first as a research repo and then as a `transformers` PR, after which the rest of the ecosystem (vLLM, TGI, llama.cpp, MLX) adapts.
- **Prototyping**: the `pipeline` API and the Hub make it possible to wire up a working classifier, summarizer, or speech transcriber in under five minutes.
- **Fine-tuning**: `Trainer` plus `peft` is the most common way to adapt an open-weight LLM to a specific domain, instruction format, or task. QLoRA fine-tuning of 7B-70B parameter models on a single GPU is now routine.[8]
- **Production inference**: companies sometimes serve directly from `transformers`, but high-throughput LLM serving usually moves to vLLM, TGI, SGLang, or TensorRT-LLM, all of which import `transformers` model definitions and tokenizers.

## How does Transformers differ from vLLM, TGI, and llama.cpp?

| Tool | Primary focus | Strengths | Trade-offs |
|------|---------------|-----------|------------|
| Hugging Face Transformers | Model definitions and training | Largest model catalog; standard API; tight Hub integration | Inference is slower than dedicated servers; some abstraction overhead |
| vLLM | High-throughput LLM serving | Continuous batching, PagedAttention, very fast | Inference only; smaller model coverage than `transformers` |
| Text Generation Inference (TGI) | Production LLM serving by Hugging Face | Production hardened, container-friendly, integrates with the Hub | Inference only; less flexible than `transformers` for research |
| [llama.cpp](/wiki/llama_cpp) | CPU and GGUF inference | Runs on laptops, phones, and edge devices; very small footprint | C++ codebase, not a Python training library |
| [Ollama](/wiki/ollama) | Local model runner built on llama.cpp | Easiest end-user UX for local LLMs | Inference only, opinionated wrapper |
| TensorFlow Hub | TF model catalog | First-party for TF | Smaller catalog; no PyTorch support; not the standard for LLMs |
| JAX/Flax model libraries | Research and TPU work | Functional style; first-class TPU support | Smaller community; mostly subset of `transformers` |
| spaCy + spacy-transformers | NLP pipelines | Production-grade NLP for traditional tasks | Narrower scope; wraps `transformers` rather than replacing it |

## Strengths

- The single largest catalog of pretrained model code and weights, with new state-of-the-art models often supported within days of release.
- Standardized API across very different architectures, which makes swapping models a one-line change.
- Permissive Apache 2.0 license.
- Active development: typically a release every few weeks, frequent bug fixes, transparent design discussions on GitHub.[7]
- High-quality documentation, course material, and a large community of contributors.[3]
- Tight integration with the rest of the open-source ML stack, from training (Accelerate, DeepSpeed, FSDP, PEFT) to inference (vLLM, TGI, llama.cpp, MLX) to evaluation (`evaluate`, `lm-evaluation-harness`).
- Reproducibility: model definitions track the original papers closely, so loading a checkpoint usually gives results within rounding error of the source.

## Limitations

- For high-throughput LLM serving, `transformers` is slower than purpose-built servers. Production teams generally pair the library with vLLM or TGI rather than serving it directly.
- The genericity that makes the library easy to use also imposes some memory and latency overhead, especially for very small models or very large batch sizes.
- Major version transitions (3.x to 4.x in 2020, 4.x to 5.x in 2025-2026) have introduced breaking changes that downstream code needs to track.
- TensorFlow and Flax coverage was deprecated in v5; new architectures ship PyTorch-only.
- CPU inference works but is much slower than what a quantized `llama.cpp` build can do for the same model.
- Some custom architectures (especially research models with non-standard layers) require `trust_remote_code=True`, which carries a security implication users sometimes overlook.

## Hugging Face the company

Hugging Face was founded in 2016 in New York by Clément Delangue (CEO), Julien Chaumond (CTO), and Thomas Wolf (CSO). All three are French.[15] The original product was a chatbot app aimed at teenagers; the company name and the hugging-face emoji come from that era. After releasing the chatbot's models as open source in 2017, the team noticed that the open-source release was getting more attention than the app itself.[15] The November 2018 release of `pytorch-pretrained-bert` accelerated that pivot.[16] By 2019 the company had effectively repositioned itself as an ML platform built around the library and what would become the Hugging Face Hub.

Key corporate milestones:

- 2019: $15M Series A.
- 2021: $40M Series B; acquired Gradio (December 2021).
- 2022: $100M Series C at a $2B valuation; launched BigScience, the open research collaboration that produced BLOOM.
- August 2023: $235M Series D led by Salesforce, with participation from Google, Amazon, NVIDIA, Intel, AMD, IBM, Qualcomm, and Sound Ventures, at a $4.5B valuation.[14]
- June 2024: acquired Argilla (a data-labeling and dataset-quality startup) for around $10M.[13]
- Late 2024: acquired XetHub, the basis of the Xet storage backend now used for Hub repos.
- April 2025: acquired Pollen Robotics, the French maker of the open-source Reachy 2 humanoid robot, marking the company's first move into hardware.[12]
- 2025-2026: continued growth of the Hub past 2 million models and 13 million users; reported revenue passed $130M in 2024.[11]

The company's strategic position is unusual: it does not train its own frontier models and does not sell a closed model API at the scale of OpenAI or Anthropic. Instead it monetizes the Hub through enterprise plans, hosted inference, training credits, and consulting. The transformers library is the gravitational center that makes everything else possible.

## Recent context (2024 to 2026)

A few threads stand out in the last two years.

- **Agentic features**: the `smolagents` library and the `transformers.agents` module support tool-using LLMs, including code agents that write and execute Python.
- **Edge and on-device**: tighter integration with MLX (Apple Silicon), `llama.cpp` (GGUF interoperability), and ONNX Runtime Web has pushed the library toward laptops, phones, and browsers.
- **Inference Providers**: a Hub-level feature that routes inference requests to third-party serving partners like Together AI, Replicate, Fireworks, and SambaNova, while keeping the same `transformers`-style API.
- **Multimodal everywhere**: vision-language models (LLaVA, Idefics, Qwen-VL, PaliGemma, Llama 3.2 Vision) and audio-language models (Whisper, SeamlessM4T) are now first-class citizens, not bolt-ons.
- **v5 release line**: the 5.x line, whose first release candidate landed on December 1, 2025 and which reached v5.12.1 by June 15, 2026, modernizes the codebase around PyTorch as the sole backend, removes long-deprecated TF and Flax helpers, formalizes attention backends, makes quantization a first-class feature, and reworks the chat-template and tool-calling layer.[17][18]

## References

1. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing. *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pp. 38-45. https://aclanthology.org/2020.emnlp-demos.6/
2. Wolf, T. et al. (2019). HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771. https://arxiv.org/abs/1910.03771
3. Hugging Face. Transformers documentation. https://huggingface.co/docs/transformers
4. Hugging Face. Pipelines documentation. https://huggingface.co/docs/transformers/main_classes/pipelines
5. Hugging Face. Quantization overview. https://huggingface.co/docs/transformers/quantization/overview
6. Hugging Face. Hub documentation. https://huggingface.co/docs/hub/index
7. huggingface/transformers GitHub repository. https://github.com/huggingface/transformers
8. Hugging Face. PEFT documentation. https://huggingface.co/docs/peft/index
9. Hugging Face. Accelerate documentation. https://huggingface.co/docs/accelerate/index
10. Hugging Face. TRL documentation. https://huggingface.co/docs/trl/index
11. Hugging Face. State of Open Source on Hugging Face: Spring 2026. https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026
12. Hugging Face. Hugging Face to sell open-source robots thanks to Pollen Robotics acquisition (April 2025). https://huggingface.co/blog/hugging-face-pollen-robotics-acquisition
13. Argilla. Argilla is joining Hugging Face (June 2024). https://argilla.io/blog/argilla-joins-hugggingface/
14. TechCrunch. Hugging Face raises $235M from investors, including Salesforce and NVIDIA (August 2023). https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia/
15. Wikipedia. Hugging Face. https://en.wikipedia.org/wiki/Hugging_Face
16. Synced. Hugging Face Releases PyTorch BERT Pretrained Models and More (February 2019). https://syncedreview.com/2019/02/20/hugging-face-releases-pytorch-bert-pretrained-models-and-more-2/
17. PyPI. transformers. https://pypi.org/project/transformers/
18. Hugging Face. Transformers v5: Simple model definitions powering the AI ecosystem (December 2025). https://huggingface.co/blog/transformers-v5