Hugging Face Transformers
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v6 · 4,049 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v6 · 4,049 words
Add missing citations, update stale details, or suggest a clearer explanation.
Note: This article is about the open-source Python library by Hugging Face. For the neural network architecture introduced in the 2017 paper "Attention Is All You Need," see Transformer (architecture).
Hugging Face Transformers is an open-source Python library that provides general-purpose architectures, a unified API, and pretrained weights for state-of-the-art machine learning models across text, vision, audio, and multimodal tasks, installable with a single pip install transformers command.[3] In the words of its own EMNLP 2020 paper, "Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community."[1] It started life as a PyTorch port of Google's BERT reference code, and has since become the de facto standard interface for natural language understanding, text generation, computer vision, audio processing, and multimodal modeling.[16] The library covers more than 200 model families, including BERT, GPT, T5, BART, ViT, CLIP, Whisper, LLaMA, Mistral, Mixtral, Gemma, Phi, Qwen, and DeepSeek, and (through version 4.x) worked with PyTorch, TensorFlow, and JAX.[3] It is developed and maintained by Hugging Face, an open-source AI company headquartered in New York and Paris.[15]
Transformers is licensed under Apache 2.0 and tightly integrated with the Hugging Face Hub, which hosts more than 2 million public model checkpoints and over 500,000 public datasets as of 2026.[6][11] The repository carries roughly 162,000 GitHub stars and about 33,600 forks as of June 2026, making it one of the most starred machine learning projects on GitHub.[7] The associated EMNLP demo paper, Wolf et al. 2020, has been cited tens of thousands of times and is one of the most cited software papers in modern NLP.[1] The official GitHub tagline now describes it as "the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training."[7]
In practical terms, transformers is a single Python package (pip install transformers) that exposes:
pipeline API for one-line inference on common tasks.Trainer class for fine-tuning with mixed precision, gradient accumulation, and distributed training.generate API for text generation, supporting greedy decoding, beam search, sampling, contrastive search, and speculative decoding.The library is, in the words of its own documentation, the "model-definition framework" for the broader ecosystem: if a model is supported in transformers, it tends to be compatible with downstream training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch Lightning) and inference engines (vLLM, SGLang, TGI, llama.cpp, MLX) that build on top of those definitions.[3] Hugging Face frames the v5 release the same way: "Transformers, at the core, remains a model architecture toolkit," and "the backbone of hundreds of thousands of projects."[18]
The library predates the Hugging Face Hub and even predates Hugging Face's pivot from a chatbot company to an ML platform. It began in late 2018 when Thomas Wolf and a small team ported Google's TensorFlow BERT code to PyTorch.[15] The package went through two renames before settling on transformers in late 2019.[15]
| Year | Milestone |
|---|---|
| 2016 | Hugging Face founded by Clément Delangue, Julien Chaumond, and Thomas Wolf in New York City as a chatbot startup for teenagers. |
| Nov 2018 | Hugging Face releases pytorch-pretrained-bert, a PyTorch port of Google's BERT, on PyPI. |
| Feb 2019 | Library expands to OpenAI GPT, GPT-2, and Transformer-XL after PyTorch reimplementations of GPT-2 small. |
| Jul 2019 | Library renamed to pytorch-transformers (v1.0) to reflect its broader scope; adds XLNet, XLM, RoBERTa, DistilBERT. |
| Sep 2019 | Renamed again to transformers (v2.0); TensorFlow 2.0 support added so models can be loaded interchangeably between frameworks. |
| Oct 2019 | Wolf et al. publish the technical report "HuggingFace's Transformers: State-of-the-art Natural Language Processing" on arXiv (1910.03771).[2] |
| 2020 | Wolf et al. paper accepted to EMNLP 2020 (System Demonstrations), pp. 38-45, formally introducing the library to the research community. Pipelines API gains traction.[1] |
| Sep 2020 | v3.0 release; ONNX export, Trainer class, and improved tokenizers integration. |
| Nov 2020 | v4.0 release; deeper Hub integration, model sharing, and a stable API. |
| 2021 | Acquires Gradio (December 2021) to provide easy demo hosting through Spaces. |
| 2022 | Adds vision models (ViT, DETR, Swin) and audio models (Wav2Vec2, Whisper); BigScience releases BLOOM through the Hub. |
| Feb 2023 | Native support for LLaMA lands shortly after Meta's release; Mistral, Falcon, MPT, and other open LLMs follow throughout the year. |
| Aug 2023 | Hugging Face raises $235M Series D at a $4.5B valuation from investors including Google, Amazon, NVIDIA, Intel, Salesforce, and AMD.[14] |
| 2024 | Multimodal pipelines, agentic features (smolagents), assisted decoding, and integration with Inference Providers. Acquires Argilla in June 2024 ($10M) and XetHub later in 2024. |
| Apr 2025 | Acquires Pollen Robotics, the maker of the open-source Reachy 2 humanoid robot.[12] |
| Dec 2025 | Transformers v5.0.0rc-0 released (December 1, 2025): PyTorch-only backend, quantization promoted to a first-class feature, and the start of the sunset of TensorFlow and Flax support.[18] |
| Jun 2026 | The v5 line continues at a fast cadence, reaching v5.12.1 on June 15, 2026; the Hub has crossed 2 million public models, 500,000 public datasets, and 13 million users.[11][17] |
The November 2018 release of pytorch-pretrained-bert matters because it made BERT, which had only just been published by Google, usable in PyTorch with a few lines of Python.[16][17] That single design choice (mirror the official architecture, ship pretrained weights, and let people fine-tune in their own training loop) is the pattern the library has followed ever since.
The library is organized around a small set of abstractions that get reused across every model.
Every model in transformers is implemented with three main classes:
BertConfig) that holds all hyperparameters as plain Python attributes.BertModel, BertForSequenceClassification) implementing the forward pass.The philosophy is intentionally light on abstraction. Each model file is meant to be readable on its own without chasing class hierarchies. Hugging Face calls this the "single model file" policy.[3]
Most users do not instantiate model classes directly. Instead they use the Auto family, which inspects a checkpoint's config and picks the right class:
AutoConfig loads the configuration.AutoTokenizer returns the correct tokenizer (fast Rust-backed when available, slow Python otherwise).AutoModel returns the base model.AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM, etc., return models with the right task head.AutoImageProcessor and AutoFeatureExtractor handle vision and audio inputs.This is what lets the same five lines of code load BERT, RoBERTa, DeBERTa, DistilBERT, or any compatible checkpoint by changing only the model name.[3]
The pipeline factory wraps preprocessing, the model, and postprocessing into one callable. It is the easiest entry point for prototyping and accounts for many of the library's tutorials.[4] Tasks supported include:
| Modality | Pipeline tasks |
|---|---|
| Text | text-classification (alias sentiment-analysis), token-classification (alias ner), question-answering, table-question-answering, fill-mask, text-generation, text2text-generation, summarization, translation, zero-shot-classification, feature-extraction |
| Vision | image-classification, image-segmentation, object-detection, depth-estimation, mask-generation, keypoint-matching, image-feature-extraction, zero-shot-image-classification, zero-shot-object-detection, video-classification |
| Audio | automatic-speech-recognition, audio-classification, text-to-audio (alias text-to-speech), zero-shot-audio-classification |
| Multimodal | image-text-to-text, document-question-answering, visual-question-answering |
A one-liner like pipeline("automatic-speech-recognition", model="openai/whisper-large-v3") will download Whisper, set up its feature extractor, and return a callable that transcribes audio files.[4]
Trainer is the library's training loop. It handles:
accelerate, including DDP, FSDP, and DeepSpeed ZeRO.torch.compile integration.For reinforcement learning from human feedback or preference optimization, users typically reach for trl (Transformer Reinforcement Learning) on top of Trainer.[10]
The model.generate() method is the main interface for text generation in causal LMs and sequence-to-sequence models. It supports greedy decoding, beam search, sampling with temperature, top-k and top-p (nucleus), contrastive search, diverse beam search, group beam search, and speculative or assisted decoding (where a smaller draft model proposes tokens that a larger model accepts or rejects).[3] Streaming output is supported through TextStreamer and TextIteratorStreamer.
Generation also includes a configurable KV cache, RoPE scaling for extended context windows, repetition penalties, logit processors, and stopping criteria. Tool calling and structured output, including JSON-grammar constrained decoding, are supported on compatible models.[3]
As of v5.x the library ships definitions for more than 200 architectures.[3] The roster spans encoder-only models (BERT, RoBERTa, DeBERTa, DistilBERT, ALBERT, ELECTRA), decoder-only models (GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, LLaMA 1-4, Mistral, Mixtral, Gemma, Phi, Qwen, DeepSeek, Falcon, MPT), encoder-decoder models (T5, mT5, BART, mBART, Pegasus, MarianMT, NLLB), vision models (ViT, DeiT, Swin, ConvNeXt, DINOv2, BEiT, MAE, DETR, Mask2Former), audio models (Wav2Vec2, HuBERT, Whisper, MMS, SeamlessM4T, MusicGen), and multimodal models (CLIP, BLIP, BLIP-2, LLaVA, IDEFICS, Pix2Struct, Donut, Kosmos-2, PaliGemma, Qwen-VL, Llama 3.2 Vision).
Transformers does not stand alone. The library ships with hooks into roughly a dozen sister packages, most of them maintained by Hugging Face itself:
| Library | Purpose |
|---|---|
transformers | Model architectures, tokenizers, training, generation, pipelines. |
tokenizers | Fast Rust-backed tokenizers (BPE, WordPiece, Unigram, byte-level). |
datasets | Streaming-capable dataset loading; over 500,000 datasets on the Hub. |
accelerate | Hardware-agnostic distributed training and inference; abstracts CUDA, ROCm, TPU, and Apple Silicon. |
peft | Parameter-efficient fine-tuning (LoRA, QLoRA, IA3, prefix tuning, prompt tuning, adapters). |
trl | RLHF and preference optimization (SFT, PPO, DPO, KTO, ORPO, GRPO). |
diffusers | Diffusion models (Stable Diffusion, SDXL, SD 3, Flux, video diffusion). |
optimum | Hardware-specific optimization backends: ONNX Runtime, TensorRT, OpenVINO, Habana Gaudi, AWS Neuron, Apple Neural Engine. |
evaluate | Standardized evaluation metrics. |
safetensors | Safe, fast tensor serialization format that has largely replaced pickle-based PyTorch checkpoints. |
huggingface_hub | Programmatic Hub access, file downloads, repo management, Inference API client. |
gradio | UI library for building demo apps; powers most Hugging Face Spaces. |
smolagents | Lightweight agent framework for tool-using LLMs. |
The ecosystem is designed so that you can mix and match. For example, fine-tuning Llama 3 with QLoRA in 4-bit precision typically uses transformers for the model, bitsandbytes for the 4-bit quantization, peft for the LoRA adapters, datasets for the training data, accelerate for distributed training, and trl if the recipe involves DPO or PPO.[8][9][10] Every step touches huggingface_hub for downloads and uploads.
Transformers is unusual in that it long tried to keep a single Python interface across multiple deep learning frameworks. In practice the support became uneven, and with the v5 release of December 2025 Hugging Face consolidated on PyTorch: "Finally, we're sunsetting our Flax/TensorFlow support in favor of focusing on PyTorch as the sole backend."[18]
| Framework | Support level | Notes |
|---|---|---|
| PyTorch | First-class for essentially every model | The reference implementations live in modeling_*.py files. As of v5.x, PyTorch is the sole supported backend. |
| TensorFlow / Keras | Deprecated in v5 | Historically implemented in modeling_tf_*.py; the v5 line sunsets TF support in favor of PyTorch. |
| JAX / Flax | Deprecated in v5 | Historically implemented in modeling_flax_*.py; Hugging Face is instead working with Jax-ecosystem partners on compatibility rather than maintaining in-tree Flax models. |
| ONNX | Via optimum | Most common production export path for CPU and GPU inference. |
| TensorRT-LLM, vLLM, SGLang, TGI | Via separate inference servers | These projects re-implement the hot path for serving but reuse transformers model definitions and tokenizers. |
The Hub is the network effect that has made the library hard to dislodge. Loading a checkpoint with from_pretrained("meta-llama/Llama-3.1-8B-Instruct") resolves to a Hub repository, downloads the relevant files (config, tokenizer, weights, often in safetensors), caches them locally, and instantiates the model.[6] Uploads are symmetrical: model.push_to_hub("my-org/my-model") creates a Git LFS-backed repo with a model card.[6]
Hugging Face's State of Open Source on Hugging Face: Spring 2026 report gives a precise snapshot of the Hub circa early 2026:[11]
The Hub is backed by the Xet storage system (acquired through XetHub in 2024), which deduplicates large files at the chunk level and significantly speeds up uploads and downloads of multi-gigabyte model weights.[11]
A few numbers give a sense of the scale.
huggingface/transformers as of June 2026, with roughly 33,600 forks and tens of thousands of contributors and pull requests, ranking it among the most starred ML projects on GitHub.[7]transformers-compatible code.It is fair to say that releasing a new pretrained model without a transformers-compatible implementation is now the exception rather than the rule.
A partial inventory of features that have shaped how people use the library.
bitsandbytes (8-bit and 4-bit, QLoRA-friendly), AWQ, GPTQ / GPTQModel, AQLM, HQQ, EETQ, FBGEMM FP8, Quanto, torchao, compressed-tensors, GGUF interoperability with llama.cpp, and built-in fine-grained FP8. With v5, Hugging Face made quantization a first-class feature: "we move to quantization being a first-class citizen."[5][18]TextStreamer and async iterators for token-by-token output.attn_implementation="flash_attention_2" or "sdpa".torch.compile: bf16 default on modern hardware; torch.compile supported across many architectures.The library shows up in roughly four kinds of work.
transformers PR, after which the rest of the ecosystem (vLLM, TGI, llama.cpp, MLX) adapts.pipeline API and the Hub make it possible to wire up a working classifier, summarizer, or speech transcriber in under five minutes.Trainer plus peft is the most common way to adapt an open-weight LLM to a specific domain, instruction format, or task. QLoRA fine-tuning of 7B-70B parameter models on a single GPU is now routine.[8]transformers, but high-throughput LLM serving usually moves to vLLM, TGI, SGLang, or TensorRT-LLM, all of which import transformers model definitions and tokenizers.| Tool | Primary focus | Strengths | Trade-offs |
|---|---|---|---|
| Hugging Face Transformers | Model definitions and training | Largest model catalog; standard API; tight Hub integration | Inference is slower than dedicated servers; some abstraction overhead |
| vLLM | High-throughput LLM serving | Continuous batching, PagedAttention, very fast | Inference only; smaller model coverage than transformers |
| Text Generation Inference (TGI) | Production LLM serving by Hugging Face | Production hardened, container-friendly, integrates with the Hub | Inference only; less flexible than transformers for research |
| llama.cpp | CPU and GGUF inference | Runs on laptops, phones, and edge devices; very small footprint | C++ codebase, not a Python training library |
| Ollama | Local model runner built on llama.cpp | Easiest end-user UX for local LLMs | Inference only, opinionated wrapper |
| TensorFlow Hub | TF model catalog | First-party for TF | Smaller catalog; no PyTorch support; not the standard for LLMs |
| JAX/Flax model libraries | Research and TPU work | Functional style; first-class TPU support | Smaller community; mostly subset of transformers |
| spaCy + spacy-transformers | NLP pipelines | Production-grade NLP for traditional tasks | Narrower scope; wraps transformers rather than replacing it |
evaluate, lm-evaluation-harness).transformers is slower than purpose-built servers. Production teams generally pair the library with vLLM or TGI rather than serving it directly.llama.cpp build can do for the same model.trust_remote_code=True, which carries a security implication users sometimes overlook.Hugging Face was founded in 2016 in New York by Clément Delangue (CEO), Julien Chaumond (CTO), and Thomas Wolf (CSO). All three are French.[15] The original product was a chatbot app aimed at teenagers; the company name and the hugging-face emoji come from that era. After releasing the chatbot's models as open source in 2017, the team noticed that the open-source release was getting more attention than the app itself.[15] The November 2018 release of pytorch-pretrained-bert accelerated that pivot.[16] By 2019 the company had effectively repositioned itself as an ML platform built around the library and what would become the Hugging Face Hub.
Key corporate milestones:
The company's strategic position is unusual: it does not train its own frontier models and does not sell a closed model API at the scale of OpenAI or Anthropic. Instead it monetizes the Hub through enterprise plans, hosted inference, training credits, and consulting. The transformers library is the gravitational center that makes everything else possible.
A few threads stand out in the last two years.
smolagents library and the transformers.agents module support tool-using LLMs, including code agents that write and execute Python.llama.cpp (GGUF interoperability), and ONNX Runtime Web has pushed the library toward laptops, phones, and browsers.transformers-style API.