Hugging Face Transformers
Last reviewed
May 8, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,645 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,645 words
Add missing citations, update stale details, or suggest a clearer explanation.
Note: This article is about the open-source Python library by Hugging Face. For the neural network architecture introduced in the 2017 paper "Attention Is All You Need," see Transformer (architecture).
Hugging Face Transformers (often written simply as the transformers library or transformers) is an open-source Python library that provides general-purpose architectures and pretrained weights for state-of-the-art machine learning models. It started life as a PyTorch port of Google's BERT reference code, and has since become the de facto standard interface for natural language understanding, text generation, computer vision, audio processing, and multimodal modeling. The library covers more than 200 model families, including BERT, GPT, T5, BART, ViT, CLIP, Whisper, LLaMA, Mistral, Mixtral, Gemma, Phi, Qwen, and DeepSeek, and works with PyTorch, TensorFlow, and JAX. It is developed and maintained by Hugging Face, an open-source AI company headquartered in New York and Paris.
Transformers is licensed under Apache 2.0 and tightly integrated with the Hugging Face Hub, which hosts more than 2 million model checkpoints and over 500,000 datasets as of 2026. The associated EMNLP demo paper, Wolf et al. 2020, has been cited tens of thousands of times and is one of the most cited software papers in modern NLP.
In practical terms, transformers is a single Python package (pip install transformers) that exposes:
pipeline API for one-line inference on common tasks.Trainer class for fine-tuning with mixed precision, gradient accumulation, and distributed training.generate API for text generation, supporting greedy decoding, beam search, sampling, contrastive search, and speculative decoding.The library is, in the words of its own documentation, the "model-definition framework" for the broader ecosystem: if a model is supported in transformers, it tends to be compatible with downstream training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch Lightning) and inference engines (vLLM, SGLang, TGI, llama.cpp, MLX) that build on top of those definitions.
The library predates the Hugging Face Hub and even predates Hugging Face's pivot from a chatbot company to an ML platform. It began in late 2018 when Thomas Wolf and a small team ported Google's TensorFlow BERT code to PyTorch. The package went through two renames before settling on transformers in late 2019.
| Year | Milestone |
|---|---|
| 2016 | Hugging Face founded by Clément Delangue, Julien Chaumond, and Thomas Wolf in New York City as a chatbot startup for teenagers. |
| Nov 2018 | Hugging Face releases pytorch-pretrained-bert, a PyTorch port of Google's BERT, on PyPI. |
| Feb 2019 | Library expands to OpenAI GPT, GPT-2, and Transformer-XL after PyTorch reimplementations of GPT-2 small. |
| Jul 2019 | Library renamed to pytorch-transformers (v1.0) to reflect its broader scope; adds XLNet, XLM, RoBERTa, DistilBERT. |
| Sep 2019 | Renamed again to transformers (v2.0); TensorFlow 2.0 support added so models can be loaded interchangeably between frameworks. |
| Oct 2019 | Wolf et al. publish the technical report "HuggingFace's Transformers: State-of-the-art Natural Language Processing" on arXiv (1910.03771). |
| 2020 | Wolf et al. paper accepted to EMNLP 2020 (System Demonstrations), formally introducing the library to the research community. Pipelines API gains traction. |
| Sep 2020 | v3.0 release; ONNX export, Trainer class, and improved tokenizers integration. |
| Nov 2020 | v4.0 release; deeper Hub integration, model sharing, and a stable API. |
| 2021 | Acquires Gradio (December 2021) to provide easy demo hosting through Spaces. |
| 2022 | Adds vision models (ViT, DETR, Swin) and audio models (Wav2Vec2, Whisper); BigScience releases BLOOM through the Hub. |
| Feb 2023 | Native support for LLaMA lands shortly after Meta's release; Mistral, Falcon, MPT, and other open LLMs follow throughout the year. |
| Aug 2023 | Hugging Face raises $235M Series D at a $4.5B valuation from investors including Google, Amazon, NVIDIA, Intel, Salesforce, and AMD. |
| 2024 | Multimodal pipelines, agentic features (smolagents), assisted decoding, and integration with Inference Providers. Acquires Argilla in June 2024 ($10M) and XetHub later in 2024. |
| Apr 2025 | Acquires Pollen Robotics, the maker of the open-source Reachy 2 humanoid robot. |
| 2025–2026 | Continued 4.x releases throughout 2025; v5.0 line rolls out, with v5.7.0 released April 28, 2026. The Hub crosses 2 million public models and 500,000 public datasets. |
The November 2018 release of pytorch-pretrained-bert matters because it made BERT, which had only just been published by Google, usable in PyTorch with a few lines of Python. That single design choice (mirror the official architecture, ship pretrained weights, and let people fine-tune in their own training loop) is the pattern the library has followed ever since.
The library is organized around a small set of abstractions that get reused across every model.
Every model in transformers is implemented with three main classes:
BertConfig) that holds all hyperparameters as plain Python attributes.BertModel, BertForSequenceClassification) implementing the forward pass.The philosophy is intentionally light on abstraction. Each model file is meant to be readable on its own without chasing class hierarchies. Hugging Face calls this the "single model file" policy.
Most users do not instantiate model classes directly. Instead they use the Auto family, which inspects a checkpoint's config and picks the right class:
AutoConfig loads the configuration.AutoTokenizer returns the correct tokenizer (fast Rust-backed when available, slow Python otherwise).AutoModel returns the base model.AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM, etc., return models with the right task head.AutoImageProcessor and AutoFeatureExtractor handle vision and audio inputs.This is what lets the same five lines of code load BERT, RoBERTa, DeBERTa, DistilBERT, or any compatible checkpoint by changing only the model name.
The pipeline factory wraps preprocessing, the model, and postprocessing into one callable. It is the easiest entry point for prototyping and accounts for many of the library's tutorials. Tasks supported include:
| Modality | Pipeline tasks |
|---|---|
| Text | text-classification (alias sentiment-analysis), token-classification (alias ner), question-answering, table-question-answering, fill-mask, text-generation, text2text-generation, summarization, translation, zero-shot-classification, feature-extraction |
| Vision | image-classification, image-segmentation, object-detection, depth-estimation, mask-generation, keypoint-matching, image-feature-extraction, zero-shot-image-classification, zero-shot-object-detection, video-classification |
| Audio | automatic-speech-recognition, audio-classification, text-to-audio (alias text-to-speech), zero-shot-audio-classification |
| Multimodal | image-text-to-text, document-question-answering, visual-question-answering |
A one-liner like pipeline("automatic-speech-recognition", model="openai/whisper-large-v3") will download Whisper, set up its feature extractor, and return a callable that transcribes audio files.
Trainer is the library's training loop. It handles:
accelerate, including DDP, FSDP, and DeepSpeed ZeRO.torch.compile integration.For reinforcement learning from human feedback or preference optimization, users typically reach for trl (Transformer Reinforcement Learning) on top of Trainer.
The model.generate() method is the main interface for text generation in causal LMs and sequence-to-sequence models. It supports greedy decoding, beam search, sampling with temperature, top-k and top-p (nucleus), contrastive search, diverse beam search, group beam search, and speculative or assisted decoding (where a smaller draft model proposes tokens that a larger model accepts or rejects). Streaming output is supported through TextStreamer and TextIteratorStreamer.
Generation also includes a configurable KV cache, RoPE scaling for extended context windows, repetition penalties, logit processors, and stopping criteria. Tool calling and structured output, including JSON-grammar constrained decoding, are supported on compatible models.
As of v5.x the library ships definitions for more than 200 architectures. The roster spans encoder-only models (BERT, RoBERTa, DeBERTa, DistilBERT, ALBERT, ELECTRA), decoder-only models (GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, LLaMA 1–4, Mistral, Mixtral, Gemma, Phi, Qwen, DeepSeek, Falcon, MPT), encoder-decoder models (T5, mT5, BART, mBART, Pegasus, MarianMT, NLLB), vision models (ViT, DeiT, Swin, ConvNeXt, DINOv2, BEiT, MAE, DETR, Mask2Former), audio models (Wav2Vec2, HuBERT, Whisper, MMS, SeamlessM4T, MusicGen), and multimodal models (CLIP, BLIP, BLIP-2, LLaVA, IDEFICS, Pix2Struct, Donut, Kosmos-2, PaliGemma, Qwen-VL, Llama 3.2 Vision).
Transformers does not stand alone. The library ships with hooks into roughly a dozen sister packages, most of them maintained by Hugging Face itself:
| Library | Purpose |
|---|---|
transformers | Model architectures, tokenizers, training, generation, pipelines. |
tokenizers | Fast Rust-backed tokenizers (BPE, WordPiece, Unigram, byte-level). |
datasets | Streaming-capable dataset loading; over 500,000 datasets on the Hub. |
accelerate | Hardware-agnostic distributed training and inference; abstracts CUDA, ROCm, TPU, and Apple Silicon. |
peft | Parameter-efficient fine-tuning (LoRA, QLoRA, IA3, prefix tuning, prompt tuning, adapters). |
trl | RLHF and preference optimization (SFT, PPO, DPO, KTO, ORPO, GRPO). |
diffusers | Diffusion models (Stable Diffusion, SDXL, SD 3, Flux, video diffusion). |
optimum | Hardware-specific optimization backends: ONNX Runtime, TensorRT, OpenVINO, Habana Gaudi, AWS Neuron, Apple Neural Engine. |
evaluate | Standardized evaluation metrics. |
safetensors | Safe, fast tensor serialization format that has largely replaced pickle-based PyTorch checkpoints. |
huggingface_hub | Programmatic Hub access, file downloads, repo management, Inference API client. |
gradio | UI library for building demo apps; powers most Hugging Face Spaces. |
smolagents | Lightweight agent framework for tool-using LLMs. |
The ecosystem is designed so that you can mix and match. For example, fine-tuning Llama 3 with QLoRA in 4-bit precision typically uses transformers for the model, bitsandbytes for the 4-bit quantization, peft for the LoRA adapters, datasets for the training data, accelerate for distributed training, and trl if the recipe involves DPO or PPO. Every step touches huggingface_hub for downloads and uploads.
Transformers is unusual in that it tries to keep a single Python interface across multiple deep learning frameworks. In practice the support is uneven, and PyTorch has clearly become the primary backend over time.
| Framework | Support level | Notes |
|---|---|---|
| PyTorch | First-class for essentially every model | The reference implementations live in modeling_*.py files. As of v5.x, PyTorch 2.4+ is required. |
| TensorFlow / Keras | Subset of older models | Implemented in modeling_tf_*.py. New architectures added since 2024 generally do not include TF versions; the project has signaled a winding-down of TF coverage. |
| JAX / Flax | Subset, mostly research-driven | Implemented in modeling_flax_*.py; notable for TPU workflows and a few high-profile models like the original Flax T5X work. |
| ONNX | Via optimum | Most common production export path for CPU and GPU inference. |
| TensorRT-LLM, vLLM, SGLang, TGI | Via separate inference servers | These projects re-implement the hot path for serving but reuse transformers model definitions and tokenizers. |
The Hub is the network effect that has made the library hard to dislodge. Loading a checkpoint with from_pretrained("meta-llama/Llama-3.1-8B-Instruct") resolves to a Hub repository, downloads the relevant files (config, tokenizer, weights, often in safetensors), caches them locally, and instantiates the model. Uploads are symmetrical: model.push_to_hub("my-org/my-model") creates a Git LFS-backed repo with a model card.
Representative Hub statistics circa 2026:
sentence-transformers/all-MiniLM-L6-v2 dominate the long tail.The Hub is backed by the Xet storage system (acquired through XetHub in 2024), which deduplicates large files at the chunk level and significantly speeds up uploads and downloads of multi-gigabyte model weights.
A few numbers give a sense of the scale.
huggingface/transformers as of April 2026, with roughly 33,000 forks and tens of thousands of contributors and pull requests.transformers-compatible code.It is fair to say that releasing a new pretrained model without a transformers-compatible implementation is now the exception rather than the rule.
A partial inventory of features that have shaped how people use the library.
bitsandbytes (8-bit and 4-bit, QLoRA-friendly), AWQ, GPTQ / GPTQModel, AQLM, HQQ, EETQ, FBGEMM FP8, Quanto, torchao, compressed-tensors, GGUF interoperability with llama.cpp, and built-in fine-grained FP8.TextStreamer and async iterators for token-by-token output.attn_implementation="flash_attention_2" or "sdpa".torch.compile: bf16 default on modern hardware; torch.compile supported across many architectures.The library shows up in roughly four kinds of work.
transformers PR, after which the rest of the ecosystem (vLLM, TGI, llama.cpp, MLX) adapts.pipeline API and the Hub make it possible to wire up a working classifier, summarizer, or speech transcriber in under five minutes.Trainer plus peft is the most common way to adapt an open-weight LLM to a specific domain, instruction format, or task. QLoRA fine-tuning of 7B–70B parameter models on a single GPU is now routine.transformers, but high-throughput LLM serving usually moves to vLLM, TGI, SGLang, or TensorRT-LLM, all of which import transformers model definitions and tokenizers.| Tool | Primary focus | Strengths | Trade-offs |
|---|---|---|---|
| Hugging Face Transformers | Model definitions and training | Largest model catalog; standard API; tight Hub integration | Inference is slower than dedicated servers; some abstraction overhead |
| vLLM | High-throughput LLM serving | Continuous batching, PagedAttention, very fast | Inference only; smaller model coverage than transformers |
| Text Generation Inference (TGI) | Production LLM serving by Hugging Face | Production hardened, container-friendly, integrates with the Hub | Inference only; less flexible than transformers for research |
| llama.cpp | CPU and GGUF inference | Runs on laptops, phones, and edge devices; very small footprint | C++ codebase, not a Python training library |
| Ollama | Local model runner built on llama.cpp | Easiest end-user UX for local LLMs | Inference only, opinionated wrapper |
| TensorFlow Hub | TF model catalog | First-party for TF | Smaller catalog; no PyTorch support; not the standard for LLMs |
| JAX/Flax model libraries | Research and TPU work | Functional style; first-class TPU support | Smaller community; mostly subset of transformers |
| spaCy + spacy-transformers | NLP pipelines | Production-grade NLP for traditional tasks | Narrower scope; wraps transformers rather than replacing it |
evaluate, lm-evaluation-harness).transformers is slower than purpose-built servers. Production teams generally pair the library with vLLM or TGI rather than serving it directly.llama.cpp build can do for the same model.trust_remote_code=True, which carries a security implication users sometimes overlook.Hugging Face was founded in 2016 in New York by Clément Delangue (CEO), Julien Chaumond (CTO), and Thomas Wolf (CSO). All three are French. The original product was a chatbot app aimed at teenagers; the company name and the hugging-face emoji come from that era. After releasing the chatbot's models as open source in 2017, the team noticed that the open-source release was getting more attention than the app itself. The November 2018 release of pytorch-pretrained-bert accelerated that pivot. By 2019 the company had effectively repositioned itself as an ML platform built around the library and what would become the Hugging Face Hub.
Key corporate milestones:
The company's strategic position is unusual: it does not train its own frontier models and does not sell a closed model API at the scale of OpenAI or Anthropic. Instead it monetizes the Hub through enterprise plans, hosted inference, training credits, and consulting. The transformers library is the gravitational center that makes everything else possible.
A few threads stand out in the last two years.
smolagents library and the transformers.agents module support tool-using LLMs, including code agents that write and execute Python.llama.cpp (GGUF interoperability), and ONNX Runtime Web has pushed the library toward laptops, phones, and browsers.transformers-style API.