AI pricing refers to the cost structures and economic models used by providers of artificial intelligence services, particularly large language model (LLM) APIs. The dominant pricing model in the industry is token-based, where customers pay per unit of text processed, with separate rates for input and output tokens. Since OpenAI introduced commercial API access in June 2020, AI pricing has undergone extraordinary deflation: the cost of accessing a frontier-class language model has fallen by roughly 99.5% in six years, from $60 per million tokens for GPT-3 Davinci in 2020 to under $0.15 per million input tokens for budget models by late 2024 [1][2].
Understanding AI pricing is essential for developers, product managers, and business leaders evaluating whether to integrate AI capabilities, which models and providers to use, and how to optimize costs as usage scales.
A token is the fundamental unit of text that language models process. Rather than reading text character by character or word by word, models use tokenizers (typically based on Byte Pair Encoding or similar algorithms) to break text into subword units that balance vocabulary size with representation efficiency [3].
As a practical approximation:
Token counts vary by language and content type. Code, technical text, and non-English languages tend to use more tokens per word due to less common vocabulary. Most API providers offer tokenizer tools or libraries so developers can estimate token counts before making requests.
Modern AI APIs charge separately for input tokens and output tokens [1][4]:
Output tokens are priced higher than input tokens, typically 2x to 8x more, reflecting the greater computational cost of generating text (which requires running the model autoregressively, one token at a time) versus processing input (which can be parallelized).
The token-based model aligns costs with actual computational resource consumption. Each token processed requires a forward pass through the model's neural network, consuming GPU compute, memory bandwidth, and energy. Per-token pricing gives customers fine-grained control over costs and allows providers to price different models according to their computational requirements. Larger, more capable models cost more per token because they require more parameters and floating-point operations per inference step.
The following table compares pricing for flagship, mid-tier, and budget models from major AI API providers as of March 2026 [1][4][5][6][7][8].
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-5.4 | $2.50 | $15.00 | 1,050,000 |
| OpenAI | GPT-5.2 | $1.75 | $14.00 | 400,000 |
| Anthropic | Claude Opus 4.6 | $5.00 | $25.00 | 1,000,000 |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1,000,000 | |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1,000,000 | |
| xAI | Grok 4.1 | $0.20 | $0.50 | 131,072 |
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4.1 | $2.00 | $8.00 | 1,000,000 |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | 1,000,000 |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1,000,000 | |
| Mistral | Mistral Large 3 | $0.50 | $1.50 | 128,000 |
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4.1 Nano | $0.10 | $0.40 | 1,000,000 |
| OpenAI | GPT-4o Mini | $0.15 | $0.60 | 128,000 |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | 200,000 |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1,000,000 | |
| DeepSeek | DeepSeek V3.2 | $0.28 | $0.42 | 128,000 |
| DeepSeek | DeepSeek R1 | $0.50 | $2.18 | 128,000 |
| Mistral | Devstral Small 2 | $0.10 | $0.30 | 128,000 |
| Meta (via Groq) | Llama 4 Scout | $0.11 | $0.34 | 1,000,000 |
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|---|
| OpenAI | o3 | $0.40 | $1.60 | General reasoning; output includes thinking tokens |
| OpenAI | o4-mini | $1.10 | $4.40 | Cost-efficient reasoning with vision |
| Anthropic | Claude Sonnet 4.6 (extended thinking) | $3.00 | $15.00 | Thinking tokens at standard output rate |
| DeepSeek | R1 | $0.50 | $2.18 | Open-weight reasoning model |
AI API pricing has experienced a period of deflation unmatched by almost any other technology sector. The cost of accessing capable language models has fallen at a pace that dwarfs even Moore's Law, with some analyses suggesting AI inference costs have improved at 50x the speed of semiconductor scaling [9].
The following table tracks the cost trajectory of OpenAI's models, illustrating the trend across the industry [1][2][10].
| Date | Model | Input Price (per 1M tokens) | Relative Cost (vs. GPT-3 Davinci) |
|---|---|---|---|
| June 2020 | GPT-3 Davinci | $60.00 | 1.0x (baseline) |
| November 2022 | GPT-3 Davinci-002 | $20.00 | 0.33x |
| March 2023 | GPT-3.5 Turbo | $2.00 | 0.033x |
| November 2023 | GPT-3.5 Turbo (updated) | $0.50 | 0.008x |
| May 2024 | GPT-4o | $5.00 | 0.083x |
| July 2024 | GPT-4o Mini | $0.15 | 0.0025x |
| April 2025 | GPT-4.1 Nano | $0.10 | 0.0017x |
| August 2025 | GPT-5 | $1.25 | 0.021x |
| March 2026 | GPT-5.4 Nano | $0.20 | 0.003x |
From GPT-3 Davinci at $60 per million tokens to GPT-4o Mini at $0.15 per million, the input cost dropped by a factor of 400x in just four years. Even accounting for the fact that newer models are significantly more capable (making direct comparison imperfect), the trend is dramatic.
Several factors have contributed to this historic cost reduction:
Hardware improvements: Each generation of GPU (from Nvidia A100 to H100 to B200) has delivered substantial improvements in inference throughput per dollar. Nvidia's Blackwell architecture, shipping from late 2024, roughly doubled inference performance per watt compared to Hopper [11].
Algorithmic efficiency: Techniques like Flash Attention, grouped-query attention, speculative decoding, and mixture-of-experts architectures have dramatically reduced the computational cost per token. DeepSeek's V3 model demonstrated that a mixture-of-experts architecture could achieve near-frontier performance at a fraction of the training and inference cost [12].
Quantization: Reducing model weights from 16-bit to 8-bit or 4-bit precision roughly halves or quarters memory requirements and inference cost, often with minimal quality degradation.
Scale and competition: As more providers entered the market and usage volumes grew, economies of scale and competitive pressure drove prices down. The emergence of DeepSeek as an ultra-low-cost provider in early 2025 intensified pricing competition across the industry.
Smaller, better models: Model distillation and improved training techniques have produced smaller models (like GPT-4o Mini and Claude Haiku) that match the performance of much larger predecessors at a fraction of the cost.
While budget and mid-tier model prices have fallen dramatically, frontier model pricing tells a more nuanced story. OpenAI's GPT-5.2 Pro is priced at $21/$168 per million tokens, and Anthropic's Claude Opus 4.6 in fast mode costs $30/$150. This creates a "K-shaped" dynamic where the most capable models maintain premium pricing while the broader market races toward zero [13]. For the hardest tasks requiring maximum capability, AI costs per query can still be substantial.
Developers and organizations have several strategies available to reduce AI API costs without sacrificing output quality.
Both OpenAI and Anthropic offer prompt caching that reuses previously processed input prefixes [1][4]. When a request shares the same opening content (system prompt, reference documents) as a recent request, the cached tokens are read at a steep discount.
| Provider | Cache Hit Discount |
|---|---|
| Anthropic | 90% off base input price |
| OpenAI (GPT-5 family) | 90% off |
| OpenAI (GPT-4.1 family) | 75% off |
| OpenAI (GPT-4o / o-series) | 50% off |
| Google Gemini | Up to 90% off |
Prompt caching is most effective for applications with stable, lengthy system prompts, such as customer support bots with extensive knowledge bases, document analysis pipelines processing the same document with different queries, or coding assistants with large codebase context.
Batch APIs offered by OpenAI, Anthropic, and Google process requests asynchronously at a 50% discount [1][4]. This is ideal for workloads that do not require real-time responses:
Batch and caching discounts typically stack, enabling combined savings exceeding 90%.
One of the highest-impact optimization strategies is matching the model to the task complexity. A task that GPT-4.1 Nano ($0.10/$0.40) handles adequately should not be routed to GPT-5.4 ($2.50/$15.00). Many production systems use a router that sends simple queries to budget models and escalates complex ones to more capable (and expensive) models.
| Task Complexity | Recommended Tier | Example Models | Typical Cost |
|---|---|---|---|
| Simple classification, extraction | Budget | GPT-4.1 Nano, Haiku 4.5, Flash-Lite | $0.10-$1.00 per 1M input |
| General conversation, summarization | Mid-tier | GPT-4.1, Sonnet 4.6, Gemini 2.5 Flash | $0.30-$3.00 per 1M input |
| Complex reasoning, research, analysis | Flagship | GPT-5.4, Opus 4.6, Gemini 3.1 Pro | $2.00-$5.00 per 1M input |
| Mathematical/scientific reasoning | Reasoning | o3, o4-mini, Sonnet extended thinking, R1 | $0.40-$3.00 per 1M input |
Model distillation involves using a larger, more expensive model to generate training data, then fine-tuning a smaller, cheaper model on that data. For example, a developer might use GPT-5.4 to generate 10,000 high-quality responses for a specific task, then fine-tune GPT-4.1 Mini on those examples. The resulting fine-tuned model can approach the larger model's quality on that narrow task at a fraction of the inference cost.
Running open-source models like Meta's Llama or DeepSeek V3 on self-hosted infrastructure can eliminate per-token API costs entirely. Quantized versions of these models (4-bit or 8-bit precision) run on consumer or mid-range hardware while maintaining much of the original model's capability.
The economics of self-hosting versus API usage depend heavily on volume [14][15].
| Factor | Self-Hosted (e.g., Llama on GPU) | API (e.g., OpenAI, Anthropic) |
|---|---|---|
| Fixed costs | High ($3,000-$5,000/month per GPU) | None |
| Marginal cost per token | Near zero at full utilization | Fixed per-token rate |
| Break-even volume | 50+ million tokens/month (realistic) | N/A |
| Engineering overhead | Significant (setup, maintenance, monitoring) | Minimal |
| Model quality | Open-source frontier (Llama 4, DeepSeek V3) | Proprietary frontier (GPT-5, Claude, Gemini) |
| Latency control | Full control over hardware and batching | Subject to provider's infrastructure |
| Data privacy | Complete control | Data sent to third-party servers |
Self-hosting a 70B-parameter model on cloud A100 GPUs costs approximately $3,000-$5,000 per month but can deliver inference at roughly $0.07 per million tokens at full utilization. The realistic break-even point, accounting for engineering time and operational overhead, is around 50+ million tokens per month for most organizations [15]. Below that volume, API access is almost always more cost-effective.
For reference, Meta estimates inference costs for Llama models at $0.30-$0.49 per million tokens on a single host [14], and third-party hosting providers like Groq offer Llama 4 Scout at $0.11/$0.34 per million tokens, often cheaper than self-hosting for moderate volumes.
Several providers offer free access tiers to attract developers [16][17].
| Provider | Free Tier Details | Limitations |
|---|---|---|
| Google Gemini | No credit card required; 5-15 RPM, 250K TPM, 1,000 RPD | Data may train models; not available in EU |
| OpenAI | $5 credit for new accounts (expires after 3 months) | Limited models and rate limits |
| Anthropic | Small free credit for evaluation | Limited to testing purposes |
| DeepSeek | Free research access with rate limits | Throttled during peak hours |
| Mistral | Free tier for small models (Pixtral, Mistral Nemo) | Very low rate limits |
| Groq | Free tier with rate limits for open-source models | Limited TPM |
Google's free tier is the most generous for developers, offering enough capacity for prototyping and small-scale production without any payment requirement. OpenAI's free credit is time-limited and primarily useful for initial evaluation.
Enterprise customers at all major providers can negotiate custom pricing arrangements that differ from the published rates [18].
Common enterprise pricing features include:
Enterprise pricing is typically negotiated individually and is not publicly disclosed. Organizations processing billions of tokens monthly may receive discounts of 30-50% or more off published rates.
In 2024, venture capitalist Jeremy Blumenfeld proposed what he called "Blumenfeld's Law": the observation that the cost of AI inference drops by roughly 10x every 12 to 18 months, far outpacing Moore's Law's traditional 2x improvement every 18-24 months [9][19]. This observation, while not a rigorous physical law, captures the empirical trend visible in the pricing data.
The underlying dynamics supporting this rate of deflation include:
If the trend holds, by 2028 the cost of running a million tokens through a frontier model could drop below $0.10, making AI inference nearly free for most applications. However, as noted in the K-shaped pricing discussion, the most capable models at the frontier may continue to command premium prices even as the broad market commoditizes.
ARK Invest's research supports a similar thesis, finding that AI training costs have improved at 50x the pace of Moore's Law, with the cost to train benchmark models dropping roughly 10x every year between 2017 and 2023 [9].
While text tokens dominate AI API pricing discussions, providers also offer pricing for other modalities.
| Provider | Model | Price per Image | Notes |
|---|---|---|---|
| OpenAI | DALL-E 3 | $0.040-$0.120 | Varies by resolution and quality |
| Stability AI | Stable Image Ultra | $0.08 per generation | Via API |
| Imagen 3 | Included in Gemini API | Billed as tokens |
| Provider | Service | Price | Unit |
|---|---|---|---|
| OpenAI | Whisper (speech-to-text) | $0.006 | Per minute |
| OpenAI | TTS-1 | $15.00 | Per 1M characters |
| OpenAI | TTS-1-HD | $30.00 | Per 1M characters |
| ElevenLabs | Speech synthesis | $0.18-$0.30 | Per 1,000 characters |
| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 |
| OpenAI | text-embedding-3-large | $0.13 |
| text-embedding-005 | Free (with rate limits) | |
| Cohere | embed-v4 | $0.10 |
As of early 2026, the AI pricing landscape is characterized by several key dynamics [5][6][20].
Continued deflation: LLM API prices dropped roughly 80% across the board between 2025 and early 2026. Budget models like GPT-4.1 Nano and DeepSeek V3.2 offer capabilities comparable to GPT-4 at prices two orders of magnitude lower.
Provider proliferation: Beyond the major providers (OpenAI, Anthropic, Google), developers now have access to DeepSeek, Mistral, xAI (Grok), Cohere, and dozens of inference providers hosting open-source models (Groq, Together AI, Fireworks AI). This fragmentation has intensified price competition.
Multi-provider strategies: Sophisticated development teams increasingly use multiple providers, routing different tasks to whichever model offers the best cost-performance ratio. Tools like OpenRouter aggregate models from multiple providers, enabling easy switching.
Open-source pressure: Meta's Llama, DeepSeek's V3, and Mistral's models continue to push commercial providers to reduce prices. When a capable open model becomes available for free download, the ceiling on API pricing for equivalent capability drops immediately.
The enterprise premium: While per-token prices fall, total enterprise AI spending continues to rise as organizations deploy AI across more use cases and process higher volumes. The market is growing both in total revenue and in the number of tokens consumed, even as unit economics improve for buyers.
Emerging cost dimensions: New features add new cost categories. Web search ($10 per 1,000 searches on Anthropic), computer use (token-heavy screenshot processing), and code execution (per-minute billing) introduce costs beyond simple input/output token pricing. Developers must now account for these tool-use costs in their budgets.
The AI pricing landscape will likely continue its deflationary trajectory, driven by hardware improvements, algorithmic innovation, and fierce competition. For developers and organizations, the practical implication is clear: AI capabilities that were prohibitively expensive just two years ago are now accessible at commodity prices, and the cost of integrating AI into products and workflows is lower than ever.