# AI Pricing

> Source: https://aiwiki.ai/wiki/ai_pricing
> Updated: 2026-04-28
> Categories: Artificial Intelligence, Large Language Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**AI pricing** refers to the cost structures and economic models used by providers of [artificial intelligence](/wiki/artificial_intelligence) services, particularly [large language model](/wiki/large_language_model) (LLM) APIs. The dominant pricing model in the industry is token-based, where customers pay per unit of text processed, with separate rates for input and output tokens. Since [OpenAI](/wiki/openai) introduced commercial API access in June 2020, AI pricing has undergone extraordinary deflation: the cost of accessing a frontier-class language model has fallen by roughly 99.5% in six years, from $60 per million tokens for [GPT-3](/wiki/gpt-3) Davinci in 2020 to under $0.15 per million input tokens for budget models by late 2024 [1][2].

Understanding AI pricing is essential for developers, product managers, and business leaders evaluating whether to integrate AI capabilities, which models and providers to use, and how to optimize costs as usage scales.

## Token-Based Pricing Model

### What Is a Token?

A [token](/wiki/tokenization) is the fundamental unit of text that language models process. Rather than reading text character by character or word by word, models use tokenizers (typically based on Byte Pair Encoding or similar algorithms) to break text into subword units that balance vocabulary size with representation efficiency [3].

As a practical approximation:

- 1 token is roughly 4 characters of English text
- 1 token is roughly 0.75 words
- 100 tokens is roughly 75 words
- 1 million tokens is roughly 750,000 words, or about 3,000 pages of text

Token counts vary by language and content type. Code, technical text, and non-English languages tend to use more tokens per word due to less common vocabulary. Most API providers offer tokenizer tools or libraries so developers can estimate token counts before making requests.

### Input vs. Output Tokens

Modern AI APIs charge separately for input tokens and output tokens [1][4]:

- **Input tokens** include the system prompt, user message, conversation history, tool definitions, and any attached documents or images. These represent the context the model must read and understand.
- **Output tokens** include the model's generated response. For reasoning models (like OpenAI's o-series), output tokens also include internal "thinking" tokens used for [chain-of-thought reasoning](/wiki/chain_of_thought), which are billed but not visible in the API response.

Output tokens are priced higher than input tokens, typically 2x to 8x more, reflecting the greater computational cost of generating text (which requires running the model autoregressively, one token at a time) versus processing input (which can be parallelized).

### Why Per-Token Pricing?

The token-based model aligns costs with actual computational resource consumption. Each token processed requires a forward pass through the model's neural network, consuming GPU compute, memory bandwidth, and energy. Per-token pricing gives customers fine-grained control over costs and allows providers to price different models according to their computational requirements. Larger, more capable models cost more per token because they require more parameters and floating-point operations per inference step.

## Pricing Comparison of Major Providers

The following table compares pricing for flagship, mid-tier, and budget models from major AI API providers as of March 2026 [1][4][5][6][7][8].

### Flagship Models

| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| [OpenAI](/wiki/openai_api) | GPT-5.4 | $2.50 | $15.00 | 1,050,000 |
| OpenAI | GPT-5.2 | $1.75 | $14.00 | 400,000 |
| [Anthropic](/wiki/anthropic_api) | Claude Opus 4.6 | $5.00 | $25.00 | 1,000,000 |
| [Google](/wiki/google_deepmind) | Gemini 3.1 Pro | $2.00 | $12.00 | 1,000,000 |
| Google | Gemini 2.5 Pro | $1.25 | $10.00 | 1,000,000 |
| [xAI](/wiki/xai) | Grok 4.1 | $0.20 | $0.50 | 131,072 |

### Mid-Tier Models

| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4.1 | $2.00 | $8.00 | 1,000,000 |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | 1,000,000 |
| Google | Gemini 2.5 Flash | $0.30 | $2.50 | 1,000,000 |
| [Mistral](/wiki/mistral_ai) | Mistral Large 3 | $0.50 | $1.50 | 128,000 |

### Budget Models

| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|---|
| OpenAI | GPT-4.1 Nano | $0.10 | $0.40 | 1,000,000 |
| OpenAI | GPT-4o Mini | $0.15 | $0.60 | 128,000 |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | 200,000 |
| Google | Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1,000,000 |
| [DeepSeek](/wiki/deepseek) | DeepSeek V3.2 | $0.28 | $0.42 | 128,000 |
| DeepSeek | DeepSeek R1 | $0.50 | $2.18 | 128,000 |
| Mistral | Devstral Small 2 | $0.10 | $0.30 | 128,000 |
| [Meta](/wiki/meta_ai) (via Groq) | Llama 4 Scout | $0.11 | $0.34 | 1,000,000 |

### Reasoning Models

| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|---|
| OpenAI | o3 | $0.40 | $1.60 | General reasoning; output includes thinking tokens |
| OpenAI | o4-mini | $1.10 | $4.40 | Cost-efficient reasoning with vision |
| Anthropic | Claude Sonnet 4.6 (extended thinking) | $3.00 | $15.00 | Thinking tokens at standard output rate |
| DeepSeek | R1 | $0.50 | $2.18 | Open-weight reasoning model |

## Pricing Trends and Deflation

AI API pricing has experienced a period of deflation unmatched by almost any other technology sector. The cost of accessing capable language models has fallen at a pace that dwarfs even [Moore's Law](/wiki/moores_law), with some analyses suggesting AI inference costs have improved at 50x the speed of semiconductor scaling [9].

### Historical Price Points

The following table tracks the cost trajectory of OpenAI's models, illustrating the trend across the industry [1][2][10].

| Date | Model | Input Price (per 1M tokens) | Relative Cost (vs. GPT-3 Davinci) |
|---|---|---|---|
| June 2020 | GPT-3 Davinci | $60.00 | 1.0x (baseline) |
| November 2022 | GPT-3 Davinci-002 | $20.00 | 0.33x |
| March 2023 | GPT-3.5 Turbo | $2.00 | 0.033x |
| November 2023 | GPT-3.5 Turbo (updated) | $0.50 | 0.008x |
| May 2024 | GPT-4o | $5.00 | 0.083x |
| July 2024 | GPT-4o Mini | $0.15 | 0.0025x |
| April 2025 | GPT-4.1 Nano | $0.10 | 0.0017x |
| August 2025 | GPT-5 | $1.25 | 0.021x |
| March 2026 | GPT-5.4 Nano | $0.20 | 0.003x |

From GPT-3 Davinci at $60 per million tokens to GPT-4o Mini at $0.15 per million, the input cost dropped by a factor of 400x in just four years. Even accounting for the fact that newer models are significantly more capable (making direct comparison imperfect), the trend is dramatic.

### Drivers of Price Deflation

Several factors have contributed to this historic cost reduction:

1. **Hardware improvements**: Each generation of [GPU](/wiki/gpu) (from [Nvidia](/wiki/nvidia) A100 to H100 to B200) has delivered substantial improvements in inference throughput per dollar. Nvidia's Blackwell architecture, shipping from late 2024, roughly doubled inference performance per watt compared to Hopper [11].

2. **Algorithmic efficiency**: Techniques like [Flash Attention](/wiki/flash_attention), [grouped-query attention](/wiki/grouped_query_attention), [speculative decoding](/wiki/speculative_decoding), and [mixture-of-experts](/wiki/moe) architectures have dramatically reduced the computational cost per token. DeepSeek's V3 model demonstrated that a mixture-of-experts architecture could achieve near-frontier performance at a fraction of the training and inference cost [12].

3. **[Quantization](/wiki/mixed_precision_training)**: Reducing model weights from 16-bit to 8-bit or 4-bit precision roughly halves or quarters memory requirements and inference cost, often with minimal quality degradation.

4. **Scale and competition**: As more providers entered the market and usage volumes grew, economies of scale and competitive pressure drove prices down. The emergence of DeepSeek as an ultra-low-cost provider in early 2025 intensified pricing competition across the industry.

5. **Smaller, better models**: Model distillation and improved training techniques have produced smaller models (like GPT-4o Mini and [Claude](/wiki/claude) Haiku) that match the performance of much larger predecessors at a fraction of the cost.

### The K-Shaped Trend

While budget and mid-tier model prices have fallen dramatically, frontier model pricing tells a more nuanced story. OpenAI's [GPT-5](/wiki/gpt-5).2 Pro is priced at $21/$168 per million tokens, and Anthropic's Claude Opus 4.6 in fast mode costs $30/$150. This creates a "K-shaped" dynamic where the most capable models maintain premium pricing while the broader market races toward zero [13]. For the hardest tasks requiring maximum capability, AI costs per query can still be substantial.

## Cost Optimization Strategies

Developers and organizations have several strategies available to reduce AI API costs without sacrificing output quality.

### Prompt Caching

Both [OpenAI](/wiki/openai_api) and [Anthropic](/wiki/anthropic_api) offer prompt caching that reuses previously processed input prefixes [1][4]. When a request shares the same opening content (system prompt, reference documents) as a recent request, the cached tokens are read at a steep discount.

| Provider | Cache Hit Discount |
|---|---|
| Anthropic | 90% off base input price |
| OpenAI (GPT-5 family) | 90% off |
| OpenAI (GPT-4.1 family) | 75% off |
| OpenAI (GPT-4o / o-series) | 50% off |
| Google Gemini | Up to 90% off |

[Prompt](/wiki/prompt) caching is most effective for applications with stable, lengthy system prompts, such as customer support bots with extensive knowledge bases, document analysis pipelines processing the same document with different queries, or coding assistants with large codebase context.

### Batching

Batch APIs offered by OpenAI, Anthropic, and Google process requests asynchronously at a 50% discount [1][4]. This is ideal for workloads that do not require real-time responses:

- Bulk classification or labeling
- Content generation pipelines
- Evaluation and testing suites
- Data extraction from large document sets

Batch and caching discounts typically stack, enabling combined savings exceeding 90%.

### Model Selection and Routing

One of the highest-impact optimization strategies is matching the model to the task complexity. A task that [GPT-4](/wiki/gpt-4).1 Nano ($0.10/$0.40) handles adequately should not be routed to GPT-5.4 ($2.50/$15.00). Many production systems use a router that sends simple queries to budget models and escalates complex ones to more capable (and expensive) models.

| Task Complexity | Recommended Tier | Example Models | Typical Cost |
|---|---|---|---|
| Simple classification, extraction | Budget | GPT-4.1 Nano, Haiku 4.5, Flash-Lite | $0.10-$1.00 per 1M input |
| General conversation, summarization | Mid-tier | GPT-4.1, Sonnet 4.6, Gemini 2.5 Flash | $0.30-$3.00 per 1M input |
| Complex reasoning, research, analysis | Flagship | GPT-5.4, Opus 4.6, Gemini 3.1 Pro | $2.00-$5.00 per 1M input |
| Mathematical/scientific reasoning | Reasoning | o3, o4-mini, Sonnet extended thinking, R1 | $0.40-$3.00 per 1M input |

### Distillation

Model [distillation](/wiki/knowledge_distillation) involves using a larger, more expensive model to generate training data, then fine-tuning a smaller, cheaper model on that data. For example, a developer might use GPT-5.4 to generate 10,000 high-quality responses for a specific task, then fine-tune GPT-4.1 Mini on those examples. The resulting fine-tuned model can approach the larger model's quality on that narrow task at a fraction of the inference cost.

### Quantized and Open-Source Models

Running [open-source](/wiki/open_source_ai) models like Meta's [Llama](/wiki/llama) or DeepSeek V3 on self-hosted infrastructure can eliminate per-token API costs entirely. Quantized versions of these models (4-bit or 8-bit precision) run on consumer or mid-range hardware while maintaining much of the original model's capability.

### Self-Hosting vs. API: Cost Comparison

The economics of self-hosting versus API usage depend heavily on volume [14][15].

| Factor | Self-Hosted (e.g., Llama on GPU) | API (e.g., OpenAI, Anthropic) |
|---|---|---|
| Fixed costs | High ($3,000-$5,000/month per GPU) | None |
| Marginal cost per token | Near zero at full utilization | Fixed per-token rate |
| Break-even volume | 50+ million tokens/month (realistic) | N/A |
| Engineering overhead | Significant (setup, maintenance, monitoring) | Minimal |
| Model quality | Open-source frontier (Llama 4, DeepSeek V3) | Proprietary frontier (GPT-5, Claude, Gemini) |
| Latency control | Full control over hardware and batching | Subject to provider's infrastructure |
| Data privacy | Complete control | Data sent to third-party servers |

Self-hosting a 70B-parameter model on cloud A100 GPUs costs approximately $3,000-$5,000 per month but can deliver inference at roughly $0.07 per million tokens at full utilization. The realistic break-even point, accounting for engineering time and operational overhead, is around 50+ million tokens per month for most organizations [15]. Below that volume, API access is almost always more cost-effective.

For reference, Meta estimates inference costs for Llama models at $0.30-$0.49 per million tokens on a single host [14], and third-party hosting providers like [Groq](/wiki/groq_hardware) offer Llama 4 Scout at $0.11/$0.34 per million tokens, often cheaper than self-hosting for moderate volumes.

## Free Tiers

Several providers offer free access tiers to attract developers [16][17].

| Provider | Free Tier Details | Limitations |
|---|---|---|
| Google Gemini | No credit card required; 5-15 RPM, 250K TPM, 1,000 RPD | Data may train models; not available in EU |
| OpenAI | $5 credit for new accounts (expires after 3 months) | Limited models and rate limits |
| Anthropic | Small free credit for evaluation | Limited to testing purposes |
| DeepSeek | Free research access with rate limits | Throttled during peak hours |
| Mistral | Free tier for small models (Pixtral, Mistral Nemo) | Very low rate limits |
| Groq | Free tier with rate limits for open-source models | Limited TPM |

Google's free tier is the most generous for developers, offering enough capacity for prototyping and small-scale production without any payment requirement. OpenAI's free credit is time-limited and primarily useful for initial evaluation.

## Enterprise Pricing

Enterprise customers at all major providers can negotiate custom pricing arrangements that differ from the published rates [18].

Common enterprise pricing features include:

- **Volume discounts**: Reduced per-token rates based on committed monthly or annual spend
- **Reserved capacity**: Guaranteed throughput (tokens per minute) at negotiated rates, avoiding rate limit concerns
- **Custom rate limits**: Higher RPM and TPM limits than standard tiers
- **Dedicated infrastructure**: Isolated compute environments for security-sensitive workloads
- **Commitment discounts**: Lower prices in exchange for minimum spend commitments (similar to cloud reserved instances)
- **Microsoft Azure OpenAI**: Offers OpenAI models with Azure's enterprise features, including tiered pricing with provisioned throughput units (PTUs) for predictable costs

Enterprise pricing is typically negotiated individually and is not publicly disclosed. Organizations processing billions of tokens monthly may receive discounts of 30-50% or more off published rates.

## Blumenfeld's Law of AI Pricing

In 2024, venture capitalist Jeremy Blumenfeld proposed what he called "Blumenfeld's Law": the observation that the cost of AI inference drops by roughly 10x every 12 to 18 months, far outpacing Moore's Law's traditional 2x improvement every 18-24 months [9][19]. This observation, while not a rigorous physical law, captures the empirical trend visible in the pricing data.

The underlying dynamics supporting this rate of deflation include:

- Hardware efficiency gains (new GPU architectures)
- Software and algorithmic improvements (Flash [Attention](/wiki/attention), MoE, speculative decoding)
- Competitive market pressure (more providers, open-source alternatives)
- Improved model architectures that achieve more capability per parameter

If the trend holds, by 2028 the cost of running a million tokens through a frontier model could drop below $0.10, making AI inference nearly free for most applications. However, as noted in the K-shaped pricing discussion, the most capable models at the frontier may continue to command premium prices even as the broad market commoditizes.

ARK Invest's research supports a similar thesis, finding that AI training costs have improved at 50x the pace of Moore's Law, with the cost to train benchmark models dropping roughly 10x every year between 2017 and 2023 [9].

## Pricing for Non-Text Modalities

While text tokens dominate AI API pricing discussions, providers also offer pricing for other modalities.

### Image Generation

| Provider | Model | Price per Image | Notes |
|---|---|---|---|
| OpenAI | DALL-E 3 | $0.040-$0.120 | Varies by resolution and quality |
| [Stability AI](/wiki/stability_ai) | Stable Image Ultra | $0.08 per generation | Via API |
| Google | Imagen 3 | Included in Gemini API | Billed as tokens |

### Audio and Speech

| Provider | Service | Price | Unit |
|---|---|---|---|
| OpenAI | Whisper (speech-to-text) | $0.006 | Per minute |
| OpenAI | TTS-1 | $15.00 | Per 1M characters |
| OpenAI | TTS-1-HD | $30.00 | Per 1M characters |
| [ElevenLabs](/wiki/eleven_labs) | Speech synthesis | $0.18-$0.30 | Per 1,000 characters |

### Embeddings

| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 |
| OpenAI | text-embedding-3-large | $0.13 |
| Google | text-embedding-005 | Free (with rate limits) |
| [Cohere](/wiki/cohere) | embed-v4 | $0.10 |

## Current State (2025-2026)

As of early 2026, the AI pricing landscape is characterized by several key dynamics [5][6][20].

**Continued deflation**: LLM API prices dropped roughly 80% across the board between 2025 and early 2026. Budget models like GPT-4.1 Nano and DeepSeek V3.2 offer capabilities comparable to GPT-4 at prices two orders of magnitude lower.

**Provider proliferation**: Beyond the major providers (OpenAI, Anthropic, Google), developers now have access to DeepSeek, Mistral, xAI ([Grok](/wiki/grok)), Cohere, and dozens of inference providers hosting open-source models (Groq, [Together AI](/wiki/together_ai), [Fireworks AI](/wiki/fireworks_ai)). This fragmentation has intensified price competition.

**Multi-provider strategies**: Sophisticated development teams increasingly use multiple providers, routing different tasks to whichever model offers the best cost-performance ratio. Tools like [OpenRouter](/wiki/openrouter) aggregate models from multiple providers, enabling easy switching.

**Open-source pressure**: Meta's Llama, DeepSeek's V3, and Mistral's models continue to push commercial providers to reduce prices. When a capable open model becomes available for free download, the ceiling on API pricing for equivalent capability drops immediately.

**The enterprise premium**: While per-token prices fall, total enterprise AI spending continues to rise as organizations deploy AI across more use cases and process higher volumes. The market is growing both in total revenue and in the number of tokens consumed, even as unit economics improve for buyers.

**Emerging cost dimensions**: New features add new cost categories. Web search ($10 per 1,000 searches on Anthropic), computer use (token-heavy screenshot processing), and code execution (per-minute billing) introduce costs beyond simple input/output token pricing. Developers must now account for these tool-use costs in their budgets.

The AI pricing landscape will likely continue its deflationary trajectory, driven by hardware improvements, algorithmic innovation, and fierce competition. For developers and organizations, the practical implication is clear: AI capabilities that were prohibitively expensive just two years ago are now accessible at commodity prices, and the cost of integrating AI into products and workflows is lower than ever.

## See Also

- [OpenAI API](/wiki/openai_api)
- [Anthropic API](/wiki/anthropic_api)
- [Tokenization](/wiki/tokenization)
- [Inference Optimization](/wiki/inference_optimization)
- [Open-Source AI](/wiki/open_source_ai)
- [GPU](/wiki/gpu)

## References

[1] OpenAI. "Pricing." OpenAI API. https://openai.com/api/pricing/

[2] Medium. "OpenAI Model Pricing Drops by 95%!" BoredGeekSociety. https://medium.com/@boredgeeksociety/openai-model-pricing-drops-by-95-3a31ab0e04e6

[3] OpenAI. "What are tokens and how to count them?" OpenAI Help Center. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

[4] Anthropic. "Pricing." Claude API Documentation. https://platform.claude.com/docs/en/about-claude/pricing

[5] IntuitionLabs. "AI API Pricing Comparison (2026)." https://intuitionlabs.ai/articles/ai-api-pricing-comparison-grok-gemini-openai-claude

[6] TLDL. "LLM API Pricing (March 2026)." https://www.tldl.io/resources/llm-api-pricing-2026

[7] DeepSeek. "Models & Pricing." DeepSeek API Docs. https://api-docs.deepseek.com/quick_start/pricing

[8] [Mistral AI](/wiki/mistral). "Pricing." https://mistral.ai/pricing

[9] ARK Invest. "AI Training Costs Are Improving at 50x the Speed of Moore's Law." https://www.ark-invest.com/articles/analyst-research/ai-training

[10] Neoteric. "How Much Does It Cost to Use GPT? GPT-3 Pricing Explained." https://neoteric.eu/blog/how-much-does-it-cost-to-use-gpt-models-gpt-3-pricing-explained

[11] Nvidia. "Blackwell Architecture." https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

[12] IntuitionLabs. "DeepSeek's Low [Inference](/wiki/inference) Cost Explained." https://intuitionlabs.ai/articles/deepseek-inference-cost-explained

[13] The Stack. "GenAI costs follow a Moore's Law-style curve, VC claims." https://www.thestack.technology/genai-costs-moores-law/

[14] RevolutionInAI. "Self-Hosting Llama 4 vs GPT-4o API: The Exact Monthly Volume Where It Makes Sense." https://www.revolutioninai.com/2026/03/self-hosting-llama-4-vs-gpt4o-api-cost-breakeven.html

[15] DevTk.AI. "Self-Host LLM vs API: Real Cost Breakdown 2026." https://devtk.ai/en/blog/self-hosting-llm-vs-api-cost-2026/

[16] Google. "[Gemini](/wiki/gemini) Developer API pricing." https://ai.google.dev/gemini-api/docs/pricing

[17] AI Free API. "Gemini API Free Quota 2025." https://www.aifreeapi.com/en/posts/gemini-api-free-quota

[18] Finout. "OpenAI Pricing in 2026 for Individuals, Orgs & Developers." https://www.finout.io/blog/openai-pricing-in-2026

[19] The Stack. "GenAI costs follow a Moore's Law-style curve, VC claims." https://www.thestack.technology/genai-costs-moores-law/

[20] PricePerToken. "LLM API Pricing 2026 - Compare 300+ AI Model Costs." https://pricepertoken.com/
