AI Pricing

Artificial Intelligence Large Language Models

17 min read

Updated Apr 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Apr 28, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 3,308 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AI pricing refers to the cost structures and economic models used by providers of artificial intelligence services, particularly large language model (LLM) APIs. The dominant pricing model in the industry is token-based, where customers pay per unit of text processed, with separate rates for input and output tokens. Since OpenAI introduced commercial API access in June 2020, AI pricing has undergone extraordinary deflation: the cost of accessing a frontier-class language model has fallen by roughly 99.5% in six years, from $60 per million tokens for GPT-3 Davinci in 2020 to under $0.15 per million input tokens for budget models by late 2024 ^[1]^[2].

Understanding AI pricing is essential for developers, product managers, and business leaders evaluating whether to integrate AI capabilities, which models and providers to use, and how to optimize costs as usage scales.

Token-Based Pricing Model

What Is a Token?

A token is the fundamental unit of text that language models process. Rather than reading text character by character or word by word, models use tokenizers (typically based on Byte Pair Encoding or similar algorithms) to break text into subword units that balance vocabulary size with representation efficiency ^[3].

As a practical approximation:

1 token is roughly 4 characters of English text
1 token is roughly 0.75 words
100 tokens is roughly 75 words
1 million tokens is roughly 750,000 words, or about 3,000 pages of text

Token counts vary by language and content type. Code, technical text, and non-English languages tend to use more tokens per word due to less common vocabulary. Most API providers offer tokenizer tools or libraries so developers can estimate token counts before making requests.

Input vs. Output Tokens

Modern AI APIs charge separately for input tokens and output tokens ^[1]^[4]:

Input tokens include the system prompt, user message, conversation history, tool definitions, and any attached documents or images. These represent the context the model must read and understand.
Output tokens include the model's generated response. For reasoning models (like OpenAI's o-series), output tokens also include internal "thinking" tokens used for chain-of-thought reasoning, which are billed but not visible in the API response.

Output tokens are priced higher than input tokens, typically 2x to 8x more, reflecting the greater computational cost of generating text (which requires running the model autoregressively, one token at a time) versus processing input (which can be parallelized).

Why Per-Token Pricing?

The token-based model aligns costs with actual computational resource consumption. Each token processed requires a forward pass through the model's neural network, consuming GPU compute, memory bandwidth, and energy. Per-token pricing gives customers fine-grained control over costs and allows providers to price different models according to their computational requirements. Larger, more capable models cost more per token because they require more parameters and floating-point operations per inference step.

Pricing Comparison of Major Providers

The following table compares pricing for flagship, mid-tier, and budget models from major AI API providers as of March 2026 ^[1]^[4]^[5]^[6]^[7]^[8].

Flagship Models

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
OpenAI	GPT-5.4	$2.50	$15.00	1,050,000
OpenAI	GPT-5.2	$1.75	$14.00	400,000
Anthropic	Claude Opus 4.6	$5.00	$25.00	1,000,000
Google	Gemini 3.1 Pro	$2.00	$12.00	1,000,000
Google	Gemini 2.5 Pro	$1.25	$10.00	1,000,000
xAI	Grok 4.1	$0.20	$0.50	131,072

Mid-Tier Models

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
OpenAI	GPT-4.1	$2.00	$8.00	1,000,000
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	1,000,000
Google	Gemini 2.5 Flash	$0.30	$2.50	1,000,000
Mistral	Mistral Large 3	$0.50	$1.50	128,000

Budget Models

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
OpenAI	GPT-4.1 Nano	$0.10	$0.40	1,000,000
OpenAI	GPT-4o Mini	$0.15	$0.60	128,000
Anthropic	Claude Haiku 4.5	$1.00	$5.00	200,000
Google	Gemini 2.5 Flash-Lite	$0.10	$0.40	1,000,000
DeepSeek	DeepSeek V3.2	$0.28	$0.42	128,000
DeepSeek	DeepSeek R1	$0.50	$2.18	128,000
Mistral	Devstral Small 2	$0.10	$0.30	128,000
Meta (via Groq)	Llama 4 Scout	$0.11	$0.34	1,000,000

Reasoning Models

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
OpenAI	o3	$0.40	$1.60	General reasoning; output includes thinking tokens
OpenAI	o4-mini	$1.10	$4.40	Cost-efficient reasoning with vision
Anthropic	Claude Sonnet 4.6 (extended thinking)	$3.00	$15.00	Thinking tokens at standard output rate
DeepSeek	R1	$0.50	$2.18	Open-weight reasoning model

Pricing Trends and Deflation

AI API pricing has experienced a period of deflation unmatched by almost any other technology sector. The cost of accessing capable language models has fallen at a pace that dwarfs even Moore's Law, with some analyses suggesting AI inference costs have improved at 50x the speed of semiconductor scaling ^[9].

Historical Price Points

The following table tracks the cost trajectory of OpenAI's models, illustrating the trend across the industry ^[1]^[2]^[10].

Date	Model	Input Price (per 1M tokens)	Relative Cost (vs. GPT-3 Davinci)
June 2020	GPT-3 Davinci	$60.00	1.0x (baseline)
November 2022	GPT-3 Davinci-002	$20.00	0.33x
March 2023	GPT-3.5 Turbo	$2.00	0.033x
November 2023	GPT-3.5 Turbo (updated)	$0.50	0.008x
May 2024	GPT-4o	$5.00	0.083x
July 2024	GPT-4o Mini	$0.15	0.0025x
April 2025	GPT-4.1 Nano	$0.10	0.0017x
August 2025	GPT-5	$1.25	0.021x
March 2026	GPT-5.4 Nano	$0.20	0.003x

From GPT-3 Davinci at $60 per million tokens to GPT-4o Mini at $0.15 per million, the input cost dropped by a factor of 400x in just four years. Even accounting for the fact that newer models are significantly more capable (making direct comparison imperfect), the trend is dramatic.

Drivers of Price Deflation

Several factors have contributed to this historic cost reduction:

Hardware improvements: Each generation of GPU (from Nvidia A100 to H100 to B200) has delivered substantial improvements in inference throughput per dollar. Nvidia's Blackwell architecture, shipping from late 2024, roughly doubled inference performance per watt compared to Hopper ^[11].
Algorithmic efficiency: Techniques like Flash Attention, grouped-query attention, speculative decoding, and mixture-of-experts architectures have dramatically reduced the computational cost per token. DeepSeek's V3 model demonstrated that a mixture-of-experts architecture could achieve near-frontier performance at a fraction of the training and inference cost ^[12].
Quantization: Reducing model weights from 16-bit to 8-bit or 4-bit precision roughly halves or quarters memory requirements and inference cost, often with minimal quality degradation.
Scale and competition: As more providers entered the market and usage volumes grew, economies of scale and competitive pressure drove prices down. The emergence of DeepSeek as an ultra-low-cost provider in early 2025 intensified pricing competition across the industry.
Smaller, better models: Model distillation and improved training techniques have produced smaller models (like GPT-4o Mini and Claude Haiku) that match the performance of much larger predecessors at a fraction of the cost.

The K-Shaped Trend

While budget and mid-tier model prices have fallen dramatically, frontier model pricing tells a more nuanced story. OpenAI's GPT-5.2 Pro is priced at $21/$168 per million tokens, and Anthropic's Claude Opus 4.6 in fast mode costs $30/$150. This creates a "K-shaped" dynamic where the most capable models maintain premium pricing while the broader market races toward zero ^[13]. For the hardest tasks requiring maximum capability, AI costs per query can still be substantial.

Cost Optimization Strategies

Developers and organizations have several strategies available to reduce AI API costs without sacrificing output quality.

Prompt Caching

Both OpenAI and Anthropic offer prompt caching that reuses previously processed input prefixes ^[1]^[4]. When a request shares the same opening content (system prompt, reference documents) as a recent request, the cached tokens are read at a steep discount.

Provider	Cache Hit Discount
Anthropic	90% off base input price
OpenAI (GPT-5 family)	90% off
OpenAI (GPT-4.1 family)	75% off
OpenAI (GPT-4o / o-series)	50% off
Google Gemini	Up to 90% off

Prompt caching is most effective for applications with stable, lengthy system prompts, such as customer support bots with extensive knowledge bases, document analysis pipelines processing the same document with different queries, or coding assistants with large codebase context.

Batching

Batch APIs offered by OpenAI, Anthropic, and Google process requests asynchronously at a 50% discount ^[1]^[4]. This is ideal for workloads that do not require real-time responses:

Bulk classification or labeling
Content generation pipelines
Evaluation and testing suites
Data extraction from large document sets

Batch and caching discounts typically stack, enabling combined savings exceeding 90%.

Model Selection and Routing

One of the highest-impact optimization strategies is matching the model to the task complexity. A task that GPT-4.1 Nano ($0.10/$0.40) handles adequately should not be routed to GPT-5.4 ($2.50/$15.00). Many production systems use a router that sends simple queries to budget models and escalates complex ones to more capable (and expensive) models.

Task Complexity	Recommended Tier	Example Models	Typical Cost
Simple classification, extraction	Budget	GPT-4.1 Nano, Haiku 4.5, Flash-Lite	$0.10-$1.00 per 1M input
General conversation, summarization	Mid-tier	GPT-4.1, Sonnet 4.6, Gemini 2.5 Flash	$0.30-$3.00 per 1M input
Complex reasoning, research, analysis	Flagship	GPT-5.4, Opus 4.6, Gemini 3.1 Pro	$2.00-$5.00 per 1M input
Mathematical/scientific reasoning	Reasoning	o3, o4-mini, Sonnet extended thinking, R1	$0.40-$3.00 per 1M input

Distillation

Model distillation involves using a larger, more expensive model to generate training data, then fine-tuning a smaller, cheaper model on that data. For example, a developer might use GPT-5.4 to generate 10,000 high-quality responses for a specific task, then fine-tune GPT-4.1 Mini on those examples. The resulting fine-tuned model can approach the larger model's quality on that narrow task at a fraction of the inference cost.

Quantized and Open-Source Models

Running open-source models like Meta's Llama or DeepSeek V3 on self-hosted infrastructure can eliminate per-token API costs entirely. Quantized versions of these models (4-bit or 8-bit precision) run on consumer or mid-range hardware while maintaining much of the original model's capability.

Self-Hosting vs. API: Cost Comparison

The economics of self-hosting versus API usage depend heavily on volume ^[14]^[15].

Factor	Self-Hosted (e.g., Llama on GPU)	API (e.g., OpenAI, Anthropic)
Fixed costs	High ($3,000-$5,000/month per GPU)	None
Marginal cost per token	Near zero at full utilization	Fixed per-token rate
Break-even volume	50+ million tokens/month (realistic)	N/A
Engineering overhead	Significant (setup, maintenance, monitoring)	Minimal
Model quality	Open-source frontier (Llama 4, DeepSeek V3)	Proprietary frontier (GPT-5, Claude, Gemini)
Latency control	Full control over hardware and batching	Subject to provider's infrastructure
Data privacy	Complete control	Data sent to third-party servers

Self-hosting a 70B-parameter model on cloud A100 GPUs costs approximately $3,000-$5,000 per month but can deliver inference at roughly $0.07 per million tokens at full utilization. The realistic break-even point, accounting for engineering time and operational overhead, is around 50+ million tokens per month for most organizations ^[15]. Below that volume, API access is almost always more cost-effective.

For reference, Meta estimates inference costs for Llama models at $0.30-$0.49 per million tokens on a single host ^[14], and third-party hosting providers like Groq offer Llama 4 Scout at $0.11/$0.34 per million tokens, often cheaper than self-hosting for moderate volumes.

Free Tiers

Several providers offer free access tiers to attract developers ^[16]^[17].

Provider	Free Tier Details	Limitations
Google Gemini	No credit card required; 5-15 RPM, 250K TPM, 1,000 RPD	Data may train models; not available in EU
OpenAI	$5 credit for new accounts (expires after 3 months)	Limited models and rate limits
Anthropic	Small free credit for evaluation	Limited to testing purposes
DeepSeek	Free research access with rate limits	Throttled during peak hours
Mistral	Free tier for small models (Pixtral, Mistral Nemo)	Very low rate limits
Groq	Free tier with rate limits for open-source models	Limited TPM

Google's free tier is the most generous for developers, offering enough capacity for prototyping and small-scale production without any payment requirement. OpenAI's free credit is time-limited and primarily useful for initial evaluation.

Enterprise Pricing

Enterprise customers at all major providers can negotiate custom pricing arrangements that differ from the published rates ^[18].

Common enterprise pricing features include:

Volume discounts: Reduced per-token rates based on committed monthly or annual spend
Reserved capacity: Guaranteed throughput (tokens per minute) at negotiated rates, avoiding rate limit concerns
Custom rate limits: Higher RPM and TPM limits than standard tiers
Dedicated infrastructure: Isolated compute environments for security-sensitive workloads
Commitment discounts: Lower prices in exchange for minimum spend commitments (similar to cloud reserved instances)
Microsoft Azure OpenAI: Offers OpenAI models with Azure's enterprise features, including tiered pricing with provisioned throughput units (PTUs) for predictable costs

Enterprise pricing is typically negotiated individually and is not publicly disclosed. Organizations processing billions of tokens monthly may receive discounts of 30-50% or more off published rates.

Blumenfeld's Law of AI Pricing

In 2024, venture capitalist Jeremy Blumenfeld proposed what he called "Blumenfeld's Law": the observation that the cost of AI inference drops by roughly 10x every 12 to 18 months, far outpacing Moore's Law's traditional 2x improvement every 18-24 months ^[9]^[19]. This observation, while not a rigorous physical law, captures the empirical trend visible in the pricing data.

The underlying dynamics supporting this rate of deflation include:

Hardware efficiency gains (new GPU architectures)
Software and algorithmic improvements (Flash Attention, MoE, speculative decoding)
Competitive market pressure (more providers, open-source alternatives)
Improved model architectures that achieve more capability per parameter

If the trend holds, by 2028 the cost of running a million tokens through a frontier model could drop below $0.10, making AI inference nearly free for most applications. However, as noted in the K-shaped pricing discussion, the most capable models at the frontier may continue to command premium prices even as the broad market commoditizes.

ARK Invest's research supports a similar thesis, finding that AI training costs have improved at 50x the pace of Moore's Law, with the cost to train benchmark models dropping roughly 10x every year between 2017 and 2023 ^[9].

Pricing for Non-Text Modalities

While text tokens dominate AI API pricing discussions, providers also offer pricing for other modalities.

Image Generation

Provider	Model	Price per Image	Notes
OpenAI	DALL-E 3	$0.040-$0.120	Varies by resolution and quality
Stability AI	Stable Image Ultra	$0.08 per generation	Via API
Google	Imagen 3	Included in Gemini API	Billed as tokens

Audio and Speech

Provider	Service	Price	Unit
OpenAI	Whisper (speech-to-text)	$0.006	Per minute
OpenAI	TTS-1	$15.00	Per 1M characters
OpenAI	TTS-1-HD	$30.00	Per 1M characters
ElevenLabs	Speech synthesis	$0.18-$0.30	Per 1,000 characters

Embeddings

Provider	Model	Price per 1M tokens
OpenAI	text-embedding-3-small	$0.02
OpenAI	text-embedding-3-large	$0.13
Google	text-embedding-005	Free (with rate limits)
Cohere	embed-v4	$0.10

Current State (2025-2026)

As of early 2026, the AI pricing landscape is characterized by several key dynamics ^[5]^[6]^[20].

Continued deflation: LLM API prices dropped roughly 80% across the board between 2025 and early 2026. Budget models like GPT-4.1 Nano and DeepSeek V3.2 offer capabilities comparable to GPT-4 at prices two orders of magnitude lower.

Provider proliferation: Beyond the major providers (OpenAI, Anthropic, Google), developers now have access to DeepSeek, Mistral, xAI (Grok), Cohere, and dozens of inference providers hosting open-source models (Groq, Together AI, Fireworks AI). This fragmentation has intensified price competition.

Multi-provider strategies: Sophisticated development teams increasingly use multiple providers, routing different tasks to whichever model offers the best cost-performance ratio. Tools like OpenRouter aggregate models from multiple providers, enabling easy switching.

Open-source pressure: Meta's Llama, DeepSeek's V3, and Mistral's models continue to push commercial providers to reduce prices. When a capable open model becomes available for free download, the ceiling on API pricing for equivalent capability drops immediately.

The enterprise premium: While per-token prices fall, total enterprise AI spending continues to rise as organizations deploy AI across more use cases and process higher volumes. The market is growing both in total revenue and in the number of tokens consumed, even as unit economics improve for buyers.

Emerging cost dimensions: New features add new cost categories. Web search ($10 per 1,000 searches on Anthropic), computer use (token-heavy screenshot processing), and code execution (per-minute billing) introduce costs beyond simple input/output token pricing. Developers must now account for these tool-use costs in their budgets.

The AI pricing landscape will likely continue its deflationary trajectory, driven by hardware improvements, algorithmic innovation, and fierce competition. For developers and organizations, the practical implication is clear: AI capabilities that were prohibitively expensive just two years ago are now accessible at commodity prices, and the cost of integrating AI into products and workflows is lower than ever.

References

OpenAI. "Pricing." OpenAI API. https://openai.com/api/pricing/ ↩
Medium. "OpenAI Model Pricing Drops by 95%!" BoredGeekSociety. https://medium.com/@boredgeeksociety/openai-model-pricing-drops-by-95-3a31ab0e04e6 ↩
OpenAI. "What are tokens and how to count them?" OpenAI Help Center. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them ↩
Anthropic. "Pricing." Claude API Documentation. https://platform.claude.com/docs/en/about-claude/pricing ↩
IntuitionLabs. "AI API Pricing Comparison (2026)." https://intuitionlabs.ai/articles/ai-api-pricing-comparison-grok-gemini-openai-claude ↩
TLDL. "LLM API Pricing (March 2026)." https://www.tldl.io/resources/llm-api-pricing-2026 ↩
DeepSeek. "Models & Pricing." DeepSeek API Docs. https://api-docs.deepseek.com/quick_start/pricing ↩
Mistral AI. "Pricing." https://mistral.ai/pricing ↩
ARK Invest. "AI Training Costs Are Improving at 50x the Speed of Moore's Law." https://www.ark-invest.com/articles/analyst-research/ai-training ↩
Neoteric. "How Much Does It Cost to Use GPT? GPT-3 Pricing Explained." https://neoteric.eu/blog/how-much-does-it-cost-to-use-gpt-models-gpt-3-pricing-explained ↩
Nvidia. "Blackwell Architecture." https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/ ↩
IntuitionLabs. "DeepSeek's Low Inference Cost Explained." https://intuitionlabs.ai/articles/deepseek-inference-cost-explained ↩
The Stack. "GenAI costs follow a Moore's Law-style curve, VC claims." https://www.thestack.technology/genai-costs-moores-law/ ↩
RevolutionInAI. "Self-Hosting Llama 4 vs GPT-4o API: The Exact Monthly Volume Where It Makes Sense." https://www.revolutioninai.com/2026/03/self-hosting-llama-4-vs-gpt4o-api-cost-breakeven.html ↩
DevTk.AI. "Self-Host LLM vs API: Real Cost Breakdown 2026." https://devtk.ai/en/blog/self-hosting-llm-vs-api-cost-2026/ ↩
Google. "Gemini Developer API pricing." https://ai.google.dev/gemini-api/docs/pricing ↩
AI Free API. "Gemini API Free Quota 2025." https://www.aifreeapi.com/en/posts/gemini-api-free-quota ↩
Finout. "OpenAI Pricing in 2026 for Individuals, Orgs & Developers." https://www.finout.io/blog/openai-pricing-in-2026 ↩
The Stack. "GenAI costs follow a Moore's Law-style curve, VC claims." https://www.thestack.technology/genai-costs-moores-law/ ↩
PricePerToken. "LLM API Pricing 2026 - Compare 300+ AI Model Costs." https://pricepertoken.com/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

LLM API Pricing Comparison OpenAI API

Token-Based Pricing Model

What Is a Token?

Input vs. Output Tokens

Why Per-Token Pricing?

Pricing Comparison of Major Providers

Flagship Models

Mid-Tier Models

Budget Models

Reasoning Models

Pricing Trends and Deflation

Historical Price Points

Drivers of Price Deflation

The K-Shaped Trend

Cost Optimization Strategies

Prompt Caching

Batching

Model Selection and Routing

Distillation

Quantized and Open-Source Models

Self-Hosting vs. API: Cost Comparison

Free Tiers

Enterprise Pricing

Blumenfeld's Law of AI Pricing

Pricing for Non-Text Modalities

Image Generation

Audio and Speech

Embeddings

Current State (2025-2026)

See Also

References

Improve this article

Related Articles

Agentic Context Engineering

Anthropic

Claude Sonnet 4.5

Context window

DeepSeek

Foundation models

What links here

Related Articles

Agentic Context Engineering

Anthropic

Claude Sonnet 4.5

Context window

DeepSeek

Foundation models

What links here