Azure OpenAI Service
Last reviewed
May 8, 2026
Sources
18 citations
Review status
Source-backed
Revision
v7 ยท 7,395 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
18 citations
Review status
Source-backed
Revision
v7 ยท 7,395 words
Add missing citations, update stale details, or suggest a clearer explanation.
Azure OpenAI Service is Microsoft's enterprise cloud offering that provides access to OpenAI's large language models and generative artificial intelligence capabilities through the Microsoft Azure platform. It combines OpenAI's frontier models, including GPT-4, GPT-4o, GPT-5, the o-series reasoning models, DALL-E, and Whisper, with Azure's enterprise-grade security, compliance, and data privacy guarantees. The service reached general availability on January 16, 2023, and has since become one of the most widely adopted enterprise AI platforms in production [1].
Since late 2024 the service has been progressively folded into Azure AI Foundry, Microsoft's broader AI development platform. The product is sometimes referred to as Microsoft Foundry Models or Azure OpenAI in Azure AI Foundry, but the underlying API surface, billing, and resource type remain the same. Customers continue to provision Azure OpenAI resources from the Azure portal, and the older *.openai.azure.com endpoints still work alongside newer Foundry endpoints. In practice, most documentation now treats Azure OpenAI Service as the OpenAI-branded slice of Foundry rather than a standalone product.
The roots of Azure OpenAI Service trace back to Microsoft's initial partnership with OpenAI, which began in 2019 with a $1 billion investment. Microsoft first debuted Azure OpenAI Service in limited preview in November 2021, allowing select customers to access large-scale generative AI models backed by Azure's cloud infrastructure [2]. At the time, the available models were GPT-3 and the original Codex.
On January 16, 2023, Microsoft announced the general availability of Azure OpenAI Service, expanding access to GPT-3.5, Codex, and DALL-E 2 with enterprise features like content filtering, role-based access control, and regional deployment options [1]. Shortly after, ChatGPT was integrated as a fine-tuned version of GPT-3.5 running on Azure AI infrastructure, and on March 9, 2023 Microsoft made GPT-4 available in preview through Azure OpenAI, only days after OpenAI's own launch.
Microsoft's total investment in OpenAI has reached approximately $13 billion across multiple funding rounds [3]. Following a restructuring of the partnership in late 2025, Microsoft holds an investment in OpenAI Group PBC valued at approximately $135 billion, representing roughly 27 percent on an as-converted diluted basis. Under the terms of the partnership, Azure remains the exclusive cloud provider for stateless OpenAI APIs, and OpenAI committed to purchasing an incremental $250 billion in Azure services over the life of the agreement [3].
In 2024 and 2025, Microsoft rebranded its broader AI development platform as Azure AI Foundry (formerly Azure AI Studio), positioning Azure OpenAI Service as a core component within a larger ecosystem of AI tools and services. The Foundry rollout was gradual: in November 2024 the company introduced "Azure AI Foundry" as the umbrella name at Microsoft Ignite, in early 2025 the Azure AI Studio web portal was redirected to ai.azure.com under the Foundry brand, and through 2025 the OpenAI-specific resource type was renamed inside the portal to "Azure OpenAI in Azure AI Foundry." Older API paths and SDK clients still referencing azure-ai-openai and *.openai.azure.com continue to function, which has caused some confusion about whether the brand has truly changed.
On January 21, 2025, OpenAI, SoftBank, Oracle, and MGX announced the Stargate Project at the White House, a venture committed to deploying $100 billion immediately and up to $500 billion over four years on AI compute capacity for OpenAI inside the United States. Microsoft was named as one of the initial technology partners alongside Nvidia and Arm, but it was not a co-investor in the venture and instead retained a right of first refusal on new OpenAI capacity orders. In the October 2025 partnership rework, that right of first refusal was relaxed, formally allowing OpenAI to contract directly with other cloud providers for training capacity while keeping Azure as the exclusive home for stateless OpenAI APIs [3].
The practical effect is that Azure data centers continue to host most of the inference traffic for hosted OpenAI models, including the workloads served through Azure OpenAI Service. Microsoft has been expanding capacity in step with that demand. Through 2025 and into early 2026, the company opened or expanded data center campuses in Wisconsin, Georgia, Iowa, Texas, Arizona, Wyoming, Spain, Sweden, Italy, and the United Arab Emirates, with several sites built out specifically to host AI accelerators for OpenAI workloads. In its January 2026 earnings, Microsoft reported quarterly capital expenditure above $37 billion, almost all of it on AI-capable infrastructure [13].
Azure OpenAI Service provides access to OpenAI's full range of production models, though new model releases may appear on OpenAI's direct API slightly before they become available on Azure. The lag is typically days to a few weeks for major launches, and Azure availability often starts in a small set of regions before expanding. Some preview models such as GPT-5 Pro and Sora reached Azure within hours of OpenAI's announcement; others such as o1-pro took longer because of capacity constraints.
| Model family | Specific models | Key capabilities |
|---|---|---|
| GPT-5 series | GPT-5, GPT-5 Pro, GPT-5-mini, GPT-5-nano | Frontier reasoning, generation, multimodal |
| GPT-4.5 | GPT-4.5 | High-quality conversational reasoning |
| GPT-4.1 series | GPT-4.1, GPT-4.1 mini, GPT-4.1 nano | 1M token context, general purpose |
| GPT-4o series | GPT-4o, GPT-4o mini | Multimodal (text, vision, audio) |
| GPT-4 series | GPT-4 Turbo, GPT-4 32k (legacy) | Long-context text and vision |
| GPT-3.5 series (legacy) | GPT-3.5 Turbo, GPT-3.5 Turbo Instruct | Cost-efficient chat, slated for retirement |
| o-series (reasoning) | o1, o1-mini, o3, o3-mini, o4-mini | Advanced reasoning, math, science, code |
| Codex | codex-mini, codex-1 | Coding-tuned variants for the Codex agent |
| Image generation | DALL-E 3, gpt-image-1 | Text-to-image, edits, transparent backgrounds |
| Speech | Whisper, gpt-4o-transcribe, gpt-4o-mini-transcribe | Transcription and translation |
| Realtime | gpt-4o-realtime, gpt-realtime | Streaming voice agent stack |
| Video | Sora, Sora 2 | Text-to-video generation, preview |
| Embeddings | text-embedding-3-large, text-embedding-3-small, ada-002 (legacy) | Text embeddings for search and RAG |
GPT-5, which became generally available in Azure AI Foundry in 2025, is described by Microsoft as the most powerful large language model ever released across key benchmarks. It pairs frontier reasoning with high-performance generation and cost efficiency. The GPT-5-mini and GPT-5-nano variants offer progressively lower costs for less demanding workloads [4]. GPT-4.5, marketed by OpenAI as its largest non-reasoning model when it launched in February 2025, was made available on Azure in March 2025 and remains an option for customers who prefer GPT-4.x style instruction following over the o-series reasoning chain.
The o-series reasoning models on Azure inherit OpenAI's tiered pricing for reasoning effort, exposing a reasoning_effort parameter that callers can set to low, medium, or high to trade latency against quality. The o3 and o4-mini models also support tool use and structured outputs, making them practical for agent workloads when paired with Foundry Agents.
Sora and Sora 2 reached Azure in preview in late 2025 in a small number of regions (initially East US 2 and Sweden Central) and remain limited-access. Realtime API support, including the gpt-4o-realtime preview and the gpt-realtime model that reached GA in late 2025, is available in the Sweden Central, East US 2, and West US regions for low-latency voice streaming.
The legacy GPT-3.5 Turbo and ada-002 models continue to be supported, but Microsoft has begun publishing retirement schedules for them on the Azure model lifecycle page; customers running on these models are encouraged to migrate to GPT-4o mini and text-embedding-3-small as direct replacements.
Unlike the direct OpenAI API, where customers do not pick a region, Azure OpenAI requires customers to choose a region when creating an Azure OpenAI resource. Each region is part of an Azure geography (United States, Europe, Asia Pacific, and so on) and supports a subset of models. Microsoft maintains a public model availability matrix that customers consult to find the closest region offering the model and deployment type they want.
The table below lists the major Azure regions where Azure OpenAI is generally available, grouped by geography. Specific model availability varies by region; flagship models like GPT-5 and the o-series typically light up first in the United States and in Sweden Central before propagating to other geographies.
| Geography | Region | Notes |
|---|---|---|
| United States | East US, East US 2 | Frontier model launches, Sora preview |
| United States | West US, West US 3 | GPT-4o, GPT-4.1, GPT-5 capacity |
| United States | North Central US, South Central US | Heavy enterprise demand, broad model coverage |
| United States | Central US, West Central US | Smaller capacity, fine-tuning region for select models |
| Europe | West Europe (Netherlands) | Broad coverage, EU data residency |
| Europe | North Europe (Ireland) | Broad coverage, EU data residency |
| Europe | Sweden Central | Frontier launches in Europe, Sora and Realtime preview |
| Europe | France Central, Switzerland North | Country-level data residency, financial sector demand |
| Europe | UK South | UK data residency, public sector |
| Asia Pacific | Japan East | Most popular APAC region for GPT-4 family |
| Asia Pacific | Australia East | ANZ data residency, government workloads |
| Asia Pacific | Korea Central, India South | Localized capacity, smaller model menus |
| Americas | Canada East | Canadian data residency |
| Americas | Brazil South | South American capacity |
| Middle East | UAE North | Sovereign-style availability for the region |
The practical implication of regional deployment is that an application calling Azure OpenAI in eastus2 is hitting a different physical fleet than one calling swedencentral, with potentially different model versions, different latencies, and different rate limits. Customers running global products often deploy in two or three regions and route traffic with Azure Front Door or a custom load balancer.
Global deployment types, introduced in 2024, hide some of this complexity by routing requests to whichever Azure region has spare capacity for the chosen model. Global Standard and Global Provisioned deployments behave like a single logical endpoint while still associating data residency with the customer's home region for processing.
Azure OpenAI Service inherits the full security stack of the Azure platform. This includes encryption of data at rest using FIPS 140-2 compliant 256-bit AES encryption, encryption of data in transit, virtual network isolation, and integration with Azure Key Vault for secrets management [5]. The service holds certifications for GDPR, HIPAA, ISO 27001, SOC 2, and other major compliance frameworks.
Organizations can deploy models in specific Azure regions to meet data residency requirements. Microsoft also introduced Data Zones in 2025, which give customers more granular control over where their data is processed and stored [5].
For organizations requiring strict network-level security, Azure OpenAI Service supports private endpoints that assign a private IP address within the customer's Azure Virtual Network (VNet). Traffic between applications and Azure OpenAI travels over the Microsoft backbone network and never traverses the public internet. Combined with the option to disable public network access entirely, this creates true network isolation for sensitive workloads [9].
Key network security capabilities include:
| Feature | Description |
|---|---|
| Private endpoints | Assign a private IP within the customer's VNet |
| VNet integration | Keep all traffic on the Microsoft backbone network |
| Public access control | Option to disable public endpoint entirely |
| DNS configuration | Private DNS zones for transparent name resolution |
| Network Security Groups | Granular traffic filtering rules |
| Customer-managed keys | Optional encryption key control via Azure Key Vault |
Microsoft recommends using Microsoft Entra ID (formerly Azure Active Directory) managed identities for authentication instead of API keys. Managed identities eliminate the need to store, manage, and rotate secrets, reducing the attack surface for credential-based compromises. As of April 2025, permissions for managed identities must be assigned manually rather than being auto-granted, giving organizations more explicit control over access [9].
Entra ID authentication is the preferred path for production deployments because it ties API access to the same identity primitives that govern the rest of Azure: user accounts, service principals, role assignments, conditional access policies, and Privileged Identity Management. API keys remain available for backward compatibility and for use cases like quick prototyping, but Microsoft has been making them optional rather than default in Foundry resources.
The full list of certifications that apply to Azure OpenAI Service is broad, in part because the service inherits most of them from the Azure platform itself. The table below summarizes the certifications most often relevant to customers in regulated industries.
| Standard or program | Status | Notes |
|---|---|---|
| SOC 1, SOC 2, SOC 3 | In scope | Annual independent audit reports available via the Service Trust Portal |
| ISO 27001, 27017, 27018, 27701 | In scope | Information security and privacy management systems |
| ISO 22301, 9001, 20000-1 | In scope | Continuity, quality, and IT service management |
| HIPAA Business Associate Agreement | Available | Required for protected health information workloads |
| HITRUST CSF | In scope | Healthcare-focused control framework |
| PCI DSS | In scope | Payment card workloads with documented controls |
| FedRAMP High | In scope (Azure Government) | Federal civilian and DoD IL2 workloads |
| DoD Impact Level 4, IL5, IL6 | Varies by region | DoD environments through Azure Government and Azure Government Secret |
| IRS 1075 | In scope | Federal tax information |
| CJIS | In scope | Criminal justice information |
| GDPR, EU-US Data Privacy Framework | In scope | EU data protection |
| EU sovereign cloud (Microsoft Cloud for Sovereignty) | Available | Customer-controlled keys and operator restrictions |
| C5 (Germany), IRAP (Australia), CCCS (Canada) | Region-specific | Country compliance frameworks |
Customer-managed keys (CMK) are supported through Azure Key Vault, which means a customer can revoke access to its own data by disabling the key. Audit logs are written to Azure Monitor and can be exported to a customer-controlled storage account or SIEM. Microsoft also publishes a quarterly transparency report on abuse monitoring and law enforcement requests for Azure OpenAI.
One of the most important distinctions between Azure OpenAI Service and the direct OpenAI API is data handling. Microsoft makes several explicit guarantees [5]:
These commitments make Azure OpenAI Service attractive to regulated industries, including healthcare, financial services, and government agencies, where data handling policies are strictly governed. It is worth noting that in March 2025, Microsoft updated its terms to disclose that fine-tuning operations may involve temporary data relocation outside the selected geography [5]. Customers who require strict in-region processing should verify the regional support matrix for fine-tuning before relying on it for sensitive data.
Azure OpenAI Service includes a built-in content filtering system that runs alongside model inference. The system evaluates both inputs and outputs for categories including hate speech, violence, sexual content, and self-harm. Filters can be configured to different severity levels, and organizations can request modified content filtering configurations through Microsoft for specific use cases [6].
The broader responsible AI framework includes abuse monitoring, which detects patterns of misuse, and Jailbreak detection, which identifies attempts to circumvent model safety instructions. Abuse monitoring writes a separate stream of metadata that Microsoft retains for up to 30 days; customers approved through the Limited Access Review for sensitive workloads (for example, journalism, research on hate speech, or law enforcement) can opt out of human review of abuse monitoring data.
The content filtering system uses an ensemble of multi-class classification models to detect harmful content across four primary categories, each evaluated at four severity levels: safe, low, medium, and high. The default configuration filters at the medium severity threshold for both prompts and completions [10].
| Filter category | Description | Default threshold |
|---|---|---|
| Hate and fairness | Content promoting discrimination or stereotypes | Medium |
| Sexual | Sexually explicit or suggestive content | Medium |
| Violence | Content depicting or promoting violence | Medium |
| Self-harm | Content related to self-harm or suicide | Medium |
Beyond the four core categories, Azure OpenAI provides additional specialized filters [10]:
| Special filter | Function |
|---|---|
| Prompt attack / jailbreak detection | Detects and blocks attempts to bypass content moderation |
| Protected material (text) | Identifies known copyrighted text (song lyrics, articles, recipes) |
| Protected material (code) | Detects source code matching public repositories |
| Groundedness detection | Verifies responses are grounded in provided source material |
| PII detection | Identifies personally identifiable information in inputs and outputs |
All customers can create custom content filtering policies tailored to their use cases, with settings configurable separately for prompts and completions. Customers approved for modified content filtering through Microsoft's Limited Access Review process can turn filters off entirely for specific scenarios [10].
The filtering layer is one of the more visible behavioral differences between Azure OpenAI Service and the direct OpenAI API. Azure's filters are typically stricter and more configurable, while OpenAI's direct API relies on the moderation endpoint and the model's own refusal training. Developers porting code from OpenAI to Azure sometimes see prompts that worked on api.openai.com get blocked on Azure, particularly for classification or red-teaming use cases that touch policy categories.
Unlike the direct OpenAI API, where models are accessed through a shared endpoint, Azure OpenAI Service requires customers to create explicit deployments of specific model versions in their Azure subscription. This approach gives teams control over which model version their application uses and when to upgrade, avoiding unexpected behavior changes from automatic model updates.
Azure OpenAI Service offers several deployment types, each optimized for different workload characteristics [11]:
| Deployment type | Description | Billing | Best for |
|---|---|---|---|
| Global Standard | Shared infrastructure across Azure global network | Per-token (pay-as-you-go) | Variable workloads, experimentation |
| Standard (Regional) | Shared infrastructure within a specific Azure region | Per-token (pay-as-you-go) | Data residency requirements |
| Data Zone Standard | Pay-as-you-go within EU or US data zones | Per-token | Multi-region pooled traffic with regulated zone |
| Global Provisioned (Provisioned-Managed) | Dedicated compute across Azure global network | Hourly (per PTU) | High-volume, latency-sensitive production |
| Regional Provisioned | Dedicated compute within a specific region | Hourly (per PTU) | Data residency plus guaranteed throughput |
| Data Zone Provisioned | Dedicated compute within EU or US data zones | Hourly (per PTU) | Regulatory compliance with throughput guarantees |
| Global Batch | Async processing across global network | Per-token (50% discount) | Large-scale, non-time-sensitive workloads |
| Provisioned-Managed Spillover | Spillover from PTU to standard during traffic spikes | Hybrid | Reservation-backed plus burst capacity |
The Standard deployment types use shared infrastructure with pay-per-token pricing, making them suitable for variable or experimental workloads. Provisioned deployments use dedicated compute with reserved throughput, offering lower and more predictable latency than shared deployments. Global Batch is a separate path entirely; it bundles requests into a 24-hour processing window for a 50 percent discount and is the Azure equivalent of OpenAI's Batch API.
Provisioned Throughput Units are model-agnostic quota units that organizations allocate to deployments for guaranteed processing capacity. PTUs abstract away the underlying compute, allowing customers to allocate capacity across different models and rebalance as needs change [11].
Key characteristics of PTU deployments include:
The internal scaling factor for a PTU varies by model and version: a single GPT-5 PTU yields a different tokens-per-minute throughput than a single GPT-4o PTU. Microsoft publishes a capacity calculator inside the Azure portal that estimates how many PTUs a workload will need based on observed input and output token rates.
As of early 2026, PTU provisioned throughput starts at approximately $2,448 per month for the smallest valid deployment of a flagship model. Organizations whose pay-as-you-go token costs exceed roughly $1,800 per month typically benefit from committing to a PTU reservation [11]. The break-even point depends heavily on the input-to-output ratio: workloads that produce long completions tend to favor PTUs earlier than workloads dominated by retrieval-augmented short answers.
Provisioned-Managed Spillover, introduced in 2025, lets customers absorb traffic spikes by overflowing from PTU capacity onto pay-as-you-go for the same model. The feature reduces the cost of overprovisioning, because operators can size PTU reservations for steady-state load instead of peak load. Spillover requests are billed at the standard per-token rate.
Azure OpenAI Service supports fine-tuning of GPT-4o, GPT-4o mini, GPT-4.1, GPT-3.5 Turbo, and selected o-series models in a subset of regions. Fine-tuning allows organizations to adapt a model's behavior using their own labeled training data, improving performance on domain-specific tasks while retaining the model's general capabilities. Fine-tuned models are hosted within the customer's Azure subscription and are not shared with other tenants.
The fine-tuning process in Azure follows several steps:
Three fine-tuning techniques are supported. Supervised fine-tuning is the default and uses labeled prompt-completion pairs. Direct Preference Optimization (DPO) trains the model from pairs of preferred and rejected outputs. Reinforcement fine-tuning, in preview for o-series models, lets customers shape the model's reasoning chain with a reward model. The DPO and reinforcement options arrived on Azure several months after the OpenAI direct API.
Fine-tuning is charged based on training compute hours and hosting costs for the resulting model. Hosting costs apply per hour, separately from training, which means a fine-tuned model continues to incur charges as long as it is deployed even if no inference traffic flows. Training data is retained for 30 days after job completion for debugging purposes, then deleted [5].
The Azure OpenAI Assistants API provided capabilities for building conversational AI agents with persistent threads, file search, code interpretation, and function calling. However, Microsoft announced in 2025 that the Assistants API is deprecated and will be retired on August 26, 2026 [12]. Microsoft recommends migration to the Microsoft Foundry Agents service, which provides the same capabilities with deeper integration into the Azure AI Foundry ecosystem.
Foundry Agents offers enhanced features including multi-agent orchestration, improved tool integration, and native support for the evolving agentic AI paradigm that the industry has adopted since 2025. The migration story for callers using the OpenAI assistants beta is straightforward in spirit but mechanical in practice: thread, run, and tool semantics map onto Foundry Agents resources, but identifiers and SDKs change, so most customers schedule a code freeze and a single switchover rather than running both APIs in parallel.
Azure OpenAI Service is now a core component of Azure AI Foundry (the successor to Azure AI Studio), which provides a unified development environment for building AI applications. Through Foundry, developers can combine OpenAI models with other Azure services like Azure AI Search (for retrieval-augmented generation), Azure Cosmos DB, Power BI, and Azure Data Lake for end-to-end AI solutions [4].
Azure AI Foundry extends the platform beyond OpenAI models to include open-weight models from Meta (Llama), Mistral, Cohere, DeepSeek, and others through the Model Catalog. This positions Azure as an increasingly model-agnostic platform, though OpenAI models remain its primary differentiator. Some non-OpenAI models are offered through a Models-as-a-Service (MaaS) billing path that uses the same per-token meter as Azure OpenAI but bills the third party through Microsoft.
Key Foundry capabilities include:
The naming distinction between Azure OpenAI and Azure AI Foundry causes some friction in customer conversations. Azure OpenAI Service is the resource type and the contractual product. Azure AI Foundry is the developer experience layered on top. A customer can use Azure OpenAI without touching Foundry, but most new projects start in the Foundry portal, which provisions Azure OpenAI resources behind the scenes.
Microsoft maintains first-party Azure OpenAI client libraries in .NET, Python, JavaScript and TypeScript, Java, and Go. Each SDK exposes the same chat, completions, embeddings, image generation, transcription, and assistants endpoints, with Azure-specific extensions for managed identity, content filter results, and deployment routing.
| Language | Package | Notes |
|---|---|---|
| .NET | Azure.AI.OpenAI | Built on the Azure SDK conventions, supports Entra ID and DefaultAzureCredential |
| Python | azure-ai-openai (formerly openai with api_type="azure") | The newer azure-ai-openai package is distinct from openai-python |
| JavaScript / TypeScript | @azure/openai | Browser and Node usage, integrates with @azure/identity |
| Java | com.azure:azure-ai-openai | Uses the Azure Java SDK reactor patterns |
| Go | github.com/Azure/azure-sdk-for-go/sdk/ai/azopenai | First-class Go client |
The Python story is the most confusing for newcomers. The openai-python library shipped Azure support natively for years through the api_type="azure" configuration option, and that path still works. Microsoft also publishes its own azure-ai-openai package that wraps the same endpoints with stricter type hints and Azure-style credential plumbing. Both are supported. Many production codebases use openai-python for portability between Azure and OpenAI direct, while azure-ai-openai is preferred when teams want tighter Azure integration.
The REST API uses Azure-specific base URLs of the form https://{resource-name}.openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version=2025-04-01-preview. The api-version query parameter is the most important difference from the OpenAI direct API: Azure freezes the API surface at specific versions and lets customers pin to a stable preview or GA version rather than tracking the live OpenAI API. As of May 2026, the recommended GA version is 2024-10-21, and the active preview version is 2025-04-01-preview. Microsoft typically supports each API version for at least 12 months after a successor ships.
The Responses API, OpenAI's agent-friendly successor to Chat Completions, reached Azure in preview in mid-2025 and went GA on most regions in early 2026. Customers building new agent workloads are pointed at Responses, while existing Chat Completions code continues to work and is not slated for retirement.
Azure OpenAI Service pricing mirrors OpenAI's API pricing structure closely, with some differences for provisioned capacity and regional differentials. Per-token pricing is published in US dollars per 1 million tokens and is the same across most Azure geographies, although a small number of regions carry a 5 to 10 percent uplift to reflect local infrastructure costs.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cached input (per 1M tokens) |
|---|---|---|---|
| GPT-5 Global | $1.25 | $10.00 | $0.13 |
| GPT-5 Pro Global | $15.00 | $120.00 | N/A |
| GPT-5-mini | $0.25 | $2.00 | N/A |
| GPT-5-nano | $0.05 | $0.40 | N/A |
| GPT-4.1 | $2.00 | $8.00 | $0.50 |
| GPT-4o | $2.50 | $10.00 | $1.25 |
| GPT-4o mini | $0.15 | $0.60 | $0.075 |
| o3 | $10.00 | $40.00 | $2.50 |
| o4-mini | $1.10 | $4.40 | $0.275 |
| GPT-3.5 Turbo (legacy) | $0.50 | $1.50 | N/A |
Prompt caching, called "context caching" in some Azure docs, applies automatically to most chat models when prompt prefixes match recent traffic. The discount is typically 50 to 90 percent depending on model. Global Batch deployments earn a flat 50 percent discount on top of the per-token rate.
For organizations with predictable, high-volume workloads, Provisioned Throughput Units (PTUs) offer guaranteed capacity with monthly or annual reservations. For example, GPT-5 Global provisioned capacity starts at 15 PTUs minimum, priced at $1 per hour, $260 per month, or $2,652 per year per PTU [7].
The decision between pay-as-you-go and provisioned capacity depends on usage volume and latency requirements. The following table summarizes the trade-offs [7][11]:
| Pricing model | Monthly cost range | Best for | Latency |
|---|---|---|---|
| Pay-as-you-go (Standard) | Variable, based on token usage | Unpredictable or low-volume workloads | Variable, may spike under load |
| Pay-as-you-go (Global) | Variable, based on token usage | Multi-region workloads | Routed to lowest-latency region |
| PTU monthly | From ~$2,448/month | Predictable high-volume workloads | Consistent, low |
| PTU annual | From ~$2,652/year per PTU | Long-term production deployments | Consistent, low |
Organizations should consider PTU reservations when monthly pay-as-you-go costs consistently exceed approximately $1,800, as the dedicated capacity provides both cost savings and performance benefits at that threshold [11]. Annual reservations carry an additional discount of roughly 15 percent over the monthly rate.
Reservations are managed in the Azure Reservations portal and can be exchanged or cancelled with prorated refunds, similar to how Azure Compute Reserved Instances work. This makes commitment less risky for customers who are uncertain about long-term volume.
The choice between Azure OpenAI Service and the direct OpenAI API is one of the most common decisions enterprises face when adopting OpenAI models.
| Factor | Azure OpenAI Service | Direct OpenAI API |
|---|---|---|
| Authentication | API key plus optional Microsoft Entra ID (recommended) | API key only |
| Data privacy | Data stays in Azure, not sent to OpenAI | Data processed by OpenAI, opt-out available |
| SLA | 99.9% guaranteed uptime | 99.82% observed uptime, no formal SLA |
| Model availability | Slight delay after OpenAI releases, days to weeks | Immediate access to new models |
| Setup complexity | Requires Azure account and quota request | Simple email signup and credit card |
| Compliance | HIPAA, SOC 2, ISO 27001, GDPR, FedRAMP High | Limited compliance certifications |
| Pricing | Same per-token rates plus PTU options and Global Batch | Pay-per-token plus standard Batch API |
| Ecosystem | Deep Azure and Microsoft 365 integration | Standalone API and ChatGPT-Enterprise links |
| Version control | Explicit deployment of pinned model versions | Auto-updated model aliases |
| Network isolation | Private endpoints, VNet integration, sovereign cloud | Shared endpoints |
| Content filtering | Configurable, multi-category with custom policies | Standard safety system, less configurable |
| Region selection | Customer chooses deployment region | OpenAI chooses, US-centric |
| API versioning | Pinned to dated api-version query parameter | Live API surface |
| Fine-tuning availability | GPT-4o, GPT-4o mini, GPT-4.1, GPT-3.5, selected o-series in some regions | Broader model coverage, faster after launch |
| Realtime API | gpt-realtime in select regions | Available globally, sometimes ahead of Azure |
| Sora video | Preview in East US 2 and Sweden Central | Available through OpenAI direct earlier |
For prototyping and experimentation, the direct OpenAI API offers a faster path to getting started. For production deployments in enterprise environments, Azure OpenAI Service provides stronger guarantees around uptime, data privacy, and compliance [8]. A common pattern is to develop against the OpenAI direct API for speed, then port to Azure OpenAI before going to production. The Python and Node SDKs are designed to make this swap mostly mechanical, requiring a configuration change rather than rewriting code.
Azure OpenAI Service has been adopted broadly across industries. Microsoft regularly publishes customer stories on the Azure blog and at Ignite. The list below is a representative sample rather than an exhaustive catalog.
| Customer | Industry | Reported use case |
|---|---|---|
| Coca-Cola | Consumer goods | Marketing copy generation, customer support, product innovation |
| BMW Group | Automotive | Vehicle data analysis, internal knowledge agents, technician support |
| ICRC (International Committee of the Red Cross) | Nonprofit | Multilingual translation for humanitarian operations |
| NBA | Sports | Game commentary tools, fan engagement, content production |
| Volvo Cars | Automotive | Software documentation, customer support, internal copilots |
| Mercedes-Benz | Automotive | In-car voice assistant pilots and engineering productivity |
| KPMG | Professional services | Tax research, audit support, internal copilots |
| Bayer | Pharmaceuticals | Document analysis and research summarization |
| AT&T | Telecommunications | Customer service summarization and call analytics |
| Carlsberg Group | Beverages | Commercial planning, marketing, internal copilots |
| Heineken | Beverages | Sales operations, marketing copy |
| H&R Block | Financial services | Tax assistance chatbot for filers |
| Vodafone | Telecommunications | Customer support agent and analytics |
| Air India | Aviation | Customer service chatbot "Maharaja" |
| Mediterranean Shipping Company (MSC) | Logistics | Document processing and shipping operations |
Microsoft itself is the largest single consumer of Azure OpenAI capacity. GitHub Copilot, Microsoft 365 Copilot, Copilot for Sales, Copilot for Service, the Bing search experience, the Microsoft Copilot consumer assistant, and the security copilot all run on top of Azure OpenAI Service. Their throughput volumes are large enough that Microsoft has historically reserved dedicated capacity in Azure regions before broader customer demand catches up.
In government and defense, the Department of Defense, the National Security Agency, NASA JPL, and several US federal civilian agencies have deployed workloads on Azure OpenAI through the FedRAMP High and Azure Government Secret offerings. UK government, Australian government, and several EU member states have run pilots through their respective sovereign clouds.
Running Azure OpenAI in production raises a set of operational questions that do not apply on the OpenAI direct API.
Quotas and capacity. Azure OpenAI quotas are per-region and per-deployment, expressed in tokens per minute (TPM) and requests per minute (RPM). Customers request quota increases through the Azure portal, and the response time can range from minutes for small bumps to several days for capacity that requires fleet provisioning.
Monitoring and observability. Azure Monitor exposes metrics for token consumption, latency percentiles, content filter activations, and rate limit errors. Customers commonly forward these metrics to Application Insights, Log Analytics, or external SIEM systems. The content filter signal is particularly useful for tracking what kinds of prompts users are submitting and how often the filter is firing.
Availability and failover. The service publishes per-region availability in the Azure Service Health dashboard. Customers running multi-region deployments typically pair an eastus2 deployment with a swedencentral or westeurope deployment and use Azure Front Door or a custom router to fail over. Global Standard and Global Provisioned deployments hide some of this complexity but still warrant a backup region for disaster recovery.
Quota management for PTU. PTU reservations sit on top of the Azure quota system and are not transferable across regions. A customer with PTU reservations in eastus cannot fail over to westus3 without a separate reservation. This makes capacity planning a multi-region exercise for PTU customers.
Cost management. Microsoft Cost Management groups Azure OpenAI charges by deployment, model, and resource group, which makes it relatively easy to track spend per application or business unit. Cost anomalies tend to come from runaway agent loops and unbounded max_tokens settings; teams that ship agents on Azure OpenAI typically wire up budget alerts and per-tenant rate limits before going to production.
Beyond the headline differences in privacy and SLA, several smaller behaviors trip up developers porting from openai.com to Azure.
Deployment indirection. On openai.com, callers reference a model by name (gpt-4o). On Azure, callers reference a deployment that points to a specific model and version (gpt-4o-2024-08-06). The deployment indirection is what enables customers to pin model versions, but it means callers must know the deployment name configured by their Azure administrator.
Api-version pinning. Each Azure call carries an api-version query parameter. Calling without one returns an error. Some preview features, such as the Responses API and reasoning effort parameter, are gated behind specific preview versions and are not visible to callers on older versions.
Content filter responses. Azure adds a prompt_filter_results and content_filter_results block to many responses, with severity scores per category. Code that strictly validates against the OpenAI response schema can break on Azure unless it allows extra fields.
Function calling and tool use. Tool use and structured outputs are supported on Azure for GPT-4o, GPT-4.1, the o-series, and GPT-5, but specific tool features (parallel tool calls, strict JSON schema enforcement) sometimes lag the OpenAI direct API by an api-version or two.
Fine-tuning availability. Fine-tuning availability on Azure tracks the OpenAI catalog with a delay. As of May 2026, GPT-5 fine-tuning is not yet available on Azure (it shipped on the OpenAI direct API in late 2025), and reinforcement fine-tuning for o-series models is in preview only on Azure.
Realtime API. The Realtime API on Azure is available in three regions and supports the gpt-realtime model that reached GA in late 2025. WebRTC support, voice activity detection, and built-in safety classifiers are at parity with OpenAI direct.
Microsoft's partnership with OpenAI is one of the defining business relationships in the AI industry. Beginning with a $1 billion investment in 2019, Microsoft has invested a cumulative total of approximately $13 billion in OpenAI [3]. This investment gave Microsoft exclusive rights to serve OpenAI's models through Azure and played a central role in establishing Azure as a leading platform for AI workloads.
In October 2025, the two companies announced a restructured partnership as OpenAI transitioned from a capped-profit to a for-profit public benefit corporation. Under the new terms, Microsoft holds approximately 27 percent of OpenAI on a diluted basis, with a stake valued at around $135 billion. Azure remains the exclusive cloud provider for stateless OpenAI APIs, and OpenAI committed to $250 billion in Azure spending [3].
The partnership has been financially significant for both sides. In Q2 FY2026 (reported January 2026), Microsoft earned $7.6 billion in revenue attributable to OpenAI-related Azure consumption [13]. The investment also carried costs; in Q1 FY2026, Microsoft reported a $3.1 billion drop in net income tied to its OpenAI investment accounting [3].
The restructuring also formally allowed OpenAI to contract with other clouds for training capacity. OpenAI subsequently announced a $50 billion compute partnership with Amazon Web Services and a smaller deal with Google Cloud, while keeping Azure as the host for the API customers consume. For Azure OpenAI Service, the restructuring is mostly invisible: the contract terms with Azure customers are unchanged, and the API endpoints and deployment types continue to behave the same way.
Azure OpenAI Service competes with a small set of cloud-hosted frontier model offerings. The closest analogues are Amazon Bedrock and Google Cloud Vertex AI, which package frontier models from Anthropic, Google, Mistral, Cohere, and others under similar enterprise terms.
| Feature | Azure OpenAI Service | AWS Bedrock | Google Vertex AI |
|---|---|---|---|
| Primary models | OpenAI GPT-5, GPT-4o, o-series | Claude, Llama, Nova, Mistral | Gemini, Claude (selected) |
| Frontier exclusivity | OpenAI hosted-API exclusive on Azure | Anthropic primary cloud | Gemini exclusive |
| Authentication | Microsoft Entra ID, API keys | AWS IAM | Google Cloud IAM |
| Data residency | Region selection plus Data Zones | Region selection | Region selection |
| Provisioned capacity | Provisioned Throughput Units (PTUs) | Provisioned Throughput | Reserved capacity |
| Pricing parity with direct API | Tracks OpenAI rates closely | Tracks Anthropic rates closely | Tracks Google rates closely |
| Compliance | SOC 2, HIPAA, FedRAMP High, ISO 27001, IRS 1075 | SOC, HIPAA, FedRAMP | SOC, HIPAA, FedRAMP |
| Sovereign cloud | Microsoft Cloud for Sovereignty | AWS Top Secret region | Google sovereign clouds (with partners) |
The core trade-off is which set of frontier models a customer needs. Customers committed to OpenAI models for product reasons usually pick Azure OpenAI; customers committed to Anthropic Claude usually pick Bedrock; customers committed to Gemini usually pick Vertex. In practice, larger enterprises run all three, with Azure OpenAI as the default for Microsoft-aligned workloads.
As of May 2026, Azure OpenAI Service is the dominant enterprise channel for accessing OpenAI models. The general availability of GPT-5 in Azure AI Foundry brought the platform's most powerful model yet, alongside the cost-optimized GPT-5-mini and GPT-5-nano variants. The Realtime API for low-latency voice and the Responses API for agent workloads have become the recommended endpoints for new applications, while Chat Completions remains supported for existing code.
Microsoft has expanded the Azure AI Foundry ecosystem to include a broader range of models beyond OpenAI's, including open-weight models from Meta, Mistral, Cohere, DeepSeek, and others, positioning it as an increasingly model-agnostic platform. The integration of AI agents, code interpreters, and tool use capabilities reflects the industry's shift toward agentic AI workflows. The deprecation of the Assistants API in favor of the Foundry Agents service signals Microsoft's commitment to a more integrated, platform-level approach to agent development.
The relationship between Microsoft and OpenAI continues to evolve. Reports in early 2026 suggested tensions over OpenAI's separate cloud deals, including the reported $50 billion agreement with Amazon Web Services, which Microsoft viewed as potentially conflicting with Azure exclusivity provisions [3]. Despite these dynamics, Azure OpenAI Service remains central to both companies' strategies and continues to serve as the primary enterprise on-ramp for OpenAI's technology.
Looking forward, the most likely changes through 2026 are wider regional availability for Sora 2 and the Realtime API, broader fine-tuning support for the GPT-5 family, and the gradual phase-out of the GPT-3.5 Turbo deployments in favor of GPT-4o mini and GPT-5-nano. The naming question (Azure OpenAI versus Azure AI Foundry Models) is unlikely to be resolved in the short term, because the legacy resource type continues to ship and the brand-level Foundry rollout has not displaced it.