DeepInfra is an AI inference cloud company founded in September 2022 and headquartered in Palo Alto, California. The company operates a purpose-built GPU infrastructure platform that hosts open-source and open-weight AI models and delivers them to developers through a low-latency, OpenAI-compatible REST API. Rather than asking customers to provision their own hardware, DeepInfra sells access on a per-token, pay-as-you-go basis, handling all underlying server management, GPU scheduling, and model optimization internally. By May 2026 the platform processed nearly five trillion tokens per week across more than 190 models and operated eight data centers in the United States.
DeepInfra competes directly with Together AI, Fireworks AI, Replicate, Groq, and other inference-as-a-service providers, while also appearing as a provider option within aggregator platforms such as OpenRouter. The company's primary differentiation is cost: it consistently ranks among the cheapest per-token providers for large open-weight models, often 50 to 80 percent below the price of equivalent proprietary APIs.
The company is incorporated as Deep Infra, Inc. and operates under the brand name DeepInfra. Its technical team includes engineers from imo, and its advisors include alumni of WhatsApp, Twitch, and Weights & Biases. The team of roughly 20 to 30 people spans engineering, customer success, sales, and operations, which is small relative to the scale of infrastructure the company manages. This lean headcount is a deliberate operating strategy: by owning hardware and focusing on a narrow product surface, DeepInfra aims to keep costs low enough to sustain prices that larger, more diversified competitors cannot match.
DeepInfra was founded in September 2022 by three co-founders who had previously worked together building backend infrastructure for imo, a mobile messaging application that reached over one billion Play Store downloads and 200 million monthly active users. Processing billions of messages daily at imo gave the founding team first-hand experience managing large GPU and CPU server fleets at scale, and a strong opinion about cloud economics: renting infrastructure from hyperscalers was far more expensive than owning it.
Nikola Borisov, who became CEO, studied at Northwestern University and later worked at Microsoft and HalloApp before co-founding imo. He and his co-founders Georgios Papoutsis and Yessenzhar Kanapin are alumni of competitive programming competitions; investor materials describe all three as having won top honors in international programming Olympiads. Their shared background in systems engineering, rather than in machine learning research, shaped the company's infrastructure-first identity.
When the founders incorporated DeepInfra in late 2022, their original plan was to host a broad catalog of machine learning models. The release of ChatGPT in November 2022 and the subsequent surge in demand for large language model APIs prompted the team to concentrate on LLM inference specifically. The company operated in stealth mode through most of 2023 before publicly announcing its seed round in November of that year.
During 2023, DeepInfra quietly built out its infrastructure and began signing up early customers. The company's early positioning was simple: offer the same popular open-source models as cloud incumbents but at meaningfully lower cost, with minimal friction to get started. Developers could switch from OpenAI to DeepInfra by changing a single base URL in their existing OpenAI SDK configuration, a deliberate design choice that lowered the barrier to adoption.
The open-source model landscape shifted substantially over 2023. Meta released the original LLaMA weights in February and the improved Llama 2 family in July. Mistral AI released Mistral 7B in September. Each release generated developer interest in affordable inference, and DeepInfra was positioned to capture that demand before the company had made any public announcements. The company added new models to its catalog within days of their public release, establishing a pattern of fast model availability that would remain a differentiator.
By the time the seed round was announced in November 2023, DeepInfra had built enough customer interest to attract institutional backing, though the company revealed little about its revenue or exact user counts at that stage. The Felicis investment thesis, summarized in a blog post titled "A Deep Dive on Deep Infra," referenced NVIDIA CEO Jensen Huang's observation that inference compute demand would grow by roughly a billion times, and positioned DeepInfra as a beneficiary of that growth by owning infrastructure rather than renting it.
DeepInfra has raised a total of approximately $133 million across three funding rounds between 2023 and 2026.
In November 2023, DeepInfra raised $8 million in seed funding. The round was led by A.Capital Ventures and Felicis Ventures. SV Angel also participated. The funding announcement confirmed the company's focus on making open-source model inference affordable and revealed that the founding team had scaled imo to 200 million users. Borisov described the round as validation of the team's conviction that open-source models would become the practical choice for production AI workloads.
In April 2025, DeepInfra raised $18 million in a Series A round led by Felicis and angel investor Georges Harik, one of Google's earliest engineers. The stated use of funds was expanding access to NVIDIA Blackwell GPU capacity, reflecting the company's view that the transition to next-generation accelerators would require meaningful capital to execute quickly. The round brought total funding to approximately $26 million.
DeepInfra announced a $107 million Series B on May 4, 2026. The round was co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90. NVIDIA's participation was notable both as a financial signal and as a reflection of DeepInfra's close technical collaboration with NVIDIA on the Blackwell and Vera Rubin GPU architectures and the NVIDIA Dynamo distributed inference software platform.
At the time of the Series B announcement, the company reported that its token processing volume had grown 25 times since the Series A, and that revenue had tripled since early 2026. Tony Wang, Managing Partner at 500 Global, described purpose-built inference infrastructure as "fundamental to the next phase of AI as compute was to the last."
Borisov framed the raise in terms of a structural shift in how AI is consumed: "Inference is no longer a thin layer. It is the system constraint that will define the majority of workloads." The proceeds were designated for expanding global compute capacity, deepening developer tooling, and accelerating support for agentic and emerging open-source models.
DeepInfra's platform has three main surface areas: a serverless inference API, GPU instances for dedicated compute, and a model catalog browser.
The core product is a serverless API that exposes popular open-source models for text generation, embeddings, image generation, text-to-speech, speech recognition, text-to-video, and multimodal tasks. The API is fully compatible with the OpenAI SDK: developers point the base_url parameter to https://api.deepinfra.com/v1/openai and authenticate with a DeepInfra API key. All other request and response formats remain identical to OpenAI's API.
The serverless pricing model means customers pay per token consumed with no idle GPU charges, no seat fees, and no minimum commitments. This contrasts with dedicated endpoint products that charge by the hour regardless of actual usage.
Supported request parameters include temperature, top-p, max-tokens (up to 16,384 for most models), stop sequences, presence and frequency penalties, structured JSON response format, tool calling (function calling), reasoning effort controls for reasoning models, and streaming via server-sent events. A service_tier parameter allows priority inference at a 20 percent surcharge for customers with latency-sensitive workloads.
In addition to serverless inference, DeepInfra offers GPU instances for customers who need dedicated compute. GPU instances give users full SSH access to containerized environments pre-configured with NVIDIA drivers and CUDA. Supported GPU types include the A100, H100, H200, B200, and B300. Hourly pricing at time of publication was $0.89/hr for A100s, $1.79/hr for H100s, $2.19/hr for H200s, and $2.79/hr for B200s.
GPU instances are suited to model training, fine-tuning runs, research experiments, and inference workloads that require a persistent, customizable environment rather than the stateless serverless API. Containers launch within minutes and customers are billed only for active runtime. Because the underlying hardware is DeepInfra-owned rather than rented from a hyperscaler, the hourly rates are competitive with or below equivalent instance pricing from AWS, Google Cloud, or Azure.
The GPU instances product sits alongside the serverless API rather than replacing it. For most production inference workloads, the serverless API is more cost-effective because customers pay only for tokens generated rather than for the total hours a GPU is provisioned. GPU instances make sense for workloads where the developer needs direct control over the software environment, where continuous GPU utilization is high enough to justify a persistent instance, or where custom model weights need to be loaded outside of DeepInfra's managed catalog.
DeepInfra's model catalog, accessible at deepinfra.com/models, organizes available models across categories including text generation, embeddings, text-to-image, text-to-speech, text-to-video, automatic speech recognition, rerankers, OCR, and zero-shot image classification. The catalog is updated frequently as new open-source models are released.
DeepInfra's catalog covers the major families of publicly available open-weight models. The company adds new models quickly after community release, which is one of its competitive advantages over providers with more selective catalogs.
The language model catalog includes models from Meta's LLaMA lineage (Llama 3.1 8B through Llama 3.3 70B and the Llama 4 generation), Alibaba's Qwen series (Qwen3-32B, Qwen3-Max, Qwen3.5-397B and related mixture-of-experts variants), DeepSeek's models (DeepSeek-V3.1, DeepSeek-V3.2, DeepSeek-R1 reasoning variants, and the DeepSeek-V4 family), Mistral AI models (Mistral-Nemo and related), and NVIDIA's Nemotron family. The catalog also includes GLM models from Zhipu AI, Kimi models from Moonshot AI, and various other community-maintained models.
For reference, the Llama 3.1 8B Instruct model is priced at $0.02 per million input tokens and $0.05 per million output tokens, while Llama 3.3 70B Turbo is $0.10 input and $0.32 output. DeepSeek-V3.2, a 671-billion parameter mixture-of-experts model, runs at $0.26 input and $0.38 output. The Qwen3-Max model costs $1.20 input and $6.00 output per million tokens.
DeepInfra also hosts several API-based proprietary models as a reseller, including Anthropic's Claude and Google's Gemini families. These are available at prices that pass through the underlying API cost without markup. While the majority of DeepInfra's catalog and its competitive identity center on open-weight models, the inclusion of proprietary models allows developers to use a single API key and billing account for their full model mix.
Reasoning models (those designed to generate extended chain-of-thought before producing a final answer) are supported with a reasoning_effort parameter that controls the depth of internal reasoning. DeepSeek-R1 variants and Qwen's thinking models use this parameter to balance response latency against answer quality, which is particularly relevant for agent systems where some steps require heavy reasoning and others do not.
Beyond pure text generation, the platform hosts image generation models from the FLUX series (including FLUX-2-klein in 4B and 9B parameter variants), text-to-speech synthesis through Qwen3-TTS and Qwen3-TTS-VoiceDesign (which supports natural language voice design instructions in ten languages), automatic speech recognition through Whisper and Voxtral, and multimodal language models like NVIDIA's Nemotron-3-Nano-Omni. Embedding models include BAAI's bge family, and reranker models support retrieval-augmented generation pipelines.
DeepInfra supports LoRA (Low-Rank Adaptation) adapters for both text and image models, allowing customers to deploy fine-tuned variants of base models through the serverless API. This enables parameter-efficient customization without requiring a dedicated GPU instance, though the range of supported adapter formats and base models is narrower than what specialized fine-tuning platforms offer. LoRA adapters are merged or applied at inference time, and billing follows the same per-token structure as base model inference. Customers who need full supervised fine-tuning workflows that involve dataset upload, training job execution, checkpoint management, and evaluation typically use the GPU instances product or a separate fine-tuning platform before deploying the resulting weights on DeepInfra.
In 2026, DeepInfra became an official Hugging Face Inference Provider. This means models hosted on DeepInfra can be accessed directly from the Hugging Face Hub using either the Hugging Face Python or JavaScript SDKs, the OpenAI SDK pointed at Hugging Face's OpenAI-compatible router, or direct DeepInfra API calls. Developers can filter for DeepInfra-hosted models at huggingface.co/models using the inference_provider=deepinfra parameter. Hugging Face passes through DeepInfra's per-token rates without any added fees, so costs are identical regardless of access path.
OpenAI compatibility is the central design decision in DeepInfra's API. Because the request schema, authentication pattern, and response format mirror OpenAI's completions and embeddings APIs exactly, a developer using the OpenAI Python SDK can switch providers by changing two lines: the base_url and the api_key. There is no new SDK to install, no new documentation to read for basic usage, and no migration of prompt templates.
This compatibility extends to streaming, tool/function calling, structured output (JSON mode), multi-turn message history, and the system/user/assistant message role schema. Models are referenced by their DeepInfra model identifier rather than OpenAI's model names, but the calling convention is otherwise identical.
Access to the API requires an API key obtained from the DeepInfra dashboard. The key is passed as a Bearer token in the HTTP Authorization header. DeepInfra does not currently offer organization-level access control or fine-grained key scoping, which is a gap compared to enterprise-oriented providers.
DeepInfra operates tiered rate limits based on account tier and historical usage. For customers with latency requirements, the service_tier parameter in the request body enables priority queuing at a 20 percent surcharge over standard rates. This feature is relevant for production systems where tail latency matters more than marginal cost savings.
By early 2026, approximately 30 percent of DeepInfra's weekly token volume came from autonomous AI agent systems rather than interactive user-facing applications. The company's infrastructure is optimized for continuous high-volume inference, which differs from the bursty interactive pattern of chatbot applications. A typical agentic task might invoke 50 to 100 or more model calls in sequence; DeepInfra's serverless architecture scales horizontally to accommodate this pattern without requiring customers to pre-provision capacity.
DeepInfra has published guides for using its platform with OpenClaw, a local-first autonomous AI agent framework that connects messaging platforms (WhatsApp, Telegram, Discord, and others) to LLM providers. OpenClaw integrated DeepInfra as a supported provider in 2026, and the use case illustrates the platform's suitability for persistent, high-volume agent deployments.
DeepInfra uses a pure pay-as-you-go pricing model with no setup fees, no minimum spend, and no long-term commitments. Language models are billed per token (separately for input and output), while image generation models are billed per image at rates that scale by output resolution, and audio models are billed per minute of audio processed.
The following table shows indicative pricing for selected models as of mid-2026. Prices are per million tokens unless otherwise noted.
| Model | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| Llama 3.1 8B Instruct | $0.02 | $0.05 | Meta open-weight |
| Mistral-Nemo | $0.02 | $0.04 | Mistral AI open-weight |
| Llama 3.3 70B Turbo | $0.10 | $0.32 | Meta open-weight |
| DeepSeek-V3.2 | $0.26 | $0.38 | 671B MoE, efficient reasoning |
| Qwen3-32B | $0.08 | $0.28 | Alibaba open-weight |
| DeepSeek-R1-0528 | $0.50 | $2.15 | Reasoning model |
| Qwen3-Max | $1.20 | $6.00 | Alibaba flagship MoE |
| claude-3-7-sonnet | $3.30 | $16.50 | Anthropic, via API resale |
| claude-4-sonnet | $3.30 | $16.50 | Anthropic, via API resale |
| claude-4-opus | $16.50 | $82.50 | Anthropic, via API resale |
| gemini-2.5-pro | $1.25 | $10.00 | Google, via API resale |
For GPU instances, pricing is per hour of active runtime:
| GPU | $/hr |
|---|---|
| NVIDIA A100 | $0.89 |
| NVIDIA H100 | $1.79 |
| NVIDIA H200 | $2.19 |
| NVIDIA B200 | $2.79 |
DeepInfra does not publicly publish a free tier, though the dashboard includes usage credits for new accounts. Enterprise customers can negotiate volume-based discounts for high-token workloads.
The inference-as-a-service market includes several providers with overlapping model catalogs but different emphasis on price, speed, feature breadth, and target customer. The table below compares DeepInfra against its three most direct publicly discussed competitors.
| Dimension | DeepInfra | Together AI | Fireworks AI | Replicate |
|---|---|---|---|---|
| Founded | 2022 | 2022 | 2022 | 2019 |
| Primary focus | Low-cost open-source inference | Inference + fine-tuning + research | Low-latency production inference | Multi-modal model deployment |
| Pricing model | Per token | Per token | Per token | Per-second GPU billing / per prediction |
| Price for Llama 3.1 70B input | $0.10/M | $0.88/M | $0.90/M | Varies |
| OpenAI-compatible API | Yes | Yes | Yes | No (proprietary format) |
| Fine-tuning on platform | Limited (LoRA) | Yes (full fine-tuning) | Yes (full fine-tuning) | No |
| Custom model hosting | Limited | Yes | Yes | Yes |
| Model catalog breadth | 190+ models | 200+ models | Curated, smaller | Thousands (community) |
| Notable performance focus | Cost efficiency | Speed + scale | Low p99 latency | Ease of deployment |
| Inference engine | Proprietary | Together Inference Engine | FireAttention | Generic cloud GPU |
| Enterprise certifications | SOC 2, ISO 27001 | SOC 2 | SOC 2, HIPAA | SOC 2 |
| GPU rental option | Yes | Yes | Yes | No |
| Hugging Face integration | Yes (official provider) | Partial | Partial | Yes |
Together AI and DeepInfra are the closest competitors by business model and model catalog. Both offer OpenAI-compatible APIs, both host the major open-weight model families, and both price per token. The principal difference is that Together AI offers full fine-tuning on the same platform as inference, supports private model deployments at scale, and has invested heavily in research tooling (the Together Research team published work on FlashAttention and related inference optimization techniques). DeepInfra's main advantage over Together AI is price: Llama 4 Maverick on DeepInfra was reported to be approximately 76 percent cheaper on input tokens and 67 percent cheaper on output tokens compared to Together AI pricing in early 2026.
Together AI has raised substantially more capital (over $100 million by 2024) and employs a larger team, which reflects both broader platform ambitions and higher operating costs.
Fireworks AI is built around its proprietary FireAttention inference engine, which the company claims delivers approximately four times lower latency than competing open-source inference engines. The product targets production applications where consistent low p99 latency matters more than raw throughput, and Fireworks has focused on enterprise compliance (SOC 2, HIPAA) and deployment flexibility. DeepInfra's throughput is competitive, but Fireworks AI consistently wins benchmarks measuring first-token latency.
For cost-conscious workloads that can tolerate moderate latency variance, DeepInfra is generally cheaper. For real-time applications with strict SLA requirements, Fireworks AI's performance characteristics may justify the higher per-token cost.
Replicate has a fundamentally different model: it hosts thousands of community-contributed models across text, image, audio, and video, with a per-prediction or per-second GPU billing approach. The emphasis is on accessibility and breadth. Replicate's API is not OpenAI-compatible, so switching from Replicate to a standard LLM provider requires more code changes than switching between DeepInfra and Together AI.
Replicate's strength is its model discovery surface: developers can find and deploy obscure or experimental models quickly without any infrastructure setup. DeepInfra's strength is consistently low pricing on the specific high-demand models that production LLM applications use. The two platforms target overlapping but distinct use cases, with Replicate better suited to prototyping across diverse model types and DeepInfra better suited to production scale with a known model.
DeepInfra's primary customer base is startups integrating AI into their products. The company has described its initial market as early-stage companies that need production-quality inference without the overhead of managing their own hardware. Enterprise adoption has been growing but was slower to materialize in the company's early years, in part because DeepInfra initially lacked compliance certifications that larger organizations require. The addition of SOC 2 and ISO 27001 certifications, along with GDPR and HIPAA alignment, has opened the platform to more enterprise procurement conversations.
Venice AI, a privacy-focused AI assistant platform, is a publicly named customer. Jesse Proudman, President and CTO of Venice AI, described DeepInfra as providing "access to best-in-class models with the reliability and speed we need to ship."
OpenClaw, an autonomous agent framework for messaging platforms, integrated DeepInfra as a provider in a 2026 update, citing reliability improvements as a primary motivation. OpenClaw connects messaging platforms including WhatsApp, Telegram, and Discord to LLM backends, and the integration allows OpenClaw users to route their agent workloads to DeepInfra's infrastructure. The relationship illustrates a broader trend in DeepInfra's customer mix: by early 2026, agent-driven systems generated approximately 30 percent of weekly token volume, up from a negligible share in 2023. Agentic workloads are an economically attractive segment for DeepInfra because their continuous, high-volume token consumption patterns map naturally to serverless per-token pricing and they tend to sustain predictable baseline load rather than generating traffic spikes.
Developers using OpenRouter, the LLM API aggregator, can route requests through DeepInfra as a backend provider. OpenRouter allows users to specify provider preferences or routing rules, and DeepInfra's low prices make it a common default selection for cost-sensitive routing configurations. This gives DeepInfra indirect exposure to OpenRouter's developer base without requiring those users to create a separate DeepInfra account.
The Hugging Face partnership extends the distribution surface further: when a developer browses models on the Hugging Face Hub and selects DeepInfra as the inference provider, they can run inference directly from the browser's Inference Playground or from their own code using the Hugging Face SDK, with billing flowing through either their DeepInfra account or their Hugging Face account.
Common use cases include:
DeepInfra operates a zero data retention policy: input prompts, model outputs, and user metadata are not stored beyond the immediate request lifecycle. This policy is the default behavior for all accounts rather than a premium add-on, and it addresses one of the most common concerns that enterprise buyers raise about third-party inference providers. Many organizations in healthcare, finance, and legal services are prohibited by internal policy or regulation from sending data to systems that retain it, making zero-retention a gating requirement rather than a preference.
The company holds SOC 2 Type II and ISO 27001 certifications and states compliance with GDPR and HIPAA requirements. SOC 2 Type II certification indicates that an independent auditor has reviewed not just the existence of security controls but their operating effectiveness over a period of time, which is a more substantive assurance than a Type I audit. ISO 27001 provides an internationally recognized framework for information security management systems.
All inference runs in US-based data centers. As of 2026, the company does not offer private VPC deployment, on-premises installation, or data processing in jurisdictions outside the United States. These are limitations for customers with strict data residency requirements in Europe, Asia, or other regions.
Infrastructure is hosted in Tier 3 data centers under DeepInfra's own hardware procurement and management rather than rented from a hyperscaler. Tier 3 data centers have redundant power and cooling paths with N+1 fault tolerance and a design uptime above 99.9 percent. The company has stated a 99.982 percent uptime SLA for its dedicated infrastructure. Security documentation is available at trust.deepinfra.com.
DeepInfra operates eight US data centers and owns the GPU hardware it uses for inference, rather than renting capacity from AWS, Google Cloud, or Azure. This vertical integration gives the company direct control over cost structure and hardware refresh cycles, which Borisov has described as the foundational insight from the imo experience: cloud rental is expensive compared to owning hardware at scale.
The company is an early adopter of NVIDIA's latest GPU generations, including Blackwell (B100, B200) and the Vera Rubin architecture. NVIDIA's Dynamo distributed inference software platform, which orchestrates inference across clusters of accelerators, is used internally. DeepInfra has claimed up to 20 times improvement in inference cost efficiency through the combination of Blackwell hardware and Dynamo software optimization.
DeepInfra also collaborates with NVIDIA on the Nemotron model family and the NemoClaw agentic framework, positioning itself within NVIDIA's broader open AI ecosystem strategy. NVIDIA's participation as an investor in the Series B round reflects this technical alignment.
The company's stated long-term vision is to become what Felicis described as "the CDN of the LLM Age": a globally distributed inference network with one major infrastructure hub per continent, routing model requests with low latency the way a content delivery network routes static assets. As of 2026, the platform is US-only, with international expansion planned using Series B proceeds.
Despite its cost advantages, DeepInfra has several limitations that developers and enterprises should consider before committing to the platform.
Fine-tuning support is limited relative to competitors. DeepInfra supports LoRA adapter deployment but does not offer the full supervised fine-tuning workflow that Together AI and Fireworks AI provide. Organizations that want to fine-tune models and then serve the fine-tuned version through the same platform will find DeepInfra less capable.
Latency performance varies. DeepInfra is competitive on throughput (sustained tokens per second over a long generation) but its time-to-first-token latency is less consistent than Groq, which uses custom silicon, or Fireworks AI, which has heavily optimized its inference stack for first-token speed. Applications where the user experiences a noticeable pause before streaming begins may perceive this as a product quality issue.
Geographic availability is restricted to US-based infrastructure as of 2026. Customers with data residency requirements outside the United States, or those for whom cross-continental network latency is a concern, cannot currently satisfy those requirements with DeepInfra.
The platform does not offer private VPC deployment or dedicated tenant isolation at the network level, which is a limitation for regulated industries that require hard isolation between customer workloads.
Proper organization-level access controls, fine-grained API key scoping, and audit logging are less developed than what enterprise-grade cloud providers offer, though the company's certifications indicate a baseline level of security maturity.
Finally, DeepInfra's business model faces structural competitive pressure. The open-source model inference market is commodity-like: if a new entrant or hyperscaler chooses to undercut DeepInfra on price, the switching cost for customers is low. The company's response to this risk is to differentiate on infrastructure ownership (lower cost basis), model availability speed, and the growing complexity of agentic inference workloads, which require more than a simple API wrapper.