Fireworks AI

AI Companies AI Inference Developer Tools Large Language Models

31 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

33 citations

Revision

v9 · 6,160 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Fireworks AI is an artificial intelligence infrastructure company that runs a high-performance inference platform for deploying and serving open large language models (LLMs), image generation models, audio models, and embedding models. Founded in 2022 by former members of the PyTorch team at Meta and led by CEO Lin Qiao, Fireworks delivers fast, cost-efficient, production-grade inference through proprietary optimization technologies including its FireAttention attention kernel, speculative decoding, and the FireOptimizer adaptive serving engine.^[11] The company is headquartered in Redwood City, California, and reached a $4 billion valuation in its October 2025 Series C.^[1]^[18]

As of early 2026, the platform processes more than 15 trillion tokens per day, sustains roughly 180,000 requests per second across its global footprint, serves over 10,000 companies, and supports hundreds of thousands of developers building AI applications across text, image, audio, and multimodal domains.^[1]^[16] Annualized revenue reached approximately $315 million in February 2026, representing 416 percent year over year growth and positioning Fireworks as one of the fastest scaling infrastructure companies in the generative AI sector.^[18] In May 2026, Bloomberg reported that the company was in advanced talks to raise a new round at a $15 billion valuation, 3.75 times its October 2025 price tag, with existing backer Index Ventures set to co-lead.^[33]

Who founded Fireworks AI and when?

Fireworks AI was co-founded in October 2022 by Lin Qiao and six other engineers who had previously worked together at Meta and Google.^[11] Lin Qiao served as Senior Director of Engineering at Meta from July 2015 to September 2022, where she led over 300 engineers developing AI frameworks and platforms, most notably Caffe2 and PyTorch.^[31] Under her leadership, the team rebuilt Meta's entire inference and training stack on top of PyTorch, eventually supporting more than five trillion inference requests per day across Meta's family of applications including Facebook, Instagram, WhatsApp, and Reality Labs hardware.^[12]

Before joining Meta, Lin Qiao held engineering roles at LinkedIn and IBM Research, where she worked on database systems and data infrastructure. She earned her bachelor's and master's degrees in computer science from Fudan University in Shanghai and a doctorate in computer science from the University of California, Santa Barbara.^[31] Chinese press has profiled her repeatedly as one of the most prominent female founders in the AI infrastructure space, with one widely circulated 36Kr feature describing her trajectory from a top Fudan graduate to a $4 billion unicorn founder in three years.^[31]

The remaining co-founders each brought deep expertise in large-scale AI systems:

Co-Founder	Previous Role
Lin Qiao (CEO)	Head of PyTorch, Senior Director of Engineering at Meta
Benny Chen	Ads infrastructure lead at Meta
Chenyu Zhao	Vertex AI lead at Google
Dmytro Dzhulgakov (CTO)	PyTorch core maintainer at Meta
Dmytro Ivchenko	PyTorch for ranking lead at Meta
James Reed	PyTorch compiler engineer at Meta
Pawel Garbacki	Newsfeed core ML lead at Meta

Dmytro Dzhulgakov, originally from Kharkiv, Ukraine, joined Facebook in 2011 and spent more than a decade as one of the core technical leaders on PyTorch alongside Lin Qiao. As CTO of Fireworks he oversees the inference engine, kernel work, and the hardware abstraction layer that allows the platform to run on both NVIDIA and AMD accelerators.^[3] Several other founding engineers, including Dmytro Ivchenko, also came directly from the PyTorch core team, giving Fireworks an unusually deep concentration of framework-level expertise relative to most inference startups.

The founding team's shared experience building and scaling PyTorch at Meta directly informed the company's core thesis: that inference optimization would become a critical bottleneck as AI adoption expanded, and that a purpose-built inference platform could deliver order-of-magnitude improvements in speed and cost over general-purpose solutions.^[11] Lin Qiao has described this view publicly in interviews on the Sequoia Capital Training Data podcast and at the Databricks Data + AI Summit, framing inference as "the new runtime" for AI-native software, much as the cloud became the runtime for web applications in the 2010s.^[12]^[13]

How much funding has Fireworks AI raised?

Fireworks AI has raised over $327 million in venture capital across three major funding rounds, reaching a $4 billion valuation as of October 2025.^[1]^[18]^[15] The company's valuation climbed more than seven times in approximately fifteen months, mirroring the broader investor appetite for AI infrastructure providers that own a proprietary serving stack rather than reselling commodity compute.

Round	Date	Amount	Lead Investor(s)	Valuation
Series A	March 2024	$25 million	Benchmark	Not disclosed
Series B	July 2024	$52 million	Sequoia Capital	$552 million
Series C	October 2025	$250 million	Lightspeed Venture Partners, Index Ventures, Evantic	$4 billion

Strategic investors across these rounds include NVIDIA, AMD, MongoDB Ventures, and Databricks Ventures.^[1] Angel investors include former Snowflake CEO Frank Slootman, former Meta COO Sheryl Sandberg, Airtable CEO Howie Liu, and Scale AI CEO Alexandr Wang.^[1] Several reports filed shortly after the Series C, including in the Wall Street Journal, put the final round size at $254 million when secondary share purchases were included.^[15] Announcing the round, Lin Qiao said: "Our mission is to enable every business to achieve automated product and model co-design to reach maximum quality, speed, and cost-efficiency using generative AI."^[15]

At the time of its Series C announcement, Fireworks reported annualized revenue exceeding $280 million and a customer base that had grown 10x since the Series B round.^[1]^[15] Research firm Sacra estimates that revenue reached approximately $305 million by year-end 2025 and $315 million in February 2026, with blended annualized revenue per active company of around $28,000.^[18] Lightspeed's investment memo, published as a perspective piece by partner Arif Janmohamed, framed the deal as a bet that the inference layer is destined to become the dominant value capture point in generative AI applications, comparable to how the database layer became central to web 2.0.^[14]

The investor appetite did not slow after the Series C. On May 27, 2026, Bloomberg reported that Fireworks was in advanced talks to raise a new round at a $15 billion valuation, with Index Ventures, an existing backer, set to co-lead.^[33] That figure represents 3.75 times the $4 billion the company set just seven months earlier, and reflects continued investor conviction in inference as a value-capture layer amid the 2026 token-consumption boom. As of June 2026 the round had not closed and terms remained subject to change.^[33]

Platform architecture and inference optimization

The core of Fireworks AI's technology stack is a custom-built inference engine designed from the ground up for low-latency, high-throughput model serving. The platform employs several proprietary optimization techniques that distinguish it from generic serving frameworks such as vLLM, TensorRT-LLM, or Text Generation Inference.

What is FireAttention?

FireAttention is Fireworks' custom attention kernel implementation, purpose-built to accelerate transformer model inference. The technology has evolved through three major versions:

FireAttention V1 focused on quantization-aware inference optimizations, reducing memory bandwidth requirements while preserving output quality. The kernel applies mixed-precision arithmetic, with weights stored in 8-bit integer format and activations computed in BF16 or FP16 to balance throughput against accuracy. V1 was the foundation for the platform's initial competitive advantage over generic vLLM-based serving on the same hardware.

FireAttention V2 addressed long-context processing challenges. It introduced optimized attention scaling, multi-host deployment strategies, and advanced kernels that deliver up to 12x faster processing for long-context tasks compared to standard implementations. V2 also added support for FP8 inference on Hopper-class GPUs (H100 and H200), unlocking another roughly 2x throughput gain on supported models without measurable quality regression on Fireworks' internal evaluation suite.

FireAttention V3 extended the inference stack to AMD MI300 GPUs. Rather than using automated porting tools, the Fireworks team rewrote the attention kernel from scratch to account for fundamental architectural differences between AMD and NVIDIA hardware.^[3] The MI300 has a warp size of 64 (compared to 32 on NVIDIA), 304 compute units (compared to 113 on the H100), 192 GB of HBM (compared to 80 GB on the H100), and a smaller 64 KB shared memory.^[3] On benchmark tests, FireAttention V3 achieved a 1.4x improvement in average requests per second for LLaMA 8B and a 1.8x improvement for LLaMA 70B compared to competing implementations. In low-latency scenarios, gains reached up to 3x against NVIDIA NIM and 5.5x against AMD's vLLM port.^[3]

FireAttention V3's clean-sheet AMD implementation was significant because it gave Fireworks practical hardware diversity at a moment when NVIDIA H100 supply was constrained throughout 2024 and 2025. Customers running mixed deployments could rebalance traffic toward MI300 capacity when H100 quotas tightened, without changing API surfaces or accepting a quality regression.

How does speculative decoding work on Fireworks?

Fireworks employs speculative decoding as a core latency reduction technique. In standard autoregressive generation, each token is produced sequentially. Speculative decoding parallelizes this process by using a smaller "draft" model to predict multiple candidate tokens ahead, which the larger target model then verifies in a single forward pass. Tokens that pass verification are accepted immediately, reducing the total number of sequential forward passes needed.^[22]

Fireworks takes this further with adaptive speculative execution, a component of the FireOptimizer system. Instead of using a generic draft model trained on public datasets, the platform automatically trains domain-specific or workload-customized draft models using production traffic data. In a documented code generation workload, this approach increased the draft model hit rate from 29 percent to 76 percent, delivering a 2x speedup compared to a generic draft model that actually caused a 1.5x slowdown. Overall, adaptive speculative execution can deliver up to 3x latency improvements.^[6]

In the most aggressive published configuration, Cursor reported that its Fast Apply model running on a fine-tuned Llama 3 70B reached roughly 1,000 tokens per second on Fireworks, an order of magnitude faster than the typical 100 to 200 tokens per second range for a 70B model on commodity inference stacks.^[28] Speculative decoding is enabled by default for latency-sensitive deployments on the Fireworks platform.^[22]

Continuous batching and request scheduling

The Fireworks inference engine uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, where all requests in a batch must complete before new ones can begin, continuous batching allows new requests to enter the processing pipeline as soon as individual sequences finish. This significantly improves throughput and reduces queuing delays, especially for workloads with variable sequence lengths.

The scheduler also performs request-level prioritization to support mixed latency requirements within a single deployment. Interactive chat traffic, for example, can be served with shorter prefill windows and tighter speculative decoding while batch transcription jobs run alongside at higher batch sizes and lower per-request priority. This multi-tenant scheduling is one of the reasons Fireworks can sustain reported figures of approximately 180,000 requests per second across its fleet without partitioning workloads into separate clusters.^[16]

Disaggregated prefill and decode

In 2025 Fireworks rolled out disaggregated prefill and decode, a serving architecture that places the compute-bound prefill phase (processing the input prompt) on a different pool of accelerators from the memory-bound decode phase (generating output tokens). Because prefill tends to saturate compute while decode is memory-bandwidth bound, splitting the two stages allows each pool to be sized and scheduled independently. The result is a measurable drop in both time-to-first-token and per-token latency for long-context workloads, and it lays the groundwork for the global compute orchestration layer that the company expanded after acquiring Hathora in 2026.^[9]

Which models does Fireworks AI support?

Fireworks AI hosts over 100 pre-deployed open-source models spanning text generation, image generation, audio processing, and embedding tasks. Developers can access these models immediately through a serverless API without managing any infrastructure.

Text and language models

The platform supports a broad range of large language models, including:

Model Family	Examples	Developer
LLaMA	LLaMA 3, LLaMA 3.1, LLaMA 4	Meta
Mistral	Mistral 3, Mistral Nemo	Mistral AI
Mixtral	Mixtral 8x7B, Mixtral 8x22B	Mistral AI
Qwen	Qwen 2, Qwen 2.5, Qwen 3, Qwen 3.6 Max	Alibaba Cloud
DeepSeek	DeepSeek V3, DeepSeek R1, DeepSeek V3.2, DeepSeek V4 Pro, DeepSeek V4 Flash	DeepSeek
Gemma	Gemma 3	Google
Phi	Phi 4	Microsoft
Kimi	Kimi K2, Kimi K2.5, Kimi K2.6	Moonshot AI
MiniMax	MiniMax M2, M2.5	MiniMax

The platform has a track record of day-zero or near-day-zero support for major open-weight releases. DeepSeek V4 Pro (1.6 trillion total parameters, 49 billion active in its mixture of experts) and DeepSeek V4 Flash (284 billion total, 13 billion active) launched on April 24, 2026 and were available on Fireworks the same week. Kimi K2.6, Moonshot AI's trillion-parameter vision-language model released on April 20, 2026, similarly went live with serverless inference, structured output, and fine-tuning support shortly after publication.

Image and vision models

Fireworks supports image generation models including FLUX.1 (dev, schnell, and Kontext variants) from Black Forest Labs and Stable Diffusion 3.5 from Stability AI. Vision-language models for image understanding are also available, including multimodal variants of Qwen and the Kimi K2 family.

Audio models

The platform hosts Whisper V3 and Whisper V3 Turbo for speech-to-text transcription, with support for diarization (speaker identification).^[20] After the Hathora acquisition in early 2026, Fireworks also offers a voice model marketplace inherited from Hathora's real-time AI roadmap, with low-latency speech-to-speech endpoints designed for voice agents and live captioning workloads.^[9]

Embedding models

Embedding models from Nomic AI and others are available for vector search and retrieval-augmented generation (RAG) applications. The platform integrates natively with MongoDB Atlas Vector Search, an integration that grew out of MongoDB Ventures' participation in the Series B and that is featured prominently in MongoDB's solutions library as a reference architecture for financial-services RAG systems.

Deployment options

Fireworks AI provides four primary deployment modes to accommodate different usage patterns and scale requirements.

Serverless inference

The serverless tier allows developers to call any pre-hosted model through a REST API with pay-per-token pricing. There are no cold starts for popular models, and Fireworks handles all scaling, load balancing, and failover automatically. This option is suited for prototyping, moderate-volume production workloads, and applications that need access to many different models. The serverless tier is the primary entry point for most new developers and accounts for the bulk of the platform's reported developer-count metrics.

On-demand (dedicated) deployments

For high-volume or latency-sensitive applications, developers can provision dedicated GPU capacity billed per second. On-demand deployments provide isolated compute resources, consistent performance, and the ability to run custom or fine-tuned models. Supported GPU types include NVIDIA A100, H100, H200, and B200, as well as AMD MI300.^[20] On-demand pricing carries no startup charge, and the platform can scale a deployment elastically between a minimum and maximum replica count to absorb traffic spikes.^[20]

Bring your own weights

Through the bring-your-own-weights (BYOW) feature, customers can upload model weights trained or fine-tuned elsewhere and serve them on Fireworks' optimized stack. The platform accepts standard Hugging Face artifacts (config.json, .safetensors weights, and tokenizer files) and supports both full-weight uploads and quantized variants.^[24] BYOW is the foundation of the Microsoft Foundry integration described below and is widely used by enterprises that want to retain ownership of their training pipelines while taking advantage of Fireworks' kernels.

Enterprise and cloud-marketplace deployments

Enterprise customers can deploy within their own cloud environments through integrations with AWS (including AWS Marketplace and Amazon SageMaker), Google Cloud Marketplace, and private connectivity options such as AWS PrivateLink and GCP Private Service Connect. The March 2026 public preview of Fireworks AI on Microsoft Foundry added Microsoft Azure as a third major cloud distribution channel through a single Azure endpoint, with Azure-grade governance and identity controls layered on top of the Fireworks serving stack.^[16] Provisioned Throughput Units (PTUs) provide reserved capacity for steady-state workloads on Foundry, mirroring Microsoft's existing PTU model for OpenAI hosted endpoints.^[32]

Fine-tuning and customization

Fireworks provides a managed fine-tuning service that supports supervised fine-tuning (SFT), preference tuning via Direct Preference Optimization (DPO), and reinforcement fine-tuning. The service uses LoRA (Low-Rank Adaptation) to enable efficient fine-tuning without retraining the full model, and it can also produce full-weight checkpoints for customers who prefer not to use adapters.

Multi-LoRA serving

A notable feature of the platform is Multi-LoRA serving, which allows hundreds of fine-tuned LoRA adapters to run simultaneously on a single base model deployment. Because LoRA adapters are small and share the same base model weights, this approach provides up to 100x cost efficiency compared to deploying separate fine-tuned model instances.^[29] Users can deploy LoRA adapters trained on other platforms, and the adapters are served at base model token rates.^[29]

Reinforcement fine-tuning

In late 2025 Fireworks launched Reinforcement Fine-Tuning (RFT) as a managed beta.^[8] RFT lets developers supply a grader function (either a Python program or an LLM-as-judge prompt) and iteratively trains the model to maximize the grader's score against task-specific rubrics.^[8] The service is positioned for agentic reasoning, function calling, coding, and other domains where ground-truth labels are scarce but evaluation rubrics are tractable. The headline claim from the RFT launch is that small open-weights models such as DeepSeek V3 and Kimi K2, fine-tuned with RFT for narrow tasks, can outperform frontier closed-source models on those tasks at a fraction of the inference cost.^[8]

RFT integrates with the broader fine-tuning catalog so a model can be trained sequentially through SFT, then DPO, then RFT. Training jobs run on managed Fireworks infrastructure with no GPU provisioning required from the customer, and the resulting checkpoint slots directly into either serverless Multi-LoRA serving or a dedicated deployment.

Supported fine-tuning models

Fine-tuning is available for models across the LLaMA, Qwen, Phi, Gemma, and DeepSeek families, as well as Mixture-of-Experts architectures including Mixtral and the DeepSeek V3 and V4 series. Fine-tuning costs start at $0.50 per million training tokens for models up to 16 billion parameters and rise to $10 per million for models above 300 billion.^[20]

Function calling and structured output

Function calling (tool use)

Fireworks supports function calling (also called tool use) through an OpenAI-compatible API. Developers define functions using JSON Schema, and the model generates structured tool calls with appropriate parameters when a query matches a defined function. Configuration options include automatic tool selection, forced tool calling, and specifying a particular function.^[23]

The platform supports parallel function calling on compatible models, streaming of tool call arguments, and integration with the Model Context Protocol (MCP) through the Responses API.^[23] Independent reviews published in Q1 2026 placed Fireworks at 96.2 percent single-tool and 92.1 percent multi-tool function-calling accuracy on standard agentic benchmarks, ahead of generic open-source serving stacks and comparable to leading proprietary models such as GPT-4o and Claude 3.7 on the same evaluation suites.^[19]

JSON mode and grammar mode

Fireworks offers two methods for constraining model output to structured formats:

JSON Mode enforces output conformance to a provided JSON schema by restricting token generation at each decoding step to only tokens that would produce valid JSON according to the schema. Fireworks reports that its JSON mode runs at approximately 120 tokens per second, roughly 4x faster than competing platforms.^[21]

Grammar Mode uses custom BNF (Backus-Naur Form) grammars to constrain output to arbitrary structured formats beyond JSON, such as classification labels, programming language syntax, or domain-specific formats. According to Fireworks, it is the only inference platform offering grammar-based constrained decoding at production scale, a capability that customers in regulated industries use to enforce strict output formats for downstream parsers.^[21]

Compound AI: FireFunction and FireOptimizer

Fireworks AI has positioned itself as a platform for building compound AI systems, which combine multiple models, tools, data sources, and processing steps to solve complex tasks. The framing was introduced publicly with the Series B announcement in July 2024, drawing on the Berkeley AI Research definition of compound AI, and it has remained the company's primary product narrative through the Series C and Microsoft Foundry launches.^[2] Two proprietary model families and one orchestration system support this vision.

FireFunction

FireFunction is Fireworks' series of open-weights models optimized for function calling and tool orchestration.

Version	Base Model	Key Benchmarks	Speed vs. GPT-4/4o
FireFunction V1	Mixtral 8x7B	87.88 percent accuracy (fewer than 5 functions); within 5 percent of GPT-4 Turbo on complex selection	4x faster than GPT-4 Turbo
FireFunction V2	LLaMA 3 70B Instruct	0.81 combined score (MT Bench + Gorilla + Nexus) vs. 0.80 for GPT-4o	2.5x faster than GPT-4o at 10 percent of the cost

FireFunction V1, released in early 2024, was built on Mixtral 8x7B and optimized for routing decisions and structured information extraction. It achieved 0.4 to 0.6 second response latency compared to 2.3 to 3.0 seconds for GPT-4, representing roughly a 4x speedup.^[4]

FireFunction V2, built on LLaMA 3 70B Instruct, matched or exceeded GPT-4o on combined benchmarks while running 2.5x faster and costing approximately 10 percent as much, $0.90 per million tokens versus $15 per million output tokens for GPT-4o.^[5] It supports parallel function calling, handles up to 30 function specifications, and maintains strong multi-turn conversational abilities alongside its tool-calling capabilities.^[5] Both versions are available as open-weights models on Hugging Face.

FireOptimizer

FireOptimizer is Fireworks' automated optimization engine that tunes inference deployments across three layers:^[6]

Hardware layer: Determines GPU allocation, model parallelism strategy, and workload distribution across accelerators.
Model layer: Applies quantization, fine-tuning, and adapter configuration to balance quality and efficiency.
Software layer: Optimizes request batching, caching strategies, prompt handling, and speculative decoding parameters.

The system explores over 100,000 possible serving configurations to find the optimal combination of quality, throughput, and latency for a given workload.^[6] A key insight driving FireOptimizer is that the same model on identical hardware can exhibit dramatically different cost-performance profiles depending on configuration; for example, LLaMA 70B on eight GPUs in a volume-optimized setup can be 4x cheaper per token than the same model on the same GPUs optimized for single-request speed.^[6]

The 3D FireOptimizer extension automates multi-dimensional tradeoff searches, allowing enterprises to specify target latency, throughput, and quality constraints and receive an automatically optimized deployment configuration.^[7] Adaptive speculative execution is available to enterprise reserved deployment users at no additional cost. Independent benchmarks from third-party providers such as Artificial Analysis place Fireworks at roughly 150 ms P50 time-to-first-token (TTFT) for Llama 70B with sustained throughput of around 145 tokens per second per request, slower than Groq's LPU on raw token-per-second but materially faster than most other GPU-based providers including Together AI and Replicate.^[30]

Compound AI orchestration

The compound AI thesis goes beyond function calling. Fireworks bundles a Responses API with native MCP support, server-side tool invocation, and shared state across multi-step workflows so that a single API call can orchestrate retrieval, tool calls, model routing, and structured output generation without round-tripping every intermediate step to the client. This pattern is the primary path the company recommends to enterprises building agentic systems, and it is the reason its function-calling and structured-output investments are positioned as system-level features rather than per-model capabilities.

API and developer experience

Fireworks AI provides an OpenAI-compatible REST API, allowing developers using the OpenAI Python or JavaScript SDK to switch to Fireworks by changing the base URL and API key. The API endpoint is https://api.fireworks.ai/inference/v1.

The platform integrates with popular developer frameworks and tools:

Framework / Tool	Integration
OpenAI SDK	Native compatibility (Python, Node.js)
LangChain	langchain_fireworks provider
Vercel AI SDK	@ai-sdk/fireworks module
LiteLLM	Built-in Fireworks provider
LlamaIndex	Fireworks embedding and LLM integration
Model Context Protocol (MCP)	Responses API with MCP support
Microsoft Foundry	Direct endpoint via Azure AI Foundry
AWS SageMaker	JumpStart / Marketplace integration
Google Cloud Marketplace	Direct procurement and private offers
MongoDB Atlas	Vector Search reference architectures

The API supports chat completions, text completions, embeddings, image generation, audio transcription, and tool calling. Streaming is supported across all text and tool-calling endpoints. Fireworks also publishes a public cookbook on GitHub with reference implementations for function calling, RAG, fine-tuning workflows, and structured output, and the documentation site at docs.fireworks.ai is the canonical reference for API parameters and deployment configurations.

How much does Fireworks AI cost?

Fireworks AI uses a usage-based pricing model across its serverless, on-demand, and fine-tuning products.^[20]

Serverless text model pricing

Model Size	Price per Million Tokens
Less than 4B parameters	$0.10
4B to 16B parameters	$0.20
Over 16B parameters	$0.90
MoE up to 56B (e.g., Mixtral 8x7B)	$0.50
MoE 56B to 176B (e.g., DBRX)	$1.20

Cached input tokens are priced at 50 percent of standard rates. Batch inference is discounted 50 percent on both input and output tokens. Some featured models use separate input/output pricing. For example, DeepSeek V3 is priced at $0.56 per million input tokens and $1.68 per million output tokens, and Kimi K2 and DeepSeek V4 Pro carry similar split pricing reflecting their mixture-of-experts cost profiles.^[20]

On-demand GPU pricing

GPU	Price per Hour
NVIDIA A100 80 GB	$2.90
NVIDIA H100 80 GB	$4.00
NVIDIA H200 141 GB	$6.00
NVIDIA B200 180 GB	$9.00

All on-demand deployments are billed per second with no startup charges, and replica counts can scale elastically with traffic.^[20]

Fine-tuning pricing

Model Size	SFT (per 1M training tokens)	DPO (per 1M training tokens)
Up to 16B	$0.50	$1.00
16B to 80B	$3.00	$6.00
80B to 300B	$6.00	$12.00
Over 300B	$10.00	$20.00

Image and audio pricing

Image generation pricing ranges from $0.00013 per step for SDXL to $0.08 per image for FLUX.1 Kontext Max. Audio transcription via Whisper V3 costs $0.0015 per audio minute, with the Turbo variant at $0.0009 per audio minute.^[20]

Is Fireworks AI secure and compliant?

Fireworks AI has achieved SOC 2 Type II certification and HIPAA compliance, enabling adoption by enterprises in regulated industries including healthcare and financial services.^[25] Data is encrypted in transit using TLS 1.2+ and at rest using AES-256.^[25] The platform does not log or store prompt or generation data for open models without explicit user opt-in.^[25]

Fireworks maintains a Trust Center at trust.fireworks.ai where customers can access audit reports and compliance documentation. Controls are mapped to GDPR, CCPA, and other international data protection frameworks. The Microsoft Foundry integration extends these controls with Azure-native identity and observability, including support for Microsoft Entra ID, Azure Monitor, and Foundry's content-filtering policies.^[16]

Who uses Fireworks AI?

Fireworks AI serves a range of high-profile technology companies and enterprises:

Customer	Use Case
Cursor	Fast Apply and Copilot++ code editing models with speculative decoding, ~1,000 tok/sec on Llama 3 70B
Sourcegraph	AI-powered code search and code generation at scale
Vercel	v0 code generation tool; 40x end-to-end latency improvement and 93 percent error-free generation
Notion	Fine-tuned models reducing latency from 2 seconds to 350 milliseconds across 100 million users
Perplexity	Search and answer engine model serving
DoorDash	Production AI applications
Uber	Enterprise AI operations
Shopify	AI-powered commerce features
Samsung	Enterprise AI deployment
Upwork	Faster, smarter proposal generation for freelancers
GitLab	AI-assisted development workflows

The Vercel v0 case study is the most publicized of the customer wins. Vercel migrated its code-fixing model to a Fireworks-hosted fine-tuned open model with speculative decoding and reinforcement fine-tuning, reducing the typical two-pass repair flow to a single pass and cutting end-to-end latency by roughly 40x on an 800-line-of-code reference file.^[27] Notion's fine-tuned writing assistants achieved a similar order-of-magnitude latency improvement, dropping perceived response time from around two seconds to roughly 350 milliseconds. Cursor uses Fireworks for its Fast Apply edits, with speculative decoding pushing throughput on a 70B model close to four-figure tokens per second, an output rate that effectively eliminates the perceived wait in the editor.^[28]

Strategic partnerships

Fireworks has formed partnerships with MongoDB for database-integrated AI applications, NVIDIA through the Inception program, Google Cloud for marketplace distribution, AWS for SageMaker and Marketplace integrations, and Microsoft for the Foundry distribution channel announced in March 2026.^[16] The company's Series B and C rounds included strategic investments from NVIDIA, AMD, MongoDB, and Databricks, reflecting deep integration with the broader AI infrastructure ecosystem.^[1]

Hathora acquisition

On March 3, 2026, Fireworks announced the acquisition of Hathora Inc., a real-time compute and server orchestration platform originally focused on multiplayer games and, more recently, on AI inference workloads.^[9] Hathora had built a container orchestration platform spanning 14 regions, two bare-metal providers, and four clouds, powering server infrastructure for live gaming titles including Splitgate 2, Stormgate, and Predecessor before pivoting toward real-time AI with a voice model marketplace.^[17] Lin Qiao characterized the deal as a talent-and-infrastructure acquisition, not a play for Hathora's gaming customer book; those gaming customers are being offboarded to Nitrado's GameFabric, with continued support through May 5, 2026.^[17] The Hathora team has been folded into Fireworks' inference and global orchestration groups to accelerate the build-out of a global compute platform with auto-scaling, disaster recovery, and ultra-low-latency routing for real-time workloads such as voice agents and live video.^[9]

Microsoft Foundry public preview

In March 2026, Microsoft announced a multi-year strategic partnership making Fireworks AI available in public preview on Microsoft Foundry as a first-party Azure inference provider for open models.^[16] The integration lets enterprise teams evaluate, deploy, customize, and operate open models such as DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 alongside proprietary models like GPT-5, all through a single Azure endpoint and under Foundry's unified governance, identity, and observability tooling.^[16] Microsoft cited Fireworks' production metrics, processing 13 trillion or more tokens per day, sustaining approximately 180,000 requests per second, and generating 1,000+ tokens per second on large models, as the rationale for selecting Fireworks as the open-model inference partner.^[16]

How does Fireworks AI compare to competitors?

Fireworks AI operates in the increasingly competitive AI inference platform market alongside several notable companies:

Competitor	Primary Differentiator
Together AI	Broad model catalog (200+ models), strong fine-tuning support, and training infrastructure
Groq	Custom Language Processing Unit (LPU) hardware for ultra-low-latency inference, 400 to 800 tok/sec on large models
Cerebras	Wafer-scale CS-3 hardware, optimized for sub-second large-model inference
SambaNova	Reconfigurable Dataflow Architecture for high-throughput LLM serving
Anyscale	Ray-based distributed computing platform for scalable AI workloads
Replicate	Developer-friendly model deployment with Docker-based packaging; stronger for prototyping than production
AWS Bedrock	Managed service with access to proprietary and open models within the AWS ecosystem
Google Vertex AI	Integrated ML platform within Google Cloud

Fireworks differentiates primarily on inference speed and throughput optimization. The company claims up to 40x faster performance and 8x cost reduction compared to other providers, driven by its proprietary FireAttention kernels, adaptive speculative decoding, and workload-specific optimization through FireOptimizer. While Groq, Cerebras, and SambaNova compete on raw latency using custom silicon, Fireworks achieves its performance gains through software optimization on standard NVIDIA and AMD GPUs, which provides greater flexibility in model support and deployment options.

Compared to Together AI, which offers a similarly broad model catalog, Fireworks places greater emphasis on production-grade serving optimizations and compound AI system orchestration. Together AI tends to be cited more often in research and training contexts and has a larger fine-tuning model menu, while Fireworks is more frequently chosen for latency-critical production workloads such as code assistants and conversational interfaces. Compared to Replicate, which targets rapid prototyping and community model sharing, Fireworks is focused on high-scale production inference with enterprise compliance requirements. Reviews from Q1 2026 note that Fireworks' 99.8 percent uptime is the highest among major inference providers, with Groq trailing at approximately 99.4 percent.^[19]

Industry analyses published in 2026 commonly recommend a multi-provider routing pattern: Groq for the absolute lowest-latency interactive calls, Together AI for batch and training workloads, and Fireworks for production agent workflows where function calling, structured output, and sustained throughput matter more than raw token-per-second peak. This routing pattern is partly the result of cross-provider price spreads of up to 6x on the same model, which makes provider selection a meaningful cost lever for high-volume production teams.

Key metrics and scale

As reported across the Series C funding announcement, the Microsoft Foundry preview, and analyst estimates from Sacra in early 2026:

Metric	Value	Source / Period
Daily token processing	Over 15 trillion tokens	Q1 2026
Companies served	Over 10,000	October 2025
Developer reach	Hundreds of thousands	October 2025
Sustained requests per second	Approximately 180,000	March 2026 (Foundry launch)
Annualized revenue	Approximately $315 million	February 2026 (Sacra)
Year-over-year revenue growth	416 percent	February 2026 (Sacra)
Total funding	Over $327 million	October 2025
Post-money valuation (Series C)	$4 billion	October 2025
Valuation sought (new round)	$15 billion (reported, not closed)	May 2026 (Bloomberg)
Employee count	Approximately 150 to 170	Q4 2025
Models available	Over 100 (serverless)	2026
API uptime	99.8 to 99.99 percent	Q1 2026 (Tokenmix review)
Maximum sustained tokens per second per request	~1,000 tok/sec on Llama 3 70B (Cursor)	2025

References

Fireworks AI. "Series C: Fireworks AI Raises $250M to Power the Future of Enterprise AI." Fireworks Blog, October 28, 2025. https://fireworks.ai/blog/series-c ↩
Fireworks AI. "Fireworks AI Raises $52M Series B to Lead Industry Shift to Compound AI Systems." Fireworks Blog, July 11, 2024. https://fireworks.ai/blog/fireworks-ai-series-b-compound-ai ↩
Fireworks AI. "FireAttention V3: Enabling AMD as a viable alternative for GPU inference." Fireworks Blog. https://fireworks.ai/blog/fireattention-v3 ↩
Fireworks AI. "FireFunction V1: GPT-4-level function calling model." Fireworks Blog. https://fireworks.ai/blog/firefunction-v1-gpt-4-level-function-calling ↩
Fireworks AI. "Firefunction-v2: Function calling capability on par with GPT4o at 2.5x the speed and 10 percent of the cost." Fireworks Blog. https://fireworks.ai/blog/firefunction-v2-launch-post ↩
Fireworks AI. "FireOptimizer: Customizing latency and quality for your production inference workload." Fireworks Blog. https://fireworks.ai/blog/fireoptimizer ↩
Fireworks AI. "3D FireOptimizer: Automating the Multi-Dimensional Tradeoffs in LLM Serving." Fireworks Blog. https://fireworks.ai/blog/3d-fireoptimizer ↩
Fireworks AI. "Reinforcement Fine Tuning: Train expert open models to surpass closed frontier models." Fireworks Blog. https://fireworks.ai/blog/reinforcement-fine-tuning ↩
Fireworks AI. "Fireworks Acquires Hathora." Fireworks Blog, March 2026. https://fireworks.ai/blog/fireworks-acquires-hathora ↩
Fireworks AI. "Fireworks AI on Microsoft Foundry." Fireworks Blog, March 2026. https://fireworks.ai/blog/fireworks-on-microsoft-foundry
Sequoia Capital. "Fireworks: Production Deployments for the Compound AI Future." Sequoia Capital. https://sequoiacap.com/article/fireworks-production-deployments-for-the-compound-ai-future/ ↩
Sequoia Capital. "Training Data: Lin Qiao on Fast Inference and Small Models." Sequoia Podcast. https://sequoiacap.com/podcast/training-data-lin-qiao/ ↩
Index Ventures. "Inference is the New Runtime: Our Investment in Fireworks." Index Ventures Perspectives. https://www.indexventures.com/perspectives/inference-is-the-new-runtime-our-investment-in-fireworks/ ↩
Lightspeed Venture Partners. "Our Investment in Fireworks AI: the Inference Platform Aiming to Power Every GenAI Application." https://lsvp.com/stories/our-investment-in-fireworks-ai-the-inference-platform-aiming-to-power-every-genai-application/ ↩
BusinessWire. "Fireworks AI Raises $250M Series C to Lead the AI Inference Market." October 28, 2025. https://www.businesswire.com/news/home/20251028604819/en/Fireworks-AI-Raises-250M-Series-C-to-Lead-the-AI-Inference-Market ↩
Microsoft Azure Blog. "Introducing Fireworks AI on Microsoft Foundry: Bringing high performance, low latency open model inference to Azure." Microsoft, March 2026. https://azure.microsoft.com/en-us/blog/introducing-fireworks-ai-on-microsoft-foundry-bringing-high-performance-low-latency-open-model-inference-to-azure/ ↩
SiliconANGLE. "Fireworks AI bets on Hathora acquisition to power the next phase of real-time AI." March 9, 2026. https://siliconangle.com/2026/03/09/fireworks-ai-bets-hathora-acquisition-power-next-phase-real-time-ai/ ↩
Sacra. "Fireworks AI revenue, valuation & funding." Sacra Research, 2026. https://sacra.com/c/fireworks-ai/ ↩
Tokenmix. "Fireworks AI Review 2026: Low-Latency Inference and Fine-Tuning." https://tokenmix.ai/blog/fireworks-ai-review ↩
Fireworks AI. "Pricing." https://fireworks.ai/pricing ↩
Fireworks AI. "Structured Outputs." Fireworks Docs. https://docs.fireworks.ai/structured-responses/structured-response-formatting ↩
Fireworks AI. "Speculative Decoding." Fireworks Docs. https://docs.fireworks.ai/deployments/speculative-decoding ↩
Fireworks AI. "Tool Calling." Fireworks Docs. https://docs.fireworks.ai/guides/function-calling ↩
Fireworks AI. "Custom Models." Fireworks Docs. https://docs.fireworks.ai/models/uploading-custom-models ↩
Fireworks AI. "Fireworks.ai Achieves SOC 2 Type II and HIPAA Compliance." Fireworks Blog. https://fireworks.ai/blog/fireworks-ai-achieves-soc-2-type-ii-and-hipaa-compliance ↩
Fireworks AI. "Real-time, performant code assistance: How Sourcegraph scaled with Fireworks AI." Fireworks Blog. https://fireworks.ai/blog/story-sourcegraph-code-generation
Fireworks AI. "40X Faster, and Smarter Outputs: How Vercel Turbocharged their Code Fixing Model." Fireworks Blog. https://fireworks.ai/blog/vercel ↩
Fireworks AI. "How Cursor built Fast Apply using the Speculative Decoding API." Fireworks Blog. https://fireworks.ai/blog/cursor ↩
Fireworks AI. "Multi-LoRA: Personalize AI at scale." Fireworks Blog. https://fireworks.ai/blog/multi-lora ↩
Artificial Analysis. "Fireworks - Intelligence, Performance & Price Analysis." https://artificialanalysis.ai/providers/fireworks ↩
36Kr. "Fudan University's Top Female Student Founded a Unicorn Company in Three Years." https://eu.36kr.com/en/p/3407737936137857 ↩
Microsoft Learn. "Fireworks models on Microsoft Foundry (preview)." https://learn.microsoft.com/en-us/azure/foundry/how-to/fireworks/enable-fireworks-models ↩
Bloomberg. "Fireworks AI in Talks for Funding at $15 Billion Valuation." May 27, 2026. https://www.bloomberg.com/news/articles/2026-05-27/fireworks-ai-in-talks-for-funding-at-15-billion-valuation ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit