Fireworks AI
Last reviewed
May 17, 2026
Sources
32 citations
Review status
Source-backed
Revision
v6 ยท 5,921 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
32 citations
Review status
Source-backed
Revision
v6 ยท 5,921 words
Add missing citations, update stale details, or suggest a clearer explanation.
Fireworks AI is an artificial intelligence infrastructure company that provides a high-performance inference platform for deploying and serving large language models (LLMs), image generation models, audio models, and embedding models. Founded in 2022 by former members of the PyTorch team at Meta, Fireworks AI focuses on delivering fast, cost-efficient, and production-ready AI inference through proprietary optimization technologies including FireAttention, speculative decoding, and adaptive serving configurations. The company is headquartered in Redwood City, California.
As of early 2026, the platform processes more than 15 trillion tokens per day, sustains roughly 180,000 requests per second across its global footprint, serves over 10,000 companies, and supports hundreds of thousands of developers building AI applications across text, image, audio, and multimodal domains. Annualized revenue reached approximately $315 million in February 2026, representing 416 percent year over year growth and positioning Fireworks as one of the fastest scaling infrastructure companies in the generative AI sector.
Fireworks AI was co-founded in October 2022 by Lin Qiao and six other engineers who had previously worked together at Meta and Google. Lin Qiao served as Senior Director of Engineering at Meta from July 2015 to September 2022, where she led over 300 engineers developing AI frameworks and platforms, most notably Caffe2 and PyTorch. Under her leadership, the team rebuilt Meta's entire inference and training stack on top of PyTorch, eventually supporting more than five trillion inference requests per day across Meta's family of applications including Facebook, Instagram, WhatsApp, and Reality Labs hardware.
Before joining Meta, Lin Qiao held engineering roles at LinkedIn and IBM Research, where she worked on database systems and data infrastructure. She earned her bachelor's and master's degrees in computer science from Fudan University in Shanghai and a doctorate in computer science from the University of California, Santa Barbara. Chinese press has profiled her repeatedly as one of the most prominent female founders in the AI infrastructure space, with one widely circulated 36Kr feature describing her trajectory from a top Fudan graduate to a $4 billion unicorn founder in three years.
The remaining co-founders each brought deep expertise in large-scale AI systems:
| Co-Founder | Previous Role |
|---|---|
| Lin Qiao (CEO) | Head of PyTorch, Senior Director of Engineering at Meta |
| Benny Chen | Ads infrastructure lead at Meta |
| Chenyu Zhao | Vertex AI lead at Google |
| Dmytro Dzhulgakov (CTO) | PyTorch core maintainer at Meta |
| Dmytro Ivchenko | PyTorch for ranking lead at Meta |
| James Reed | PyTorch compiler engineer at Meta |
| Pawel Garbacki | Newsfeed core ML lead at Meta |
Dmytro Dzhulgakov, originally from Kharkiv, Ukraine, joined Facebook in 2011 and spent more than a decade as one of the core technical leaders on PyTorch alongside Lin Qiao. As CTO of Fireworks he oversees the inference engine, kernel work, and the hardware abstraction layer that allows the platform to run on both NVIDIA and AMD accelerators. Several other founding engineers, including Dmytro Ivchenko, also came directly from the PyTorch core team, giving Fireworks an unusually deep concentration of framework-level expertise relative to most inference startups.
The founding team's shared experience building and scaling PyTorch at Meta directly informed the company's core thesis: that inference optimization would become a critical bottleneck as AI adoption expanded, and that a purpose-built inference platform could deliver order-of-magnitude improvements in speed and cost over general-purpose solutions. Lin Qiao has described this view publicly in interviews on the Sequoia Capital Training Data podcast and at the Databricks Data + AI Summit, framing inference as "the new runtime" for AI-native software, much as the cloud became the runtime for web applications in the 2010s.
Fireworks AI has raised over $327 million in venture capital across three major funding rounds, reaching a $4 billion valuation as of October 2025. The company's valuation has climbed more than seven times in approximately fifteen months, mirroring the broader investor appetite for AI infrastructure providers that own a proprietary serving stack rather than reselling commodity compute.
| Round | Date | Amount | Lead Investor(s) | Valuation |
|---|---|---|---|---|
| Series A | March 2024 | $25 million | Benchmark | Not disclosed |
| Series B | July 2024 | $52 million | Sequoia Capital | $552 million |
| Series C | October 2025 | $250 million | Lightspeed Venture Partners, Index Ventures, Evantic | $4 billion |
Strategic investors across these rounds include NVIDIA, AMD, MongoDB Ventures, and Databricks Ventures. Angel investors include former Snowflake CEO Frank Slootman, former Meta COO Sheryl Sandberg, Airtable CEO Howie Liu, and Scale AI CEO Alexandr Wang. Several reports filed shortly after the Series C, including in the Wall Street Journal, put the final round size at $254 million when secondary share purchases were included.
At the time of its Series C announcement, Fireworks reported annualized revenue exceeding $280 million and a customer base that had grown 10x since the Series B round. Research firm Sacra estimates that revenue reached approximately $305 million by year-end 2025 and $315 million in February 2026, with blended annualized revenue per active company of around $28,000. Lightspeed's investment memo, published as a perspective piece by partner Arif Janmohamed, framed the deal as a bet that the inference layer is destined to become the dominant value capture point in generative AI applications, comparable to how the database layer became central to web 2.0.
The core of Fireworks AI's technology stack is a custom-built inference engine designed from the ground up for low-latency, high-throughput model serving. The platform employs several proprietary optimization techniques that distinguish it from generic serving frameworks such as vLLM, TensorRT-LLM, or Text Generation Inference.
FireAttention is Fireworks' custom attention kernel implementation, purpose-built to accelerate transformer model inference. The technology has evolved through three major versions:
FireAttention V1 focused on quantization-aware inference optimizations, reducing memory bandwidth requirements while preserving output quality. The kernel applies mixed-precision arithmetic, with weights stored in 8-bit integer format and activations computed in BF16 or FP16 to balance throughput against accuracy. V1 was the foundation for the platform's initial competitive advantage over generic vLLM-based serving on the same hardware.
FireAttention V2 addressed long-context processing challenges. It introduced optimized attention scaling, multi-host deployment strategies, and advanced kernels that deliver up to 12x faster processing for long-context tasks compared to standard implementations. V2 also added support for FP8 inference on Hopper-class GPUs (H100 and H200), unlocking another roughly 2x throughput gain on supported models without measurable quality regression on Fireworks' internal evaluation suite.
FireAttention V3 extended the inference stack to AMD MI300 GPUs. Rather than using automated porting tools, the Fireworks team rewrote the attention kernel from scratch to account for fundamental architectural differences between AMD and NVIDIA hardware. The MI300 has a warp size of 64 (compared to 32 on NVIDIA), 304 compute units (compared to 113 on the H100), 192 GB of HBM (compared to 80 GB on the H100), and a smaller 64 KB shared memory. On benchmark tests, FireAttention V3 achieved a 1.4x improvement in average requests per second for LLaMA 8B and a 1.8x improvement for LLaMA 70B compared to competing implementations. In low-latency scenarios, gains reached up to 3x against NVIDIA NIM and 5.5x against AMD's vLLM port.
FireAttention V3's clean-sheet AMD implementation was significant because it gave Fireworks practical hardware diversity at a moment when NVIDIA H100 supply was constrained throughout 2024 and 2025. Customers running mixed deployments could rebalance traffic toward MI300 capacity when H100 quotas tightened, without changing API surfaces or accepting a quality regression.
Fireworks employs speculative decoding as a core latency reduction technique. In standard autoregressive generation, each token is produced sequentially. Speculative decoding parallelizes this process by using a smaller "draft" model to predict multiple candidate tokens ahead, which the larger target model then verifies in a single forward pass. Tokens that pass verification are accepted immediately, reducing the total number of sequential forward passes needed.
Fireworks takes this further with adaptive speculative execution, a component of the FireOptimizer system. Instead of using a generic draft model trained on public datasets, the platform automatically trains domain-specific or workload-customized draft models using production traffic data. In a documented code generation workload, this approach increased the draft model hit rate from 29 percent to 76 percent, delivering a 2x speedup compared to a generic draft model that actually caused a 1.5x slowdown. Overall, adaptive speculative execution can deliver up to 3x latency improvements.
In the most aggressive published configuration, Cursor reported that its Fast Apply model running on a fine-tuned Llama 3 70B reached roughly 1,000 tokens per second on Fireworks, an order of magnitude faster than the typical 100 to 200 tokens per second range for a 70B model on commodity inference stacks. Speculative decoding is enabled by default for latency-sensitive deployments on the Fireworks platform.
The Fireworks inference engine uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, where all requests in a batch must complete before new ones can begin, continuous batching allows new requests to enter the processing pipeline as soon as individual sequences finish. This significantly improves throughput and reduces queuing delays, especially for workloads with variable sequence lengths.
The scheduler also performs request-level prioritization to support mixed latency requirements within a single deployment. Interactive chat traffic, for example, can be served with shorter prefill windows and tighter speculative decoding while batch transcription jobs run alongside at higher batch sizes and lower per-request priority. This multi-tenant scheduling is one of the reasons Fireworks can sustain reported figures of approximately 180,000 requests per second across its fleet without partitioning workloads into separate clusters.
In 2025 Fireworks rolled out disaggregated prefill and decode, a serving architecture that places the compute-bound prefill phase (processing the input prompt) on a different pool of accelerators from the memory-bound decode phase (generating output tokens). Because prefill tends to saturate compute while decode is memory-bandwidth bound, splitting the two stages allows each pool to be sized and scheduled independently. The result is a measurable drop in both time-to-first-token and per-token latency for long-context workloads, and it lays the groundwork for the global compute orchestration layer that the company expanded after acquiring Hathora in 2026.
Fireworks AI hosts over 100 pre-deployed open-source models spanning text generation, image generation, audio processing, and embedding tasks. Developers can access these models immediately through a serverless API without managing any infrastructure.
The platform supports a broad range of large language models, including:
| Model Family | Examples | Developer |
|---|---|---|
| LLaMA | LLaMA 3, LLaMA 3.1, LLaMA 4 | Meta |
| Mistral | Mistral 3, Mistral Nemo | Mistral AI |
| Mixtral | Mixtral 8x7B, Mixtral 8x22B | Mistral AI |
| Qwen | Qwen 2, Qwen 2.5, Qwen 3, Qwen 3.6 Max | Alibaba Cloud |
| DeepSeek | DeepSeek V3, DeepSeek R1, DeepSeek V3.2, DeepSeek V4 Pro, DeepSeek V4 Flash | DeepSeek |
| Gemma | Gemma 3 | |
| Phi | Phi 4 | Microsoft |
| Kimi | Kimi K2, Kimi K2.5, Kimi K2.6 | Moonshot AI |
| MiniMax | MiniMax M2, M2.5 | MiniMax |
The platform has a track record of day-zero or near-day-zero support for major open-weight releases. DeepSeek V4 Pro (1.6 trillion total parameters, 49 billion active in its mixture of experts) and DeepSeek V4 Flash (284 billion total, 13 billion active) launched on April 24, 2026 and were available on Fireworks the same week. Kimi K2.6, Moonshot AI's trillion-parameter vision-language model released on April 20, 2026, similarly went live with serverless inference, structured output, and fine-tuning support shortly after publication.
Fireworks supports image generation models including FLUX.1 (dev, schnell, and Kontext variants) from Black Forest Labs and Stable Diffusion 3.5 from Stability AI. Vision-language models for image understanding are also available, including multimodal variants of Qwen and the Kimi K2 family.
The platform hosts Whisper V3 and Whisper V3 Turbo for speech-to-text transcription, with support for diarization (speaker identification). After the Hathora acquisition in early 2026, Fireworks also offers a voice model marketplace inherited from Hathora's real-time AI roadmap, with low-latency speech-to-speech endpoints designed for voice agents and live captioning workloads.
Embedding models from Nomic AI and others are available for vector search and retrieval-augmented generation (RAG) applications. The platform integrates natively with MongoDB Atlas Vector Search, an integration that grew out of MongoDB Ventures' participation in the Series B and that is featured prominently in MongoDB's solutions library as a reference architecture for financial-services RAG systems.
Fireworks AI provides four primary deployment modes to accommodate different usage patterns and scale requirements.
The serverless tier allows developers to call any pre-hosted model through a REST API with pay-per-token pricing. There are no cold starts for popular models, and Fireworks handles all scaling, load balancing, and failover automatically. This option is suited for prototyping, moderate-volume production workloads, and applications that need access to many different models. The serverless tier is the primary entry point for most new developers and accounts for the bulk of the platform's reported developer-count metrics.
For high-volume or latency-sensitive applications, developers can provision dedicated GPU capacity billed per second. On-demand deployments provide isolated compute resources, consistent performance, and the ability to run custom or fine-tuned models. Supported GPU types include NVIDIA A100, H100, H200, and B200, as well as AMD MI300. On-demand pricing carries no startup charge, and the platform can scale a deployment elastically between a minimum and maximum replica count to absorb traffic spikes.
Through the bring-your-own-weights (BYOW) feature, customers can upload model weights trained or fine-tuned elsewhere and serve them on Fireworks' optimized stack. The platform accepts standard Hugging Face artifacts (config.json, .safetensors weights, and tokenizer files) and supports both full-weight uploads and quantized variants. BYOW is the foundation of the Microsoft Foundry integration described below and is widely used by enterprises that want to retain ownership of their training pipelines while taking advantage of Fireworks' kernels.
Enterprise customers can deploy within their own cloud environments through integrations with AWS (including AWS Marketplace and Amazon SageMaker), Google Cloud Marketplace, and private connectivity options such as AWS PrivateLink and GCP Private Service Connect. The March 2026 public preview of Fireworks AI on Microsoft Foundry added Microsoft Azure as a third major cloud distribution channel through a single Azure endpoint, with Azure-grade governance and identity controls layered on top of the Fireworks serving stack. Provisioned Throughput Units (PTUs) provide reserved capacity for steady-state workloads on Foundry, mirroring Microsoft's existing PTU model for OpenAI hosted endpoints.
Fireworks provides a managed fine-tuning service that supports supervised fine-tuning (SFT), preference tuning via Direct Preference Optimization (DPO), and reinforcement fine-tuning. The service uses LoRA (Low-Rank Adaptation) to enable efficient fine-tuning without retraining the full model, and it can also produce full-weight checkpoints for customers who prefer not to use adapters.
A notable feature of the platform is Multi-LoRA serving, which allows hundreds of fine-tuned LoRA adapters to run simultaneously on a single base model deployment. Because LoRA adapters are small and share the same base model weights, this approach provides up to 100x cost efficiency compared to deploying separate fine-tuned model instances. Users can deploy LoRA adapters trained on other platforms, and the adapters are served at base model token rates.
In late 2025 Fireworks launched Reinforcement Fine-Tuning (RFT) as a managed beta. RFT lets developers supply a grader function (either a Python program or an LLM-as-judge prompt) and iteratively trains the model to maximize the grader's score against task-specific rubrics. The service is positioned for agentic reasoning, function calling, coding, and other domains where ground-truth labels are scarce but evaluation rubrics are tractable. The headline claim from the RFT launch is that small open-weights models such as DeepSeek V3 and Kimi K2, fine-tuned with RFT for narrow tasks, can outperform frontier closed-source models on those tasks at a fraction of the inference cost.
RFT integrates with the broader fine-tuning catalog so a model can be trained sequentially through SFT, then DPO, then RFT. Training jobs run on managed Fireworks infrastructure with no GPU provisioning required from the customer, and the resulting checkpoint slots directly into either serverless Multi-LoRA serving or a dedicated deployment.
Fine-tuning is available for models across the LLaMA, Qwen, Phi, Gemma, and DeepSeek families, as well as Mixture-of-Experts architectures including Mixtral and the DeepSeek V3 and V4 series. Fine-tuning costs start at $0.50 per million training tokens for models up to 16 billion parameters and rise to $10 per million for models above 300 billion.
Fireworks supports function calling (also called tool use) through an OpenAI-compatible API. Developers define functions using JSON Schema, and the model generates structured tool calls with appropriate parameters when a query matches a defined function. Configuration options include automatic tool selection, forced tool calling, and specifying a particular function.
The platform supports parallel function calling on compatible models, streaming of tool call arguments, and integration with the Model Context Protocol (MCP) through the Responses API. Independent reviews published in Q1 2026 placed Fireworks at 96.2 percent single-tool and 92.1 percent multi-tool function-calling accuracy on standard agentic benchmarks, ahead of generic open-source serving stacks and comparable to leading proprietary models such as GPT-4o and Claude 3.7 on the same evaluation suites.
Fireworks offers two methods for constraining model output to structured formats:
JSON Mode enforces output conformance to a provided JSON schema by restricting token generation at each decoding step to only tokens that would produce valid JSON according to the schema. Fireworks reports that its JSON mode runs at approximately 120 tokens per second, roughly 4x faster than competing platforms.
Grammar Mode uses custom BNF (Backus-Naur Form) grammars to constrain output to arbitrary structured formats beyond JSON, such as classification labels, programming language syntax, or domain-specific formats. According to Fireworks, it is the only inference platform offering grammar-based constrained decoding at production scale, a capability that customers in regulated industries use to enforce strict output formats for downstream parsers.
Fireworks AI has positioned itself as a platform for building compound AI systems, which combine multiple models, tools, data sources, and processing steps to solve complex tasks. The framing was introduced publicly with the Series B announcement in July 2024, drawing on the Berkeley AI Research definition of compound AI, and it has remained the company's primary product narrative through the Series C and Microsoft Foundry launches. Two proprietary model families and one orchestration system support this vision.
FireFunction is Fireworks' series of open-weights models optimized for function calling and tool orchestration.
| Version | Base Model | Key Benchmarks | Speed vs. GPT-4/4o |
|---|---|---|---|
| FireFunction V1 | Mixtral 8x7B | 87.88 percent accuracy (fewer than 5 functions); within 5 percent of GPT-4 Turbo on complex selection | 4x faster than GPT-4 Turbo |
| FireFunction V2 | LLaMA 3 70B Instruct | 0.81 combined score (MT Bench + Gorilla + Nexus) vs. 0.80 for GPT-4o | 2.5x faster than GPT-4o at 10 percent of the cost |
FireFunction V1, released in early 2024, was built on Mixtral 8x7B and optimized for routing decisions and structured information extraction. It achieved 0.4 to 0.6 second response latency compared to 2.3 to 3.0 seconds for GPT-4, representing roughly a 4x speedup.
FireFunction V2, built on LLaMA 3 70B Instruct, matched or exceeded GPT-4o on combined benchmarks while running 2.5x faster and costing approximately 10 percent as much, $0.90 per million tokens versus $15 per million output tokens for GPT-4o. It supports parallel function calling, handles up to 30 function specifications, and maintains strong multi-turn conversational abilities alongside its tool-calling capabilities. Both versions are available as open-weights models on Hugging Face.
FireOptimizer is Fireworks' automated optimization engine that tunes inference deployments across three layers:
The system explores over 100,000 possible serving configurations to find the optimal combination of quality, throughput, and latency for a given workload. A key insight driving FireOptimizer is that the same model on identical hardware can exhibit dramatically different cost-performance profiles depending on configuration; for example, LLaMA 70B on eight GPUs in a volume-optimized setup can be 4x cheaper per token than the same model on the same GPUs optimized for single-request speed.
The 3D FireOptimizer extension automates multi-dimensional tradeoff searches, allowing enterprises to specify target latency, throughput, and quality constraints and receive an automatically optimized deployment configuration. Adaptive speculative execution is available to enterprise reserved deployment users at no additional cost. Independent benchmarks from third-party providers such as Artificial Analysis place Fireworks at roughly 150 ms P50 time-to-first-token (TTFT) for Llama 70B with sustained throughput of around 145 tokens per second per request, slower than Groq's LPU on raw token-per-second but materially faster than most other GPU-based providers including Together AI and Replicate.
The compound AI thesis goes beyond function calling. Fireworks bundles a Responses API with native MCP support, server-side tool invocation, and shared state across multi-step workflows so that a single API call can orchestrate retrieval, tool calls, model routing, and structured output generation without round-tripping every intermediate step to the client. This pattern is the primary path the company recommends to enterprises building agentic systems, and it is the reason its function-calling and structured-output investments are positioned as system-level features rather than per-model capabilities.
Fireworks AI provides an OpenAI-compatible REST API, allowing developers using the OpenAI Python or JavaScript SDK to switch to Fireworks by changing the base URL and API key. The API endpoint is https://api.fireworks.ai/inference/v1.
The platform integrates with popular developer frameworks and tools:
| Framework / Tool | Integration |
|---|---|
| OpenAI SDK | Native compatibility (Python, Node.js) |
| LangChain | langchain_fireworks provider |
| Vercel AI SDK | @ai-sdk/fireworks module |
| LiteLLM | Built-in Fireworks provider |
| LlamaIndex | Fireworks embedding and LLM integration |
| Model Context Protocol (MCP) | Responses API with MCP support |
| Microsoft Foundry | Direct endpoint via Azure AI Foundry |
| AWS SageMaker | JumpStart / Marketplace integration |
| Google Cloud Marketplace | Direct procurement and private offers |
| MongoDB Atlas | Vector Search reference architectures |
The API supports chat completions, text completions, embeddings, image generation, audio transcription, and tool calling. Streaming is supported across all text and tool-calling endpoints. Fireworks also publishes a public cookbook on GitHub with reference implementations for function calling, RAG, fine-tuning workflows, and structured output, and the documentation site at docs.fireworks.ai is the canonical reference for API parameters and deployment configurations.
Fireworks AI uses a usage-based pricing model across its serverless, on-demand, and fine-tuning products.
| Model Size | Price per Million Tokens |
|---|---|
| Less than 4B parameters | $0.10 |
| 4B to 16B parameters | $0.20 |
| Over 16B parameters | $0.90 |
| MoE up to 56B (e.g., Mixtral 8x7B) | $0.50 |
| MoE 56B to 176B (e.g., DBRX) | $1.20 |
Cached input tokens are priced at 50 percent of standard rates. Batch inference is discounted 50 percent on both input and output tokens. Some featured models use separate input/output pricing. For example, DeepSeek V3 is priced at $0.56 per million input tokens and $1.68 per million output tokens, and Kimi K2 and DeepSeek V4 Pro carry similar split pricing reflecting their mixture-of-experts cost profiles.
| GPU | Price per Hour |
|---|---|
| NVIDIA A100 80 GB | $2.90 |
| NVIDIA H100 80 GB | $4.00 |
| NVIDIA H200 141 GB | $6.00 |
| NVIDIA B200 180 GB | $9.00 |
All on-demand deployments are billed per second with no startup charges, and replica counts can scale elastically with traffic.
| Model Size | SFT (per 1M training tokens) | DPO (per 1M training tokens) |
|---|---|---|
| Up to 16B | $0.50 | $1.00 |
| 16B to 80B | $3.00 | $6.00 |
| 80B to 300B | $6.00 | $12.00 |
| Over 300B | $10.00 | $20.00 |
Image generation pricing ranges from $0.00013 per step for SDXL to $0.08 per image for FLUX.1 Kontext Max. Audio transcription via Whisper V3 costs $0.0015 per audio minute, with the Turbo variant at $0.0009 per audio minute.
Fireworks AI has achieved SOC 2 Type II certification and HIPAA compliance, enabling adoption by enterprises in regulated industries including healthcare and financial services. Data is encrypted in transit using TLS 1.2+ and at rest using AES-256. The platform does not log or store prompt or generation data for open models without explicit user opt-in.
Fireworks maintains a Trust Center at trust.fireworks.ai where customers can access audit reports and compliance documentation. Controls are mapped to GDPR, CCPA, and other international data protection frameworks. The Microsoft Foundry integration extends these controls with Azure-native identity and observability, including support for Microsoft Entra ID, Azure Monitor, and Foundry's content-filtering policies.
Fireworks AI serves a range of high-profile technology companies and enterprises:
| Customer | Use Case |
|---|---|
| Cursor | Fast Apply and Copilot++ code editing models with speculative decoding, ~1,000 tok/sec on Llama 3 70B |
| Sourcegraph | AI-powered code search and code generation at scale |
| Vercel | v0 code generation tool; 40x end-to-end latency improvement and 93 percent error-free generation |
| Notion | Fine-tuned models reducing latency from 2 seconds to 350 milliseconds across 100 million users |
| Perplexity | Search and answer engine model serving |
| DoorDash | Production AI applications |
| Uber | Enterprise AI operations |
| Shopify | AI-powered commerce features |
| Samsung | Enterprise AI deployment |
| Upwork | Faster, smarter proposal generation for freelancers |
| GitLab | AI-assisted development workflows |
The Vercel v0 case study is the most publicized of the customer wins. Vercel migrated its code-fixing model to a Fireworks-hosted fine-tuned open model with speculative decoding and reinforcement fine-tuning, reducing the typical two-pass repair flow to a single pass and cutting end-to-end latency by roughly 40x on an 800-line-of-code reference file. Notion's fine-tuned writing assistants achieved a similar order-of-magnitude latency improvement, dropping perceived response time from around two seconds to roughly 350 milliseconds. Cursor uses Fireworks for its Fast Apply edits, with speculative decoding pushing throughput on a 70B model close to four-figure tokens per second, an output rate that effectively eliminates the perceived wait in the editor.
Fireworks has formed partnerships with MongoDB for database-integrated AI applications, NVIDIA through the Inception program, Google Cloud for marketplace distribution, AWS for SageMaker and Marketplace integrations, and Microsoft for the Foundry distribution channel announced in March 2026. The company's Series B and C rounds included strategic investments from NVIDIA, AMD, MongoDB, and Databricks, reflecting deep integration with the broader AI infrastructure ecosystem.
On March 3, 2026, Fireworks announced the acquisition of Hathora Inc., a real-time compute and server orchestration platform originally focused on multiplayer games and, more recently, on AI inference workloads. Hathora had built a container orchestration platform spanning 14 regions, two bare-metal providers, and four clouds, powering server infrastructure for live gaming titles including Splitgate 2, Stormgate, and Predecessor before pivoting toward real-time AI with a voice model marketplace. Lin Qiao characterized the deal as a talent-and-infrastructure acquisition, not a play for Hathora's gaming customer book; those gaming customers are being offboarded to Nitrado's GameFabric, with continued support through May 5, 2026. The Hathora team has been folded into Fireworks' inference and global orchestration groups to accelerate the build-out of a global compute platform with auto-scaling, disaster recovery, and ultra-low-latency routing for real-time workloads such as voice agents and live video.
In March 2026, Microsoft announced a multi-year strategic partnership making Fireworks AI available in public preview on Microsoft Foundry as a first-party Azure inference provider for open models. The integration lets enterprise teams evaluate, deploy, customize, and operate open models such as DeepSeek V3.2, Kimi K2.5, and MiniMax M2.5 alongside proprietary models like GPT-5, all through a single Azure endpoint and under Foundry's unified governance, identity, and observability tooling. Microsoft cited Fireworks' production metrics, processing 13 trillion or more tokens per day, sustaining approximately 180,000 requests per second, and generating 1,000+ tokens per second on large models, as the rationale for selecting Fireworks as the open-model inference partner.
Fireworks AI operates in the increasingly competitive AI inference platform market alongside several notable companies:
| Competitor | Primary Differentiator |
|---|---|
| Together AI | Broad model catalog (200+ models), strong fine-tuning support, and training infrastructure |
| Groq | Custom Language Processing Unit (LPU) hardware for ultra-low-latency inference, 400 to 800 tok/sec on large models |
| Cerebras | Wafer-scale CS-3 hardware, optimized for sub-second large-model inference |
| SambaNova | Reconfigurable Dataflow Architecture for high-throughput LLM serving |
| Anyscale | Ray-based distributed computing platform for scalable AI workloads |
| Replicate | Developer-friendly model deployment with Docker-based packaging; stronger for prototyping than production |
| AWS Bedrock | Managed service with access to proprietary and open models within the AWS ecosystem |
| Google Vertex AI | Integrated ML platform within Google Cloud |
Fireworks differentiates primarily on inference speed and throughput optimization. The company claims up to 40x faster performance and 8x cost reduction compared to other providers, driven by its proprietary FireAttention kernels, adaptive speculative decoding, and workload-specific optimization through FireOptimizer. While Groq, Cerebras, and SambaNova compete on raw latency using custom silicon, Fireworks achieves its performance gains through software optimization on standard NVIDIA and AMD GPUs, which provides greater flexibility in model support and deployment options.
Compared to Together AI, which offers a similarly broad model catalog, Fireworks places greater emphasis on production-grade serving optimizations and compound AI system orchestration. Together AI tends to be cited more often in research and training contexts and has a larger fine-tuning model menu, while Fireworks is more frequently chosen for latency-critical production workloads such as code assistants and conversational interfaces. Compared to Replicate, which targets rapid prototyping and community model sharing, Fireworks is focused on high-scale production inference with enterprise compliance requirements. Reviews from Q1 2026 note that Fireworks' 99.8 percent uptime is the highest among major inference providers, with Groq trailing at approximately 99.4 percent.
Industry analyses published in 2026 commonly recommend a multi-provider routing pattern: Groq for the absolute lowest-latency interactive calls, Together AI for batch and training workloads, and Fireworks for production agent workflows where function calling, structured output, and sustained throughput matter more than raw token-per-second peak. This routing pattern is partly the result of cross-provider price spreads of up to 6x on the same model, which makes provider selection a meaningful cost lever for high-volume production teams.
As reported across the Series C funding announcement, the Microsoft Foundry preview, and analyst estimates from Sacra in early 2026:
| Metric | Value | Source / Period |
|---|---|---|
| Daily token processing | Over 15 trillion tokens | Q1 2026 |
| Companies served | Over 10,000 | October 2025 |
| Developer reach | Hundreds of thousands | October 2025 |
| Sustained requests per second | Approximately 180,000 | March 2026 (Foundry launch) |
| Annualized revenue | Approximately $315 million | February 2026 (Sacra) |
| Year-over-year revenue growth | 416 percent | February 2026 (Sacra) |
| Total funding | Over $327 million | October 2025 |
| Post-money valuation | $4 billion | October 2025 |
| Employee count | Approximately 150 to 170 | Q4 2025 |
| Models available | Over 100 (serverless) | 2026 |
| API uptime | 99.8 to 99.99 percent | Q1 2026 (Tokenmix review) |
| Maximum sustained tokens per second per request | ~1,000 tok/sec on Llama 3 70B (Cursor) | 2025 |