Fireworks AI is an artificial intelligence infrastructure company that provides a high-performance inference platform for deploying and serving large language models (LLMs), image generation models, audio models, and embedding models. Founded in 2022 by former members of the PyTorch team at Meta, Fireworks AI focuses on delivering fast, cost-efficient, and production-ready AI inference through proprietary optimization technologies including FireAttention, speculative decoding, and adaptive serving configurations. The company is headquartered in Redwood City, California.
As of late 2025, the platform processes over 10 trillion tokens per day, serves more than 10,000 companies, and supports hundreds of thousands of developers building AI applications across text, image, audio, and multimodal domains.
Fireworks AI was co-founded in October 2022 by Lin Qiao and six other engineers who had previously worked together at Meta and Google. Lin Qiao served as Senior Director of Engineering at Meta from July 2015 to September 2022, where she led over 300 engineers developing AI frameworks and platforms, most notably Caffe2 and PyTorch. Under her leadership, the team rebuilt Meta's entire inference and training stack on top of PyTorch, eventually supporting more than five trillion inference requests per day.
Before joining Meta, Lin Qiao held engineering roles at LinkedIn and IBM. She earned her bachelor's and master's degrees in computer science from Fudan University and a doctorate in computer science from the University of California, Santa Barbara.
The remaining co-founders each brought deep expertise in large-scale AI systems:
| Co-Founder | Previous Role |
|---|---|
| Lin Qiao (CEO) | Head of PyTorch, Senior Director of Engineering at Meta |
| Benny Chen | Ads infrastructure lead at Meta |
| Chenyu Zhao | Vertex AI lead at Google |
| Dmytro Dzhulgakov | PyTorch core maintainer at Meta |
| Dmytro Ivchenko | PyTorch for ranking lead at Meta |
| James Reed | PyTorch compiler engineer at Meta |
| Pawel Garbacki | Newsfeed core ML lead at Meta |
The founding team's shared experience building and scaling PyTorch at Meta directly informed the company's core thesis: that inference optimization would become a critical bottleneck as AI adoption expanded, and that a purpose-built inference platform could deliver order-of-magnitude improvements in speed and cost over general-purpose solutions.
Fireworks AI has raised over $327 million in venture capital across three major funding rounds, reaching a $4 billion valuation as of October 2025.
| Round | Date | Amount | Lead Investor(s) | Valuation |
|---|---|---|---|---|
| Series A | March 2024 | $25 million | Benchmark | Not disclosed |
| Series B | July 2024 | $52 million | Sequoia Capital | $552 million |
| Series C | October 2025 | $250 million | Lightspeed Venture Partners, Index Ventures, Evantic | $4 billion |
Strategic investors across these rounds include NVIDIA, AMD, MongoDB Ventures, and Databricks Ventures. Angel investors include former Snowflake CEO Frank Slootman, former Meta COO Sheryl Sandberg, Airtable CEO Howie Liu, and Scale AI CEO Alexandr Wang.
At the time of its Series C announcement, Fireworks reported annualized revenue exceeding $280 million and a customer base that had grown 10x since the Series B round.
The core of Fireworks AI's technology stack is a custom-built inference engine designed from the ground up for low-latency, high-throughput model serving. The platform employs several proprietary optimization techniques that distinguish it from generic serving frameworks.
FireAttention is Fireworks' custom attention kernel implementation, purpose-built to accelerate transformer model inference. The technology has evolved through three major versions:
FireAttention V1 focused on quantization-aware inference optimizations, reducing memory bandwidth requirements while preserving output quality.
FireAttention V2 addressed long-context processing challenges. It introduced optimized attention scaling, multi-host deployment strategies, and advanced kernels that deliver up to 12x faster processing for long-context tasks compared to standard implementations.
FireAttention V3 extended the inference stack to AMD MI300 GPUs. Rather than using automated porting tools, the Fireworks team rewrote the attention kernel from scratch to account for fundamental architectural differences between AMD and NVIDIA hardware. The MI300 has a warp size of 64 (compared to 32 on NVIDIA), 304 compute units (compared to 113 on the H100), 192 GB of HBM (compared to 80 GB on the H100), and a smaller 64 KB shared memory. On benchmark tests, FireAttention V3 achieved a 1.4x improvement in average requests per second for LLaMA 8B and a 1.8x improvement for LLaMA 70B compared to competing implementations. In low-latency scenarios, gains reached up to 3x against NVIDIA NIM and 5.5x against AMD vLLM.
Fireworks employs speculative decoding as a core latency reduction technique. In standard autoregressive generation, each token is produced sequentially. Speculative decoding parallelizes this process by using a smaller "draft" model to predict multiple candidate tokens ahead, which the larger target model then verifies in a single forward pass. Tokens that pass verification are accepted immediately, reducing the total number of sequential forward passes needed.
Fireworks takes this further with adaptive speculative execution, a component of the FireOptimizer system. Instead of using a generic draft model trained on public datasets, the platform automatically trains domain-specific or workload-customized draft models using production traffic data. In a documented code generation workload, this approach increased the draft model hit rate from 29% to 76%, delivering a 2x speedup compared to a generic draft model that actually caused a 1.5x slowdown. Overall, adaptive speculative execution can deliver up to 3x latency improvements.
Speculative decoding is enabled by default for latency-sensitive deployments on the Fireworks platform.
The Fireworks inference engine uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, where all requests in a batch must complete before new ones can begin, continuous batching allows new requests to enter the processing pipeline as soon as individual sequences finish. This significantly improves throughput and reduces queuing delays, especially for workloads with variable sequence lengths.
Fireworks AI hosts over 100 pre-deployed open-source models spanning text generation, image generation, audio processing, and embedding tasks. Developers can access these models immediately through a serverless API without managing any infrastructure.
The platform supports a broad range of large language models, including:
| Model Family | Examples | Developer |
|---|---|---|
| LLaMA | LLaMA 3, LLaMA 3.1, LLaMA 4 | Meta |
| Mistral | Mistral 3, Mistral Nemo | Mistral AI |
| Mixtral | Mixtral 8x7B, Mixtral 8x22B | Mistral AI |
| Qwen | Qwen 2, Qwen 2.5, Qwen 3 | Alibaba Cloud |
| DeepSeek | DeepSeek V3, DeepSeek R1 | DeepSeek |
| Gemma | Gemma 3 | |
| Phi | Phi 4 | Microsoft |
| Kimi | Kimi K2, Kimi K2.5 | Moonshot AI |
Fireworks supports image generation models including FLUX.1 (dev, schnell, and Kontext variants) from Black Forest Labs and Stable Diffusion 3.5 from Stability AI. Vision-language models for image understanding are also available.
The platform hosts Whisper V3 and Whisper V3 Turbo for speech-to-text transcription, with support for diarization (speaker identification).
Embedding models from Nomic AI and others are available for vector search and retrieval-augmented generation (RAG) applications.
Fireworks AI provides three primary deployment modes to accommodate different usage patterns and scale requirements.
The serverless tier allows developers to call any pre-hosted model through a REST API with pay-per-token pricing. There are no cold starts for popular models, and Fireworks handles all scaling, load balancing, and failover automatically. This option is suited for prototyping, moderate-volume production workloads, and applications that need access to many different models.
For high-volume or latency-sensitive applications, developers can provision dedicated GPU capacity billed per second. On-demand deployments provide isolated compute resources, consistent performance, and the ability to run custom or fine-tuned models. Supported GPU types include NVIDIA A100, H100, H200, and B200, as well as AMD MI300.
Enterprise customers can deploy within their own cloud environments through integrations with AWS (including AWS Marketplace and Amazon SageMaker), Google Cloud Marketplace, and private connectivity options such as AWS PrivateLink and GCP Private Service Connect.
Fireworks provides a managed fine-tuning service that supports supervised fine-tuning (SFT), preference tuning via Direct Preference Optimization (DPO), and reinforcement fine-tuning. The service uses LoRA (Low-Rank Adaptation) to enable efficient fine-tuning without retraining the full model.
A notable feature of the platform is Multi-LoRA serving, which allows hundreds of fine-tuned LoRA adapters to run simultaneously on a single base model deployment. Because LoRA adapters are small and share the same base model weights, this approach provides up to 100x cost efficiency compared to deploying separate fine-tuned model instances. Users can deploy LoRA adapters trained on other platforms, and the adapters are served at base model token rates.
Fine-tuning is available for models across the LLaMA, Qwen, Phi, Gemma, and DeepSeek families, as well as Mixture-of-Experts architectures. Fine-tuning costs start at $0.50 per million training tokens for models up to 16 billion parameters.
Fireworks supports function calling (also called tool use) through an OpenAI-compatible API. Developers define functions using JSON Schema, and the model generates structured tool calls with appropriate parameters when a query matches a defined function. Configuration options include automatic tool selection, forced tool calling, and specifying a particular function.
The platform supports parallel function calling on compatible models, streaming of tool call arguments, and integration with the Model Context Protocol (MCP) through the Responses API.
Fireworks offers two methods for constraining model output to structured formats:
JSON Mode enforces output conformance to a provided JSON schema by restricting token generation at each decoding step to only tokens that would produce valid JSON according to the schema. Fireworks reports that its JSON mode runs at approximately 120 tokens per second, roughly 4x faster than competing platforms.
Grammar Mode uses custom BNF (Backus-Naur Form) grammars to constrain output to arbitrary structured formats beyond JSON, such as classification labels, programming language syntax, or domain-specific formats. According to Fireworks, it is the only inference platform offering grammar-based constrained decoding.
Fireworks AI has positioned itself as a platform for building compound AI systems, which combine multiple models, tools, data sources, and processing steps to solve complex tasks. Two proprietary model families support this vision.
FireFunction is Fireworks' series of open-weights models optimized for function calling and tool orchestration.
| Version | Base Model | Key Benchmarks | Speed vs. GPT-4/4o |
|---|---|---|---|
| FireFunction V1 | Mixtral 8x7B | 87.88% accuracy (fewer than 5 functions); within 5% of GPT-4 Turbo on complex selection | 4x faster than GPT-4 Turbo |
| FireFunction V2 | LLaMA 3 70B Instruct | 0.81 combined score (MT Bench + Gorilla + Nexus) vs. 0.80 for GPT-4o | 2.5x faster than GPT-4o at 10% of the cost |
FireFunction V1, released in early 2024, was built on Mixtral 8x7B and optimized for routing decisions and structured information extraction. It achieved 0.4 to 0.6 second response latency compared to 2.3 to 3.0 seconds for GPT-4, representing roughly a 4x speedup.
FireFunction V2, built on LLaMA 3 70B Instruct, matched or exceeded GPT-4o on combined benchmarks while running 2.5x faster and costing approximately 10% as much ($0.90 per million tokens versus $15 per million output tokens for GPT-4o). It supports parallel function calling, handles up to 30 function specifications, and maintains strong multi-turn conversational abilities alongside its tool-calling capabilities. Both versions are available as open-weights models on Hugging Face.
FireOptimizer is Fireworks' automated optimization engine that tunes inference deployments across three layers:
The system explores over 100,000 possible serving configurations to find the optimal combination of quality, throughput, and latency for a given workload. A key insight driving FireOptimizer is that the same model on identical hardware can exhibit dramatically different cost-performance profiles depending on configuration; for example, LLaMA 70B on eight GPUs in a volume-optimized setup can be 4x cheaper per token than the same model on the same GPUs optimized for single-request speed.
The 3D FireOptimizer extension automates multi-dimensional tradeoff searches, allowing enterprises to specify target latency, throughput, and quality constraints and receive an automatically optimized deployment configuration. Adaptive speculative execution is available to enterprise reserved deployment users at no additional cost.
Fireworks AI provides an OpenAI-compatible REST API, allowing developers using the OpenAI Python or JavaScript SDK to switch to Fireworks by changing the base URL and API key. The API endpoint is https://api.fireworks.ai/inference/v1.
The platform integrates with popular developer frameworks and tools:
| Framework / Tool | Integration |
|---|---|
| OpenAI SDK | Native compatibility (Python, Node.js) |
| LangChain | langchain_fireworks provider |
| Vercel AI SDK | @ai-sdk/fireworks module |
| LiteLLM | Built-in Fireworks provider |
| LlamaIndex | Fireworks embedding and LLM integration |
| Model Context Protocol (MCP) | Responses API with MCP support (beta) |
The API supports chat completions, text completions, embeddings, image generation, audio transcription, and tool calling. Streaming is supported across all text and tool-calling endpoints.
Fireworks AI uses a usage-based pricing model across its serverless, on-demand, and fine-tuning products.
| Model Size | Price per Million Tokens |
|---|---|
| Less than 4B parameters | $0.10 |
| 4B to 16B parameters | $0.20 |
| Over 16B parameters | $0.90 |
| MoE up to 56B (e.g., Mixtral 8x7B) | $0.50 |
| MoE 56B to 176B (e.g., DBRX) | $1.20 |
Cached input tokens are priced at 50% of standard rates. Batch inference is discounted 50% on both input and output tokens.
Some featured models use separate input/output pricing. For example, DeepSeek V3 is priced at $0.56 per million input tokens and $1.68 per million output tokens.
| GPU | Price per Hour |
|---|---|
| NVIDIA A100 80 GB | $2.90 |
| NVIDIA H100 80 GB | $4.00 |
| NVIDIA H200 141 GB | $6.00 |
| NVIDIA B200 180 GB | $9.00 |
All on-demand deployments are billed per second with no startup charges.
| Model Size | SFT (per 1M training tokens) | DPO (per 1M training tokens) |
|---|---|---|
| Up to 16B | $0.50 | $1.00 |
| 16B to 80B | $3.00 | $6.00 |
| 80B to 300B | $6.00 | $12.00 |
| Over 300B | $10.00 | $20.00 |
Image generation pricing ranges from $0.00013 per step for SDXL to $0.08 per image for FLUX.1 Kontext Max. Audio transcription via Whisper V3 costs $0.0015 per audio minute, with the Turbo variant at $0.0009 per audio minute.
Fireworks AI has achieved SOC 2 Type II certification and HIPAA compliance, enabling adoption by enterprises in regulated industries including healthcare and financial services. Data is encrypted in transit using TLS 1.2+ and at rest using AES-256. The platform does not log or store prompt or generation data for open models without explicit user opt-in.
Fireworks maintains a Trust Center at trust.fireworks.ai where customers can access audit reports and compliance documentation. Controls are mapped to GDPR, CCPA, and other international data protection frameworks.
Fireworks AI serves a range of high-profile technology companies and enterprises:
| Customer | Use Case |
|---|---|
| Cursor | Fast Apply and Copilot++ code editing models with speculative decoding |
| Sourcegraph | AI-powered code search and code generation at scale |
| Vercel | v0 code generation tool; achieved 40x end-to-end latency improvement and 93% error-free generation |
| Notion | Fine-tuned models reducing latency from 2 seconds to 350 milliseconds |
| DoorDash | Production AI applications |
| Uber | Enterprise AI operations |
| Shopify | AI-powered commerce features |
| Samsung | Enterprise AI deployment |
| Upwork | Faster, smarter proposal generation for freelancers |
| GitLab | AI-assisted development workflows |
Fireworks has formed partnerships with MongoDB for database-integrated AI applications, NVIDIA through the Inception program, Google Cloud for marketplace distribution, and AWS for SageMaker and Marketplace integrations. The company's Series B and C rounds included strategic investments from NVIDIA, AMD, MongoDB, and Databricks, reflecting deep integration with the broader AI infrastructure ecosystem.
Fireworks AI operates in the increasingly competitive AI inference platform market alongside several notable companies:
| Competitor | Primary Differentiator |
|---|---|
| Together AI | Broad model catalog (200+ models), strong fine-tuning support, and training infrastructure |
| Groq | Custom Language Processing Unit (LPU) hardware for ultra-low-latency inference |
| Anyscale | Ray-based distributed computing platform for scalable AI workloads |
| Replicate | Developer-friendly model deployment with Docker-based packaging; stronger for prototyping than production |
| AWS Bedrock | Managed service with access to proprietary and open models within the AWS ecosystem |
| Google Vertex AI | Integrated ML platform within Google Cloud |
Fireworks differentiates primarily on inference speed and throughput optimization. The company claims up to 40x faster performance and 8x cost reduction compared to other providers, driven by its proprietary FireAttention kernels, adaptive speculative decoding, and workload-specific optimization through FireOptimizer. While Groq competes on raw latency using custom silicon, Fireworks achieves its performance gains through software optimization on standard NVIDIA and AMD GPUs, which provides greater flexibility in model support and deployment options.
Compared to Together AI, which offers a similarly broad model catalog, Fireworks places greater emphasis on production-grade serving optimizations and compound AI system orchestration. Compared to Replicate, which targets rapid prototyping and community model sharing, Fireworks is focused on high-scale production inference with enterprise compliance requirements.
As reported alongside the Series C funding announcement in October 2025:
| Metric | Value |
|---|---|
| Daily token processing | Over 10 trillion tokens |
| Companies served | Over 10,000 |
| Developer reach | Hundreds of thousands |
| Annualized revenue | Over $280 million |
| Total funding | Over $327 million |
| Post-money valuation | $4 billion |
| Employee count | Approximately 150 to 170 |
| Models available | Over 100 (serverless) |
| API uptime | 99.99% |