Baseten is an inference platform for deploying, serving, and scaling machine learning models in production. The company provides infrastructure that converts ML models into production-ready APIs with GPU-accelerated hardware, traffic-based autoscaling, and multi-cloud reliability. Baseten is headquartered in San Francisco, California, with a small office in Manhattan and remote employees across the United States.
Founded in 2019 by Tuhin Srivastava, Amir Haghighat, Philip Howes, and Pankaj Gupta, Baseten has raised $585 million in venture capital funding across five rounds and reached a $5 billion valuation with its January 2026 Series E round. The platform serves customers including Cursor, Notion, Quora, Patreon, and Clay, and experienced 100x growth in inference volume during 2025. As of February 2026, the company employed approximately 200 people, and all four original co-founders remained with the company.
Baseten was founded in September 2019 by four co-founders who had worked together at Gumroad, an online marketplace for creators. Tuhin Srivastava (CEO) and Philip Howes (Chief Scientist) grew up together in Australia. They met Amir Haghighat (CTO) around 2012, when all three were among the first employees at Gumroad. At Gumroad, Haghighat served as head of engineering while Srivastava and Howes worked as data scientists who needed to build full-stack applications to use machine learning for fraud detection and content moderation. That experience forced them to learn web development, Kubernetes, and Docker on top of their ML work, a process they found unnecessarily difficult.
Before starting Baseten, Srivastava and Howes co-founded Shape, an HR analytics platform that was acquired by Reflektive in 2018. Haghighat worked as an engineering manager at Clover Health. The fourth co-founder, Pankaj Gupta, came from a software engineering role at Uber.
The founding insight behind Baseten was that while model training had become increasingly accessible through frameworks like PyTorch and TensorFlow, the bottleneck had shifted to deploying trained models for inference. Moving a trained model into production often required months of engineering work to build APIs, manage containers, handle GPU orchestration, and create monitoring infrastructure. Baseten was created so that data scientists would not have to become full-stack engineers to ship ML-powered applications.
Baseten initially focused on building a low-code platform that let data scientists create internal web applications on top of their ML models, similar to Retool but for ML teams. The company raised a $2.5 million seed round led by First Round Capital in 2021 under this original vision.
As the generative AI wave accelerated in 2022 and 2023, the company recognized that the larger opportunity lay in model inference infrastructure rather than application building. Baseten pivoted to focus on serverless model serving, developing Truss, an open-source model packaging framework, and building GPU-optimized infrastructure for running models at scale.
Baseten has raised $585 million in total venture capital funding across five rounds. The company's valuation grew from roughly $200 million in March 2024 to $5 billion by January 2026.
| Round | Date | Amount | Lead Investor(s) | Post-Money Valuation |
|---|---|---|---|---|
| Seed | 2021 | $2.5 million | First Round Capital | Not disclosed |
| Series A | April 2022 | $20 million | Greylock Partners | Not disclosed |
| Series B | March 2024 | $40 million | IVP, Spark Capital | ~$200 million |
| Series C | February 2025 | $75 million | Conviction | $825 million |
| Series D | September 2025 | $150 million | BOND | $2.15 billion |
| Series E | January 2026 | $300 million | IVP, CapitalG | $5 billion |
Other investors across these rounds include NVIDIA, Greylock, Altimeter, Battery Ventures, BoxGroup, Blackbird Ventures, 01 Advisors, Premji Invest, Scribble Ventures, South Park Commons, and Lachy Groom. NVIDIA contributed $150 million to the Series E round.
Key quotes from investors:
Baseten's revenue grew from near zero in 2022 to an estimated $15.8 million by 2025, with significant acceleration as demand for AI inference infrastructure surged.
| Year | Estimated Revenue | Employees | Notes |
|---|---|---|---|
| 2022 | Near zero | ~20 | Pre-pivot phase |
| 2023 | $2.7 million | ~49 | Inference platform gaining traction |
| 2024 | ~$16 million | ~100 | 6x year-over-year growth |
| 2025 | $15.8 million (reported) | ~147 | 100x growth in inference volume |
| 2026 | Not disclosed | ~200 | $5 billion valuation |
By March 2024, the platform served approximately 20 large enterprises and tens of thousands of developers. By February 2025, that number had grown to over 100 large organizations and hundreds of smaller businesses. The company reported near-zero customer churn as of February 2025.
The Baseten Inference Stack is the core technology powering the platform. It combines optimized inference runtimes with production infrastructure for autoscaling, request routing, and multi-cloud capacity management. The stack is designed to achieve 99.99% uptime through active-active deployments across multiple cloud providers and regions.
The inference stack consists of several layers:
Runtime Layer: Supports the largest large language models with low latency and high throughput. Features include KV cache reuse and request prioritization for fast time-to-first-token (TTFT), windowed attention for long-context models, quantization support, and a speculation engine for low inter-token latency.
Infrastructure Layer: Provides cold start optimizations, intelligent request routing for KV cache and LoRA cache hit rates, multi-node inference support, and disaggregated serving. The system holds requests in a queue while new GPUs spin up, then routes those requests across the expanded compute capacity.
Baseten Delivery Network (BDN): A multi-tier caching system for model weights. When a new replica starts, the BDN agent fetches a manifest, downloads weights through an in-cluster cache, and stores them in a node-level cache. The system uses parallelized byte-range downloads and specialized pods to accelerate loading times. As a result, most models cold start in seconds, and even the largest models start in a few minutes.
Baseten offers three deployment configurations:
| Option | Description | Best For |
|---|---|---|
| Baseten Cloud | Fully managed multi-cloud infrastructure across 10+ providers | Teams that want zero infrastructure management |
| Self-Hosted | Deploy within customer VPCs for compliance and data control | Regulated industries, strict data residency requirements |
| Hybrid | Primary on self-hosted infrastructure with overflow to Baseten Cloud during demand spikes | Enterprises needing both control and elastic scaling |
The self-hosted option supports SOC 2 Type II certification and HIPAA compliance, controls data residency, and aligns with standards like GDPR.
Baseten abstracts differences between cloud providers to ensure its Inference Stack runs identically on any underlying infrastructure. The multi-cloud approach powers high availability through active-active deployments across different clouds. If a region or provider faces a capacity crunch or outage, the system can rapidly reroute and reprovision workloads to maintain service continuity.
In December 2025, Baseten signed a Strategic Collaboration Agreement (SCA) with Amazon Web Services, expanding the availability of Baseten's inference services to customers deploying AI applications on AWS. This partnership gives enterprises a way to use Baseten's inference technology on their own AWS infrastructure while keeping full control of their data.
Baseten Model APIs provide instant access to popular open-source models through OpenAI-compatible endpoints. Developers can point their existing OpenAI SDK at Baseten's inference endpoint and start making calls without any model deployment. Pricing is calculated per million input and output tokens.
Available models include DeepSeek V3, DeepSeek R1, Llama 4 Maverick, Qwen, GLM-5, and GPT-OSS-120B, among others. Model APIs support structured outputs and tool calling. By leveraging Google Cloud A4 virtual machines based on NVIDIA Blackwell GPUs and the NVIDIA Dynamo inference framework, Baseten serves these models with over 225% better cost-performance for high-throughput inference and 25% better cost-performance for latency-sensitive inference compared to previous-generation hardware.
Dedicated deployments let teams serve custom, fine-tuned, and open-source models on specific GPU hardware. Users package their model using Truss, deploy it to Baseten, and receive a production API endpoint with autoscaling, monitoring, and request routing. Billing is per minute for the specific GPU hardware running the model.
Dedicated deployments support models from any framework: Hugging Face Transformers, diffusers, PyTorch, TensorFlow, vLLM, SGLang, TensorRT-LLM, and custom serving code.
Launched in late 2025, Baseten Training is an infrastructure platform for fine-tuning open-source AI models. The platform handles GPU cluster management, multi-node orchestration, and cloud capacity planning. A key differentiator is model weight ownership: all production-critical artifacts, including model weights, evaluations, and training scripts, belong entirely to the customer. This stands in contrast to some competing fine-tuning platforms whose terms of service restrict customers from exporting their fine-tuned model weights.
The Training platform supports multiple weight formats including full model fine-tunes and LoRA adapter weights, with seamless promotion from training jobs to inference endpoints on Baseten's serving infrastructure. Baseten also credits 20% of training costs toward inference usage.
Baseten Chains is an SDK for building and deploying compound AI systems, which are multi-model workflows that combine several AI models or processing steps. Built on top of Truss, Chains reached general availability in 2025.
The architecture uses two core concepts:
Chainlets call each other directly without a centralized orchestration executor, which reduces latency by eliminating intermediary result retrieval and transmission between steps. Each Chainlet runs on customized hardware with independent autoscaling. For example, in a transcription pipeline, chunking operations can scale horizontally on CPUs while the transcription model runs on GPUs.
Key features of Chains include output streaming, binary IO with NumPy array support, subclassing for Chainlet reuse, "Chains Watch" for live-patching deployed code, and built-in linting and logging.
Common use cases for Chains include:
Baseten Embeddings Inference is a purpose-built system for high-throughput embedding, reranker, and classifier inference. On NVIDIA B200 GPUs, BEI achieved 3.3x higher throughput than vLLM and 3.6x higher throughput than TEI (Text Embeddings Inference) running on H100s.
Truss is Baseten's open-source framework for packaging and serving ML models. Written in Python, Truss handles containerization, dependency management, and GPU configuration, allowing developers to create containerized model servers without learning Docker or Kubernetes.
Key features of Truss include:
| Feature | Description |
|---|---|
| Framework-agnostic | Supports models from any framework: Transformers, diffusers, PyTorch, TensorFlow, vLLM, SGLang, TensorRT-LLM |
| No Docker required | Creates containerized model servers without writing Dockerfiles |
| Live reload | Iterate on model serving code in a remote development environment that mirrors production |
| GPU support | Built-in configuration for GPU types and counts |
| Secrets management | Secure handling of API keys and credentials |
| Caching | Model weight and dependency caching for fast cold starts |
| Local and remote parity | Equally straightforward to serve a model on localhost and in production |
Truss is maintained by Baseten and has accumulated over 6,000 stars on GitHub. While Truss deploys natively to the Baseten Inference Stack, it can also be used to deploy models to other infrastructure.
Baseten offers a range of NVIDIA GPUs with per-minute billing and no idle charges. Users pay only for the time their model is actively using compute.
| GPU | VRAM | Per Minute | Per Hour (approx.) |
|---|---|---|---|
| NVIDIA T4 | 16 GiB | $0.01052 | $0.63 |
| NVIDIA L4 | 24 GiB | $0.01414 | $0.85 |
| NVIDIA A10G | 24 GiB | $0.02012 | $1.21 |
| NVIDIA H100 MIG | 40 GiB | $0.0625 | $3.75 |
| NVIDIA A100 | 80 GiB | $0.06667 | $4.00 |
| NVIDIA H100 | 80 GiB | $0.10833 | $6.50 |
| NVIDIA B200 | 180 GiB | $0.16633 | $9.98 |
Baseten reduced prices by 40% across all instance types at one point, reflecting decreasing GPU costs and improved infrastructure efficiency.
Baseten supports fractional H100 GPUs through NVIDIA's Multi-Instance GPU (MIG) technology. An H100 MIG instance provides 40 GiB of VRAM at $3.75 per hour, compared to $6.50 per hour for a full 80 GiB H100. This allows smaller models to run on high-performance hardware at lower cost.
| Feature | Basic | Pro | Enterprise |
|---|---|---|---|
| Monthly cost | $0 (pay as you go) | Volume discounts available | Custom pricing |
| Dedicated deployments | Yes | Yes | Yes |
| Model APIs | Yes | Yes | Yes |
| Training | Yes | Yes | Yes |
| Fast cold starts | Yes | Yes | Yes |
| SOC 2 / HIPAA | Yes | Yes | Yes |
| Priority GPU access | No | Yes | Yes |
| Dedicated compute | No | Yes | Yes |
| Custom SLAs | No | No | Yes |
| Self-hosting | No | No | Yes |
| Custom global regions | No | No | Yes |
Baseten has published several performance benchmarks highlighting its inference optimization work:
TensorRT on H100: Serving Stable Diffusion XL (SDXL) with NVIDIA TensorRT on an H100 GPU improved latency by 40% and throughput by 70% compared to standard serving. For language models, the H100 provides 2x to 3x better inference performance than the A100 while costing only 62% more per hour.
Mistral 7B on H100 MIG: Running Mistral 7B in FP16 on an H100 MIG instance demonstrated 20% lower latency and 6% higher total throughput compared to a full A100 GPU, at a lower hourly cost.
Blackwell GPUs: Using Google Cloud A4 virtual machines with NVIDIA Blackwell architecture, Baseten achieved 225% better cost-performance for high-throughput inference and 25% better cost-performance for latency-sensitive inference.
BEI on B200: Baseten Embeddings Inference on B200 GPUs achieved 3.3x higher throughput than vLLM and 3.6x higher throughput than TEI on H100s.
Voice AI: For text-to-speech applications using Chains, processing times were halved and GPU utilization improved 6x.
NVIDIA made a $150 million investment in Baseten as part of the company's $300 million Series E round in January 2026. The investment reflects NVIDIA's strategic interest in the AI inference software layer. Baseten uses NVIDIA's hardware stack extensively, from T4 and A100 GPUs through the latest Blackwell B200 architecture.
Baseten adopted NVIDIA's Blackwell GPUs on Google Cloud alongside the NVIDIA Dynamo inference framework and TensorRT-LLM. NVIDIA has also published a case study on Baseten's inference infrastructure, highlighting the company as an example of optimized GPU utilization for AI workloads.
In December 2025, Baseten acquired Parsed, a reinforcement learning startup specializing in post-training and continual learning for large models. Parsed was co-founded by CEO Mudith Jayasekara and Chief Scientist Charles O'Neill. Before the acquisition, Parsed had already been working closely with Baseten's ecosystem, running more than 500 training jobs on Baseten's infrastructure.
The acquisition brought production data, fine-tuning, and inference under one roof, enabling companies to shape learning signals from production usage through reinforcement learning that rewards strong outputs and penalizes weak ones. Financial terms were not disclosed.
Baseten serves a wide range of AI companies and enterprises. By January 2026, the customer base included over 100 large organizations and hundreds of smaller businesses.
| Customer | Industry / Use Case |
|---|---|
| Cursor | AI-powered code editor |
| Notion | Productivity and AI features |
| Quora | AI-powered Q&A platform |
| Patreon | Creator platform; uses Baseten for OpenAI Whisper deployment |
| Clay | Sales intelligence and data enrichment |
| Writer | Enterprise AI writing platform |
| Abridge | Medical transcription |
| OpenEvidence | Medical AI |
| HeyGen | AI video generation |
| Mercor | AI recruiting |
| Superhuman | AI-powered email |
| World Labs | Spatial intelligence |
| Hex | Data analytics |
| Decagon | Customer support AI |
| Retool | Internal tool building |
| Wispr | Voice AI |
| Lovable | AI development platform |
| Scaled Cognition | AI agent platform |
Paperwork filed by Patreon showed the company saved 440 engineer hours annually, $600,000, and 70% in GPU costs by deploying OpenAI Whisper on Baseten.
Scaled Cognition, an AI agent platform, deployed Baseten's inference stack on their own AWS GPUs within their VPC and achieved time-to-first-token under 120 milliseconds while reducing overall latency by 40%.
Baseten competes in the AI inference infrastructure market against both specialized startups and cloud provider offerings.
| Competitor | Category | Differentiator |
|---|---|---|
| Replicate | Serverless inference | One-click model deployment; strong for demos and prototyping; simpler developer experience |
| Modal | Serverless compute | Python-native infrastructure; faster cold starts (sub-second to 4 seconds); broader use cases beyond inference |
| Together AI | Full-stack AI cloud | Broad model catalog (200+ models); combined training and inference |
| Fireworks AI | Inference optimization | Custom FireAttention engine; focused on throughput and latency |
| AWS SageMaker | Cloud ML platform | Deep AWS ecosystem integration; comprehensive MLOps tooling |
| Google Vertex AI | Cloud ML platform | Integrated within Google Cloud; managed ML pipelines |
| Azure ML | Cloud ML platform | Microsoft ecosystem integration; enterprise features |
| RunPod | GPU cloud | Raw GPU access; competitive pricing; pod-based and serverless options |
| CoreWeave | GPU cloud | Large-scale GPU clusters; focused on raw infrastructure |
| Lambda | GPU cloud | On-demand GPU instances; research-focused |
Baseten differentiates itself from API-first platforms like Replicate through its support for custom models and enterprise deployment options (self-hosted, hybrid). Compared to general-purpose compute platforms like Modal, Baseten focuses specifically on model inference optimization with features like the Baseten Delivery Network for model weight caching and TensorRT-LLM integration. Against cloud provider offerings like SageMaker, Baseten positions itself as faster to deploy, with better GPU utilization and 40% lower costs compared to in-house infrastructure solutions.
The AI inference market was valued at $106.2 billion in 2025 and is projected to reach $255 billion by 2030, growing at a 19.2% compound annual growth rate.
Baseten maintains several open-source projects on GitHub under the basetenlabs organization:
| Project | Description |
|---|---|
| Truss | Model packaging and serving framework (6,000+ GitHub stars) |
| ML Cookbook | Ready-to-use ML training recipes for building and deploying models on Baseten |
| Model Trusses | Pre-built Truss configurations for popular models (MPT-7B, Stable Diffusion, Whisper, and others) |
Truss is the primary open-source contribution and serves as the on-ramp for developers adopting the Baseten platform. It is also usable independently of Baseten for local model serving and deployment to other infrastructure.
| Name | Role | Background |
|---|---|---|
| Tuhin Srivastava | CEO and Co-Founder | Former data scientist at Gumroad; co-founded Shape (acquired by Reflektive, 2018) |
| Amir Haghighat | CTO and Co-Founder | Former head of engineering at Gumroad; engineering manager at Clover Health |
| Philip Howes | Chief Scientist and Co-Founder | Former data scientist at Gumroad; co-founded Shape |
| Pankaj Gupta | Co-Founder | Former software engineer at Uber |
CEO Tuhin Srivastava has stated: "We think AI applications are just the last great market. We want to be the index for that economic growth."