Together AI is an AI cloud platform specializing in high-performance inference, fine-tuning, and training infrastructure for open-source foundation models. Founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, the company has positioned itself as a leading alternative to hyperscaler AI services by focusing on speed, cost efficiency, and deep support for the open-source AI ecosystem. As of early 2026, Together AI supports over 200 open-source models, serves more than 450,000 developers, and operates its own data center infrastructure with 200 MW of secured power capacity [1][2].
Together AI (originally incorporated as Together Computer Inc.) was founded in June 2022 to address what its founders saw as a growing "compute moat" limiting access to the hardware and infrastructure needed to train and deploy large language models. The founding team brought together industry experience and academic AI research [3].
Vipul Ved Prakash, who serves as CEO, previously co-founded and led Topsy (acquired by Apple) and Cloudmark (acquired by Proofpoint), both of which dealt with large-scale data processing. Ce Zhang came from ETH Zurich, where his research focused on data management for machine learning. Chris Re and Percy Liang are both professors at Stanford University, with Re's lab having produced foundational work on data-centric AI and Liang leading the Center for Research on Foundation Models (CRFM) [3].
The company initially explored a decentralized cloud model that would aggregate idle compute across data centers and coordinate high-latency networks for synchronized training. Over time, Together AI shifted toward a more conventional but highly optimized cloud infrastructure approach, building and operating its own GPU clusters.
A pivotal moment came in the summer of 2023, when Together AI brought on Tri Dao as Chief Scientist. Dao is the creator of FlashAttention and FlashAttention-2, which are memory-efficient attention algorithms that have become standard components in modern transformer training and inference. His research forms the basis for much of Together AI's performance advantage [3].
Together AI has raised significant venture capital across multiple rounds, reflecting investor confidence in the infrastructure layer of the AI ecosystem.
| Round | Date | Amount | Lead Investors | Valuation |
|---|---|---|---|---|
| Seed | 2022 | Undisclosed | Lux Capital | N/A |
| Series A | November 2023 | $102.5M | Kleiner Perkins | ~$500M |
| Series B | February 2025 | $305M | General Catalyst, Prosperity7 | $3.3B |
The Series A round in November 2023 included participation from NVIDIA, NEA, Prosperity7 Ventures, Greycroft, and 137 Ventures, among others [4]. The $305 million Series B, announced in February 2025, was led by General Catalyst and co-led by Prosperity7 (the venture arm of Saudi Aramco). This round valued Together AI at $3.3 billion and was earmarked for expanding GPU cluster capacity and deploying NVIDIA Blackwell GPUs across multiple data centers in North America [2].
By September 2025, Sacra estimated Together AI's annualized revenue at $300 million, up from $130 million at the end of 2024 [5].
Together AI's inference API provides OpenAI-compatible access to over 200 open-source models. The platform is built on a proprietary inference engine that incorporates FlashAttention-3 kernels and advanced quantization techniques, delivering what the company claims is 2 to 3 times faster inference than hyperscaler solutions [2].
The API supports chat completions, text completions, embeddings, image generation, and audio processing. Its OpenAI-compatible format means that developers can often switch from OpenAI or other providers with minimal code changes.
Together AI's focus on inference speed has resulted in measurable performance advantages over competing platforms. The company achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through a combination of GPU-level optimization, advanced speculative decoding, and FP4 quantization [9].
Key performance technologies include:
| Technology | Description | Impact |
|---|---|---|
| FlashAttention-3 kernels | Memory-efficient attention computation optimized for latest GPU architectures | Reduced memory overhead, higher throughput |
| FP4 quantization | 4-bit floating-point model compression | Lower memory usage, faster inference with minimal quality loss |
| ATLAS (AdapTive-LeArning Speculative System) | Learns from production traffic patterns to predict and pre-generate tokens | Further acceleration beyond static optimization |
| Together Kernel Collection | Proprietary CUDA kernels optimized for open-source model architectures | Model-specific performance gains |
| Speculative decoding | Uses a smaller draft model to predict tokens, verified by the larger model | 1.5-2x throughput improvement for autoregressive generation |
On NVIDIA Blackwell architecture, Together AI ranks first in speed benchmarks for top open-source models [9]. The ATLAS system is particularly notable because it continuously adapts to real-world production traffic rather than relying solely on static optimizations, meaning performance improves over time as the system observes usage patterns.
Together AI maintains one of the broadest catalogs of open-source models available through a managed inference service.
| Model Family | Provider | Notable Variants | Use Cases |
|---|---|---|---|
| Llama 3.3 / Llama 4 | Meta | 8B, 70B, 405B, Llama 4 Maverick | General text, code, multilingual |
| Mistral / Mixtral | Mistral AI | Mistral-7B, Mixtral-8x22B, Mistral Small 3, Mistral Large | Text generation, code, RAG |
| Qwen 2.5 / Qwen3 | Alibaba | Qwen 2.5, Qwen3, Qwen3-Coder-Next, Qwen3.5-397B | General text, code, multilingual |
| DeepSeek | DeepSeek | DeepSeek R1, DeepSeek-V3, DeepSeek-V3.1 | Reasoning, code, research |
| Gemma | Gemma 2 9B, Gemma 2 27B, Gemma 3n E4B | Lightweight text generation | |
| DBRX | Databricks | DBRX Instruct | Enterprise text generation |
| Stable Diffusion | Stability AI | SDXL, Stable Diffusion 3 | Image generation |
| Whisper | OpenAI | Whisper Large v3 | Speech-to-text |
The platform also supports specialized models for code generation, vision-language tasks, and embeddings. New models are typically added within days of their public release.
Together AI's fine-tuning service allows users to customize open-source models on their own data. The platform supports supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO). In 2025, Together AI expanded fine-tuning with native support for tool call training, reasoning model fine-tuning, and vision-language model adaptation, along with support for models with over 100 billion parameters [6].
The fine-tuning pipeline includes cost and ETA estimates before job submission, and Together AI reports up to 6 times higher throughput compared to earlier versions of their training infrastructure [6].
Together AI prices fine-tuning per million tokens, with costs varying by method and model size [7]:
| Method | Model Size | LoRA Price (per 1M tokens) | Full Fine-Tune Price (per 1M tokens) |
|---|---|---|---|
| Supervised Fine-Tuning (SFT) | Up to 16B parameters | $0.48 | $0.54 |
| Direct Preference Optimization (DPO) | Up to 16B parameters | $1.20 | $1.35 |
| SFT | 16B+ parameters | Custom pricing | Custom pricing |
| RLHF | All sizes | Custom pricing | Custom pricing |
LoRA (Low-Rank Adaptation) fine-tuning is more cost-effective than full fine-tuning because it updates only a small subset of model parameters. For most use cases, LoRA provides comparable quality at lower cost, while full fine-tuning offers maximum customization for highly specialized applications.
For organizations that need to train models from scratch or do extensive continued pre-training, Together AI offers custom training infrastructure built on its GPU clusters. This service is targeted at AI-native companies and research labs that need direct access to large-scale compute.
Training performance has improved significantly with the introduction of Blackwell GPUs. Training a 70B-parameter Llama-architecture model in BF16 precision with an optimized TorchTitan + Together Kernel Collection reaches 15,264 tokens per second per GPU on NVIDIA HGX B200, up from 8,080 tokens per second on NVIDIA HGX H100, representing a 90 percent improvement in training speed [9].
Together AI operates its own GPU cluster infrastructure, offering on-demand access to NVIDIA H100, H200, and (beginning in 2025) Blackwell B200 GPUs. The company has secured 200 MW of power capacity across multiple data centers in North America, with a facility in Maryland that went live in July 2025 and additional capacity in Memphis [2].
In September 2025, Together AI launched self-service GPU infrastructure, allowing customers to provision GPU clusters ranging from a single node with eight GPUs to multi-node systems with hundreds of processors [10]. The self-service model supports the latest NVIDIA Hopper and Blackwell hardware and is optimized for distributed training and elastic inference workloads.
GPU clusters are available across multiple commitment levels [7]:
| GPU Type | On-Demand (per hour) | Dedicated Inference (per hour) | Notes |
|---|---|---|---|
| NVIDIA HGX H100 (80GB) | $3.49 | $3.99 | Most widely available |
| NVIDIA HGX H200 (141GB) | $4.19 | $5.49 | Higher memory for larger models |
| NVIDIA HGX B200 (180GB) | $7.49 | $9.95 | Latest Blackwell architecture |
Reserved capacity with multi-month commitments offers significantly reduced rates. Volume pricing is available for large-scale deployments through direct negotiation with Together AI's enterprise sales team.
Together AI's performance advantage is closely tied to its integration of FlashAttention, a family of memory-efficient attention algorithms created by Chief Scientist Tri Dao. FlashAttention reduces the memory overhead and computational cost of the attention mechanism in transformers, which is typically the primary bottleneck in both training and inference.
The platform's inference engine incorporates FlashAttention-3 kernels, which are optimized for the latest NVIDIA GPU architectures. Combined with advanced quantization (reducing model precision to lower memory usage and increase throughput) and a proprietary "Together Kernel," these optimizations enable the platform to serve large models at significantly lower latency and cost than standard serving frameworks [2].
Together AI has invested heavily in developer experience, making its platform accessible to developers already familiar with the OpenAI ecosystem.
Together AI's API endpoints for chat, vision, images, embeddings, and speech are fully compatible with OpenAI's API format [11]. This means developers using the OpenAI Python library, LangChain, or other frameworks with OpenAI integrations can point their existing applications at Together AI's servers by changing only the base URL and API key. All parameters available in the OpenAI API work with Together AI, including streaming, function calling, and JSON mode.
This compatibility significantly reduces migration friction and allows developers to experiment with open-source models without rewriting application code.
| Feature | Description |
|---|---|
| OpenAI-compatible API | Drop-in replacement for OpenAI endpoints |
| Code sandbox | Isolated execution environment for code generation tasks ($0.0446/vCPU hour) |
| Evaluations dashboard | Compare model performance across benchmarks and custom datasets |
| Transcription API | Whisper-based speech-to-text ($0.0015/audio minute) |
| Streaming | Real-time token streaming for interactive applications |
| Function calling | Native tool use support compatible with OpenAI function calling format |
| JSON mode | Guaranteed structured JSON output for programmatic consumption |
Together AI's pricing is structured across its core product lines, with serverless inference priced per token and other services priced per GPU-hour.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Llama 4 Maverick | $0.27 | $0.85 | Latest Meta model |
| DeepSeek-V3.1 | $0.60 | $1.70 | Reasoning-focused |
| Qwen3.5-397B | $0.20 | $0.60 | Large multilingual model |
| Mistral Small 3 | $0.10 | $0.30 | Efficient mid-range |
| Gemma 3n E4B | $0.02 | $0.04 | Ultra-lightweight |
| Llama 3.3 8B | $0.18 (combined) | - | Budget option |
| Qwen3-Coder-Next | $0.50 | $1.20 | Code-specialized |
| Service | Pricing |
|---|---|
| Fine-Tuning (LoRA SFT, up to 16B) | $0.48 per 1M tokens |
| Fine-Tuning (Full SFT, up to 16B) | $0.54 per 1M tokens |
| Dedicated Inference (H100) | $3.99/hour |
| Dedicated Inference (B200) | $9.95/hour |
| GPU Clusters (H100, on-demand) | $3.49/hour |
| GPU Clusters (B200, on-demand) | $7.49/hour |
| Storage | $0.16/GiB/month |
The company positions its pricing as competitive with major cloud providers. For popular open-source models, Together AI's per-token inference costs are often 30 to 60 percent lower than the same models served on Amazon Bedrock or Google Vertex AI, reflecting the efficiency of its custom inference engine [7].
Together AI offers a free tier that lets developers experiment without commitment, and enterprise customers can negotiate volume pricing.
The following table compares pricing for representative models across major inference providers [7]:
| Model | Together AI | Amazon Bedrock | Google Vertex AI | Notes |
|---|---|---|---|---|
| Llama 3.3 70B (input) | ~$0.54/1M | ~$1.95/1M | ~$1.80/1M | Together AI 60-70% cheaper |
| DeepSeek R1 (input) | ~$0.60/1M | ~$0.62/1M | N/A | Comparable pricing |
| Mistral Small (input) | $0.10/1M | ~$0.10/1M | ~$0.10/1M | Price parity on small models |
The cost advantage is most pronounced for larger models, where Together AI's custom inference stack can serve more requests per GPU than standard serving frameworks. For small models, pricing differences narrow as compute efficiency matters less relative to overhead costs.
| Feature | Together AI | Amazon Bedrock | Replicate | HuggingFace Inference |
|---|---|---|---|---|
| Focus | Open-source model inference + training | Multi-provider managed AI | Community model hosting | Model hub + inference |
| Model Count | 200+ | ~100 | 50,000+ | 500,000+ (hub) |
| Custom Training | Yes (GPU clusters) | Limited (fine-tuning only) | No | No (AutoTrain for fine-tuning) |
| Inference Speed | 2-3x faster (claimed) | Standard | Standard | Variable |
| Pricing Model | Per-token / per-GPU-hour | Per-token | Per-second compute | Per-token / per-GPU-hour |
| Target Users | AI developers, enterprises | Enterprise cloud users | Indie developers, startups | ML researchers, developers |
| GPU Access | H100, H200, B200 clusters | N/A (managed only) | N/A | A100, A10G via Inference Endpoints |
| Fine-Tuning Methods | SFT, DPO, RLHF, tool call, vision | SFT, continued pre-training | N/A | SFT (AutoTrain) |
Together AI's main differentiator is the combination of speed, cost efficiency, and deep support for the open-source ecosystem. While Bedrock offers a wider range of proprietary models and Replicate has a larger community catalog, Together AI focuses on delivering the fastest and cheapest inference for production-grade open-source models [7].
Together AI has built a customer base that spans AI-native startups and large enterprises. Notable customers include Salesforce, Zoom, SK Telecom, Hedra, Cognition, Zomato, Krea, Cartesia, and The Washington Post [2]. The company also counts Salesforce Ventures among its investors, reflecting a strategic partnership with one of the largest enterprise software companies.
The company's customer base reflects two distinct segments. AI-native startups like Cognition, Hedra, and Krea use Together AI as their primary inference infrastructure, drawn by the performance advantages and cost savings that allow them to scale AI-intensive products without the overhead of managing their own GPU clusters. Larger enterprises like Salesforce, Zoom, and SK Telecom typically use Together AI alongside their existing cloud infrastructure, often for specific workloads where open-source model performance and cost are critical factors.
Together AI's approach to infrastructure represents a deliberate bet on vertical integration. Rather than renting capacity from major cloud providers, the company builds and operates its own GPU clusters, giving it direct control over hardware configuration, networking, and cooling. This approach mirrors the strategies of other AI-focused infrastructure companies but is unusual among inference-focused platforms, which typically operate on top of existing cloud infrastructure [2].
The 200 MW of secured power capacity across multiple data centers is significant. For context, a single NVIDIA HGX B200 system consumes approximately 14 kilowatts, meaning 200 MW could theoretically power roughly 14,000 B200 systems at full load (before accounting for cooling and overhead). This level of capacity positions Together AI to serve large-scale training and inference workloads well into the future as demand for open-source model deployment continues to grow.
The geographic distribution of data centers across North America, including facilities in Maryland and Memphis, provides both redundancy and the ability to serve customers with data residency requirements. The Maryland facility's proximity to the Washington, D.C. area is notable given the growing demand for AI infrastructure from government and defense customers.
As of early 2026, Together AI continues to scale rapidly. At its first AI Native conference in March 2026, the company announced several product and business milestones [8]. The platform now supports over 200 models across all modalities, including chat, image, audio, vision, code, and embeddings.
The competitive landscape for AI inference has intensified, with hyperscalers like AWS, Microsoft Azure, and Google Cloud all expanding their open-source model offerings. Together AI's response has been to double down on performance, investing in custom kernels, Blackwell GPU deployments, and vertically integrated infrastructure that gives it a cost advantage over larger but less specialized competitors.
The company's trajectory from a research-oriented startup to a $3.3 billion infrastructure provider illustrates the growing importance of the inference and training infrastructure layer in the AI ecosystem. With revenue growing rapidly and a strong developer community, Together AI has established itself as a significant force in the market for open-source AI deployment.