Together AI

16 min read

Updated Apr 26, 2026

Together AI is an AI cloud platform specializing in high-performance inference, fine-tuning, and training infrastructure for open-source foundation models. Founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, the company has positioned itself as a leading alternative to hyperscaler AI services by focusing on speed, cost efficiency, and deep support for the open-source AI ecosystem. As of early 2026, Together AI supports over 200 open-source models, serves more than 450,000 developers, and operates its own data center infrastructure with 200 MW of secured power capacity ^[1]^[2].

History and Founding

Together AI (originally incorporated as Together Computer Inc.) was founded in June 2022 to address what its founders saw as a growing "compute moat" limiting access to the hardware and infrastructure needed to train and deploy large language models. The founding team brought together industry experience and academic AI research ^[3].

Vipul Ved Prakash, who serves as CEO, previously co-founded and led Topsy (acquired by Apple) and Cloudmark (acquired by Proofpoint), both of which dealt with large-scale data processing. Ce Zhang came from ETH Zurich, where his research focused on data management for machine learning. Chris Re and Percy Liang are both professors at Stanford University, with Re's lab having produced foundational work on data-centric AI and Liang leading the Center for Research on Foundation Models (CRFM) ^[3].

The company initially explored a decentralized cloud model that would aggregate idle compute across data centers and coordinate high-latency networks for synchronized training. Over time, Together AI shifted toward a more conventional but highly optimized cloud infrastructure approach, building and operating its own GPU clusters.

A pivotal moment came in the summer of 2023, when Together AI brought on Tri Dao as Chief Scientist. Dao is the creator of FlashAttention and FlashAttention-2, which are memory-efficient attention algorithms that have become standard components in modern transformer training and inference. His research forms the basis for much of Together AI's performance advantage ^[3].

Funding

Together AI has raised significant venture capital across multiple rounds, reflecting investor confidence in the infrastructure layer of the AI ecosystem.

Round	Date	Amount	Lead Investors	Valuation
Seed	2022	Undisclosed	Lux Capital	N/A
Series A	November 2023	$102.5M	Kleiner Perkins	~$500M
Series B	February 2025	$305M	General Catalyst, Prosperity7	$3.3B

The Series A round in November 2023 included participation from NVIDIA, NEA, Prosperity7 Ventures, Greycroft, and 137 Ventures, among others ^[4]. The $305 million Series B, announced in February 2025, was led by General Catalyst and co-led by Prosperity7 (the venture arm of Saudi Aramco). This round valued Together AI at $3.3 billion and was earmarked for expanding GPU cluster capacity and deploying NVIDIA Blackwell GPUs across multiple data centers in North America ^[2].

By September 2025, Sacra estimated Together AI's annualized revenue at $300 million, up from $130 million at the end of 2024 ^[5].

Key Products and Services

Inference API

Together AI's inference API provides OpenAI-compatible access to over 200 open-source models. The platform is built on a proprietary inference engine that incorporates FlashAttention-3 kernels and advanced quantization techniques, delivering what the company claims is 2 to 3 times faster inference than hyperscaler solutions ^[2].

The API supports chat completions, text completions, embeddings, image generation, and audio processing. Its OpenAI-compatible format means that developers can often switch from OpenAI or other providers with minimal code changes.

Inference Performance Benchmarks

Together AI's focus on inference speed has resulted in measurable performance advantages over competing platforms. The company achieves up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through a combination of GPU-level optimization, advanced speculative decoding, and FP4 quantization ^[9].

Key performance technologies include:

Technology	Description	Impact
FlashAttention-3 kernels	Memory-efficient attention computation optimized for latest GPU architectures	Reduced memory overhead, higher throughput
FP4 quantization	4-bit floating-point model compression	Lower memory usage, faster inference with minimal quality loss
ATLAS (AdapTive-LeArning Speculative System)	Learns from production traffic patterns to predict and pre-generate tokens	Further acceleration beyond static optimization
Together Kernel Collection	Proprietary CUDA kernels optimized for open-source model architectures	Model-specific performance gains
Speculative decoding	Uses a smaller draft model to predict tokens, verified by the larger model	1.5-2x throughput improvement for autoregressive generation

On NVIDIA Blackwell architecture, Together AI ranks first in speed benchmarks for top open-source models ^[9]. The ATLAS system is particularly notable because it continuously adapts to real-world production traffic rather than relying solely on static optimizations, meaning performance improves over time as the system observes usage patterns.

Supported Models

Together AI maintains one of the broadest catalogs of open-source models available through a managed inference service.

Model Family	Provider	Notable Variants	Use Cases
Llama 3.3 / Llama 4	Meta	8B, 70B, 405B, Llama 4 Maverick	General text, code, multilingual
Mistral / Mixtral	Mistral AI	Mistral-7B, Mixtral-8x22B, Mistral Small 3, Mistral Large	Text generation, code, RAG
Qwen 2.5 / Qwen3	Alibaba	Qwen 2.5, Qwen3, Qwen3-Coder-Next, Qwen3.5-397B	General text, code, multilingual
DeepSeek	DeepSeek	DeepSeek R1, DeepSeek-V3, DeepSeek-V3.1	Reasoning, code, research
Gemma	Google	Gemma 2 9B, Gemma 2 27B, Gemma 3n E4B	Lightweight text generation
DBRX	Databricks	DBRX Instruct	Enterprise text generation
Stable Diffusion	Stability AI	SDXL, Stable Diffusion 3	Image generation
Whisper	OpenAI	Whisper Large v3	Speech-to-text

The platform also supports specialized models for code generation, vision-language tasks, and embeddings. New models are typically added within days of their public release.

Fine-Tuning

Together AI's fine-tuning service allows users to customize open-source models on their own data. The platform supports supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO). In 2025, Together AI expanded fine-tuning with native support for tool call training, reasoning model fine-tuning, and vision-language model adaptation, along with support for models with over 100 billion parameters ^[6].

The fine-tuning pipeline includes cost and ETA estimates before job submission, and Together AI reports up to 6 times higher throughput compared to earlier versions of their training infrastructure ^[6].

Fine-Tuning Pricing and Methods

Together AI prices fine-tuning per million tokens, with costs varying by method and model size ^[7]:

Method	Model Size	LoRA Price (per 1M tokens)	Full Fine-Tune Price (per 1M tokens)
Supervised Fine-Tuning (SFT)	Up to 16B parameters	$0.48	$0.54
Direct Preference Optimization (DPO)	Up to 16B parameters	$1.20	$1.35
SFT	16B+ parameters	Custom pricing	Custom pricing
RLHF	All sizes	Custom pricing	Custom pricing

LoRA (Low-Rank Adaptation) fine-tuning is more cost-effective than full fine-tuning because it updates only a small subset of model parameters. For most use cases, LoRA provides comparable quality at lower cost, while full fine-tuning offers maximum customization for highly specialized applications.

Custom Training

For organizations that need to train models from scratch or do extensive continued pre-training, Together AI offers custom training infrastructure built on its GPU clusters. This service is targeted at AI-native companies and research labs that need direct access to large-scale compute.

Training performance has improved significantly with the introduction of Blackwell GPUs. Training a 70B-parameter Llama-architecture model in BF16 precision with an optimized TorchTitan + Together Kernel Collection reaches 15,264 tokens per second per GPU on NVIDIA HGX B200, up from 8,080 tokens per second on NVIDIA HGX H100, representing a 90 percent improvement in training speed ^[9].

Together GPU Clusters

Together AI operates its own GPU cluster infrastructure, offering on-demand access to NVIDIA H100, H200, and (beginning in 2025) Blackwell B200 GPUs. The company has secured 200 MW of power capacity across multiple data centers in North America, with a facility in Maryland that went live in July 2025 and additional capacity in Memphis ^[2].

In September 2025, Together AI launched self-service GPU infrastructure, allowing customers to provision GPU clusters ranging from a single node with eight GPUs to multi-node systems with hundreds of processors ^[10]. The self-service model supports the latest NVIDIA Hopper and Blackwell hardware and is optimized for distributed training and elastic inference workloads.

GPU Cluster Pricing

GPU clusters are available across multiple commitment levels ^[7]:

GPU Type	On-Demand (per hour)	Dedicated Inference (per hour)	Notes
NVIDIA HGX H100 (80GB)	$3.49	$3.99	Most widely available
NVIDIA HGX H200 (141GB)	$4.19	$5.49	Higher memory for larger models
NVIDIA HGX B200 (180GB)	$7.49	$9.95	Latest Blackwell architecture

Reserved capacity with multi-month commitments offers significantly reduced rates. Volume pricing is available for large-scale deployments through direct negotiation with Together AI's enterprise sales team.

FlashAttention Integration

Together AI's performance advantage is closely tied to its integration of FlashAttention, a family of memory-efficient attention algorithms created by Chief Scientist Tri Dao. FlashAttention reduces the memory overhead and computational cost of the attention mechanism in transformers, which is typically the primary bottleneck in both training and inference.

The platform's inference engine incorporates FlashAttention-3 kernels, which are optimized for the latest NVIDIA GPU architectures. Combined with advanced quantization (reducing model precision to lower memory usage and increase throughput) and a proprietary "Together Kernel," these optimizations enable the platform to serve large models at significantly lower latency and cost than standard serving frameworks ^[2].

Developer Experience

Together AI has invested heavily in developer experience, making its platform accessible to developers already familiar with the OpenAI ecosystem.

OpenAI API Compatibility

Together AI's API endpoints for chat, vision, images, embeddings, and speech are fully compatible with OpenAI's API format ^[11]. This means developers using the OpenAI Python library, LangChain, or other frameworks with OpenAI integrations can point their existing applications at Together AI's servers by changing only the base URL and API key. All parameters available in the OpenAI API work with Together AI, including streaming, function calling, and JSON mode.

This compatibility significantly reduces migration friction and allows developers to experiment with open-source models without rewriting application code.

Developer Tools and Features

Feature	Description
OpenAI-compatible API	Drop-in replacement for OpenAI endpoints
Code sandbox	Isolated execution environment for code generation tasks ($0.0446/vCPU hour)
Evaluations dashboard	Compare model performance across benchmarks and custom datasets
Transcription API	Whisper-based speech-to-text ($0.0015/audio minute)
Streaming	Real-time token streaming for interactive applications
Function calling	Native tool use support compatible with OpenAI function calling format
JSON mode	Guaranteed structured JSON output for programmatic consumption

Pricing

Together AI's pricing is structured across its core product lines, with serverless inference priced per token and other services priced per GPU-hour.

Serverless Inference Pricing

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Llama 4 Maverick	$0.27	$0.85	Latest Meta model
DeepSeek-V3.1	$0.60	$1.70	Reasoning-focused
Qwen3.5-397B	$0.20	$0.60	Large multilingual model
Mistral Small 3	$0.10	$0.30	Efficient mid-range
Gemma 3n E4B	$0.02	$0.04	Ultra-lightweight
Llama 3.3 8B	$0.18 (combined)	-	Budget option
Qwen3-Coder-Next	$0.50	$1.20	Code-specialized

Other Service Pricing

Service	Pricing
Fine-Tuning (LoRA SFT, up to 16B)	$0.48 per 1M tokens
Fine-Tuning (Full SFT, up to 16B)	$0.54 per 1M tokens
Dedicated Inference (H100)	$3.99/hour
Dedicated Inference (B200)	$9.95/hour
GPU Clusters (H100, on-demand)	$3.49/hour
GPU Clusters (B200, on-demand)	$7.49/hour
Storage	$0.16/GiB/month

The company positions its pricing as competitive with major cloud providers. For popular open-source models, Together AI's per-token inference costs are often 30 to 60 percent lower than the same models served on Amazon Bedrock or Google Vertex AI, reflecting the efficiency of its custom inference engine ^[7].

Together AI offers a free tier that lets developers experiment without commitment, and enterprise customers can negotiate volume pricing.

Pricing Comparison with Competitors

The following table compares pricing for representative models across major inference providers ^[7]:

Model	Together AI	Amazon Bedrock	Google Vertex AI	Notes
Llama 3.3 70B (input)	~$0.54/1M	~$1.95/1M	~$1.80/1M	Together AI 60-70% cheaper
DeepSeek R1 (input)	~$0.60/1M	~$0.62/1M	N/A	Comparable pricing
Mistral Small (input)	$0.10/1M	~$0.10/1M	~$0.10/1M	Price parity on small models

The cost advantage is most pronounced for larger models, where Together AI's custom inference stack can serve more requests per GPU than standard serving frameworks. For small models, pricing differences narrow as compute efficiency matters less relative to overhead costs.

Comparison with Other Inference Providers

Feature	Together AI	Amazon Bedrock	Replicate	HuggingFace Inference
Focus	Open-source model inference + training	Multi-provider managed AI	Community model hosting	Model hub + inference
Model Count	200+	~100	50,000+	500,000+ (hub)
Custom Training	Yes (GPU clusters)	Limited (fine-tuning only)	No	No (AutoTrain for fine-tuning)
Inference Speed	2-3x faster (claimed)	Standard	Standard	Variable
Pricing Model	Per-token / per-GPU-hour	Per-token	Per-second compute	Per-token / per-GPU-hour
Target Users	AI developers, enterprises	Enterprise cloud users	Indie developers, startups	ML researchers, developers
GPU Access	H100, H200, B200 clusters	N/A (managed only)	N/A	A100, A10G via Inference Endpoints
Fine-Tuning Methods	SFT, DPO, RLHF, tool call, vision	SFT, continued pre-training	N/A	SFT (AutoTrain)

Together AI's main differentiator is the combination of speed, cost efficiency, and deep support for the open-source ecosystem. While Bedrock offers a wider range of proprietary models and Replicate has a larger community catalog, Together AI focuses on delivering the fastest and cheapest inference for production-grade open-source models ^[7].

Enterprise Customers and Partnerships

Together AI has built a customer base that spans AI-native startups and large enterprises. Notable customers include Salesforce, Zoom, SK Telecom, Hedra, Cognition, Zomato, Krea, Cartesia, and The Washington Post ^[2]. The company also counts Salesforce Ventures among its investors, reflecting a strategic partnership with one of the largest enterprise software companies.

The company's customer base reflects two distinct segments. AI-native startups like Cognition, Hedra, and Krea use Together AI as their primary inference infrastructure, drawn by the performance advantages and cost savings that allow them to scale AI-intensive products without the overhead of managing their own GPU clusters. Larger enterprises like Salesforce, Zoom, and SK Telecom typically use Together AI alongside their existing cloud infrastructure, often for specific workloads where open-source model performance and cost are critical factors.

Infrastructure Strategy

Together AI's approach to infrastructure represents a deliberate bet on vertical integration. Rather than renting capacity from major cloud providers, the company builds and operates its own GPU clusters, giving it direct control over hardware configuration, networking, and cooling. This approach mirrors the strategies of other AI-focused infrastructure companies but is unusual among inference-focused platforms, which typically operate on top of existing cloud infrastructure ^[2].

The 200 MW of secured power capacity across multiple data centers is significant. For context, a single NVIDIA HGX B200 system consumes approximately 14 kilowatts, meaning 200 MW could theoretically power roughly 14,000 B200 systems at full load (before accounting for cooling and overhead). This level of capacity positions Together AI to serve large-scale training and inference workloads well into the future as demand for open-source model deployment continues to grow.

The geographic distribution of data centers across North America, including facilities in Maryland and Memphis, provides both redundancy and the ability to serve customers with data residency requirements. The Maryland facility's proximity to the Washington, D.C. area is notable given the growing demand for AI infrastructure from government and defense customers.

Current State (2025-2026)

As of early 2026, Together AI continues to scale rapidly. At its first AI Native conference in March 2026, the company announced several product and business milestones ^[8]. The platform now supports over 200 models across all modalities, including chat, image, audio, vision, code, and embeddings.

The competitive landscape for AI inference has intensified, with hyperscalers like AWS, Microsoft Azure, and Google Cloud all expanding their open-source model offerings. Together AI's response has been to double down on performance, investing in custom kernels, Blackwell GPU deployments, and vertically integrated infrastructure that gives it a cost advantage over larger but less specialized competitors.

The company's trajectory from a research-oriented startup to a $3.3 billion infrastructure provider illustrates the growing importance of the inference and training infrastructure layer in the AI ecosystem. With revenue growing rapidly and a strong developer community, Together AI has established itself as a significant force in the market for open-source AI deployment.

History and Founding

Funding

Key Products and Services

Inference API

Inference Performance Benchmarks

Supported Models

Fine-Tuning

Fine-Tuning Pricing and Methods

Custom Training

Together GPU Clusters

GPU Cluster Pricing

FlashAttention Integration

Developer Experience

OpenAI API Compatibility

Developer Tools and Features

Pricing

Serverless Inference Pricing

Other Service Pricing

Pricing Comparison with Competitors

Comparison with Other Inference Providers

Enterprise Customers and Partnerships

Infrastructure Strategy

Current State (2025-2026)

References

Related Articles

Model Context Protocol

Perplexity AI

Apple Intelligence

Amazon Bedrock

Azure OpenAI Service

Replicate

History and Founding

Funding

Key Products and Services

Inference API

Inference Performance Benchmarks

Supported Models

Fine-Tuning

Fine-Tuning Pricing and Methods

Custom Training

Together GPU Clusters

GPU Cluster Pricing

FlashAttention Integration

Developer Experience

OpenAI API Compatibility

Developer Tools and Features

Pricing

Serverless Inference Pricing

Other Service Pricing

Pricing Comparison with Competitors

Comparison with Other Inference Providers

Enterprise Customers and Partnerships

Infrastructure Strategy

Current State (2025-2026)

References

Related Articles

Model Context Protocol

Perplexity AI

Apple Intelligence

Amazon Bedrock

Azure OpenAI Service

Replicate