Together AI

AI Tools & Products Artificial Intelligence

31 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v6 · 6,099 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Together AI is an AI cloud platform for running, fine-tuning, and training open-source foundation models at high performance and low cost. Founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, it serves over 200 open-source models to more than 450,000 developers through an OpenAI-compatible API, and operates its own vertically integrated GPU cloud with 200 MW of secured power capacity ^[2]^[17]. By early 2026 the company reached roughly $1 billion in annualized revenue and was reported to have raised about $1 billion at a $7.5 billion valuation, making it one of the largest independent infrastructure providers built around open-weight AI models ^[12].

The company sits at the center of a fast-growing market for managed access to open-weight models like Llama, DeepSeek, Qwen, and Mistral, competing with specialized inference providers such as Fireworks AI and Replicate, and with hyperscaler offerings like Amazon Bedrock and Google Vertex AI. Its strategy combines research-grade optimization, including kernels developed by chief scientist Tri Dao, with a vertically integrated GPU cloud designed to capture both inference traffic and training workloads from AI-native startups and large enterprises. "Our AI Acceleration Cloud uniquely provides organizations with the performance, security, and functionality required to train frontier models and build production-scale AI applications with incredible cost efficiency," CEO Vipul Ved Prakash said when announcing the company's Series B ^[17].

What is Together AI?

Together AI is a full-stack "AI Acceleration Cloud" that lets developers and enterprises access open-source models through a single API and reserve dedicated GPU capacity for training and inference. Where most rivals do either serverless model serving or raw GPU rental, Together AI does both on infrastructure it increasingly owns and operates, and pairs it with research-derived performance optimizations. The platform spans serverless inference, fine-tuning, custom training, GPU clusters, code-execution sandboxes, evaluations, and data preparation, all built around the open-weight model ecosystem rather than a single proprietary model.

History and founding

Together AI (originally incorporated as Together Computer Inc.) was founded in June 2022 to address what its founders saw as a growing compute moat limiting access to the hardware and infrastructure needed to train and deploy large language models. The founding team brought together industry experience and academic AI research ^[3].

Vipul Ved Prakash, who serves as CEO, previously co-founded and led Topsy (acquired by Apple) and Cloudmark (acquired by Proofpoint), both of which dealt with large-scale data processing. Ce Zhang, who serves as CTO, came from ETH Zurich, where his research focused on data management for machine learning and distributed training. Chris Re and Percy Liang are both professors at Stanford University, with Re's lab having produced foundational work on data-centric AI and Liang leading the Center for Research on Foundation Models (CRFM) ^[3]^[13].

The company initially explored a decentralized cloud model that would aggregate idle compute across data centers and coordinate high-latency networks for synchronized training. This research direction yielded several early publications on decentralized training over commodity internet links, but Together AI ultimately shifted toward a more conventional cloud infrastructure approach, building and operating its own high-bandwidth GPU clusters tuned for large model workloads.

A pivotal moment came in mid-2023, when Together AI brought on Tri Dao as chief scientist. Dao is the creator of FlashAttention and FlashAttention-2, memory-efficient attention algorithms that have become standard components in modern transformer training and inference stacks used by OpenAI, Anthropic, Meta, and Mistral. His research forms the basis for much of Together AI's performance advantage and underpins the Together Kernel Collection that ships with the platform's training and inference engines ^[3]^[13].

Founding team

Founder	Role	Background
Vipul Ved Prakash	Co-founder, CEO	Founder of Topsy (acquired by Apple), Cloudmark (acquired by Proofpoint)
Ce Zhang	Co-founder, CTO	Former associate professor at ETH Zurich, distributed ML systems research
Chris Re	Co-founder, advisor	Professor at Stanford, founder of Snorkel AI, MacArthur Fellow
Percy Liang	Co-founder, advisor	Stanford professor, director of CRFM, co-author of HELM benchmark
Tri Dao	Chief scientist (joined 2023)	Creator of FlashAttention and Mamba, assistant professor at Princeton

Together AI is headquartered in San Francisco. The full-time research team has grown to include other prominent open-source contributors and former hyperscaler infrastructure engineers, with a deliberate emphasis on hiring people who can move between systems research and product engineering.

RedPajama and open dataset work

Before Together AI became known primarily as an inference and training cloud, the company spent much of 2023 and 2024 publishing open datasets and pretraining recipes under the RedPajama project. RedPajama is a multi-institution effort involving Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research that set out to make state-of-the-art language model pretraining fully reproducible without reliance on closed corpora ^[11].

The first release, RedPajama-V1, reconstructed the training mixture described in the original LLaMA paper and contained roughly 1.2 trillion tokens spanning Common Crawl, GitHub, books, ArXiv, Wikipedia, and Stack Exchange. The follow-on dataset, RedPajama-V2, is a web-only corpus containing more than 100 billion documents drawn from 84 Common Crawl snapshots, processed with the CCNet pipeline, and shipped together with 30 billion documents of quality signals and 20 billion deduplicated documents. Across both releases the project distributes more than 100 trillion tokens of openly licensed text ^[11].

RedPajama datasets have been used directly or indirectly in the pretraining of several open models that later went into production, including Snowflake Arctic, Salesforce XGen, and AI2 OLMo. The associated tooling lives on GitHub under the togethercomputer organization, and the data itself is mirrored on Hugging Face under a permissive license. The project doubled as a credibility builder for Together AI's training services, since enterprise customers could see the company contributing to and curating the same kinds of corpora used for foundation model training.

In 2024, Together AI published the RedPajama-V2 paper at NeurIPS, formalizing the data preparation pipeline and analyzing the impact of different quality signals on downstream model quality. The work has since been cited in research from major labs and has informed how Together AI structures its training data services for paying customers.

How much funding has Together AI raised?

Together AI has raised significant venture capital across multiple rounds, reflecting investor confidence in the infrastructure layer of the AI ecosystem. Counting the reported 2026 round, the company has raised well over $1.5 billion in total, with its valuation rising from roughly $500 million in late 2023 to a reported $7.5 billion in 2026 ^[4]^[2]^[12].

Round	Date	Amount	Lead investors	Valuation
Seed	2022	Undisclosed	Lux Capital	N/A
Series A	November 2023	$102.5M	Kleiner Perkins	~$500M
Series A extension	March 2024	$106M	Salesforce Ventures	$1.25B
Series B	February 2025	$305M	General Catalyst, Prosperity7	$3.3B
Series C (reported)	Q2 2026	~$1B (target)	Prosperity7 (reported), Nvidia	$7.5B

The Series A round in November 2023 included participation from NVIDIA, NEA, Prosperity7 Ventures, Greycroft, and 137 Ventures, among others ^[4]. The March 2024 Series A extension led by Salesforce Ventures pushed Together AI's valuation past the unicorn threshold to $1.25 billion and brought Salesforce into the cap table as both an investor and a customer.

The $305 million Series B announced on February 20, 2025 was led by General Catalyst and co-led by Prosperity7, the venture arm of Saudi Aramco. The round valued Together AI at $3.3 billion and brought in additional capital from Salesforce Ventures, DAMAC Capital, Nvidia, Kleiner Perkins, March Capital, Emergence Capital, Lux Capital, SE Ventures, Greycroft, Coatue, Definition, Cadenza Ventures, Long Journey Ventures, Brave Capital, Scott Banister, SK Telecom, and Cisco founder John Chambers. Proceeds were earmarked for expanding GPU cluster capacity and deploying NVIDIA Blackwell GPUs across multiple data centers in North America ^[2]^[17]. General Catalyst managing director Marc Bhargava said at the time that "Vipul and team have built an incredible tech platform and business, emerging as a dominant player in AI infrastructure in less than two years" ^[17].

In March 2026, multiple outlets reported that Together AI was in talks to raise approximately $1 billion at a $7.5 billion pre-money valuation, more than doubling its Series B mark just thirteen months earlier. Prosperity7 was again reported as a potential lead, with significant follow-on participation expected from existing strategic investors including Nvidia. The talks were tied to the company's roughly $1 billion annualized revenue run rate and its expanding Blackwell deployment, and subsequent reporting indicated the round was finalized at the $7.5 billion valuation around April 2026, though Together AI had not published an official announcement of the close ^[12].

Revenue and growth

Together AI has scaled revenue at a pace reminiscent of hyperscaler AI services. Annualized revenue rose from approximately $130 million at the end of 2024 to $300 million in September 2025, and reached approximately $1 billion by February 2026, more than tripling over roughly the preceding year ^[5]^[12].

Period	Annualized revenue	Notes
End of 2024	~$130M	Per Sacra estimates
September 2025	~$300M	Per Sacra and LinkedIn reporting
Early 2026	~$1B	Reported by The Information ahead of $7.5B valuation talks

At the Series B, the company reported 6x year-over-year growth in annual recurring revenue and 20x growth in its customer base, underscoring how quickly demand for open-model infrastructure was compounding ^[17]. Revenue is generated through two primary lines. The first is per-token API usage on the serverless inference platform, which accounts for roughly 30 to 40 percent of revenue. The second and larger share comes from renting GPU capacity, whether as dedicated inference endpoints, reserved GPU clusters, or on-demand training infrastructure. As Blackwell capacity has come online, the GPU rental business has scaled rapidly, with several customers committing to multi-month reservations for thousands of GPUs at a time ^[12].

Key products and services

Together AI organizes its offerings around four pillars: serverless inference, fine-tuning, custom training, and GPU clusters, with additional managed services for code execution, evaluations, and data preparation that wrap the core compute layer.

Inference API

Together AI's inference API provides OpenAI-compatible access to over 200 open-source models. The platform is built on a proprietary inference engine that incorporates FlashAttention-3 kernels and advanced quantization techniques, delivering what the company claims is 2 to 3 times faster inference than hyperscaler solutions ^[2]^[17].

The API supports chat completions, text completions, embeddings, image generation, and audio processing. Its OpenAI API compatible format means that developers can often switch from OpenAI or other providers with minimal code changes, simply swapping the base URL and API key.

How fast is Together AI inference?

Together AI's focus on inference speed has resulted in measurable performance advantages over competing platforms. The company reports up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through a combination of GPU-level optimization, advanced speculative decoding, and FP4 quantization ^[9].

Key performance technologies include:

Technology	Description	Impact
FlashAttention-3 kernels	Memory-efficient attention optimized for latest GPU architectures	Reduced memory overhead, higher throughput
FP4 quantization	4-bit floating-point model compression	Lower memory usage, faster inference with minimal quality loss
ATLAS (AdapTive-LeArning Speculative System)	Learns from production traffic to predict and pre-generate tokens	Acceleration beyond static optimization
Together Kernel Collection (TKC)	Proprietary CUDA kernels optimized for open-source model architectures	Model-specific performance gains
Speculative decoding	Small draft model predicts tokens, verified by the larger model	1.5-2x throughput improvement for autoregressive generation

On NVIDIA Blackwell architecture, Together AI ranks first in independent speed benchmarks for several top open-source models ^[9]. The ATLAS system is particularly notable because it continuously adapts to real-world production traffic rather than relying solely on static optimizations, meaning performance improves over time as the system observes usage patterns. Together AI also publishes that its inference stack achieves up to four times the tokens-per-second throughput of vanilla vLLM on equivalent hardware for popular Llama and DeepSeek configurations ^[14].

Supported models

Together AI maintains one of the broadest catalogs of open-source models available through a managed inference service. New models are typically added within days of public release, and the platform aims to support every major open-weight model family at parity with the original authors' reference implementations.

Model family	Provider	Notable variants	Use cases
Llama 3.3 / Llama 4	Meta	8B, 70B, 405B, Llama 4 Maverick	General text, code, multilingual
Mistral / Mixtral	Mistral AI	Mistral-7B, Mixtral-8x22B, Mistral Small 3, Mistral Large	Text, code, RAG
Qwen 2.5 / Qwen3	Alibaba	Qwen 2.5, Qwen3, Qwen3-Coder-Next, Qwen3.5-397B	General text, code, multilingual
DeepSeek	DeepSeek	DeepSeek R1, DeepSeek-V3, DeepSeek-V3.1	Reasoning, code, research
Gemma	Google	Gemma 2 9B, Gemma 2 27B, Gemma 3n E4B	Lightweight text generation
DBRX	Databricks	DBRX Instruct	Enterprise text generation
Stable Diffusion	Stability AI	SDXL, Stable Diffusion 3	Image generation
Whisper	OpenAI	Whisper Large v3	Speech-to-text
Refuel LLM-2	Refuel.ai (Together AI)	Refuel-LLM-2	Data labeling and structured extraction

The platform also supports specialized models for code generation, vision-language tasks, and embeddings. Following the May 2025 acquisition of Refuel.ai (covered below), Refuel LLM-2 is offered as a first-party data-task model with both serverless inference and LoRA fine-tuning support ^[10].

Fine-tuning

Together AI's fine-tuning service allows users to customize open-source models on their own data. The platform supports supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO). In 2025, Together AI expanded fine-tuning with native support for tool call training, reasoning model fine-tuning, and vision-language model adaptation, along with support for models above 100 billion parameters ^[6].

The fine-tuning pipeline includes cost and ETA estimates before job submission, and Together AI reports up to 6 times higher throughput compared to earlier versions of its training infrastructure ^[6]. Customers can run jobs either inside the managed fine-tuning service or by reserving raw GPU capacity and using the open-source Together training scripts directly.

Fine-tuning pricing and methods

Together AI prices fine-tuning per million tokens, with costs varying by method and model size ^[7]:

Method	Model size	LoRA price (per 1M tokens)	Full fine-tune price (per 1M tokens)
Supervised Fine-Tuning (SFT)	Up to 16B parameters	$0.48	$0.54
Direct Preference Optimization (DPO)	Up to 16B parameters	$1.20	$1.35
SFT	16B+ parameters	Custom pricing	Custom pricing
RLHF	All sizes	Custom pricing	Custom pricing

LoRA (Low-Rank Adaptation) fine-tuning is more cost-effective than full fine-tuning because it updates only a small subset of model parameters. For most use cases, LoRA provides comparable quality at lower cost, while full fine-tuning offers maximum customization for highly specialized applications.

Custom training

For organizations that need to train models from scratch or do extensive continued pretraining, Together AI offers custom training infrastructure built on its GPU clusters. This service targets AI-native companies and research labs that need direct access to large-scale compute and bespoke parallelism strategies.

Training performance has improved significantly with the introduction of Blackwell GPUs. Training a 70B-parameter Llama-architecture model in BF16 precision with an optimized TorchTitan plus Together Kernel Collection stack reaches 15,264 tokens per second per GPU on NVIDIA HGX B200, up from 8,080 tokens per second on NVIDIA HGX H100, representing a 90 percent improvement in training speed ^[9]. Together AI's training services are designed to compose with PyTorch FSDP, Megatron-LM, and TorchTitan, with the kernel collection providing drop-in acceleration for the most expensive operators.

Together GPU Clusters

Together AI operates its own GPU cluster infrastructure, offering on-demand access to NVIDIA H100, H200, and (beginning in 2025) Blackwell B200 and GB200 GPUs. The company has secured 200 MW of power capacity across multiple data centers in North America, with a facility in Maryland that went live in July 2025 and additional capacity in Memphis ^[2].

In September 2025, Together AI launched a self-service GPU infrastructure tier that allows customers to provision clusters ranging from a single eight-GPU node to multi-node systems with hundreds of processors ^[8]. The self-service model supports the latest NVIDIA Hopper and Blackwell hardware and is optimized for distributed training and elastic inference workloads.

Hypertec Cloud and the 36,000 GPU GB200 cluster

In November 2024, Together AI announced a partnership with Hypertec Cloud to co-build one of the world's largest optimized GPU clusters, featuring 36,000 NVIDIA GB200 NVL72 GPUs. Deployment began in Q1 2025 and is being layered on top of the thousands of H100 and H200 GPUs already operational across North America. Combined with previously announced commitments, the partnership gives Together AI secured data center capacity for over 100,000 GPUs throughout 2025 and into 2026 ^[14].

The GB200 NVL72 platform pairs Grace CPUs with Blackwell GPUs using fifth-generation NVLink, and Nvidia rates it as delivering up to 30x faster real-time inference for trillion-parameter models and up to 4x accelerated training versus the previous Hopper generation. The Hypertec deployment uses liquid cooling throughout and is integrated with Together AI's networking and orchestration stack, including the Together Kernel Collection ^[14].

GPU cluster pricing

GPU clusters are available across multiple commitment levels ^[7]:

GPU type	On-demand (per hour)	Dedicated inference (per hour)	Notes
NVIDIA HGX H100 (80GB)	$3.49	$3.99	Most widely available
NVIDIA HGX H200 (141GB)	$4.19	$5.49	Higher memory for larger models
NVIDIA HGX B200 (180GB)	$7.49	$9.95	Latest Blackwell architecture
NVIDIA GB200 NVL72	Custom	Custom	Liquid-cooled rack-scale via Hypertec

Reserved capacity with multi-month commitments offers significantly reduced rates. Volume pricing is available for large-scale deployments through direct negotiation with Together AI's enterprise sales team.

Together Code Sandbox and Code Interpreter

In 2025, Together AI introduced two products targeted at AI agent developers: Together Code Sandbox (TCS) and Together Code Interpreter (TCI) ^[15]. Both run on the same underlying micro-VM infrastructure but expose different developer surfaces.

Together Code Sandbox provides customizable virtual machines that serve as full development environments for AI applications, including persistent state, interactive shells, and real-time previews. Together Code Interpreter is session-based and exposes a simpler API for one-shot or short-lived code execution, designed to be called by LLMs as a tool during agentic workflows.

Capability	Together Code Sandbox	Together Code Interpreter
Use case	Full IDE-like development environments	Stateless tool-call code execution
Persistence	Git-versioned filesystem	Session-scoped
VM sizing	Hot-swappable, 2 to 64 vCPUs, 1 to 128 GB RAM	Same micro-VM substrate
Cold start	500 ms P95 from snapshot	Sub-second
Cloning	Under 1 second	Under 1 second
Typical caller	AI IDEs, SaaS platforms	Agent frameworks, RL training loops

A notable production deployment is Agentica, an open-source project from Berkeley AI Research and the Sky Computing Lab, which used Together Code Interpreter during the training of DeepCoder-14B-Preview. Agentica reported running 1,024 code executions in parallel and scaling to more than 100 concurrent coding sandboxes with thousands of evaluations per minute, illustrating how TCI integrates with reinforcement learning operations for coding agents ^[15]. Together AI has positioned the sandbox products as direct alternatives to E2B and Modal for teams that want to keep both the model and the execution environment on the same platform.

Pricing for sandbox and other developer services

In addition to inference, fine-tuning, and GPU cluster pricing, Together AI publishes per-resource rates for ancillary services that are commonly bundled into agentic applications.

Service	Pricing
Together Code Sandbox vCPU	$0.0446 per vCPU hour
Transcription (Whisper Large v3)	$0.0015 per audio minute
Storage	$0.16 per GiB per month
Evaluations dashboard	Included with platform

Refuel.ai acquisition

On May 15, 2025, Together AI acquired Refuel.ai, a San Francisco startup founded in 2021 by Stanford alumni Rishabh Bhargava and Nihit Desai that used LLMs to clean, structure, and label enterprise data ^[10]^[18]. The acquisition price was not disclosed. Refuel.ai's flagship products included Refuel LLM-2, a family of models purpose-built for data labeling and structured extraction tasks that the company said achieved 50 percent fewer errors than state-of-the-art alternatives, and Refuel Cloud, a workflow platform for building multi-step data pipelines ^[18]. At the time of acquisition, Refuel was processing tens of millions of records and billions of tokens per week for customers including major financial institutions ^[10].

"Joining Together AI accelerates our mission to solve the data bottleneck that every AI team faces today," said Rishabh Bhargava, CEO of Refuel.ai ^[18]. The strategic rationale is straightforward: high-quality fine-tuning and post-training depend on high-quality data, and many Together AI customers were stitching together their own labeling pipelines on top of Together's inference and fine-tuning services. By bringing Refuel in-house, Together AI was able to offer end-to-end data preparation and customization inside a single platform.

What the acquisition added

Capability	Pre-acquisition	Post-acquisition
First-party data labeling models	None	Refuel LLM-2 family with serverless inference and LoRA fine-tuning
Workflow orchestration	Notebook examples and custom scripts	Refuel Cloud workflows integrated with Together inference
Customer base	Developers and AI-native startups	Adds financial services and data-heavy enterprise accounts
Roadmap	Inference and training	Data preparation, labeling, evaluation, and agentic tooling

The Refuel team joined Together AI's data and applied research groups, and Refuel LLM-2 was integrated into the Together model catalog so that existing customers could call it from the same OpenAI-compatible API used for Llama, DeepSeek, or Mistral models ^[10]. The acquisition was viewed by analysts as filling a notable gap relative to competitors like Databricks and Scale AI, both of which had moved aggressively to combine data infrastructure with model training services.

FlashAttention integration

Together AI's performance advantage is closely tied to its integration of FlashAttention, a family of memory-efficient attention algorithms created by chief scientist Tri Dao. FlashAttention reduces the memory overhead and computational cost of the attention mechanism in transformers, which is typically the primary bottleneck in both training and inference. FlashAttention is now used in production by OpenAI, Anthropic, Meta, and Mistral, but Together AI is the only commercial cloud whose product roadmap is shaped directly by its author ^[3]^[13].

The platform's inference engine incorporates FlashAttention-3 kernels, which are optimized for the latest NVIDIA GPU architectures including Hopper and Blackwell. Combined with advanced quantization (reducing model precision to lower memory usage and increase throughput) and the proprietary Together Kernel Collection, these optimizations enable the platform to serve large models at significantly lower latency and cost than standard serving frameworks ^[2]. Dao's work on Mamba and follow-on state space models has also informed Together AI's roadmap for serving non-transformer architectures efficiently.

Developer experience

Together AI has invested heavily in developer experience, making its platform accessible to developers already familiar with the OpenAI ecosystem.

Is Together AI compatible with the OpenAI API?

Yes. Together AI's API endpoints for chat, vision, images, embeddings, and speech are fully compatible with OpenAI's API format ^[11]. Developers using the OpenAI Python library, LangChain, or other frameworks with OpenAI integrations can point their existing applications at Together AI's servers by changing only the base URL and API key. All major parameters available in the OpenAI API work with Together AI, including streaming, function calling, and JSON mode.

This compatibility significantly reduces migration friction and allows developers to experiment with open-source models without rewriting application code. It also makes Together AI a common second provider in multi-provider routing setups, where applications fail over from a primary inference vendor to Together AI under load or for specific model families.

Developer tools and features

Feature	Description
OpenAI-compatible API	Drop-in replacement for OpenAI endpoints
Code Sandbox	Isolated VM execution environment for code generation tasks
Code Interpreter	Session-based tool-call execution for agents
Evaluations dashboard	Compare model performance across benchmarks and custom datasets
Transcription API	Whisper-based speech-to-text
Streaming	Real-time token streaming for interactive applications
Function calling	Native tool use compatible with OpenAI function calling format
JSON mode	Guaranteed structured JSON output for programmatic consumption
Refuel data workflows	Multi-step labeling and extraction pipelines

Pricing

Together AI's pricing is structured across its core product lines, with serverless inference priced per token and other services priced per GPU-hour or per resource-unit.

Serverless inference pricing

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Llama 4 Maverick	$0.27	$0.85	Latest Meta model
DeepSeek-V3.1	$0.60	$1.70	Reasoning-focused
Qwen3.5-397B	$0.20	$0.60	Large multilingual model
Mistral Small 3	$0.10	$0.30	Efficient mid-range
Gemma 3n E4B	$0.02	$0.04	Ultra-lightweight
Llama 3.3 8B	$0.18 (combined)	-	Budget option
Qwen3-Coder-Next	$0.50	$1.20	Code-specialized
Refuel LLM-2	Custom	Custom	Data labeling and extraction

Other service pricing

Service	Pricing
Fine-tuning (LoRA SFT, up to 16B)	$0.48 per 1M tokens
Fine-tuning (Full SFT, up to 16B)	$0.54 per 1M tokens
Dedicated inference (H100)	$3.99/hour
Dedicated inference (B200)	$9.95/hour
GPU clusters (H100, on-demand)	$3.49/hour
GPU clusters (B200, on-demand)	$7.49/hour
Together Code Sandbox	$0.0446 per vCPU hour
Storage	$0.16 per GiB per month

The company positions its pricing as competitive with major cloud providers. For popular open-source models, Together AI's per-token inference costs are often 30 to 60 percent lower than the same models served on Amazon Bedrock or Google Vertex AI, reflecting the efficiency of its custom inference engine ^[7]. Together AI offers a free tier that lets developers experiment without commitment, and enterprise customers can negotiate volume pricing.

Pricing comparison with competitors

The following table compares pricing for representative models across major inference providers ^[7]:

Model	Together AI	Amazon Bedrock	Google Vertex AI	Notes
Llama 3.3 70B (input)	~$0.54/1M	~$1.95/1M	~$1.80/1M	Together AI 60-70% cheaper
DeepSeek R1 (input)	~$0.60/1M	~$0.62/1M	N/A	Comparable pricing
Mistral Small (input)	$0.10/1M	~$0.10/1M	~$0.10/1M	Price parity on small models

The cost advantage is most pronounced for larger models, where Together AI's custom inference stack can serve more requests per GPU than standard serving frameworks. For small models, pricing differences narrow as compute efficiency matters less relative to overhead costs.

How does Together AI compare with other inference providers?

Together AI competes against a mix of pure-play inference startups and hyperscaler offerings. The closest direct comparable is Fireworks AI, which also targets fast, low-cost serving of open-source models with its own proprietary inference engine (FireAttention). Replicate is broader in modality coverage but less optimized for production LLM throughput, while Hugging Face Inference Endpoints sit closer to the model hub itself. Hyperscaler offerings like Amazon Bedrock bundle access to a curated set of proprietary and open-source models inside the AWS account boundary ^[16].

Feature	Together AI	Fireworks AI	Replicate	Hugging Face Inference	Amazon Bedrock
Focus	Open-source inference, training, GPU cloud	Open-source inference	Community model hosting	Model hub plus inference	Multi-provider managed AI
Catalog size	200+ open-source models	100+ open-source models	50,000+ community models	500,000+ hub, smaller served set	~100 curated models
Custom training	Yes (GPU clusters)	Fine-tuning only	No	AutoTrain for fine-tuning	Limited (fine-tuning only)
Inference engine	Together Kernel Collection, FlashAttention-3	FireAttention	Standard serving	Standard serving	AWS-managed
Pricing model	Per-token plus per-GPU-hour	Per-token	Per-second compute	Per-token plus per-GPU-hour	Per-token
Code sandbox	Yes (TCS, TCI)	No	No	No	No
Data labeling	Yes (Refuel.ai)	No	No	No	Limited
Target users	AI developers, enterprises	AI developers, startups	Indie developers, image/video apps	ML researchers, developers	Enterprise cloud users

Together AI and Fireworks AI sit at the head of the speed-and-cost optimized inference category, with each company publishing benchmarks claiming a lead on different model families. Independent comparisons generally find that the two providers trade leadership depending on the model, batch size, and quantization regime, with Together AI more strongly differentiated on the training and GPU cluster side and on tooling like code sandboxes and Refuel data workflows. Replicate is most useful for image, video, and audio workloads where its long tail of community models and per-second billing fit a more experimental use case ^[16].

Enterprise customers and partnerships

Together AI has built a customer base that spans AI-native startups and large enterprises. Notable customers include Salesforce, Zoom, SK Telecom, Hedra, Cognition, Zomato, Krea, Cartesia, Pika Labs, and The Washington Post ^[2]. The company also counts Salesforce Ventures among its investors, reflecting a strategic partnership with one of the largest enterprise software companies.

The customer base reflects two distinct segments. AI-native startups like Cognition (maker of the Devin agent), Hedra (avatar and video generation), Krea (image and video generation), Pika Labs (video generation), and Cartesia (real-time voice with the Sonic model) use Together AI as their primary inference infrastructure, drawn by the performance advantages and cost savings that allow them to scale AI-intensive products without managing their own GPU clusters. Larger enterprises like Salesforce, Zoom, and SK Telecom typically use Together AI alongside their existing cloud infrastructure, often for specific workloads where open-source model performance and cost are critical factors.

Publicly reported customer outcomes include 24 percent faster training operations for Pika Labs and ultra-low latency voice AI through Cartesia's Sonic model integration. Krea has highlighted Together AI's ability to scale through traffic surges while maintaining performance during product launches.

Strategic partnerships

Partner	Nature of relationship
Nvidia	Strategic investor, GPU supplier, joint go-to-market on Blackwell
Hypertec Cloud	Co-build partner on 36,000-GPU GB200 NVL72 cluster
Salesforce	Investor and enterprise customer
SK Telecom	Investor and Asia-Pacific customer
General Catalyst	Series B lead investor
Prosperity7 (Saudi Aramco)	Series B co-lead, reported lead on next round

Infrastructure strategy

Together AI's approach to infrastructure represents a deliberate bet on vertical integration. Rather than renting capacity from major cloud providers, the company builds and operates its own GPU clusters, giving it direct control over hardware configuration, networking, and cooling. This approach mirrors the strategies of other AI-focused infrastructure companies like CoreWeave but is unusual among inference-focused platforms, which typically operate on top of existing cloud infrastructure ^[2].

The 200 MW of secured power capacity across multiple data centers is significant. For context, a single NVIDIA HGX B200 system consumes approximately 14 kilowatts, meaning 200 MW could theoretically power roughly 14,000 B200 systems at full load (before accounting for cooling and overhead). Combined with the Hypertec Cloud partnership and additional reserved capacity, Together AI has access to data center footprints sufficient for over 100,000 GPUs through 2025 and 2026 ^[14].

The geographic distribution of data centers across North America, including facilities in Maryland and Memphis, provides redundancy and the ability to serve customers with data residency requirements. The Maryland facility's proximity to the Washington, D.C. area is notable given the growing demand for AI infrastructure from government and defense customers, while the Memphis site is positioned to take advantage of relatively low-cost power.

Together AI also runs its inference stack on hyperscaler hardware where customers require it, including private deployments inside AWS, Microsoft Azure, and Google Cloud accounts. This hybrid posture lets the company chase wholesale GPU economics on its own metal while still meeting enterprise data residency and procurement constraints.

Open-source contributions

In parallel with its commercial products, Together AI has remained an active contributor to the open-source AI ecosystem. The company maintains or has contributed to several widely used projects:

Project	Role	Description
RedPajama	Co-lead	100 trillion token open dataset for LLM training
FlashAttention	Author (Tri Dao)	Memory-efficient attention used across the field
Mamba	Co-author (Tri Dao)	Selective state space architecture for sequence modeling
TorchTitan recipes	Contributor	Reference training recipes for large Llama-style models
Together Kernel Collection	Maintainer	CUDA kernels for transformer and SSM workloads
Refuel LLM-2 (post-acquisition)	Maintainer	Open-weight model for data labeling and extraction

This posture matters commercially because most of Together AI's customers run open-weight models themselves, and a willingness to publish data, recipes, and kernels acts as both a recruiting tool and a credibility signal in the open-source community.

Current state (2025-2026)

As of May 2026, Together AI continues to scale rapidly. At its first AI Native conference in March 2026, the company announced several product and business milestones ^[8]. The platform now supports over 200 models across all modalities, including chat, image, audio, vision, code, and embeddings, and serves more than 450,000 developers ^[17]. Refuel.ai integration is fully shipped, the Hypertec GB200 cluster is in production, and the company's annualized revenue is roughly $1 billion ^[12].

The competitive landscape for AI inference has intensified, with hyperscalers like AWS, Microsoft Azure, and Google Cloud all expanding their open-source model offerings, and specialized rivals like Fireworks AI, Anyscale, and Modal pushing on the same buyer base. Together AI's response has been to double down on performance and verticalization, investing in custom kernels, Blackwell GPU deployments, Refuel data workflows, code sandboxes, and vertically integrated infrastructure that gives it a cost advantage over larger but less specialized competitors.

The company's trajectory from a research-oriented startup to a multi-billion-dollar infrastructure provider illustrates the growing importance of the inference and training infrastructure layer in the AI ecosystem. With revenue growing rapidly, an expanding customer base, and a strong developer community, Together AI has established itself as one of the most prominent independent infrastructure companies serving the open-weight model market.

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Together AI

What is Together AI?

History and founding

Founding team

RedPajama and open dataset work

How much funding has Together AI raised?

Revenue and growth

Key products and services

Inference API

How fast is Together AI inference?

Supported models

Fine-tuning

Fine-tuning pricing and methods

Custom training

Together GPU Clusters

Hypertec Cloud and the 36,000 GPU GB200 cluster

GPU cluster pricing

Together Code Sandbox and Code Interpreter

Pricing for sandbox and other developer services

Refuel.ai acquisition

What the acquisition added

FlashAttention integration

Developer experience

Is Together AI compatible with the OpenAI API?

Developer tools and features

Pricing

Serverless inference pricing

Other service pricing

Pricing comparison with competitors

How does Together AI compare with other inference providers?

Enterprise customers and partnerships

Strategic partnerships

Infrastructure strategy

Open-source contributions

Current state (2025-2026)

References

Improve this article

What links here (24 of 59)

What links here (24 of 59)

What is Together AI?

History and founding

Founding team

RedPajama and open dataset work

How much funding has Together AI raised?

Revenue and growth

Key products and services

Inference API

How fast is Together AI inference?

Supported models

Fine-tuning

Fine-tuning pricing and methods

Custom training

Together GPU Clusters

Hypertec Cloud and the 36,000 GPU GB200 cluster

GPU cluster pricing

Together Code Sandbox and Code Interpreter

Pricing for sandbox and other developer services

Refuel.ai acquisition

What the acquisition added

FlashAttention integration

Developer experience

Is Together AI compatible with the OpenAI API?

Developer tools and features

Pricing

Serverless inference pricing

Other service pricing

Pricing comparison with competitors

How does Together AI compare with other inference providers?

Enterprise customers and partnerships

Strategic partnerships

Infrastructure strategy

Open-source contributions

Current state (2025-2026)

References

Improve this article

Related Articles

Claude Sonnet 4.5

Microsoft 365 Copilot

Model Context Protocol

Apple Intelligence

AI in Healthcare

AI Drug Discovery

What links here (24 of 59)

Related Articles

Claude Sonnet 4.5

Microsoft 365 Copilot

Model Context Protocol

Apple Intelligence

AI in Healthcare

AI Drug Discovery

What links here (24 of 59)