Together AI
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 5,743 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 5,743 words
Add missing citations, update stale details, or suggest a clearer explanation.
Together AI is an AI cloud platform specializing in high-performance inference, fine-tuning, and training infrastructure for open-source foundation models. Founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, the company has positioned itself as a leading alternative to hyperscaler AI services by focusing on speed, cost efficiency, and deep support for the open-source AI ecosystem. As of May 2026, Together AI supports over 200 open-source models, serves more than 450,000 developers, operates its own data center infrastructure with 200 MW of secured power capacity, and is reportedly in talks to raise approximately $1 billion at a $7.5 billion valuation on the back of roughly $1 billion in annualized revenue [1][2][12].
The company sits at the center of a fast-growing market for managed access to open-weight models like Llama, DeepSeek, Qwen, and Mistral, competing with specialized inference providers such as Fireworks AI and Replicate, and with hyperscaler offerings like Amazon Bedrock and Google Vertex AI. Its strategy combines research-grade optimization, including kernels developed by chief scientist Tri Dao, with a vertically integrated GPU cloud designed to capture both inference traffic and training workloads from AI-native startups and large enterprises.
Together AI (originally incorporated as Together Computer Inc.) was founded in June 2022 to address what its founders saw as a growing compute moat limiting access to the hardware and infrastructure needed to train and deploy large language models. The founding team brought together industry experience and academic AI research [3].
Vipul Ved Prakash, who serves as CEO, previously co-founded and led Topsy (acquired by Apple) and Cloudmark (acquired by Proofpoint), both of which dealt with large-scale data processing. Ce Zhang, who serves as CTO, came from ETH Zurich, where his research focused on data management for machine learning and distributed training. Chris Re and Percy Liang are both professors at Stanford University, with Re's lab having produced foundational work on data-centric AI and Liang leading the Center for Research on Foundation Models (CRFM) [3][13].
The company initially explored a decentralized cloud model that would aggregate idle compute across data centers and coordinate high-latency networks for synchronized training. This research direction yielded several early publications on decentralized training over commodity internet links, but Together AI ultimately shifted toward a more conventional cloud infrastructure approach, building and operating its own high-bandwidth GPU clusters tuned for large model workloads.
A pivotal moment came in mid-2023, when Together AI brought on Tri Dao as chief scientist. Dao is the creator of FlashAttention and FlashAttention-2, memory-efficient attention algorithms that have become standard components in modern transformer training and inference stacks used by OpenAI, Anthropic, Meta, and Mistral. His research forms the basis for much of Together AI's performance advantage and underpins the Together Kernel Collection that ships with the platform's training and inference engines [3][13].
| Founder | Role | Background |
|---|---|---|
| Vipul Ved Prakash | Co-founder, CEO | Founder of Topsy (acquired by Apple), Cloudmark (acquired by Proofpoint) |
| Ce Zhang | Co-founder, CTO | Former associate professor at ETH Zurich, distributed ML systems research |
| Chris Re | Co-founder, advisor | Professor at Stanford, founder of Snorkel AI, MacArthur Fellow |
| Percy Liang | Co-founder, advisor | Stanford professor, director of CRFM, co-author of HELM benchmark |
| Tri Dao | Chief scientist (joined 2023) | Creator of FlashAttention and Mamba, assistant professor at Princeton |
Together AI is headquartered in San Francisco. The full-time research team has grown to include other prominent open-source contributors and former hyperscaler infrastructure engineers, with a deliberate emphasis on hiring people who can move between systems research and product engineering.
Before Together AI became known primarily as an inference and training cloud, the company spent much of 2023 and 2024 publishing open datasets and pretraining recipes under the RedPajama project. RedPajama is a multi-institution effort involving Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research that set out to make state-of-the-art language model pretraining fully reproducible without reliance on closed corpora [11].
The first release, RedPajama-V1, reconstructed the training mixture described in the original LLaMA paper and contained roughly 1.2 trillion tokens spanning Common Crawl, GitHub, books, ArXiv, Wikipedia, and Stack Exchange. The follow-on dataset, RedPajama-V2, is a web-only corpus containing more than 100 billion documents drawn from 84 Common Crawl snapshots, processed with the CCNet pipeline, and shipped together with 30 billion documents of quality signals and 20 billion deduplicated documents. Across both releases the project distributes more than 100 trillion tokens of openly licensed text [11].
RedPajama datasets have been used directly or indirectly in the pretraining of several open models that later went into production, including Snowflake Arctic, Salesforce XGen, and AI2 OLMo. The associated tooling lives on GitHub under the togethercomputer organization, and the data itself is mirrored on Hugging Face under a permissive license. The project doubled as a credibility builder for Together AI's training services, since enterprise customers could see the company contributing to and curating the same kinds of corpora used for foundation model training.
In 2024, Together AI published the RedPajama-V2 paper at NeurIPS, formalizing the data preparation pipeline and analyzing the impact of different quality signals on downstream model quality. The work has since been cited in research from major labs and has informed how Together AI structures its training data services for paying customers.
Together AI has raised significant venture capital across multiple rounds, reflecting investor confidence in the infrastructure layer of the AI ecosystem.
| Round | Date | Amount | Lead investors | Valuation |
|---|---|---|---|---|
| Seed | 2022 | Undisclosed | Lux Capital | N/A |
| Series A | November 2023 | $102.5M | Kleiner Perkins | ~$500M |
| Series A extension | March 2024 | $106M | Salesforce Ventures | $1.25B |
| Series B | February 2025 | $305M | General Catalyst, Prosperity7 | $3.3B |
| Series C (reported, in progress) | Q2 2026 | ~$1B (target) | Prosperity7 (reported), Nvidia | $7.5B (pre-money) |
The Series A round in November 2023 included participation from NVIDIA, NEA, Prosperity7 Ventures, Greycroft, and 137 Ventures, among others [4]. The March 2024 Series A extension led by Salesforce Ventures pushed Together AI's valuation past the unicorn threshold and brought Salesforce into the cap table as both an investor and a customer.
The $305 million Series B announced on February 20, 2025 was led by General Catalyst and co-led by Prosperity7, the venture arm of Saudi Aramco. The round valued Together AI at $3.3 billion and brought in additional capital from Salesforce Ventures, DAMAC Capital, Nvidia, Kleiner Perkins, March Capital, Emergence Capital, Lux Capital, SE Ventures, Greycroft, Coatue, Definition, Cadenza Ventures, Long Journey Ventures, Brave Capital, Scott Banister, SK Telecom, and Cisco founder John Chambers. Proceeds were earmarked for expanding GPU cluster capacity and deploying NVIDIA Blackwell GPUs across multiple data centers in North America [2].
In March 2026, multiple outlets reported that Together AI was in talks to raise approximately $1 billion at a $7.5 billion pre-money valuation, more than doubling its Series B mark just thirteen months earlier. Prosperity7 was again reported as a potential lead, with significant follow-on participation expected from existing strategic investors including Nvidia. The talks were tied to the company's roughly $1 billion annualized revenue run rate and its expanding Blackwell deployment, although a final close had not been publicly announced as of May 2026 [12].
Together AI has scaled revenue at a pace reminiscent of hyperscaler AI services. Annualized revenue rose from approximately $130 million at the end of 2024 to $300 million in September 2025, then more than tripled to approximately $1 billion by early 2026 [5][12].
| Period | Annualized revenue | Notes |
|---|---|---|
| End of 2024 | ~$130M | Per Sacra estimates |
| September 2025 | ~$300M | Per Sacra and LinkedIn reporting |
| Early 2026 | ~$1B | Reported by The Information ahead of $7.5B valuation talks |
Revenue is generated through two primary lines. The first is per-token API usage on the serverless inference platform, which accounts for roughly 30 to 40 percent of revenue. The second and larger share comes from renting GPU capacity, whether as dedicated inference endpoints, reserved GPU clusters, or on-demand training infrastructure. As Blackwell capacity has come online, the GPU rental business has scaled rapidly, with several customers committing to multi-month reservations for thousands of GPUs at a time [12].
Together AI organizes its offerings around four pillars: serverless inference, fine-tuning, custom training, and GPU clusters, with additional managed services for code execution, evaluations, and data preparation that wrap the core compute layer.
Together AI's inference API provides OpenAI-compatible access to over 200 open-source models. The platform is built on a proprietary inference engine that incorporates FlashAttention-3 kernels and advanced quantization techniques, delivering what the company claims is 2 to 3 times faster inference than hyperscaler solutions [2].
The API supports chat completions, text completions, embeddings, image generation, and audio processing. Its OpenAI API compatible format means that developers can often switch from OpenAI or other providers with minimal code changes, simply swapping the base URL and API key.
Together AI's focus on inference speed has resulted in measurable performance advantages over competing platforms. The company reports up to 2x faster inference for top open-source models like Qwen, DeepSeek, and Kimi through a combination of GPU-level optimization, advanced speculative decoding, and FP4 quantization [9].
Key performance technologies include:
| Technology | Description | Impact |
|---|---|---|
| FlashAttention-3 kernels | Memory-efficient attention optimized for latest GPU architectures | Reduced memory overhead, higher throughput |
| FP4 quantization | 4-bit floating-point model compression | Lower memory usage, faster inference with minimal quality loss |
| ATLAS (AdapTive-LeArning Speculative System) | Learns from production traffic to predict and pre-generate tokens | Acceleration beyond static optimization |
| Together Kernel Collection (TKC) | Proprietary CUDA kernels optimized for open-source model architectures | Model-specific performance gains |
| Speculative decoding | Small draft model predicts tokens, verified by the larger model | 1.5-2x throughput improvement for autoregressive generation |
On NVIDIA Blackwell architecture, Together AI ranks first in independent speed benchmarks for several top open-source models [9]. The ATLAS system is particularly notable because it continuously adapts to real-world production traffic rather than relying solely on static optimizations, meaning performance improves over time as the system observes usage patterns. Together AI also publishes that its inference stack achieves up to four times the tokens-per-second throughput of vanilla vLLM on equivalent hardware for popular Llama and DeepSeek configurations [14].
Together AI maintains one of the broadest catalogs of open-source models available through a managed inference service. New models are typically added within days of public release, and the platform aims to support every major open-weight model family at parity with the original authors' reference implementations.
| Model family | Provider | Notable variants | Use cases |
|---|---|---|---|
| Llama 3.3 / Llama 4 | Meta | 8B, 70B, 405B, Llama 4 Maverick | General text, code, multilingual |
| Mistral / Mixtral | Mistral AI | Mistral-7B, Mixtral-8x22B, Mistral Small 3, Mistral Large | Text, code, RAG |
| Qwen 2.5 / Qwen3 | Alibaba | Qwen 2.5, Qwen3, Qwen3-Coder-Next, Qwen3.5-397B | General text, code, multilingual |
| DeepSeek | DeepSeek | DeepSeek R1, DeepSeek-V3, DeepSeek-V3.1 | Reasoning, code, research |
| Gemma | Gemma 2 9B, Gemma 2 27B, Gemma 3n E4B | Lightweight text generation | |
| DBRX | Databricks | DBRX Instruct | Enterprise text generation |
| Stable Diffusion | Stability AI | SDXL, Stable Diffusion 3 | Image generation |
| Whisper | OpenAI | Whisper Large v3 | Speech-to-text |
| Refuel LLM-2 | Refuel.ai (Together AI) | Refuel-LLM-2 | Data labeling and structured extraction |
The platform also supports specialized models for code generation, vision-language tasks, and embeddings. Following the May 2025 acquisition of Refuel.ai (covered below), Refuel LLM-2 is offered as a first-party data-task model with both serverless inference and LoRA fine-tuning support [10].
Together AI's fine-tuning service allows users to customize open-source models on their own data. The platform supports supervised fine-tuning, reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO). In 2025, Together AI expanded fine-tuning with native support for tool call training, reasoning model fine-tuning, and vision-language model adaptation, along with support for models above 100 billion parameters [6].
The fine-tuning pipeline includes cost and ETA estimates before job submission, and Together AI reports up to 6 times higher throughput compared to earlier versions of its training infrastructure [6]. Customers can run jobs either inside the managed fine-tuning service or by reserving raw GPU capacity and using the open-source Together training scripts directly.
Together AI prices fine-tuning per million tokens, with costs varying by method and model size [7]:
| Method | Model size | LoRA price (per 1M tokens) | Full fine-tune price (per 1M tokens) |
|---|---|---|---|
| Supervised Fine-Tuning (SFT) | Up to 16B parameters | $0.48 | $0.54 |
| Direct Preference Optimization (DPO) | Up to 16B parameters | $1.20 | $1.35 |
| SFT | 16B+ parameters | Custom pricing | Custom pricing |
| RLHF | All sizes | Custom pricing | Custom pricing |
LoRA (Low-Rank Adaptation) fine-tuning is more cost-effective than full fine-tuning because it updates only a small subset of model parameters. For most use cases, LoRA provides comparable quality at lower cost, while full fine-tuning offers maximum customization for highly specialized applications.
For organizations that need to train models from scratch or do extensive continued pretraining, Together AI offers custom training infrastructure built on its GPU clusters. This service targets AI-native companies and research labs that need direct access to large-scale compute and bespoke parallelism strategies.
Training performance has improved significantly with the introduction of Blackwell GPUs. Training a 70B-parameter Llama-architecture model in BF16 precision with an optimized TorchTitan plus Together Kernel Collection stack reaches 15,264 tokens per second per GPU on NVIDIA HGX B200, up from 8,080 tokens per second on NVIDIA HGX H100, representing a 90 percent improvement in training speed [9]. Together AI's training services are designed to compose with PyTorch FSDP, Megatron-LM, and TorchTitan, with the kernel collection providing drop-in acceleration for the most expensive operators.
Together AI operates its own GPU cluster infrastructure, offering on-demand access to NVIDIA H100, H200, and (beginning in 2025) Blackwell B200 and GB200 GPUs. The company has secured 200 MW of power capacity across multiple data centers in North America, with a facility in Maryland that went live in July 2025 and additional capacity in Memphis [2].
In September 2025, Together AI launched a self-service GPU infrastructure tier that allows customers to provision clusters ranging from a single eight-GPU node to multi-node systems with hundreds of processors [8]. The self-service model supports the latest NVIDIA Hopper and Blackwell hardware and is optimized for distributed training and elastic inference workloads.
In November 2024, Together AI announced a partnership with Hypertec Cloud to co-build one of the world's largest optimized GPU clusters, featuring 36,000 NVIDIA GB200 NVL72 GPUs. Deployment began in Q1 2025 and is being layered on top of the thousands of H100 and H200 GPUs already operational across North America. Combined with previously announced commitments, the partnership gives Together AI secured data center capacity for over 100,000 GPUs throughout 2025 and into 2026 [14].
The GB200 NVL72 platform pairs Grace CPUs with Blackwell GPUs using fifth-generation NVLink, and Nvidia rates it as delivering up to 30x faster real-time inference for trillion-parameter models and up to 4x accelerated training versus the previous Hopper generation. The Hypertec deployment uses liquid cooling throughout and is integrated with Together AI's networking and orchestration stack, including the Together Kernel Collection [14].
GPU clusters are available across multiple commitment levels [7]:
| GPU type | On-demand (per hour) | Dedicated inference (per hour) | Notes |
|---|---|---|---|
| NVIDIA HGX H100 (80GB) | $3.49 | $3.99 | Most widely available |
| NVIDIA HGX H200 (141GB) | $4.19 | $5.49 | Higher memory for larger models |
| NVIDIA HGX B200 (180GB) | $7.49 | $9.95 | Latest Blackwell architecture |
| NVIDIA GB200 NVL72 | Custom | Custom | Liquid-cooled rack-scale via Hypertec |
Reserved capacity with multi-month commitments offers significantly reduced rates. Volume pricing is available for large-scale deployments through direct negotiation with Together AI's enterprise sales team.
In 2025, Together AI introduced two products targeted at AI agent developers: Together Code Sandbox (TCS) and Together Code Interpreter (TCI) [15]. Both run on the same underlying micro-VM infrastructure but expose different developer surfaces.
Together Code Sandbox provides customizable virtual machines that serve as full development environments for AI applications, including persistent state, interactive shells, and real-time previews. Together Code Interpreter is session-based and exposes a simpler API for one-shot or short-lived code execution, designed to be called by LLMs as a tool during agentic workflows.
| Capability | Together Code Sandbox | Together Code Interpreter |
|---|---|---|
| Use case | Full IDE-like development environments | Stateless tool-call code execution |
| Persistence | Git-versioned filesystem | Session-scoped |
| VM sizing | Hot-swappable, 2 to 64 vCPUs, 1 to 128 GB RAM | Same micro-VM substrate |
| Cold start | 500 ms P95 from snapshot | Sub-second |
| Cloning | Under 1 second | Under 1 second |
| Typical caller | AI IDEs, SaaS platforms | Agent frameworks, RL training loops |
A notable production deployment is Agentica, an open-source project from Berkeley AI Research and the Sky Computing Lab, which used Together Code Interpreter during the training of DeepCoder-14B-Preview. Agentica reported running 1,024 code executions in parallel and scaling to more than 100 concurrent coding sandboxes with thousands of evaluations per minute, illustrating how TCI integrates with reinforcement learning operations for coding agents [15]. Together AI has positioned the sandbox products as direct alternatives to E2B and Modal for teams that want to keep both the model and the execution environment on the same platform.
In addition to inference, fine-tuning, and GPU cluster pricing, Together AI publishes per-resource rates for ancillary services that are commonly bundled into agentic applications.
| Service | Pricing |
|---|---|
| Together Code Sandbox vCPU | $0.0446 per vCPU hour |
| Transcription (Whisper Large v3) | $0.0015 per audio minute |
| Storage | $0.16 per GiB per month |
| Evaluations dashboard | Included with platform |
On May 14, 2025, Together AI acquired Refuel.ai, a four-year-old San Francisco startup focused on using LLMs to clean, structure, and label enterprise data [10]. The acquisition price was not disclosed. Refuel.ai's flagship products included Refuel LLM-2, a family of models purpose-built for data labeling and structured extraction tasks, and Refuel Cloud, a workflow platform for building multi-step data pipelines. At the time of acquisition, Refuel was processing tens of millions of records and billions of tokens per week for customers including major financial institutions [10].
The strategic rationale is straightforward: high-quality fine-tuning and post-training depend on high-quality data, and many Together AI customers were stitching together their own labeling pipelines on top of Together's inference and fine-tuning services. By bringing Refuel in-house, Together AI was able to offer end-to-end data preparation and customization inside a single platform.
| Capability | Pre-acquisition | Post-acquisition |
|---|---|---|
| First-party data labeling models | None | Refuel LLM-2 family with serverless inference and LoRA fine-tuning |
| Workflow orchestration | Notebook examples and custom scripts | Refuel Cloud workflows integrated with Together inference |
| Customer base | Developers and AI-native startups | Adds financial services and data-heavy enterprise accounts |
| Roadmap | Inference and training | Data preparation, labeling, evaluation, and agentic tooling |
The Refuel team joined Together AI's data and applied research groups, and Refuel LLM-2 was integrated into the Together model catalog so that existing customers could call it from the same OpenAI-compatible API used for Llama, DeepSeek, or Mistral models [10]. The acquisition was viewed by analysts as filling a notable gap relative to competitors like Databricks and Scale AI, both of which had moved aggressively to combine data infrastructure with model training services.
Together AI's performance advantage is closely tied to its integration of FlashAttention, a family of memory-efficient attention algorithms created by chief scientist Tri Dao. FlashAttention reduces the memory overhead and computational cost of the attention mechanism in transformers, which is typically the primary bottleneck in both training and inference. FlashAttention is now used in production by OpenAI, Anthropic, Meta, and Mistral, but Together AI is the only commercial cloud whose product roadmap is shaped directly by its author [3][13].
The platform's inference engine incorporates FlashAttention-3 kernels, which are optimized for the latest NVIDIA GPU architectures including Hopper and Blackwell. Combined with advanced quantization (reducing model precision to lower memory usage and increase throughput) and the proprietary Together Kernel Collection, these optimizations enable the platform to serve large models at significantly lower latency and cost than standard serving frameworks [2]. Dao's work on Mamba and follow-on state space models has also informed Together AI's roadmap for serving non-transformer architectures efficiently.
Together AI has invested heavily in developer experience, making its platform accessible to developers already familiar with the OpenAI ecosystem.
Together AI's API endpoints for chat, vision, images, embeddings, and speech are fully compatible with OpenAI's API format [11]. Developers using the OpenAI Python library, LangChain, or other frameworks with OpenAI integrations can point their existing applications at Together AI's servers by changing only the base URL and API key. All major parameters available in the OpenAI API work with Together AI, including streaming, function calling, and JSON mode.
This compatibility significantly reduces migration friction and allows developers to experiment with open-source models without rewriting application code. It also makes Together AI a common second provider in multi-provider routing setups, where applications fail over from a primary inference vendor to Together AI under load or for specific model families.
| Feature | Description |
|---|---|
| OpenAI-compatible API | Drop-in replacement for OpenAI endpoints |
| Code Sandbox | Isolated VM execution environment for code generation tasks |
| Code Interpreter | Session-based tool-call execution for agents |
| Evaluations dashboard | Compare model performance across benchmarks and custom datasets |
| Transcription API | Whisper-based speech-to-text |
| Streaming | Real-time token streaming for interactive applications |
| Function calling | Native tool use compatible with OpenAI function calling format |
| JSON mode | Guaranteed structured JSON output for programmatic consumption |
| Refuel data workflows | Multi-step labeling and extraction pipelines |
Together AI's pricing is structured across its core product lines, with serverless inference priced per token and other services priced per GPU-hour or per resource-unit.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Llama 4 Maverick | $0.27 | $0.85 | Latest Meta model |
| DeepSeek-V3.1 | $0.60 | $1.70 | Reasoning-focused |
| Qwen3.5-397B | $0.20 | $0.60 | Large multilingual model |
| Mistral Small 3 | $0.10 | $0.30 | Efficient mid-range |
| Gemma 3n E4B | $0.02 | $0.04 | Ultra-lightweight |
| Llama 3.3 8B | $0.18 (combined) | - | Budget option |
| Qwen3-Coder-Next | $0.50 | $1.20 | Code-specialized |
| Refuel LLM-2 | Custom | Custom | Data labeling and extraction |
| Service | Pricing |
|---|---|
| Fine-tuning (LoRA SFT, up to 16B) | $0.48 per 1M tokens |
| Fine-tuning (Full SFT, up to 16B) | $0.54 per 1M tokens |
| Dedicated inference (H100) | $3.99/hour |
| Dedicated inference (B200) | $9.95/hour |
| GPU clusters (H100, on-demand) | $3.49/hour |
| GPU clusters (B200, on-demand) | $7.49/hour |
| Together Code Sandbox | $0.0446 per vCPU hour |
| Storage | $0.16 per GiB per month |
The company positions its pricing as competitive with major cloud providers. For popular open-source models, Together AI's per-token inference costs are often 30 to 60 percent lower than the same models served on Amazon Bedrock or Google Vertex AI, reflecting the efficiency of its custom inference engine [7]. Together AI offers a free tier that lets developers experiment without commitment, and enterprise customers can negotiate volume pricing.
The following table compares pricing for representative models across major inference providers [7]:
| Model | Together AI | Amazon Bedrock | Google Vertex AI | Notes |
|---|---|---|---|---|
| Llama 3.3 70B (input) | ~$0.54/1M | ~$1.95/1M | ~$1.80/1M | Together AI 60-70% cheaper |
| DeepSeek R1 (input) | ~$0.60/1M | ~$0.62/1M | N/A | Comparable pricing |
| Mistral Small (input) | $0.10/1M | ~$0.10/1M | ~$0.10/1M | Price parity on small models |
The cost advantage is most pronounced for larger models, where Together AI's custom inference stack can serve more requests per GPU than standard serving frameworks. For small models, pricing differences narrow as compute efficiency matters less relative to overhead costs.
Together AI competes against a mix of pure-play inference startups and hyperscaler offerings. The closest direct comparable is Fireworks AI, which also targets fast, low-cost serving of open-source models with its own proprietary inference engine (FireAttention). Replicate is broader in modality coverage but less optimized for production LLM throughput, while Hugging Face Inference Endpoints sit closer to the model hub itself. Hyperscaler offerings like Amazon Bedrock bundle access to a curated set of proprietary and open-source models inside the AWS account boundary [16].
| Feature | Together AI | Fireworks AI | Replicate | Hugging Face Inference | Amazon Bedrock |
|---|---|---|---|---|---|
| Focus | Open-source inference, training, GPU cloud | Open-source inference | Community model hosting | Model hub plus inference | Multi-provider managed AI |
| Catalog size | 200+ open-source models | 100+ open-source models | 50,000+ community models | 500,000+ hub, smaller served set | ~100 curated models |
| Custom training | Yes (GPU clusters) | Fine-tuning only | No | AutoTrain for fine-tuning | Limited (fine-tuning only) |
| Inference engine | Together Kernel Collection, FlashAttention-3 | FireAttention | Standard serving | Standard serving | AWS-managed |
| Pricing model | Per-token plus per-GPU-hour | Per-token | Per-second compute | Per-token plus per-GPU-hour | Per-token |
| Code sandbox | Yes (TCS, TCI) | No | No | No | No |
| Data labeling | Yes (Refuel.ai) | No | No | No | Limited |
| Target users | AI developers, enterprises | AI developers, startups | Indie developers, image/video apps | ML researchers, developers | Enterprise cloud users |
Together AI and Fireworks AI sit at the head of the speed-and-cost optimized inference category, with each company publishing benchmarks claiming a lead on different model families. Independent comparisons generally find that the two providers trade leadership depending on the model, batch size, and quantization regime, with Together AI more strongly differentiated on the training and GPU cluster side and on tooling like code sandboxes and Refuel data workflows. Replicate is most useful for image, video, and audio workloads where its long tail of community models and per-second billing fit a more experimental use case [16].
Together AI has built a customer base that spans AI-native startups and large enterprises. Notable customers include Salesforce, Zoom, SK Telecom, Hedra, Cognition, Zomato, Krea, Cartesia, Pika Labs, and The Washington Post [2]. The company also counts Salesforce Ventures among its investors, reflecting a strategic partnership with one of the largest enterprise software companies.
The customer base reflects two distinct segments. AI-native startups like Cognition (maker of the Devin agent), Hedra (avatar and video generation), Krea (image and video generation), Pika Labs (video generation), and Cartesia (real-time voice with the Sonic model) use Together AI as their primary inference infrastructure, drawn by the performance advantages and cost savings that allow them to scale AI-intensive products without managing their own GPU clusters. Larger enterprises like Salesforce, Zoom, and SK Telecom typically use Together AI alongside their existing cloud infrastructure, often for specific workloads where open-source model performance and cost are critical factors.
Publicly reported customer outcomes include 24 percent faster training operations for Pika Labs and ultra-low latency voice AI through Cartesia's Sonic model integration. Krea has highlighted Together AI's ability to scale through traffic surges while maintaining performance during product launches.
| Partner | Nature of relationship |
|---|---|
| Nvidia | Strategic investor, GPU supplier, joint go-to-market on Blackwell |
| Hypertec Cloud | Co-build partner on 36,000-GPU GB200 NVL72 cluster |
| Salesforce | Investor and enterprise customer |
| SK Telecom | Investor and Asia-Pacific customer |
| General Catalyst | Series B lead investor |
| Prosperity7 (Saudi Aramco) | Series B co-lead, reported lead on next round |
Together AI's approach to infrastructure represents a deliberate bet on vertical integration. Rather than renting capacity from major cloud providers, the company builds and operates its own GPU clusters, giving it direct control over hardware configuration, networking, and cooling. This approach mirrors the strategies of other AI-focused infrastructure companies like CoreWeave but is unusual among inference-focused platforms, which typically operate on top of existing cloud infrastructure [2].
The 200 MW of secured power capacity across multiple data centers is significant. For context, a single NVIDIA HGX B200 system consumes approximately 14 kilowatts, meaning 200 MW could theoretically power roughly 14,000 B200 systems at full load (before accounting for cooling and overhead). Combined with the Hypertec Cloud partnership and additional reserved capacity, Together AI has access to data center footprints sufficient for over 100,000 GPUs through 2025 and 2026 [14].
The geographic distribution of data centers across North America, including facilities in Maryland and Memphis, provides redundancy and the ability to serve customers with data residency requirements. The Maryland facility's proximity to the Washington, D.C. area is notable given the growing demand for AI infrastructure from government and defense customers, while the Memphis site is positioned to take advantage of relatively low-cost power.
Together AI also runs its inference stack on hyperscaler hardware where customers require it, including private deployments inside AWS, Microsoft Azure, and Google Cloud accounts. This hybrid posture lets the company chase wholesale GPU economics on its own metal while still meeting enterprise data residency and procurement constraints.
In parallel with its commercial products, Together AI has remained an active contributor to the open-source AI ecosystem. The company maintains or has contributed to several widely used projects:
| Project | Role | Description |
|---|---|---|
| RedPajama | Co-lead | 100 trillion token open dataset for LLM training |
| FlashAttention | Author (Tri Dao) | Memory-efficient attention used across the field |
| Mamba | Co-author (Tri Dao) | Selective state space architecture for sequence modeling |
| TorchTitan recipes | Contributor | Reference training recipes for large Llama-style models |
| Together Kernel Collection | Maintainer | CUDA kernels for transformer and SSM workloads |
| Refuel LLM-2 (post-acquisition) | Maintainer | Open-weight model for data labeling and extraction |
This posture matters commercially because most of Together AI's customers run open-weight models themselves, and a willingness to publish data, recipes, and kernels acts as both a recruiting tool and a credibility signal in the open-source community.
As of May 2026, Together AI continues to scale rapidly. At its first AI Native conference in March 2026, the company announced several product and business milestones [8]. The platform now supports over 200 models across all modalities, including chat, image, audio, vision, code, and embeddings. Refuel.ai integration is fully shipped, the Hypertec GB200 cluster is in production, and the company's annualized revenue is roughly $1 billion [12].
The competitive landscape for AI inference has intensified, with hyperscalers like AWS, Microsoft Azure, and Google Cloud all expanding their open-source model offerings, and specialized rivals like Fireworks AI, Anyscale, and Modal pushing on the same buyer base. Together AI's response has been to double down on performance and verticalization, investing in custom kernels, Blackwell GPU deployments, Refuel data workflows, code sandboxes, and vertically integrated infrastructure that gives it a cost advantage over larger but less specialized competitors.
The company's trajectory from a research-oriented startup to a multi-billion-dollar infrastructure provider illustrates the growing importance of the inference and training infrastructure layer in the AI ecosystem. With revenue growing rapidly, an expanding customer base, and a strong developer community, Together AI has established itself as one of the most prominent independent infrastructure companies serving the open-weight model market.