AssemblyAI
Last reviewed
May 7, 2026
Sources
30 citations
Review status
Source-backed
Revision
v1 ยท 4,653 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
30 citations
Review status
Source-backed
Revision
v1 ยท 4,653 words
Add missing citations, update stale details, or suggest a clearer explanation.
AssemblyAI is an American artificial intelligence company that builds and operates Speech AI models for developers and enterprises. Founded in 2017 by Dylan Fox in San Francisco and backed by Y Combinator's Winter 2017 batch, the company provides a cloud-based API platform for automatic speech recognition, audio intelligence, and voice AI applications. Its flagship product line has progressed through the Conformer, Universal, and Slam model families, culminating in the Universal-3 Pro and Slam-1 models. As of early 2026, AssemblyAI has raised approximately $158 million in total funding, counts over 200,000 developers building on its platform, and processes hundreds of millions of hours of audio annually for customers including Spotify, NASA, the Wall Street Journal, and NBCUniversal.
The company occupies a niche in the AI-as-a-service market distinct from pure infrastructure cloud providers: it trains and hosts its own proprietary speech recognition models, offers them through a unified REST API, and layers higher-level audio intelligence features (sentiment analysis, speaker identification, summarization, and content moderation) on top of the base transcription. Its LeMUR framework and LLM Gateway extend this approach by allowing developers to apply large language models directly to audio content through a single API call.
Dylan Fox founded AssemblyAI in 2017 after leaving Cisco, where he had worked as an engineer. Fox applied as a solo founder to Y Combinator's Winter 2017 cohort, submitting a video demonstration of early speech recognition technology. He was accepted into the program, where he met Daniel Gross, a former Apple engineer who had studied speech recognition problems and became one of AssemblyAI's earliest investors.
The founding thesis was that developers needed access to accurate, production-ready speech AI without the need to train or host their own models. At the time, the dominant options were cloud provider APIs from Google, Amazon, and Microsoft (which traded accuracy for breadth), or open-source models such as Mozilla DeepSpeech (which required significant infrastructure investment to deploy). Fox positioned AssemblyAI as a model-as-a-service company that would handle the research and infrastructure complexity while exposing a simple HTTP API.
The company grew slowly through its first several years, spending the period before the 2022 AI funding boom building model infrastructure and accumulating paying customers. Fox later described the period as spending five years waiting for the market to catch up to the technology. By early 2022, the company was processing one million audio streams per day and had seen revenue triple year over year.
Before the Universal model family, AssemblyAI released two generations of Conformer-architecture models.
Conformer-1, released in March 2023, was built on Google Brain's 2020 Conformer architecture (a hybrid of Transformer and convolutional neural network designs) as modified by the Efficient Conformer paper. AssemblyAI trained it on 650,000 hours of audio using noisy student-teacher training, a technique where a teacher model generates pseudo-labels on unlabeled audio and a student model trains on both labeled and pseudo-labeled data. Conformer-1 was claimed to make up to 43% fewer errors than comparable models on noisy audio.
Conformer-2, released later in 2023, scaled the training dataset to 1.1 million hours of English audio and increased model parameters to 450 million. It used an ensemble of teacher models during training, a variance-reduction technique that produced more robust pseudo-labels. Compared to Conformer-1, it improved alphanumeric recognition by 31.7%, proper noun accuracy by 6.8%, and noise robustness by 12.0%.
AssemblyAI has raised capital across four institutional rounds since 2022, totaling approximately $158 million.
In March 2022, AssemblyAI closed a $28 million Series A led by Accel. Co-investors included Y Combinator, John and Patrick Collison (founders of Stripe), Nat Friedman (former CEO of GitHub), and Daniel Gross (founder of Pioneer and one of the company's original backers). The round was reported in TechCrunch on March 4, 2022.
At the time of the announcement, Fox described the company as processing one million audio streams per day and having hundreds of paying customers. Stated uses of the funding included hiring, GPU infrastructure expansion (including over $1 million in Nvidia A100 servers), and product development.
Four months after the Series A, AssemblyAI raised a $30 million Series B led by Insight Partners in July 2022, with participation from Accel and Y Combinator. The rapid back-to-back rounds reflected the accelerating interest in AI developer tools during 2022 and the company's intent to aggressively scale model training infrastructure. CEO Fox described the company's ambition as building the "Stripe for AI models," offering developers access to frontier AI capabilities through simple APIs the way Stripe had simplified payment processing.
AssemblyAI closed a $50 million Series C on December 3, 2023, again led by Accel. Co-investors included Insight Partners, Y Combinator, Nat Friedman, Daniel Gross, and Keith Block and Smith Point Capital. Block is the former co-CEO of Salesforce. TechCrunch covered the round on December 4, 2023.
At the time of the Series C, AssemblyAI reported:
The company stated its intent to train a universal speech model on more than a petabyte of voice data, which subsequently became Universal-1.
Subsequent funding brought total capitalization to approximately $158 million by 2024, implying additional capital beyond the three institutional rounds above. Tracxn and related databases cited a valuation of approximately $300 million as of 2024.
AssemblyAI released Universal-1 in April 2024 as its most capable model to date. The model was trained on over 12.5 million hours of multilingual audio, encompassing non-native speakers, heavy background noise, multi-speaker conversations, and diverse recording conditions. Universal-1 launched with support for English and Spanish, with German and French added in subsequent weeks.
Key technical characteristics of Universal-1:
Universal-1 was built on research published in the paper "Anatomy of Industrial Scale Multilingual ASR" (arXiv:2404.09841), which described the training methodology and architecture decisions in detail.
AssemblyAI released Universal-2 in October 2024, with accompanying research published at assemblyai.com/research/universal-2. The model built on Universal-1 and targeted three specific weaknesses identified in production deployments: proper noun recognition, alphanumeric formatting, and general text formatting.
Key improvements over Universal-1:
Universal-2 also incorporated Universal-2-TF, a two-stage neural text formatting model described in a separate research paper. The system combines a token classification approach (for punctuation and capitalization) with a sequence-to-sequence approach (for complex normalization) to handle text formatting as a learned neural task rather than a rule-based post-processing step. This architecture allowed formatting to be handled at the speed of real-time transcription rather than as a separate offline step.
Universal-2 launched at $0.15 per hour for batch transcription, supporting 99 languages with automatic language detection and automatic code-switching between English and other languages. By October 2025, AssemblyAI added a 64% reduction in speaker counting errors for mid-to-long-duration audio files and expanded keyterm support to 200 terms.
AssemblyAI introduced Universal-3 Pro on February 3, 2026, positioning it as the first production-quality speech model to accept natural language prompts for controlling transcription behavior. The model was designed to reduce or eliminate post-processing pipelines that developers previously needed to build on top of raw transcripts.
Universal-3 Pro features:
Universal-3 Pro is priced at $0.21 per hour, described by AssemblyAI as 35 to 50% lower cost than competing solutions. It achieved the lowest word error rate on real-world data on AssemblyAI's internal benchmarks across call center, medical, and multi-speaker recordings.
AssemblyAI announced Slam-1 in March 2025 and released it to public beta on April 23, 2025. Slam stands for Speech Language Model. The model represents a different architectural approach from the Universal family: rather than a dedicated ASR model that converts audio to text, Slam-1 combines an audio encoder with a large language model decoder, allowing the system to apply genuine language understanding to transcription rather than pattern-matching against training distributions.
Slam-1 is described as the most powerful prompt-based Speech Language Model available at its launch. Key characteristics:
Subsequent updates in October 2025 improved Slam's accuracy by up to 57% on critical terms and expanded context-aware key term prompting to 1,000 words, with pricing adjusted to $0.27 per hour. The same October 2025 update introduced intelligent model fallback, allowing developers to specify Slam-1 as the primary model with automatic fallback to Universal-2 for audio in languages Slam-1 does not support.
AssemblyAI introduced LeMUR (Leveraging Large Language Models to Understand Recognized Speech) as an early-access product and later rebranded and expanded it into the LLM Gateway. The framework solves a specific integration problem: applying large language models to audio content requires first transcribing the audio, managing long transcripts that may exceed LLM context windows, and constructing effective prompts.
LeMUR and its successor LLM Gateway handle all of this plumbing. Developers submit an audio file URL or transcript ID along with an LLM prompt; the system transcribes the audio if needed, chunks and manages context, calls the specified LLM, and returns a structured response. This allows a developer to, for example, ask "What were the three main action items from this one-hour meeting?" as a single API call without manually managing transcription, chunking, or LLM invocation.
The LLM Gateway, released in October 2025, expanded LeMUR into a unified API providing access to over 20 large language models from Anthropic (Claude), OpenAI (GPT), and Google (Gemini) through a single interface and billing relationship. Key capabilities:
Built on top of transcription and LLM capabilities, AssemblyAI offers a set of pre-built audio intelligence features billed as add-ons per audio hour:
| Feature | Description | Price |
|---|---|---|
| Speaker Identification | Labels speakers by name using audio context | $0.02/hr |
| Sentiment Analysis | Per-sentence sentiment detection | $0.02/hr |
| Auto Chapters | Automatic segmentation and chapter titles | $0.08/hr |
| Summarization | LLM-powered summary of audio content | $0.03/hr |
| Entity Detection | Names, dates, organizations, locations | $0.08/hr |
| Key Phrases | Automatic extraction of key terms | $0.01/hr |
| Topic Detection | Classification into topic categories | $0.15/hr |
| Translation | Translation of transcript to another language | $0.06/hr |
| Custom Formatting | User-specified output format rules | $0.03/hr |
AssemblyAI offers two primary transcription modes that suit different application architectures and latency requirements.
Batch transcription accepts pre-recorded audio files through an HTTP POST request or URL submission. The API is asynchronous: the developer submits an audio file, receives a job ID, and polls for completion or registers a webhook to be notified when transcription is ready. Processing time varies with file duration and queue depth.
Batch transcription supports parallel processing of large file volumes. The Python SDK includes a built-in batch processing method that submits files concurrently and collects results. Developers can process entire audio libraries simultaneously, with total elapsed time determined by the longest single file rather than the sum of all file durations.
Batch mode is appropriate for post-call analysis, podcast transcription, media monitoring, video subtitle generation, and similar workflows where a latency of seconds to minutes is acceptable.
AssemblyAI's streaming API returns partial transcripts within approximately 300 milliseconds (P50) over a persistent WebSocket connection. The Universal-Streaming model supports:
Universal-3 Pro Streaming, released alongside the batch model in February 2026, adds natural language prompting to streaming transcription at a price of $0.45 per hour.
Streaming is appropriate for voice agents, real-time captions, live customer service call analysis, and interactive voice interfaces where sub-second latency matters.
AssemblyAI uses a consumption-based pricing model. All tiers include $50 in free credits (approximately 185 hours of Universal-2 transcription) with no credit card required at signup. Enterprise customers can negotiate volume discounts and receive access to dedicated support.
| Model | Price | Notes |
|---|---|---|
| Universal-3 Pro | $0.21/hr | 6 languages natively; prompting included |
| Universal-2 | $0.15/hr | 99 languages; standard batch transcription |
| Slam-1 (Beta) | $0.27/hr | English only; prompt-based customization |
| Model | Price | Notes |
|---|---|---|
| Universal-3 Pro Streaming | $0.45/hr | Promptable; 6 languages |
| Universal-Streaming | $0.15/hr | English; immutable transcripts |
| Universal-Streaming Multilingual | $0.15/hr | English + Spanish, French, German, Italian, Portuguese |
| Whisper-Streaming | $0.30/hr | OpenAI Whisper via AssemblyAI infrastructure |
| Feature | Price |
|---|---|
| Keyterms Prompting | $0.04-0.05/hr |
| Speaker Diarization (batch) | $0.02/hr |
| Speaker Diarization (streaming) | $0.12/hr |
| Medical Mode | $0.15/hr |
| PII Audio Redaction | $0.05/hr |
| PII Text Redaction | $0.08/hr |
| Content Moderation | $0.15/hr |
| Profanity Filtering | $0.01/hr |
| Product | Price |
|---|---|
| Voice Agent API | $4.50/hr ($0.075/min) |
| Model | Input | Output |
|---|---|---|
| GPT-5.5 | $5.00 | $30.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Claude 4.7 Opus | $5.50 | $27.50 |
| Claude 4.6 Sonnet | $3.00 | $15.00 |
| Gemini 3 Flash | $0.50 | $3.00 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
AssemblyAI competes primarily with Deepgram (Nova-3 model), OpenAI's Whisper and GPT-4o-Transcribe, Azure Cognitive Services Speech, Google Cloud Speech-to-Text, and AWS Transcribe. The following comparison reflects published benchmarks and documented pricing as of early 2026.
| Provider | Model | English WER | Notes |
|---|---|---|---|
| AssemblyAI | Universal-3 Pro | ~5.9% | Best on AssemblyAI's 80,000+ file benchmark |
| OpenAI | GPT-4o-Transcribe | ~6.5% | AssemblyAI's benchmark; third parties report lower WER in some tests |
| ElevenLabs | Scribe | ~6.5% | Per AssemblyAI's benchmark |
| Amazon | Transcribe | ~7.6% | Per AssemblyAI's benchmark |
| Microsoft | Azure Speech | ~7.5% | Per AssemblyAI's benchmark |
| Deepgram | Nova-3 | ~8.1% | Per AssemblyAI's benchmark; Deepgram's own data shows sub-7% on batch |
Note: All accuracy figures above are from AssemblyAI's own benchmarks using 250+ hours of audio across 80,000+ files from 26 datasets. Independent third-party benchmarks and each provider's own benchmarks yield different figures. Deepgram's published benchmarks report Nova-3 achieving a median WER of approximately 5.26% on batch audio across 2,703 production audio files. OpenAI's internal data shows GPT-4o-Transcribe achieving approximately 2.46% WER on its test sets. Benchmark methodology, dataset selection, and evaluation conditions materially affect published WER numbers, and results on production audio with noise, accents, or domain jargon typically differ significantly from clean benchmark results.
| Dimension | AssemblyAI | Deepgram | Whisper (OpenAI) |
|---|---|---|---|
| Base price (batch) | $0.15-0.21/hr | ~$0.21/hr (Nova-3) | $0.006/min (~$0.36/hr) |
| Streaming | Yes ($0.15-0.45/hr) | Yes | Limited (gpt-4o-realtime) |
| Speaker diarization | Add-on | Add-on | No (base model) |
| LLM integration | Yes (LLM Gateway) | Limited | Via API (GPT-4o) |
| Audio intelligence | Extensive add-on suite | Moderate | None built-in |
| On-premise deployment | No (cloud only) | No (cloud only) | Yes (self-hosted) |
| Language support | 99 (Universal-2) | 36+ | 100+ (Whisper Large-v3) |
| Hallucination rate | Lower than Whisper | Comparable | Higher (Whisper Large-v3) |
| Prompting/customization | Yes (Slam-1, U3 Pro) | Yes (Nova-3) | Limited |
| SOC 2 Type 2 | Yes | Yes | Via OpenAI |
Key differentiators by provider:
AssemblyAI offers the most complete audio intelligence stack of the API providers, combining transcription with an extensive set of pre-built LLM-powered features (sentiment, summarization, entity detection, topic classification) and the LLM Gateway for custom analysis. Universal-3 Pro's natural language prompting reduces the post-processing engineering required for specialized use cases. The primary tradeoff is cloud-only deployment, add-on pricing that can accumulate, and English-only support for the Slam-1 model.
Deepgram Nova-3 emphasizes low latency and high throughput in streaming scenarios. Deepgram publishes strong independent benchmark results and is frequently cited by developers in high-volume voice agent applications for its speed and reliability at scale. Deepgram also supports on-premise deployment for customers with data residency requirements.
Whisper (OpenAI) is available as both a cloud API and a self-hosted open-source model. The open-source availability makes it the default choice for teams requiring on-premise deployment or needing to avoid per-call API costs at very high volume. Whisper's hallucination rate on longer audio is a widely cited limitation. OpenAI's GPT-4o-Transcribe model improves accuracy significantly over Whisper Large-v3 but is priced at $0.006 per minute ($0.36/hour), roughly 2x the cost of AssemblyAI's Universal-2.
AssemblyAI's customer base spans media, technology, healthcare, finance, and enterprise software sectors. Documented customers include Spotify, NASA, the Wall Street Journal, NBCUniversal, CallRail, Loop Media, and Fireflies. As of the Series C in December 2023, the company reported 4,000 paying brands.
Conversation intelligence: Sales and support call centers use AssemblyAI to transcribe and analyze recorded calls. Post-call analytics platforms layer sentiment analysis, entity detection, and automatic summarization to surface coaching signals and compliance flags. Siro, a sales coaching platform, reported a 90% reduction in support tickets after deploying AssemblyAI.
Meeting transcription and summaries: Video conferencing and productivity tools use the batch API to generate searchable transcripts and LLM-powered summaries of recorded meetings. Fireflies, a meeting intelligence platform, is a documented customer.
Podcast and media production: Media companies and podcast platforms use AssemblyAI to automatically generate subtitles, transcripts for search indexing, and chapter markers. NBCUniversal and Wall Street Journal use AssemblyAI for broadcast media processing.
Voice agents: Real-time transcription APIs enable voice-driven AI agents to convert user speech to text with sub-second latency. The October 2025 Voice Agent API product packaged these capabilities with billing optimized for interactive voice applications.
Healthcare documentation: The Medical Mode add-on ($0.15/hr on top of base transcription) improves recognition of clinical terminology, drug names, and medical codes. Healthcare platforms use AssemblyAI for ambient clinical documentation, reducing the time physicians spend on documentation after patient visits.
Qualitative research: Market research and UX research platforms use AssemblyAI to transcribe user interviews at scale. One documented qualitative data-analysis platform reported a 60% reduction in time spent analyzing data after integrating AssemblyAI.
Hiring and talent: Hiring intelligence platforms use speech AI to transcribe and analyze recorded candidate interviews. One such platform reported a 90% reduction in time spent on manual interview review tasks.
AssemblyAI holds SOC 2 Type 2 certification, audited in 2022-2023 and maintained since. The certification verifies that AssemblyAI's security controls meet AICPA standards for availability, confidentiality, and processing integrity on a continuous basis.
Data-in-transit is encrypted with TLS 1.3 by default. Data at rest is encrypted with AES-256. AssemblyAI offers EU data residency, allowing customers in regulated industries to store and process data entirely within the European Union rather than the United States. The company does not store audio files or transcripts beyond the processing period unless customers explicitly enable storage.
Enterprise customers can purchase Premier Support, which provides access to dedicated AI specialists and engineers, faster response times, and proactive guidance on model selection and integration patterns.
Cloud-only deployment: AssemblyAI provides no on-premise or private-cloud deployment option. All audio is processed on AssemblyAI's infrastructure. This is a blocking constraint for organizations with strict data residency requirements outside the US/EU, classified data environments, or air-gapped deployments.
Slam-1 language coverage: The Slam-1 model, which offers the most powerful customization and highest accuracy for English, supports only English as of its public beta. Multilingual workloads must use Universal-2 or Universal-3 Pro.
Streaming latency at enterprise scale: While Universal-Streaming achieves approximately 300 ms P50 latency for typical use cases, network latency and load variability can produce perceptible delays that affect low-latency voice agent applications at enterprise scale.
Benchmark methodology: AssemblyAI's published accuracy benchmarks use internally selected test sets. Independent evaluations have produced different rankings depending on dataset composition. Real-world production audio with heavy accents, overlapping speakers, or specialized jargon consistently produces higher WER than benchmark results for all providers.
Add-on pricing accumulation: The base transcription price of $0.15-0.21/hr is competitive, but production deployments typically require speaker diarization, summarization, PII redaction, and other add-ons that can bring the effective hourly cost to $0.35-0.50/hr or more for feature-rich applications.
Support responsiveness at scale: Developer community feedback has noted that support ticket response times can be slow for teams running high-volume production workloads on the standard pricing tier. AssemblyAI's Premier Support tier addresses this with dedicated support personnel, but it requires a separate enterprise agreement.
No HIPAA Business Associate Agreement disclosed: As of the time of writing, AssemblyAI has not publicly documented a HIPAA Business Associate Agreement (BAA) offering. Healthcare customers with HIPAA obligations should confirm compliance requirements directly with AssemblyAI before deploying.