# AssemblyAI

> Source: https://aiwiki.ai/wiki/assemblyai
> Updated: 2026-06-23
> Categories: AI Companies, Speech & Audio AI, Voice AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AssemblyAI** is an American [Speech AI](/wiki/voice_ai) company, founded in 2017 by Dylan Fox in San Francisco, that trains its own [automatic speech recognition](/wiki/automatic_speech_recognition) models and sells them to developers and enterprises through a single cloud API. Backed by [Y Combinator's](/wiki/y_combinator) Winter 2017 batch, it offers proprietary [speech-to-text](/wiki/speech_recognition) models (the Conformer, Universal, and Slam families, culminating in Universal-3 Pro and Slam-1), layered audio intelligence features such as speaker diarization, sentiment analysis and summarization, and the LeMUR / LLM Gateway framework for running large language models over audio. As of early 2026 the company has raised approximately $115 million across four funding rounds, counts over 200,000 developers building on its platform, and processes more than 10 terabytes of voice data per day for customers including Spotify, NASA, the Wall Street Journal, and NBCUniversal.[3][28][31] CEO Dylan Fox has framed the company's goal as building "superhuman Voice AI models that would unlock an entirely new class of AI applications to be built leveraging voice data."[4]

The company occupies a niche in the AI-as-a-service market distinct from pure infrastructure cloud providers: it trains and hosts its own proprietary speech recognition models, offers them through a unified REST API, and layers higher-level audio intelligence features (sentiment analysis, speaker identification, summarization, and content moderation) on top of the base transcription. Its LeMUR framework and LLM Gateway extend this approach by allowing developers to apply [large language models](/wiki/large_language_models) directly to audio content through a single API call.[16]

## History

### When was AssemblyAI founded?

Dylan Fox founded AssemblyAI in 2017 after leaving Cisco, where he had worked as an engineer. Fox applied as a solo founder to [Y Combinator's](/wiki/y_combinator) Winter 2017 cohort, submitting a video demonstration of early speech recognition technology. He was accepted into the program, where he met [Daniel Gross](/wiki/daniel_gross), a former Apple engineer who had studied speech recognition problems and became one of AssemblyAI's earliest investors.[26] The company closed an initial seed round of approximately $1.2 million in September 2017.[31]

The founding thesis was that developers needed access to accurate, production-ready speech AI without the need to train or host their own models. At the time, the dominant options were cloud provider APIs from Google, Amazon, and Microsoft (which traded accuracy for breadth), or open-source models such as Mozilla DeepSpeech (which required significant infrastructure investment to deploy). Fox positioned AssemblyAI as a model-as-a-service company that would handle the research and infrastructure complexity while exposing a simple HTTP API.

The company grew slowly through its first several years, spending the period before the 2022 AI funding boom building model infrastructure and accumulating paying customers. Fox later described the period as spending five years waiting for the market to catch up to the technology. By early 2022, the company was processing one million audio streams per day and had seen revenue triple year over year.[1]

### Model history: the Conformer era

Before the Universal model family, AssemblyAI released two generations of Conformer-architecture models.

**Conformer-1**, released in March 2023, was built on Google Brain's 2020 Conformer architecture (a hybrid of [Transformer](/wiki/transformer) and convolutional neural network designs) as modified by the Efficient Conformer paper. AssemblyAI trained it on 650,000 hours of audio using noisy student-teacher training, a technique where a teacher model generates pseudo-labels on unlabeled audio and a student model trains on both labeled and pseudo-labeled data. Conformer-1 was claimed to make up to 43% fewer errors than comparable models on noisy audio.[24]

**Conformer-2**, released later in 2023, scaled the training dataset to 1.1 million hours of English audio and increased model parameters to 450 million. It used an ensemble of teacher models during training, a variance-reduction technique that produced more robust pseudo-labels. Compared to Conformer-1, it improved alphanumeric recognition by 31.7%, proper noun accuracy by 6.8%, and noise robustness by 12.0%.[25]

## Funding history

### How much funding has AssemblyAI raised?

AssemblyAI has raised approximately $115 million across four funding rounds: a 2017 seed round of about $1.2 million followed by three institutional rounds between March 2022 and December 2023.[31] The $50 million Series C, announced in December 2023, brought the company's cumulative funding to roughly $115 million, of which the company said about 90% had been raised in the preceding 22 months.[4][3]

| Round | Date | Amount | Lead investor |
|---|---|---|---|
| Seed | Sept 2017 | ~$1.2M | Y Combinator |
| Series A | March 2022 | $28M | Accel |
| Series B | July 2022 | $30M | Insight Partners |
| Series C | December 2023 | $50M | Accel |

### Series A (March 2022): $28 million

In March 2022, AssemblyAI closed a $28 million Series A led by [Accel](/wiki/accel). Co-investors included [Y Combinator](/wiki/y_combinator), John and Patrick Collison (founders of Stripe), [Nat Friedman](/wiki/nat_friedman) (former CEO of GitHub), and [Daniel Gross](/wiki/daniel_gross) (founder of Pioneer and one of the company's original backers). The round was reported in TechCrunch on March 4, 2022.[1]

At the time of the announcement, Fox described the company as processing one million audio streams per day and having hundreds of paying customers. Stated uses of the funding included hiring, GPU infrastructure expansion (including over $1 million in Nvidia A100 servers), and product development.[1]

### Series B (July 2022): $30 million

Four months after the Series A, AssemblyAI raised a $30 million Series B led by [Insight Partners](/wiki/insight_partners) in July 2022, with participation from Accel and Y Combinator.[2] The rapid back-to-back rounds reflected the accelerating interest in AI developer tools during 2022 and the company's intent to aggressively scale model training infrastructure. CEO Fox described the company's ambition as building the "Stripe for AI models," offering developers access to frontier AI capabilities through simple APIs the way Stripe had simplified payment processing.[2]

### Series C (December 2023): $50 million

AssemblyAI closed a $50 million Series C on December 3, 2023, again led by Accel.[4] Co-investors included Insight Partners, Y Combinator, Nat Friedman, Daniel Gross, and Keith Block and Smith Point Capital. Block is the former co-CEO of Salesforce. TechCrunch covered the round on December 4, 2023.[3]

At the time of the Series C, AssemblyAI reported:
- 4,000 paying brands, up 200% year over year[3]
- Approximately 25 million inference API calls served per day[4]
- Over 200,000 developers on the platform[3]
- More than 10 terabytes of voice data processed daily[4]
- Over 10,000 new organizations signing up each month[4]
- 115 employees, with plans to grow headcount by 50 to 75 percent[27]

The company stated its intent to train a universal speech model on more than a petabyte of voice data, which subsequently became Universal-1.[4]

### What is AssemblyAI's valuation?

The $50 million Series C in December 2023 brought total disclosed capitalization to approximately $115 million.[31][3] Third-party private-market databases including Tracxn and Get Latka have reported a valuation of approximately $300 million and annual recurring revenue of roughly $10.4 million as of 2024; AssemblyAI has not publicly confirmed these figures.[31]

## Model family

### Universal-1 (April 2024)

AssemblyAI released Universal-1 on April 12, 2024 as its most capable model to date.[5] The model was trained on over 12.5 million hours of multilingual audio, encompassing non-native speakers, heavy background noise, multi-speaker conversations, and diverse recording conditions. Universal-1 launched with support for English and Spanish, with German and French added in subsequent weeks.[6]

Key technical characteristics of Universal-1:

- **Accuracy**: Claimed 10% or greater improvement in English, Spanish, and German word error rate compared to the next-best commercial speech-to-text system tested at launch. Claimed more than 22% accuracy improvement over speech-to-text APIs from Azure, AWS, and Google on internal benchmarks.[5]
- **Hallucinations**: 30% reduction in hallucination rate over [Whisper](/wiki/whisper) Large-v3.[21]
- **Speed**: Processes one hour of audio in approximately 38 seconds on AssemblyAI's infrastructure, representing approximately a 5x speed improvement over Whisper Large-v3 on equivalent hardware.[5]
- **Timestamps**: 13% improvement in word-level timestamp accuracy over Conformer-2.[5]
- **Speaker diarization**: 14% improvement in concatenated minimum-permutation WER for speaker diarization.[5]
- **Code-switching**: Ability to transcribe multiple languages within a single audio file without requiring the user to specify language changes.
- **Human preference**: Human evaluators preferred Universal-1 outputs over Conformer-2 outputs 71% of the time when they expressed a preference.[5]

Universal-1 was built on research published in the paper "Anatomy of Industrial Scale Multilingual ASR" (arXiv:2404.09841), which described the training methodology and architecture decisions in detail.[7]

### Universal-2 (October 2024)

AssemblyAI released Universal-2 in October 2024, with accompanying research published at `assemblyai.com/research/universal-2`.[8] The model built on Universal-1 and targeted three specific weaknesses identified in production deployments: proper noun recognition, alphanumeric formatting, and general text formatting.

Key improvements over Universal-1:

- **Overall accuracy**: 3% improvement in word accuracy over Universal-1; 15% improvement over the next-best system tested (combining commercial providers and open-source models) as of October 2024.[10]
- **Proper nouns**: 24% improvement in recognition of names, brand names, and locations, measured with AssemblyAI's proper noun error rate (PNER) metric.[10]
- **Text formatting**: 15% improvement in formatting accuracy, producing more immediately readable and actionable output.[10]
- **Alphanumerics**: 21% improvement in accuracy on phone numbers, zip codes, and other numerical identifiers.[10]

Universal-2 also incorporated Universal-2-TF, a two-stage neural text formatting model described in a separate research paper.[9] The system combines a token classification approach (for punctuation and capitalization) with a sequence-to-sequence approach (for complex normalization) to handle text formatting as a learned neural task rather than a rule-based post-processing step. This architecture allowed formatting to be handled at the speed of real-time transcription rather than as a separate offline step.

Universal-2 launched at $0.15 per hour for batch transcription, supporting 99 languages with automatic language detection and automatic code-switching between English and other languages.[11] By October 2025, AssemblyAI added a 64% reduction in speaker counting errors for mid-to-long-duration audio files and expanded keyterm support to 200 terms.[15]

### Universal-3 Pro (February 2026)

AssemblyAI introduced Universal-3 Pro on February 3, 2026, describing it as "a first of its kind promptable speech language model" and the first production-quality speech model to accept natural language prompts for controlling transcription behavior.[14] The model was designed to reduce or eliminate post-processing pipelines that developers previously needed to build on top of raw transcripts.

Universal-3 Pro features:

- **Natural language prompting**: Developers can supply a text prompt describing desired transcription behavior, such as formatting conventions, domain vocabulary, or output style, without writing custom post-processing code.[14]
- **Keyterm prompting**: Up to 45% accuracy improvement on domain-specific vocabulary terms; up to 1,000 custom terms per request.[14]
- **Promptable speaker diarization**: Speaker identification and labeling can be controlled through prompts, improving accuracy in use cases where speaker roles or names are known in advance.[14]
- **Audio event tagging**: Detection and labeling of non-speech audio events.[14]
- **Disfluency control**: Configurable output between verbatim transcription (including filler words) and cleaned transcription.[14]
- **Code-switching**: Native support for bilingual conversations in English, Spanish, German, French, Portuguese, and Italian.[14]
- **Language coverage**: 6 native high-accuracy languages, with 99-language fallback routing via Universal-2.[14]

Universal-3 Pro is priced at $0.21 per hour, described by AssemblyAI as 35 to 50% lower cost than competing solutions.[14] It achieved the lowest word error rate on real-world data on AssemblyAI's internal benchmarks across call center, medical, and multi-speaker recordings.[14]

### Slam-1 (April 2025)

AssemblyAI announced Slam-1 in March 2025 and released it to public beta on April 23, 2025.[12] Slam stands for Speech Language Model. The model represents a different architectural approach from the Universal family: rather than a dedicated ASR model that converts audio to text, Slam-1 combines an audio encoder with a [large language model](/wiki/large_language_model) decoder, allowing the system to apply genuine language understanding to transcription rather than pattern-matching against training distributions.[13]

Slam-1 is described as the most powerful prompt-based Speech Language Model available at its launch. Key characteristics:

- **Architecture**: Multi-modal, processing audio and language simultaneously through a combined neural architecture rather than a pipeline of separate components.[29]
- **Customization**: Users can supply up to 1,000 domain-specific terms through natural language prompts, improving recognition of specialized vocabulary without custom model fine-tuning.[12]
- **Contextual understanding**: The model comprehends the semantic meaning of provided terminology and applies that understanding to recognize related variations and edge cases, not just exact term matches.[29]
- **Accuracy claims**: In side-by-side blind tests, two-thirds of human evaluators preferred Slam-1 transcripts over Universal model transcripts for accuracy, readability, and formatting. AssemblyAI claimed a 72% human preference rating over [Deepgram's](/wiki/deepgram) Nova-3 at launch.[12]
- **Error reductions**: Compared to Universal-2, Slam-1 reduced errors on alphanumerics by 12%, addresses by 41%, email addresses by 37%, numerical values by 25%, and formatting by 27%.[12]
- **Word error rate**: Approximately 7% WER on diverse test datasets.[12]
- **Language**: English only at public beta launch.[12]
- **Integration**: Supports speaker diarization, word-level timestamps, and multichannel transcription.[29]
- **Pricing**: $0.37 per hour at public beta launch.[12]

Subsequent updates in October 2025 improved Slam's accuracy by up to 57% on critical terms and expanded context-aware key term prompting to 1,000 words, with pricing adjusted to $0.27 per hour.[15] The same October 2025 update introduced intelligent model fallback, allowing developers to specify Slam-1 as the primary model with automatic fallback to Universal-2 for audio in languages Slam-1 does not support.[15]

### How do AssemblyAI's models compare?

| Model | Released | Training data | Languages | Launch price | Headline claim |
|---|---|---|---|---|---|
| Conformer-1 | March 2023 | 650K hours | English | n/a | Up to 43% fewer errors on noisy audio [24] |
| Conformer-2 | 2023 | 1.1M hours | English | n/a | 450M params; +31.7% alphanumeric accuracy [25] |
| Universal-1 | April 2024 | 12.5M+ hours | EN/ES/DE/FR | n/a | 30% fewer hallucinations vs Whisper Large-v3 [5][21] |
| Universal-2 | October 2024 | builds on U-1 | 99 | $0.15/hr | +24% proper-noun accuracy over U-1 [10][11] |
| Slam-1 | April 2025 | LLM-decoder | English | $0.37/hr | ~7% WER; 72% preference vs Nova-3 [12] |
| Universal-3 Pro | February 2026 | speech-only | 6 native, 99 fallback | $0.21/hr | Promptable; lowest real-world WER on internal benchmark [14] |

## What is LeMUR? The LLM framework

AssemblyAI introduced LeMUR (Leveraging Large Language Models to Understand Recognized Speech) as an early-access product and later rebranded and expanded it into the LLM Gateway.[16] The framework solves a specific integration problem: applying [large language models](/wiki/large_language_models) to audio content requires first transcribing the audio, managing long transcripts that may exceed LLM context windows, and constructing effective prompts.

LeMUR and its successor LLM Gateway handle all of this plumbing. Developers submit an audio file URL or transcript ID along with an LLM prompt; the system transcribes the audio if needed, chunks and manages context, calls the specified LLM, and returns a structured response.[17] This allows a developer to, for example, ask "What were the three main action items from this one-hour meeting?" as a single API call without manually managing transcription, chunking, or LLM invocation.

### LLM Gateway capabilities

The LLM Gateway, released in October 2025, expanded LeMUR into a unified API providing access to over 20 large language models from Anthropic ([Claude](/wiki/claude)), OpenAI (GPT), and Google (Gemini) through a single interface and billing relationship.[15] Key capabilities:

- **Scale**: Process over 200 hours of audio in a single API call; handles over 1 million tokens as input.[17]
- **Multi-file processing**: Analyze multiple audio files simultaneously in a single request.[17]
- **Long-form audio**: Supports transcripts up to 10 hours in duration, translating to approximately 150,000 LLM context tokens.[17]
- **Structured output**: Returns JSON-structured responses that can be directly consumed by downstream application logic.[17]
- **Model routing**: A single API endpoint routes requests to GPT, Claude, or Gemini models based on developer preference, with unified billing.[17]
- **Claude 4 integration**: As of May 2026, the LLM Gateway supports Claude 4.x models including Opus, Sonnet, and Haiku variants.[18]

### Audio Intelligence features

Built on top of transcription and LLM capabilities, AssemblyAI offers a set of pre-built audio intelligence features billed as add-ons per audio hour:

| Feature | Description | Price |
|---|---|---|
| Speaker Identification | Labels speakers by name using audio context | $0.02/hr |
| Sentiment Analysis | Per-sentence sentiment detection | $0.02/hr |
| Auto Chapters | Automatic segmentation and chapter titles | $0.08/hr |
| Summarization | LLM-powered summary of audio content | $0.03/hr |
| Entity Detection | Names, dates, organizations, locations | $0.08/hr |
| Key Phrases | Automatic extraction of key terms | $0.01/hr |
| Topic Detection | Classification into topic categories | $0.15/hr |
| Translation | Translation of transcript to another language | $0.06/hr |
| Custom Formatting | User-specified output format rules | $0.03/hr |

## Real-time and batch APIs

AssemblyAI offers two primary transcription modes that suit different application architectures and latency requirements.

### Batch transcription

Batch transcription accepts pre-recorded audio files through an HTTP POST request or URL submission. The API is asynchronous: the developer submits an audio file, receives a job ID, and polls for completion or registers a webhook to be notified when transcription is ready. Processing time varies with file duration and queue depth.

Batch transcription supports parallel processing of large file volumes. The Python SDK includes a built-in batch processing method that submits files concurrently and collects results. Developers can process entire audio libraries simultaneously, with total elapsed time determined by the longest single file rather than the sum of all file durations.

Batch mode is appropriate for post-call analysis, podcast transcription, media monitoring, video subtitle generation, and similar workflows where a latency of seconds to minutes is acceptable.

### Streaming transcription

AssemblyAI's streaming API returns partial transcripts within approximately 300 milliseconds (P50) over a persistent WebSocket connection. The Universal-Streaming model supports:

- **Immutable transcripts**: Unlike some streaming systems that continuously revise previous words, Universal-Streaming emits final, non-revisable word sequences, simplifying downstream processing.[13]
- **Intelligent endpointing**: Uses both acoustic cues (pauses, voice activity) and semantic cues (sentence completion) to determine phrase boundaries.[13]
- **Word-level timestamps and confidence scores**: Each word includes a start time, end time, and confidence value.
- **Keyterms Prompting** (English only): Boosts recognition probability for a list of domain-specific terms provided at session initialization.
- **Unlimited concurrent streams**: No hard cap on the number of simultaneous WebSocket connections.
- **Multilingual streaming**: As of October 2025, the Universal-Streaming-Multilingual model supports English, Spanish, French, German, Italian, and Portuguese in streaming mode with automatic language detection.[15]

Universal-3 Pro Streaming, released alongside the batch model in February 2026, adds natural language prompting to streaming transcription at a price of $0.45 per hour.[14]

Streaming is appropriate for voice agents, real-time captions, live customer service call analysis, and interactive voice interfaces where sub-second latency matters.

## Pricing

AssemblyAI uses a consumption-based pricing model. All tiers include $50 in free credits (approximately 185 hours of Universal-2 transcription) with no credit card required at signup. Enterprise customers can negotiate volume discounts and receive access to dedicated support.[20]

### Speech-to-Text

| Model | Price | Notes |
|---|---|---|
| Universal-3 Pro | $0.21/hr | 6 languages natively; prompting included |
| Universal-2 | $0.15/hr | 99 languages; standard batch transcription |
| Slam-1 (Beta) | $0.27/hr | English only; prompt-based customization |

### Streaming Speech-to-Text

| Model | Price | Notes |
|---|---|---|
| Universal-3 Pro Streaming | $0.45/hr | Promptable; 6 languages |
| Universal-Streaming | $0.15/hr | English; immutable transcripts |
| Universal-Streaming Multilingual | $0.15/hr | English + Spanish, French, German, Italian, Portuguese |
| Whisper-Streaming | $0.30/hr | OpenAI Whisper via AssemblyAI infrastructure |

### Add-ons (batch and streaming)

| Feature | Price |
|---|---|
| Keyterms Prompting | $0.04-0.05/hr |
| Speaker Diarization (batch) | $0.02/hr |
| Speaker Diarization (streaming) | $0.12/hr |
| Medical Mode | $0.15/hr |
| PII Audio Redaction | $0.05/hr |
| PII Text Redaction | $0.08/hr |
| Content Moderation | $0.15/hr |
| Profanity Filtering | $0.01/hr |

### Voice Agent API

| Product | Price |
|---|---|
| Voice Agent API | $4.50/hr ($0.075/min) |

### LLM Gateway (per million tokens)

| Model | Input | Output |
|---|---|---|
| GPT-5.5 | $5.00 | $30.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Claude 4.7 Opus | $5.50 | $27.50 |
| Claude 4.6 Sonnet | $3.00 | $15.00 |
| Gemini 3 Flash | $0.50 | $3.00 |
| Gemini 2.5 Flash | $0.30 | $2.50 |

## How does AssemblyAI compare with competitors?

AssemblyAI competes primarily with [Deepgram](/wiki/deepgram) (Nova-3 model), OpenAI's [Whisper](/wiki/whisper) and GPT-4o-Transcribe, Azure Cognitive Services Speech, Google Cloud Speech-to-Text, and AWS Transcribe. The following comparison reflects published benchmarks and documented pricing as of early 2026.

### Accuracy comparison

| Provider | Model | English WER | Notes |
|---|---|---|---|
| AssemblyAI | Universal-3 Pro | ~5.9% | Best on AssemblyAI's 80,000+ file benchmark [19] |
| OpenAI | GPT-4o-Transcribe | ~6.5% | AssemblyAI's benchmark; third parties report lower WER in some tests [19] |
| ElevenLabs | Scribe | ~6.5% | Per AssemblyAI's benchmark [19] |
| Amazon | Transcribe | ~7.6% | Per AssemblyAI's benchmark [19] |
| Microsoft | Azure Speech | ~7.5% | Per AssemblyAI's benchmark [19] |
| [Deepgram](/wiki/deepgram) | Nova-3 | ~8.1% | Per AssemblyAI's benchmark; Deepgram's own data shows sub-7% on batch [19] |

Note: All accuracy figures above are from AssemblyAI's own benchmarks using 250+ hours of audio across 80,000+ files from 26 datasets.[19] Independent third-party benchmarks and each provider's own benchmarks yield different figures. Deepgram's published benchmarks report Nova-3 achieving a median WER of approximately 5.26% on batch audio across 2,703 production audio files. OpenAI's internal data shows GPT-4o-Transcribe achieving approximately 2.46% WER on its test sets. Benchmark methodology, dataset selection, and evaluation conditions materially affect published WER numbers, and results on production audio with noise, accents, or domain jargon typically differ significantly from clean benchmark results.

### Feature and pricing comparison

| Dimension | AssemblyAI | [Deepgram](/wiki/deepgram) | [Whisper](/wiki/whisper) (OpenAI) |
|---|---|---|---|
| Base price (batch) | $0.15-0.21/hr | ~$0.21/hr (Nova-3) | $0.006/min (~$0.36/hr) |
| Streaming | Yes ($0.15-0.45/hr) | Yes | Limited (gpt-4o-realtime) |
| Speaker diarization | Add-on | Add-on | No (base model) |
| LLM integration | Yes (LLM Gateway) | Limited | Via API (GPT-4o) |
| Audio intelligence | Extensive add-on suite | Moderate | None built-in |
| On-premise deployment | No (cloud only) | No (cloud only) | Yes (self-hosted) |
| Language support | 99 (Universal-2) | 36+ | 100+ (Whisper Large-v3) |
| Hallucination rate | Lower than Whisper | Comparable | Higher (Whisper Large-v3) |
| Prompting/customization | Yes (Slam-1, U3 Pro) | Yes (Nova-3) | Limited |
| SOC 2 Type 2 | Yes | Yes | Via OpenAI |

**Key differentiators by provider:**

**AssemblyAI** offers the most complete audio intelligence stack of the API providers, combining transcription with an extensive set of pre-built LLM-powered features (sentiment, summarization, entity detection, topic classification) and the LLM Gateway for custom analysis. Universal-3 Pro's natural language prompting reduces the post-processing engineering required for specialized use cases. The primary tradeoff is cloud-only deployment, add-on pricing that can accumulate, and English-only support for the Slam-1 model.

**[Deepgram Nova-3](/wiki/deepgram_nova_3)** emphasizes low latency and high throughput in streaming scenarios. Deepgram publishes strong independent benchmark results and is frequently cited by developers in high-volume voice agent applications for its speed and reliability at scale. Deepgram also supports on-premise deployment for customers with data residency requirements.

**[Whisper](/wiki/whisper)** (OpenAI) is available as both a cloud API and a self-hosted open-source model. The open-source availability makes it the default choice for teams requiring on-premise deployment or needing to avoid per-call API costs at very high volume. Whisper's hallucination rate on longer audio is a widely cited limitation. OpenAI's GPT-4o-Transcribe model improves accuracy significantly over Whisper Large-v3 but is priced at $0.006 per minute ($0.36/hour), roughly 2x the cost of AssemblyAI's Universal-2.

## Customers and use cases

AssemblyAI's customer base spans media, technology, healthcare, finance, and enterprise software sectors. Documented customers include Spotify, NASA, the Wall Street Journal, NBCUniversal, CallRail, Loop Media, and Fireflies.[28] As of the Series C in December 2023, the company reported 4,000 paying brands.[3]

### Documented use cases

**Conversation intelligence**: Sales and support call centers use AssemblyAI to transcribe and analyze recorded calls. Post-call analytics platforms layer sentiment analysis, entity detection, and automatic summarization to surface coaching signals and compliance flags. Siro, a sales coaching platform, reported a 90% reduction in support tickets after deploying AssemblyAI.

**Meeting transcription and summaries**: Video conferencing and productivity tools use the batch API to generate searchable transcripts and LLM-powered summaries of recorded meetings. Fireflies, a meeting intelligence platform, is a documented customer.

**Podcast and media production**: Media companies and podcast platforms use AssemblyAI to automatically generate subtitles, transcripts for search indexing, and chapter markers. NBCUniversal and Wall Street Journal use AssemblyAI for broadcast media processing.

**Voice agents**: Real-time transcription APIs enable voice-driven AI agents to convert user speech to text with sub-second latency. The October 2025 Voice Agent API product packaged these capabilities with billing optimized for interactive voice applications.[15]

**Healthcare documentation**: The Medical Mode add-on ($0.15/hr on top of base transcription) improves recognition of clinical terminology, drug names, and medical codes. Healthcare platforms use AssemblyAI for ambient clinical documentation, reducing the time physicians spend on documentation after patient visits.

**Qualitative research**: Market research and UX research platforms use AssemblyAI to transcribe user interviews at scale. One documented qualitative data-analysis platform reported a 60% reduction in time spent analyzing data after integrating AssemblyAI.

**Hiring and talent**: Hiring intelligence platforms use speech AI to transcribe and analyze recorded candidate interviews. One such platform reported a 90% reduction in time spent on manual interview review tasks.

## Security and compliance

### Is AssemblyAI SOC 2 compliant?

AssemblyAI holds SOC 2 Type 2 certification, audited in 2022-2023 and maintained since.[23] The certification verifies that AssemblyAI's security controls meet AICPA standards for availability, confidentiality, and processing integrity on a continuous basis.

Data-in-transit is encrypted with TLS 1.3 by default. Data at rest is encrypted with AES-256. AssemblyAI offers EU data residency, allowing customers in regulated industries to store and process data entirely within the European Union rather than the United States. The company does not store audio files or transcripts beyond the processing period unless customers explicitly enable storage.

Enterprise customers can purchase Premier Support, which provides access to dedicated AI specialists and engineers, faster response times, and proactive guidance on model selection and integration patterns.[22]

## Limitations

**Cloud-only deployment**: AssemblyAI provides no on-premise or private-cloud deployment option. All audio is processed on AssemblyAI's infrastructure. This is a blocking constraint for organizations with strict data residency requirements outside the US/EU, classified data environments, or air-gapped deployments.

**Slam-1 language coverage**: The Slam-1 model, which offers the most powerful customization and highest accuracy for English, supports only English as of its public beta. Multilingual workloads must use Universal-2 or Universal-3 Pro.

**Streaming latency at enterprise scale**: While Universal-Streaming achieves approximately 300 ms P50 latency for typical use cases, network latency and load variability can produce perceptible delays that affect low-latency voice agent applications at enterprise scale.

**Benchmark methodology**: AssemblyAI's published accuracy benchmarks use internally selected test sets. Independent evaluations have produced different rankings depending on dataset composition. Real-world production audio with heavy accents, overlapping speakers, or specialized jargon consistently produces higher WER than benchmark results for all providers.

**Add-on pricing accumulation**: The base transcription price of $0.15-0.21/hr is competitive, but production deployments typically require speaker diarization, summarization, PII redaction, and other add-ons that can bring the effective hourly cost to $0.35-0.50/hr or more for feature-rich applications.

**Support responsiveness at scale**: Developer community feedback has noted that support ticket response times can be slow for teams running high-volume production workloads on the standard pricing tier. AssemblyAI's Premier Support tier addresses this with dedicated support personnel, but it requires a separate enterprise agreement.

**No HIPAA Business Associate Agreement disclosed**: As of the time of writing, AssemblyAI has not publicly documented a HIPAA Business Associate Agreement (BAA) offering. Healthcare customers with HIPAA obligations should confirm compliance requirements directly with AssemblyAI before deploying.

## See also

- [Deepgram Nova-3](/wiki/deepgram_nova_3)
- [Whisper (speech recognition)](/wiki/whisper)
- [Voice AI](/wiki/voice_ai)
- [Automatic speech recognition](/wiki/automatic_speech_recognition)
- [Speech recognition](/wiki/speech_recognition)
- [Large language models](/wiki/large_language_models)
- [Y Combinator](/wiki/y_combinator)

## References

1. TechCrunch. "Assembly AI snags $28M for all-in-one API to transcribe, summarize and moderate audio." March 4, 2022. https://techcrunch.com/2022/03/04/assembly-ai-snags-28m-for-all-in-one-api-to-transcribe-summarize-and-moderate-audio/
2. TechCrunch. "With new cash, AssemblyAI looks to grow its AI as a service." July 14, 2022. https://techcrunch.com/2022/07/14/flush-with-new-cash-assemblyai-looks-to-grow-its-ai-as-a-service-business/
3. TechCrunch. "AssemblyAI lands $50M to build and serve AI speech models." December 4, 2023. https://techcrunch.com/2023/12/04/assemblyai-nabs-50m-to-build-and-serve-ai-speech-models/
4. AssemblyAI Blog. "Announcing our $50M Series C to build superhuman Voice AI models." December 3, 2023. https://www.assemblyai.com/blog/announcing-our-50m-series-c-to-build-superhuman-speech-ai-models
5. AssemblyAI Blog. "Introducing Universal-1." April 2024. https://www.assemblyai.com/blog/announcing-universal-1-speech-recognition-model
6. AssemblyAI Research. "Universal-1: Robust and accurate multilingual speech-to-text." https://www.assemblyai.com/research/universal-1
7. arXiv. "Anatomy of Industrial Scale Multilingual ASR." 2024. https://arxiv.org/html/2404.09841v1
8. AssemblyAI. "Introducing Universal-2." https://www.assemblyai.com/universal-2
9. AssemblyAI Research. "Universal-2-TF: Robust All-Neural Text Formatting for ASR." https://www.assemblyai.com/research/universal-2
10. AssemblyAI Blog. "Beyond Word Error Rate: Universal-2 Delivers Accuracy Where It Matters." https://www.assemblyai.com/blog/universal-2-delivers-accuracy-where-it-matters
11. MarkTechPost. "Assembly AI Introduces Universal-2: The Next Leap in Speech-to-Text Technology." November 9, 2024. https://www.marktechpost.com/2024/11/09/assembly-ai-introduces-universal-2-the-next-leap-in-speech-to-text-technology/
12. AssemblyAI Blog. "Slam-1 now in public beta." April 23, 2025. https://www.assemblyai.com/blog/slam-1-public-beta
13. AssemblyAI Blog. "Raising the Bar for Speech AI: Introducing Slam-1 & a New Streaming Model." https://www.assemblyai.com/blog/speech-language-model-and-improved-streaming-model
14. AssemblyAI Blog. "Introducing Universal-3 Pro: A new class of speech language model optimized for Voice AI." February 3, 2026. https://www.assemblyai.com/blog/introducing-universal-3-pro
15. AssemblyAI Blog. "AssemblyAI's October 2025 releases: Multilingual streaming, guardrails, and LLM gateway." https://www.assemblyai.com/blog/assemblyai-october-2025-releases
16. AssemblyAI Blog. "LeMUR: Now Available for Early Access." https://www.assemblyai.com/blog/lemur
17. AssemblyAI Documentation. "Apply LLM Gateway to Audio Transcripts." https://www.assemblyai.com/docs/lemur/apply-llms-to-audio-files
18. AssemblyAI Blog. "Claude 4 models now available through our LeMUR API." https://www.assemblyai.com/blog/claude-4-models-now-available-through-our-lemur-api
19. AssemblyAI. "Benchmarks." https://www.assemblyai.com/benchmarks
20. AssemblyAI. "Pricing." https://www.assemblyai.com/pricing
21. VentureBeat. "Assembly AI claims its new Universal-1 model has 30% fewer hallucinations than Whisper." https://venturebeat.com/ai/assembly-ai-claims-its-new-universal-1-model-has-30-fewer-hallucinations-than-whisper
22. AssemblyAI Blog. "New for Enterprise: Improved Accuracy, Always-on Support, and SOC 2 Type 2." https://www.assemblyai.com/blog/new-for-enterprise-improved-accuracy-always-on-support-soc2-type2
23. AssemblyAI Blog. "AssemblyAI obtains SOC 2 Type 2 compliance for 2022/2023." https://www.assemblyai.com/blog/assemblyai-obtains-soc2-type-2-compliance-for-2022-2023
24. AssemblyAI Blog. "Conformer-1: A robust speech recognition model trained on 650K hours of data." https://www.assemblyai.com/blog/conformer-1
25. AssemblyAI Blog. "Conformer-2: a state-of-the-art speech recognition model trained on 1.1M hours of data." https://www.assemblyai.com/blog/conformer-2
26. Y Combinator on X. "In 2017, Dylan Fox started AssemblyAI." https://x.com/ycombinator/status/2029610582975136183
27. SiliconANGLE. "AssemblyAI raises $50M for its cloud-based AI speech models." December 4, 2023. https://siliconangle.com/2023/12/04/assemblyai-raises-50m-cloud-based-ai-speech-models/
28. Accel. "AssemblyAI." https://www.accel.com/relationships/assemblyai
29. AssemblyAI Documentation. "Introducing Slam-1." https://www.assemblyai.com/docs/getting-started/slam-1
30. AssemblyAI Blog. "Introducing new products and model updates." https://www.assemblyai.com/blog/introducing-new-products-and-model-updates
31. Tracxn. "AssemblyAI: 2026 Company Profile, Team, Funding & Competitors." https://tracxn.com/d/companies/assemblyai