Speechmatics
Last reviewed
May 16, 2026
Sources
28 citations
Review status
Source-backed
Revision
v1 · 3,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
28 citations
Review status
Source-backed
Revision
v1 · 3,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
Speechmatics is a British artificial intelligence company that develops automatic speech recognition (ASR) and voice AI technology for enterprise customers. It was founded in 2006 in Cambridge, United Kingdom, by Dr. Tony Robinson, a pioneer of recurrent neural network research applied to speech recognition. Speechmatics is legally incorporated as Cantab Research Ltd., trading under the Speechmatics brand, and remains headquartered in Cambridge with offices in London, Brno (Czech Republic), and Chennai (India). The company is known for its accent-agnostic speech-to-text engines, its Ursa family of self-supervised models, the Flow voice agent API launched in 2024, and a benchmark in which its system outperformed Google, Amazon, Microsoft, IBM, and Apple at transcribing African American speakers. As of 2026, Speechmatics serves more than 170 enterprise customers and has raised approximately $90.6 million in total funding, including a $62 million Series B led by Susquehanna Growth Equity in June 2022.
Speechmatics was incorporated in 2006 under the name Cantab Research Ltd. The founder, Dr. Tony Robinson, completed an MPhil in Computer Speech and Language Processing at the University of Cambridge in 1985, followed by a PhD in the same field at Cambridge in 1989. During the late 1980s and 1990s, Robinson became one of the earliest researchers to apply recurrent neural networks (RNNs) to large vocabulary continuous speech recognition, publishing more than 100 papers across a long academic career. His doctoral and postdoctoral work at Cambridge helped seed a broader Cambridge speech research ecosystem, which also produced companies and projects associated with the HTK Toolkit and the Cambridge Engineering Department's speech group.
Robinson initially launched Cantab Research as a small commercial vehicle for licensing speech recognition technology to specialized customers in media monitoring, parliamentary archives, and broadcast transcription. For the first decade of its existence, the company operated as a self-funded, profitable engineering business, with the Speechmatics brand emerging publicly in the mid-2010s as a packaged speech-to-text product.
In 2019, after more than a decade of bootstrapped operation, Speechmatics raised a £6.35 million Series A round led by Albion Venture Capital (now AlbionVC), with participation from IQ Capital and Amadeus Capital Partners. The round was timed alongside the rapid commercialization of deep-learning ASR and the rising demand for accurate, language-diverse transcription in media, contact centers, and accessibility products. Around this period, the company received the Queen's Award for Enterprise in Innovation (2019) and was listed in the Financial Times FT 1000 ranking of Europe's fastest-growing companies for four consecutive years (2019 through 2022).
Katy Wigdahl, who joined Speechmatics in 2018, was appointed Chief Executive Officer in the run-up to the Series A. Tony Robinson moved into a Chief Scientist role and remained closely involved in research direction, while the operational executive team expanded to include a CFO, CRO, and CMO. By 2022, the company employed roughly 150 to 250 people across its global offices.
Speechmatics has raised approximately $90.6 million across five disclosed funding rounds, after operating as a self-funded business for its first 13 years. The 2022 Series B, led by Susquehanna Growth Equity, was structured to fund language expansion, on-premises product investment, and an expanded sales presence in North America.
| Round | Date | Amount | Lead investor | Notable participants |
|---|---|---|---|---|
| Seed / early | 2006 to 2018 | Self-funded / undisclosed | None (bootstrapped) | Customer revenue |
| Series A | June 2019 | £6.35 million (approx. $8 million) | AlbionVC | IQ Capital, Amadeus Capital Partners |
| Series B | June 2022 | $62 million (approx. £50 million) | Susquehanna Growth Equity | AlbionVC, IQ Capital |
| Subsequent (extension and venture debt) | 2023 to 2025 | Undisclosed top-ups | Various | Existing backers |
The Series B announcement on 28 June 2022 emphasized the company's mission of "understanding every voice" and its record of reducing accuracy gaps for non-mainstream accents. Susquehanna Growth Equity, the growth arm of Susquehanna International Group, had previously backed language-services companies including thebigword.
Since approximately 2019, Speechmatics has built its production systems on self-supervised learning (SSL). The company trains a large acoustic encoder on more than one million hours of unlabeled, multilingual audio drawn from podcasts, video, social media, and broadcast feeds. The SSL pretraining stage produces dense speech representations that capture phonetic, prosodic, and speaker-related information without requiring transcripts. A smaller volume of labeled audio is then used to fine-tune the encoder for each target language.
Speechmatics argues this approach yields two benefits over transcript-labeled training. It reduces dependence on costly labeled corpora, allowing high-accuracy models in low-resource languages. It also exposes the model to a far wider distribution of voices, including accents, dialects, and non-studio recording conditions, which improves accuracy for traditionally underserved speaker groups. The company has described scaling its SSL encoder to roughly 2 billion parameters and increasing the language model size by approximately 30x relative to earlier generations.
The Ursa engine, announced in March 2023, was Speechmatics' first production model at the multi-billion parameter scale. At launch, the company described Ursa as "the world's most accurate" speech-to-text engine and reported large reductions in word error rate (WER) on noisy benchmarks, children's voices, and speakers of African American Vernacular English. Ursa served both real-time and batch workloads from a single underlying model, with distilled variants for low-latency streaming.
Ursa 2, released in October 2024, extended the architecture to 50-plus languages. Speechmatics reported an average 18% reduction in WER versus the original Ursa, sub-one-second latency in real-time mode, and the addition of Irish and Maltese, completing coverage of the 24 official EU languages. Ursa 2 also upgraded the Global Arabic pack covering Modern Standard Arabic alongside Egyptian, Levantine, and Gulf dialects. By 2025, Ursa 2 powered all Speechmatics speech services, including the Flow voice agent API.
In July 2024, Speechmatics launched Flow, a voice agent API that combines the company's STT with pluggable large language models and third-party text-to-speech to enable real-time conversational voice applications. Flow provides turn detection, interruption handling, and barge-in functionality, and integrates with orchestration frameworks such as LiveKit, Pipecat, Vapi, and LangGraph. Customers can mix Speechmatics STT with their preferred LLM (for example, models from OpenAI, Anthropic, or self-hosted open-weights systems) and a TTS provider of their choice, including ElevenLabs and the company's own synthetic voices.
On top of raw transcription, Speechmatics offers a set of "speech intelligence" features that operate on the model's hypotheses, including:
The company emphasizes "hallucination-free" behavior for enterprise and medical use cases, noting that its models output only words derived from the audio signal rather than generating fluent guesses when the acoustic evidence is weak, a known failure mode of fully generative ASR systems such as Whisper.
Speechmatics offers a unified API surface with multiple deployment modes, which is a defining commercial differentiator relative to cloud-only competitors. Customers in regulated industries and those with strict data-residency requirements can run the same models inside their own infrastructure.
| Product | Description | Typical use cases |
|---|---|---|
| Real-Time STT (cloud) | Streaming transcription with sub-second latency, partial hypotheses, and live diarization | Live captioning, contact-center listening, conversational AI |
| Batch STT (cloud) | Asynchronous transcription of recorded audio with maximum accuracy settings | Media archives, compliance review, podcast publishing |
| On-premises Container | Same Ursa models packaged as containers for customer-managed Kubernetes or VM hosts | Defense, healthcare, finance, government, broadcasters |
| On-device / Edge | Optimized smaller models for embedded deployment | Automotive, in-car voice, broadcast appliances |
| Flow Voice Agent API | Composable STT plus LLM plus TTS pipeline for real-time agents | Voice bots, IVR replacement, in-app voice assistants |
| Translation | Real-time and batch speech translation across 69 language pairs | International media, multilingual contact centers |
| Speech Intelligence | Summaries, sentiment, topics, entities | Compliance, analytics, knowledge management |
Deployment is supported in any of the major public clouds, on private clouds via Azure Kubernetes Service and equivalent managed environments, on customer-owned hardware, and at the edge. The Speechmatics SaaS service is GDPR, SOC 2, and HIPAA aligned, with encryption in transit and at rest. The company's pricing combines per-hour usage tiers with enterprise contracts.
Speechmatics' multilingual coverage has grown alongside its model generations. As of 2026, the company markets support for more than 55 languages for transcription and 69 language pairs for translation.
| Language family | Examples | Notes |
|---|---|---|
| Global English | US, UK, Australian, Indian, Caribbean, African accents | Single "accent-agnostic" model rather than per-accent models |
| Global Spanish | Castilian, Mexican, Argentine, Caribbean | Unified model across major dialect groups |
| Major European | French, German, Italian, Portuguese, Dutch, Polish, Czech | All 24 official EU languages with Ursa 2 |
| Nordics and Baltics | Swedish, Norwegian, Danish, Finnish, Estonian, Latvian, Lithuanian | |
| Arabic | Modern Standard Arabic plus Egyptian, Levantine, Gulf | Global Arabic pack with Ursa 2 |
| East Asian | Mandarin Chinese, Japanese, Korean | |
| South and Southeast Asian | Hindi, Tamil, Bengali, Thai, Indonesian, Vietnamese | |
| Other | Turkish, Russian, Hebrew, Greek, Irish, Maltese, Ukrainian | Irish and Maltese added in 2024 |
A defining strand of Speechmatics' technical identity has been its emphasis on reducing demographic bias in speech recognition. In March 2020, researchers at Stanford published a paper in the Proceedings of the National Academy of Sciences (PNAS) titled "Racial disparities in automated speech recognition." The study tested commercial ASR systems from Amazon, Apple, Google, IBM, and Microsoft against the Corpus of Regional African American Language (CORAAL), finding that all five systems made roughly twice as many word errors on Black speakers as on white speakers.
Speechmatics was not part of the original Stanford evaluation, but in 2021 the company ran the same test data through its own engine and reported that it correctly transcribed 82.8% of words spoken by African American voices, compared with 68.6% for Google and Amazon, 73% for Microsoft, 62% for IBM, and 55% for Apple. The result, equivalent to a roughly 45% reduction in word errors relative to the best of the major cloud providers, was widely covered by CNBC, the Financial Times, and TechCrunch and became a central marketing message for the company through the 2022 Series B fundraise.
Speechmatics has attributed the gap to two design choices. First, the company curates training audio that intentionally includes a wide variety of accents, dialects, ages, and recording conditions rather than relying on broadcast-quality "reference" English. Second, its self-supervised pretraining pipeline ingests large quantities of unlabeled audio from podcasts, video sharing platforms, and social media, which together expose the model to a far broader speaker population than the small, professionally produced corpora historically used in ASR research.
The company has continued to publish updates on inclusion benchmarks, including a reported 91.8% accuracy on children's voices with Ursa 2 and further gains on African American Vernacular English with the Flow voice agent stack.
Because Speechmatics operates as a business-to-business platform, most of its end users encounter the technology indirectly via embedded products. Publicly disclosed direct customers include:
The Series B announcement in 2022 referenced approximately 170 customers globally, with subsequent updates indicating continued growth in voice-agent platform customers as Flow rolled out.
The global speech-to-text market in the mid-2020s is competitive and increasingly segmented between cloud hyperscaler APIs (AWS, Google, Microsoft), specialized voice-AI startups, open-source models, and consumer-facing transcription brands. Speechmatics is consistently grouped with Deepgram, AssemblyAI, and the OpenAI Whisper family as one of the leading independent, enterprise-grade ASR providers.
| Provider | HQ | Founded | Flagship STT engine | Notable strengths | Notable trade-offs |
|---|---|---|---|---|---|
| Speechmatics | Cambridge, UK | 2006 | Ursa 2 | Accent-agnostic English, on-prem and edge, 55+ languages, inclusion focus | Smaller developer brand than US peers |
| Deepgram | San Francisco, US | 2015 | Nova-3 | Very low latency, large dev ecosystem, drive-thru and call-center traction | Cloud-first; on-prem is enterprise-only |
| AssemblyAI | San Francisco, US | 2017 | Universal-2 | Strong speech-intelligence stack, hallucination-reduction focus | Cloud-only API |
| OpenAI Whisper | San Francisco, US | 2022 (model) | Whisper / gpt-4o-transcribe | Open weights for Whisper, multilingual breadth | Hallucination risk in noisy audio; generative behavior |
| ElevenLabs Scribe | London / NY | 2022 | Scribe | High accuracy on long-form audio; strong TTS pairing | Newer to STT; cloud-only |
| Otter.ai | Mountain View, US | 2016 | Proprietary | Consumer and meeting brand, integrations with Zoom | Less attractive as a developer API |
| Google Cloud STT | Mountain View, US | n/a | Chirp 2 | Global scale, broad language list | Higher accent error rates in CORAAL benchmarks |
| AWS Transcribe | Seattle, US | n/a | Proprietary | Deep AWS integration | Historically weaker on accents in independent tests |
| Microsoft Azure Speech | Redmond, US | n/a | Proprietary | Tight Office and Teams integration | Mixed performance on non-broadcast English |
Word error rate (WER) is the dominant accuracy metric in commercial ASR, measured as the percentage of substitutions, insertions, and deletions required to align a model hypothesis with a human reference transcript. WER varies dramatically with audio domain, microphone quality, speaker demographics, and reference-transcription conventions, so absolute numbers across vendors are only meaningful when measured on identical evaluation sets.
| Benchmark / source | Test set | Speechmatics result | Comparator results |
|---|---|---|---|
| Stanford CORAAL (re-run by Speechmatics, 2021) | African American English conversational speech | 82.8% accuracy (approx. 17% WER) | Google 68.6%, Amazon 68.6%, Microsoft 73%, IBM 62%, Apple 55% |
| Speechmatics internal multi-dataset (2024) covering Common Voice, CORAAL, AVICAR, Switchboard, Rev16, Casual Conversations and two noisy proprietary sets | Mixed real-world audio across 8 datasets | Roughly 32% fewer errors than OpenAI Whisper | Whisper used as comparator |
| Speechmatics internal multilingual (2024) | 50-plus languages | Most-accurate provider on 93.73% of language-test combinations versus Amazon, AssemblyAI, and Deepgram | Per Speechmatics' marketing data |
| Ursa 2 release (October 2024) | 50-plus languages | 18% average WER reduction over original Ursa | Vendor self-comparison |
| Artificial Analysis WER index (2025 to 2026) | Independent third-party | Consistently in the top tier alongside Deepgram, AssemblyAI, and gpt-4o-transcribe | Public ranking on artificialanalysis.ai |
Note that several of these figures derive from Speechmatics' own technical posts and should be read alongside independent third-party benchmarks such as Artificial Analysis and the MLCommons multilingual ASR working group. Different vendors lead on different domains, languages, and audio conditions, and no single ASR product is universally most accurate.
Speechmatics technology powers a range of voice-driven applications:
Speechmatics is consistently described in industry coverage as one of the most technically credible independent ASR providers, with TechCrunch in 2022 highlighting its "inclusive approach" and Slator emphasizing enterprise traction across language services. The 2021 CNBC report on the Stanford CORAAL re-run was a particularly visible inflection point and is frequently cited in academic work on algorithmic bias.
Public honors include the Queen's Award for Enterprise: Innovation (2019), the SME National Business Award for High Growth (2018), inclusion in the FT 1000 Europe's Fastest Growing Companies list from 2019 to 2022, and G2 High Performer in Speech Recognition in 2025. Customer reviews on G2 in 2025 averaged 4.8 out of 5, with strong scores in ease of use and support quality.
Speechmatics is not without trade-offs. Industry analysts have observed: