Universal Speech Model: Difference between revisions

no edit summary
(Created page with "==Universal Speech Model== The Universal Speech Model (USM) is a state-of-the-art collection of speech models with 2 billion parameters, engineered to conduct automatic speech recognition (ASR) in over 300 languages. USM has been trained using 12 million hours of spoken data and 28 billion text sentences. Developed for uses such as subtitles on YouTube, the system supports widely-used languages like English and Mandarin, as well as less common languages, encompassing...")
 
No edit summary
Line 1: Line 1:
{{see also|Papers}}
==Universal Speech Model==
==Universal Speech Model==
The Universal Speech Model (USM) is a state-of-the-art collection of speech models with 2 billion parameters, engineered to conduct automatic speech recognition (ASR) in over 300 languages. USM has been trained using 12 million hours of spoken data and 28 billion text sentences. Developed for uses such as subtitles on [[YouTube]], the system supports widely-used languages like English and Mandarin, as well as less common languages, encompassing Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, and Nzema.
The [[Universal Speech Model]] (USM) is a state-of-the-art collection of speech models with 2 billion parameters, engineered to conduct automatic speech recognition (ASR) in over 300 languages. USM has been trained using 12 million hours of spoken data and 28 billion text sentences. Developed for uses such as subtitles on [[YouTube]], the system supports widely-used languages like English and Mandarin, as well as less common languages, encompassing Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, and Nzema.


==Model Development and Efficacy==
==Model Development and Efficacy==
Line 6: Line 7:


==Proficiency Across Numerous Languages==
==Proficiency Across Numerous Languages==
[[File:usm1.png|right|400px]]
Incorporating more than 300 languages in the pre-training phase, USM has displayed exceptional performance when fine-tuned on YouTube Caption's multilingual speech information. This supervised dataset comprises 73 languages, with an average of fewer than three thousand hours of data for each language. Despite the limited supervised data, the model achieves an average word error rate (WER) of under 30% for the 73 languages, an unprecedented accomplishment.
Incorporating more than 300 languages in the pre-training phase, USM has displayed exceptional performance when fine-tuned on YouTube Caption's multilingual speech information. This supervised dataset comprises 73 languages, with an average of fewer than three thousand hours of data for each language. Despite the limited supervised data, the model achieves an average word error rate (WER) of under 30% for the 73 languages, an unprecedented accomplishment.


Line 11: Line 13:


==Adaptation to Subsequent ASR Tasks==
==Adaptation to Subsequent ASR Tasks==
[[File:usm2.png|right|400px]]
USM surpasses the Whisper model on various publicly accessible datasets, including CORAAL (African American Vernacular English), SpeechStew (English), and FLEURS (102 languages). These outcomes are attained with and without training on in-domain data. For the 62 languages in the FLEURS dataset that coincide with those supported by the Whisper model, CUSM has a 65.8% relative lower WER without in-domain data and a 67.8% relative lower WER with in-domain data.
USM surpasses the Whisper model on various publicly accessible datasets, including CORAAL (African American Vernacular English), SpeechStew (English), and FLEURS (102 languages). These outcomes are attained with and without training on in-domain data. For the 62 languages in the FLEURS dataset that coincide with those supported by the Whisper model, CUSM has a 65.8% relative lower WER without in-domain data and a 67.8% relative lower WER with in-domain data.


==Superior Automated Speech Translation (AST)==
==Superior Automated Speech Translation (AST)==
[[File:usm3.png|right|400px]]
USM is also efficient for speech translation tasks, as evidenced by its performance on the CoVoST dataset. The model categorizes languages into high, medium, and low-resource groups, computing the BLEU score for each segment. USM consistently surpasses the Whisper model across all resource divisions, highlighting its adaptability and effectiveness in speech translation.
USM is also efficient for speech translation tasks, as evidenced by its performance on the CoVoST dataset. The model categorizes languages into high, medium, and low-resource groups, computing the BLEU score for each segment. USM consistently surpasses the Whisper model across all resource divisions, highlighting its adaptability and effectiveness in speech translation.


[[Category:Papers]]
[[Category:Papers]]
370

edits