Universal Speech Model: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 7: Line 7:


==Proficiency Across Numerous Languages==
==Proficiency Across Numerous Languages==
[[File:usm1.png|400px]]
[[File:usm1.png|right|300px]]
Incorporating more than 300 languages in the pre-training phase, USM has displayed exceptional performance when fine-tuned on YouTube Caption's multilingual speech information. This supervised dataset comprises 73 languages, with an average of fewer than three thousand hours of data for each language. Despite the limited supervised data, the model achieves an average word error rate (WER) of under 30% for the 73 languages, an unprecedented accomplishment.
Incorporating more than 300 languages in the pre-training phase, USM has displayed exceptional performance when fine-tuned on YouTube Caption's multilingual speech information. This supervised dataset comprises 73 languages, with an average of fewer than three thousand hours of data for each language. Despite the limited supervised data, the model achieves an average word error rate (WER) of under 30% for the 73 languages, an unprecedented accomplishment.


Line 13: Line 13:


==Adaptation to Subsequent ASR Tasks==
==Adaptation to Subsequent ASR Tasks==
[[File:usm2.png|400px]]
[[File:usm2.png|right|300px]]
USM surpasses the Whisper model on various publicly accessible datasets, including CORAAL (African American Vernacular English), SpeechStew (English), and FLEURS (102 languages). These outcomes are attained with and without training on in-domain data. For the 62 languages in the FLEURS dataset that coincide with those supported by the Whisper model, CUSM has a 65.8% relative lower WER without in-domain data and a 67.8% relative lower WER with in-domain data.
USM surpasses the Whisper model on various publicly accessible datasets, including CORAAL (African American Vernacular English), SpeechStew (English), and FLEURS (102 languages). These outcomes are attained with and without training on in-domain data. For the 62 languages in the FLEURS dataset that coincide with those supported by the Whisper model, CUSM has a 65.8% relative lower WER without in-domain data and a 67.8% relative lower WER with in-domain data.


==Superior Automated Speech Translation (AST)==
==Superior Automated Speech Translation (AST)==
[[File:usm3.png|400px]]
[[File:usm3.png|right|300px]]
USM is also efficient for speech translation tasks, as evidenced by its performance on the CoVoST dataset. The model categorizes languages into high, medium, and low-resource groups, computing the BLEU score for each segment. USM consistently surpasses the Whisper model across all resource divisions, highlighting its adaptability and effectiveness in speech translation.
USM is also efficient for speech translation tasks, as evidenced by its performance on the CoVoST dataset. The model categorizes languages into high, medium, and low-resource groups, computing the BLEU score for each segment. USM consistently surpasses the Whisper model across all resource divisions, highlighting its adaptability and effectiveness in speech translation.


[[Category:Papers]]
[[Category:Papers]]
370

edits