Publications - Factored and MLCommons deliver the world’s largest multilingual keyword spotting dataset, enhancing speech recognition in 50+ languages.

The Multilingual Spoken Words Corpus (MSWC) introduces a monumental leap in speech recognition, offering over 23.4 million audio clips covering 340,000 keywords across 50 languages spoken by 5 billion people. Designed for academic and commercial use, the dataset is pivotal for voice assistants, call center automation, and low-resource language development applications. It leverages forced alignment to extract high-quality keyword data from Mozilla's Common Voice project, making it the largest open-source resource for keyword spotting.

Delivering Scalable Innovation

Factored played an integral role in this groundbreaking initiative. Collaborating with MLCommons, Factored contributed to data engineering, scalability, and quality assurance of the MSWC pipeline. Our engineers optimized data processing frameworks, ensuring efficient alignment and extraction processes for keywords across diverse languages.

‍

Key Features and Benchmarks

Scale and Diversity: High-resource languages like English and Spanish complement low-resource languages like Oriya and Dhivehi.
Enhanced Accessibility: Includes gender diversity and background noise for real-world robustness.
Performance Metrics: Keyword spotting models trained on MSWC achieve competitive accuracies compared to standard datasets like Google Speech Commands.

Factored also contributed to the outlier detection metrics, enabling users to filter and select between larger, noisier datasets or smaller, cleaner datasets tailored to their needs.Shaping the Future of Multilingual Speech RecognitionThe MSWC has already set a new global benchmark in inclusivity and scalability. Its versatility has empowered low-resource languages, enabling researchers to develop speech models with unprecedented accuracy using limited data.

‍

To see the full paper click here.