Publications - Factored and MLCommons deliver scalable, ethically sourced audio for robust speech recognition, transforming AI accessibility globally.

The People's Speech is a groundbreaking supervised conversational English speech recognition dataset. With over 30,000 hours of transcribed audio sourced ethically and licensed under CC-BY-SA and CC-BY, this dataset transforms the speech recognition landscape. Factored played a pivotal role in ensuring its scalability and adherence to rigorous ethical standards, solidifying its position as a cornerstone for academic and commercial usage.

Summary of The People’s Speech dataset versus prior work. We focus on commercial use, speaker diversity, quantity, and incorporating natural background noise conditions into the dataset.

‍

Setting New Standards in Accessibility and Scale

This dataset harnesses Creative Commons-licensed audio data from diverse sources, including movies, lectures, historical recordings, and podcasts. Unlike traditional benchmarks, such as Librispeech, The People's Speech incorporates real-world environmental noise and a broad spectrum of accents, making it highly adaptable for training robust ASR systems. Its collection pipeline, utilizing forced alignment and open-source tools like Apache Spark, reduces the prohibitive costs typically associated with large-scale dataset curation—bringing the estimated cost from $5 million to just $3,000.

Shaping the Future of Ethical and Inclusive AI

The People's Speech is a technical marvel and a testament to responsible AI development. Factored's contribution, led by Juan Cerón, ensured the dataset met the highest standards of data integrity while respecting privacy and intellectual property rights. Currently being downloaded in Huggin Face over 11,000 times a month, the dataset enables researchers and developers to train models that better understand diverse accents and contexts, reducing bias and enhancing accessibility in speech recognition technologies globally.

A Global Milestone in Open AI Collaboration

The People's Speech is a collaborative triumph, sponsored by MLCommons and supported by institutions like Harvard and NVIDIA. Factored led the charge in ensuring ethical standards, scalability, and technical excellence, making the dataset a transformative resource across industries.

To see the full paper click here.

‍