Speech Wikimedia is a landmark in the evolution of multilingual AI, introducing a dataset encompassing 1,780 hours of transcribed speech across 77 languages. Developed collaboratively by researchers from Factored, NVIDIA, and MLCommons, this project addresses the scarcity of high-quality multilingual datasets with robust licensing for academic and commercial use. This resource is essential for training speech recognition (ASR), speech translation, and machine translation (MT) models.

Ensuring Ethical and Diverse Data Collection
By sourcing data from Wikimedia Commons, an archive of openly licensed audio, the dataset avoids the common pitfalls of copyright issues and unethical sourcing. Each audio file has one or more transcriptions, with 69% of the dataset supporting ASR tasks and 31% enabling speech translation. The collection includes rare language pairings like Welsh-English and Dutch-Russian, ensuring inclusivity and representation of global languages

Setting a New Standard for Multilingual AI
Speech Wikimedia spans a broader number of audio sources and languages than Mozilla Common Voice and Multilingual Librispeech respectively, which makes it much more diverse. The inclusion of topics ranging from current events to history and science makes it adaptable to various scenarios, surpassing traditional "read speech" datasets. Factored's contributions were pivotal in optimizing the dataset for practical applications, ensuring scalability and ethical standards.
Pioneering Multilingual Speech Research
Factored, the dataset introduces innovative possibilities for AI research:
- ASR: Covers languages like Basque, Korean, and Bengali, alongside widely spoken languages such as English and Spanish.
- Speech Translation: Facilitates cross-lingual transcription, with English-Spanish and Latin-English as notable examples.
- Machine Translation: Unlocks multitask learning opportunities with 929 language pairings and transcripts in at least three languages for 10% of the dataset.
Empowering AI Innovation
Speech Wikimedia, hosted on HuggingFace, is actively shaping the future of AI research with over 800 downloads each month. The next phase includes processing raw data for model readiness and exploring its potential for multimodal tasks by integrating removed video content.
To see the full paper click here