Factored, NVIDIA, and MLCommons drive multilingual AI innovation with 77-language speech datasets, promoting ethical data and global inclusivity.

Factored, NVIDIA, and MLCommons Lead Multilingual AI with Speech Wikimedia

Speech Wikimedia is a landmark in the evolution of multilingual AI, introducing a dataset encompassing 1,780 hours of transcribed speech across 77 languages. Developed collaboratively by researchers from Factored, NVIDIA, and MLCommons, this project addresses the scarcity of high-quality multilingual datasets with robust licensing for academic and commercial use. This resource is essential for training speech recognition (ASR), speech translation, and machine translation (MT) models.

Ensuring Ethical and Diverse Data Collection

By sourcing data from Wikimedia Commons, an archive of openly licensed audio, the dataset avoids the common pitfalls of copyright issues and unethical sourcing. Each audio file has one or more transcriptions, with 69% of the dataset supporting ASR tasks and 31% enabling speech translation. The collection includes rare language pairings like Welsh-English and Dutch-Russian, ensuring inclusivity and representation of global languages

Setting a New Standard for Multilingual AI

Speech Wikimedia spans a broader number of audio sources and languages than Mozilla Common Voice and Multilingual Librispeech respectively, which makes it much more diverse. The inclusion of topics ranging from current events to history and science makes it adaptable to various scenarios, surpassing traditional "read speech" datasets. Factored's contributions were pivotal in optimizing the dataset for practical applications, ensuring scalability and ethical standards.

Pioneering Multilingual Speech Research

Factored, the dataset introduces innovative possibilities for AI research:

  • ASR: Covers languages like Basque, Korean, and Bengali, alongside widely spoken languages such as English and Spanish.
  • Speech Translation: Facilitates cross-lingual transcription, with English-Spanish and Latin-English as notable examples.
  • Machine Translation: Unlocks multitask learning opportunities with 929 language pairings and transcripts in at least three languages for 10% of the dataset.

Empowering AI Innovation

Speech Wikimedia, hosted on HuggingFace, is actively shaping the future of AI research with over 800 downloads each month. The next phase includes processing raw data for model readiness and exploring its potential for multimodal tasks by integrating removed video content.

To see the full paper click here

We cover 100% of U.S. time zones, becoming a natural extension of your team
Hire the highest-caliber engineers in under a week
Build IP that belongs to you
Accelerate your roadmap
Start Building Your Team