The rapid advancement of AI has been driven primarily by model improvements, yet datasets—the foundational element of machine learning (ML)—often remain overlooked. To address this gap, DataPerf emerged as a community-driven initiative to revolutionize dataset evaluation and foster innovation in data-centric AI. Supported by MLCommons , this benchmark suite invites the global ML community to improve data quality, comparability, and reproducibility.

Transforming Dataset Benchmarking Across Modalities
DataPerf introduces five innovative benchmarks—spanning vision, speech, data acquisition, debugging, and diffusion prompting—to tackle longstanding challenges in dataset creation and optimization. Hosted on the open-source Dynabench platform, it ensures accessibility for academia and industry while fostering dataset iteration and refinement. Factored played a pivotal role in developing scalable pipelines for all benchmarks.
Innovative Contributions to AI Development
DataPerf redefines benchmarking by prioritizing dataset innovation over model architecture, driving advancements in data cleaning, coreset selection, and debugging to enhance ML model performance. Factored’s pivotal contributions to the speech and vision benchmarks underscore the transformative impact of this initiative in setting new standards for data-centric AI development.
A Future of Collaborative Progress
Already hosting submissions on Dynabench, DataPerf stands as a testament to the power of collaboration, with Factored contributing as a key partner in shaping its success. The platform’s scalable framework encourages continuous innovation through competitions and open-source contributions. Future expansions include multimodal challenges and a closed division for evaluating generalization on unseen datasets, solidifying Factored’s role in shaping the next era of data-centric AI.
To see the full paper click here