Publications - Factored, MLCommons, and NVIDIA tackle dataset redundancy with scalable LSH models, presented at NeurIPS 2021.

Addressing the Challenge of Nearly Redundant Data

In the era of large-scale data driving advancements in natural language processing (NLP) and artificial intelligence (AI), the quality and integrity of datasets have become crucial to ensuring reliable model performance. Duplicate and nearly duplicate data pose significant challenges, including potential overfitting and biased model behavior. Factored, collaborating with MLCommons, NVIDIA, and other institutions, tackled these issues through Locality Sensitive Hashing (LSH) models—an innovative method for identifying and removing nearly redundant data from datasets.

To validate these approaches, the team constructed an artificial dataset derived from 30,000 English Wikipedia articles, creating duplicates with varying levels of similarity (87%-94%). This allowed the evaluation of LSH models for data deduplication with exceptional Area-Under-Curve (AUC) scores exceeding 0.9 in most configurations, making them effective and scalable for large datasets.

‍

Factored’s Contributions to Scalable Data Deduplication

Factored played a pivotal role in implementing and fine-tuning LSH models for this project. The methodology enabled efficient deduplication by grouping nearly similar articles into buckets, drastically reducing computational complexity compared to traditional methods. The result was a scalable framework capable of evaluating datasets with O(N) complexity, significantly optimizing the deduplication process while maintaining high accuracy.

This work also introduced a benchmark artificial dataset to evaluate data deduplication techniques, ensuring reproducibility and setting a new standard for validating data quality in machine learning pipelines. The approach demonstrated that maintaining dataset integrity enhances the reliability of AI models across training, validation, and testing phases.

‍

Innovation and Relevance: Advancing Data-Centric AI

MinHash-based LSH techniques for data deduplication represent a paradigm shift in data-centric AI. By preserving data quality without unnecessary exclusions, the framework achieved an AUC of up to 0.96, underscoring its precision and practicality. LSH techniques' scalability ensures their applicability to datasets spanning millions of entries, offering a significant edge over conventional k-nearest-neighbor methods.

Factored’s contribution exemplifies its expertise in developing and deploying scalable machine learning solutions, pushing the boundaries of AI integrity and performance.

Showcased at NeurIPS 2021

The groundbreaking work was presented at the Data-Centric AI Workshop during NeurIPS 2021, one of the world’s premier AI conferences. This platform highlighted the collaboration's pivotal advancements, bringing global attention to the significance of scalable and accurate data deduplication in AI research and applications.

To see the full paper click here

‍