Erwin L. Rimban, Data Sovereignty and the Myth of the Universal Dataset: A Critical Review of Benchmarking in Machine Learning, International Journal of Soft Computing, Volume 19,Issue 1, 2024, Pages 1-8, ISSN 1816-9503, makijsc.2024.1.8, (https://makhillpublications.co/view-article.php?doi=makijsc.2024.1.8) Abstract:

This paper presents a critical review of the concept of universal benchmark datasets in machine learning through the lens of data sovereignty and decolonial theory. While benchmark datasets like Image Net, COCO and GLUE have become standard tools for evaluating model performance, they often reflect Western cultural norms, linguistic biases, and geopolitical priorities. Drawing on theoretical frameworks from Walter Mignolo's epistemic disobedience, Boaventura de Sousa Santos's epistemologies of the South, Miranda Fricker's epistemic injustice and Philip Alston's digital colonialism, this paper critically examines the historical development, construction politics and universality claims of benchmark datasets. The analysis reveals how these datasets marginalize non‐Western knowledge systems and perpetuate colonial power dynamics in data practices. As alternatives, this paper proposes data pluriverses, co‐design frameworks for localized benchmarking, decentralized dataset stewardship and integration of Indigenous data governance principles like CARE (Collective Benefit, Authority to Control, Responsibility, Ethics). The paper concludes by emphasizing the urgent need to dismantle universalist assumptions in AI development and calls for more ethical and pluralistic data practices in machine learning research.

Keywords: Artificial intelligence; benchmark datasets; data sovereignty; decolonial theory; machine learning; epistemic justice; indigenous data governance; data pluriverses