Abstract
The online data markets have emerged as a valuable source of diverse datasets for training machine learning (ML) models. However, datasets from different data providers may exhibit varying levels of bias with respect to certain sensitive attributes in the population (such as race, sex, age, and marital status). Recent dataset acquisition research has focused on maximizing accuracy improvements for downstream model training, ignoring the negative impact of biases in the acquired datasets, which can lead to an unfair model. Can a consumer obtain an unbiased dataset from datasets with diverse biases? In this work, we propose a fairness-aware data acquisition framework (FAIRDA) to acquire high-quality datasets that maximize both accuracy and fairness for consumer local classifier training while remaining within a limited budget. Given the biases of data commodities remain opaque to consumers, the data acquisition in FAIRDA employs explore-exploit strategies. Based on whether exploration and exploitation are conducted sequentially or alternately, we introduce two algorithms: the knowledge-based offline data acquisition (KDA) and the reward-based online data acquisition algorithms (RDA). Each algorithm is tailored to specific customer needs, giving the former an advantage in computational efficiency and the latter an advantage in robustness. We conduct experiments to demonstrate the effectiveness of the proposed data acquisition framework in steering users toward fairer model training compared to existing baselines under varying market settings.
Original language | English |
---|---|
Title of host publication | Proceedings of the Seventh AAAI/ACM Conference on AI, Ethics, and Society (AIES2024) |
Editors | Sanmay DAS, Brian Patrick GREEN, Kush VARSHNEY, Marianna GANAPINI, Andrea RENDA |
Publisher | AAAI press |
Pages | 451-462 |
Volume | 7 |
ISBN (Electronic) | 9781577358923 |
Publication status | Published - 16 Oct 2024 |
Funding
This work was supported by Key Programs of Guangdong Province under Grant2021QN02X166. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding parties.