Sanitized clustering against confounding bias

Yinghua YAO, Yuangang PAN, Jing LI, Ivor W. TSANG*, Xin YAO*

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)

Abstract

Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias, which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by variational auto-encoder. Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias.

Original languageEnglish
Pages (from-to)3711-3730
Number of pages20
JournalMachine Learning
Volume113
Issue number6
Early online date27 Dec 2023
DOIs
Publication statusPublished - Jun 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2023, The Author(s).

Funding

This work was supported in part by the A*STAR Centre for Frontier AI Research; in part by the AISG Grand Challenge in AI for Materials Discovery (Grant No. AISG2-GC-2023-010); in part by the A*STAR C222812019; in part by the A*STAR Pitchfest for ECR 232D800027; in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2017ZT07X386); and in part by the Program for Guangdong Provincial Key Laboratory (Grant No. 2020B121201001).

Keywords

  • Confounding bias
  • Deep clustering
  • Mutual information
  • Non-linear dependence

Fingerprint

Dive into the research topics of 'Sanitized clustering against confounding bias'. Together they form a unique fingerprint.

Cite this