Robust twin boosting for feature selection from high-dimensional omics data with label noise

Shan HE, Huanhuan CHEN, Zexuan ZHU, Douglas G. WARD, Helen J. COOPER, Mark R. VIANT, John K. HEATH, Xin YAO

Research output: Journal PublicationsJournal Article (refereed)peer-review

31 Citations (Scopus)

Abstract

Omics data such as microarray transcriptomic and mass spectrometry proteomic data are typically characterized by high dimensionality and relatively small sample sizes. In order to discover biomarkers for diagnosis and prognosis from omics data, feature selection has become an indispensable step to find a parsimonious set of informative features. However, many previous studies report considerable label noise in omics data, which will lead to unreliable inferences to select uninformative features. Yet, to the best of our knowledge, very few feature selection methods are proposed to address this problem. This paper proposes a novel ensemble feature selection algorithm, robust twin boosting feature selection (RTBFS), which is robust to label noise in omics data. The algorithm has been validated on an omics feature selection test bed and seven real-world heterogeneous omics datasets, of which some are known to have label noise. Compared with several state-of-the-art ensemble feature selection methods, RTBFS can select more informative features despite label noise and obtain better classification results. RTBFS is a general feature selection method and can be applied to other data with label noise. MATLAB implementation of RTBFS and sample datasets are available at: http://www.cs.bham.ac.uk/szh/TReBFSMatlab.zip. © 2014 Elsevier Inc. All rights reserved.
Original languageEnglish
Pages (from-to)1-18
Number of pages18
JournalInformation Sciences
Volume291
Issue numberC
Early online date30 Aug 2014
DOIs
Publication statusPublished - Jan 2015
Externally publishedYes

Bibliographical note

This work is supported by the Leverhulme Trust Early Career Fellowship (ECF/2007/0433), the Royal Society International Exchanges 2011 NSFC cost share scheme (IE111069), National Natural Science Foundation of China (61471246 and 61205092), the NSFC-RS joint project (61211130120), the Guangdong Foundation of Outstanding Young Teachers in Higher Education Institutions (Yq2013141), the Shenzhen Scientific Research and Development Funding Program (JCYJ20130329115450637, KQC201108300045A, and ZYC201105170243A), and the Guangdong Natural Science Foundation (S2012010009545).

Keywords

  • Boosting
  • Ensemble learning
  • Feature selection

Fingerprint

Dive into the research topics of 'Robust twin boosting for feature selection from high-dimensional omics data with label noise'. Together they form a unique fingerprint.

Cite this