Voting-based instance selection from large data sets with MapReduce and random weight networks

Junhai ZHAI*, Xizhao WANG, Xiaohe PANG

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

44 Citations (Scopus)

Abstract

Instance selection is an important preprocessing step in machine learning. By choosing a subset of a data set, it achieves the same performance of a machine learning algorithm as if the whole data set is used, and it enables a machine learning algorithm to be feasible for and to work effectively with large data sets. Based on voting mechanism, this paper proposes a large data sets instance selection algorithm with MapReduce and random weight networks (RWNs). Firstly, the proposed algorithm employs the Map of MapReduce to partition the large data sets into some small subsets, and deploys them to different cloud computing nodes. Secondly, the informative instances are selected in parallel with an instance selection algorithm. Thirdly, the Reduce of MapReduce is used to collect the selected instances from different cloud computing nodes and a selected instance subset is obtained. The previous three processes are repeated p times (p is a parameter defined by the user), and p instance subsets are obtained. Finally, the voting method is used to select the most informative instances from the p subsets. The random weight network classifier is trained with the selected instance subset, and the testing accuracy is verified on the testing set. The proposed algorithm is experimentally compared with three state-of-the-art approaches which are CNN, ENN and RNN. The experimental results show that the proposed algorithm is effective and efficient.
Original languageEnglish
Pages (from-to)1066-1077
Number of pages12
JournalInformation Sciences
Volume367-368
Early online date7 Jul 2016
DOIs
Publication statusPublished - 1 Nov 2016
Externally publishedYes

Bibliographical note

This research is supported by Basic Research Project of Knowledge Innovation Program in Shenzhen (JCYJ20150324140036825), by National Natural Science Foundations of China ( 71371063 ), by Key Scientific Research Foundation of Education Department of Hebei Province (ZD20131028) and by the Opening Fund of Zhejiang Provincial Top Key Discipline of Computer Science and Technology at Zhejiang Normal University, China.

Keywords

  • Instance selection
  • Large data sets
  • MapReduce
  • Random weight networks

Fingerprint

Dive into the research topics of 'Voting-based instance selection from large data sets with MapReduce and random weight networks'. Together they form a unique fingerprint.

Cite this