Abstract
Instance selection is an important preprocessing step in machine learning. By choosing a subset of a data set, it achieves the same performance of a machine learning algorithm as if the whole data set is used, and it enables a machine learning algorithm to be feasible for and to work effectively with large data sets. Based on voting mechanism, this paper proposes a large data sets instance selection algorithm with MapReduce and random weight networks (RWNs). Firstly, the proposed algorithm employs the Map of MapReduce to partition the large data sets into some small subsets, and deploys them to different cloud computing nodes. Secondly, the informative instances are selected in parallel with an instance selection algorithm. Thirdly, the Reduce of MapReduce is used to collect the selected instances from different cloud computing nodes and a selected instance subset is obtained. The previous three processes are repeated p times (p is a parameter defined by the user), and p instance subsets are obtained. Finally, the voting method is used to select the most informative instances from the p subsets. The random weight network classifier is trained with the selected instance subset, and the testing accuracy is verified on the testing set. The proposed algorithm is experimentally compared with three state-of-the-art approaches which are CNN, ENN and RNN. The experimental results show that the proposed algorithm is effective and efficient.
Original language | English |
---|---|
Pages (from-to) | 1066-1077 |
Number of pages | 12 |
Journal | Information Sciences |
Volume | 367-368 |
Early online date | 7 Jul 2016 |
DOIs | |
Publication status | Published - 1 Nov 2016 |
Externally published | Yes |
Bibliographical note
This research is supported by Basic Research Project of Knowledge Innovation Program in Shenzhen (JCYJ20150324140036825), by National Natural Science Foundations of China ( 71371063 ), by Key Scientific Research Foundation of Education Department of Hebei Province (ZD20131028) and by the Opening Fund of Zhejiang Provincial Top Key Discipline of Computer Science and Technology at Zhejiang Normal University, China.Keywords
- Instance selection
- Large data sets
- MapReduce
- Random weight networks