Top K representative: a method to select representative samples based on K nearest neighbors

Kai YANG, Yi CAI*, Zhiwei CAI, Haoran XIE, Tak Lam WONG, Wai Hong CHAN

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

7 Citations (Scopus)

Abstract

Short text categorization involves the use of a supervised learning process that requires a large amount of labeled data for training and therefore consumes considerable human labor. Active learning is a way to reduce the number of manually labeled samples in traditional supervised learning problems. In active learning, the number of samples is reduced by selecting the most representative samples to represent an entire training set. Uncertainty sampling is a means of active learning but is easily affected by outliers. In this paper, a new sampling method called Top K representative (TKR) is proposed to solve the problem caused by outliers. However, TKR optimization is a nondeterministic polynomial-time hardness (NP-hard) problem, making it challenging to obtain exact solutions. To tackle this problem, we propose a new approach based on the greedy algorithm, which can obtain approximate solutions, and thereby achieve high performance. Experiments show that our proposed sampling method outperforms the existing methods in terms of efficiency.

Original languageEnglish
Pages (from-to)2119-2129
Number of pages11
JournalInternational Journal of Machine Learning and Cybernetics
Volume10
Issue number8
Early online date12 Dec 2017
DOIs
Publication statusPublished - 1 Aug 2019
Externally publishedYes

Funding

This work is supported by the Fundamental Research Funds for the Central Universities, SCUT (No. 2017ZD048), Tiptop Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2015TQ01X633), Science and Technology Planning Project of Guangdong Province, China (No. 2017B050506004), Science and Technology Program of Guangzhou (International Science & Technology Cooperation Program No. 201704030076), and the Internal Research Grant (RG 66/2016-2017) and the Funding Support to ECS Proposal (RG 23/2017-2018R) of The Education University of Hong Kong.

Keywords

  • Active learning
  • Text categorization

Fingerprint

Dive into the research topics of 'Top K representative: a method to select representative samples based on K nearest neighbors'. Together they form a unique fingerprint.

Cite this