TY - JOUR
T1 - Top K representative: a method to select representative samples based on K nearest neighbors
AU - YANG, Kai
AU - CAI, Yi
AU - CAI, Zhiwei
AU - XIE, Haoran
AU - WONG, Tak Lam
AU - CHAN, Wai Hong
PY - 2019/8/1
Y1 - 2019/8/1
N2 - Short text categorization involves the use of a supervised learning process that requires a large amount of labeled data for training and therefore consumes considerable human labor. Active learning is a way to reduce the number of manually labeled samples in traditional supervised learning problems. In active learning, the number of samples is reduced by selecting the most representative samples to represent an entire training set. Uncertainty sampling is a means of active learning but is easily affected by outliers. In this paper, a new sampling method called Top K representative (TKR) is proposed to solve the problem caused by outliers. However, TKR optimization is a nondeterministic polynomial-time hardness (NP-hard) problem, making it challenging to obtain exact solutions. To tackle this problem, we propose a new approach based on the greedy algorithm, which can obtain approximate solutions, and thereby achieve high performance. Experiments show that our proposed sampling method outperforms the existing methods in terms of efficiency.
AB - Short text categorization involves the use of a supervised learning process that requires a large amount of labeled data for training and therefore consumes considerable human labor. Active learning is a way to reduce the number of manually labeled samples in traditional supervised learning problems. In active learning, the number of samples is reduced by selecting the most representative samples to represent an entire training set. Uncertainty sampling is a means of active learning but is easily affected by outliers. In this paper, a new sampling method called Top K representative (TKR) is proposed to solve the problem caused by outliers. However, TKR optimization is a nondeterministic polynomial-time hardness (NP-hard) problem, making it challenging to obtain exact solutions. To tackle this problem, we propose a new approach based on the greedy algorithm, which can obtain approximate solutions, and thereby achieve high performance. Experiments show that our proposed sampling method outperforms the existing methods in terms of efficiency.
KW - Active learning
KW - Text categorization
UR - http://www.scopus.com/inward/record.url?scp=85069435292&partnerID=8YFLogxK
U2 - 10.1007/s13042-017-0755-8
DO - 10.1007/s13042-017-0755-8
M3 - Journal Article (refereed)
AN - SCOPUS:85069435292
SN - 1868-8071
VL - 10
SP - 2119
EP - 2129
JO - International Journal of Machine Learning and Cybernetics
JF - International Journal of Machine Learning and Cybernetics
IS - 8
ER -