Abstract
针对邻近加权合成过采样技术(proximity weighted synthetic oversampling technique, ProWSyn)在合成样例时未删除噪声样例,且当平滑因子在0 ~ 1区间取值时,权重比例难以覆盖整个搜索空间的缺陷,提出一种改进的邻近加权合成过采样技术(improved proximity weighted synthetic oversampling technique, IProWSyn).改变权重的计算策略,引入底数为0 ~ 1区间的普通指数函数,通过动态改变底数令权重能够覆盖更大范围的搜索空间,进而找到更优的权重.将 IProWSyn 与 ASN-SMOTE 和 ProWSyn 应用在 ada、ecoli1、glass1、haberman、Pima和yeast1这6个非平衡数据集上,再使用k近邻分类器(k-nearest neighbors,kNN)和神经网络分类器检验方法的有效性.实验结果表明,在多数数据集上IProWSyn的F1、几何平均值 (geometric mean, G-mean)和曲线下面积(area under curve, AUC)指标性能都高于其他过采样方法,说明本研究的IProWSyn过采样技术在这些数据集的综合分类效果更好,有更好的泛化表现.
An improved proximity weighted synthetic oversampling technique (IProWSyn) is proposed to address the shortcomings of the proximity weighted synthetic oversampling technique (ProWSyn) in that it does not remove noise samples during the synthesis of samples, and when the smoothing factor takes values in the range from 0 to 1, the weight ratio is difficult to cover the entire search space. The calculation strategy of weights is changed, and by introducing a common exponential function with a base ranging from 0 to 1, the base is dynamically changed to allow the weights to cover a larger range of the search space, thereby finding better weights. Apply IProWSyn, ASNSMOTE, and ProWSyn oversampling methods to six imbalanced datasets: ada, ecoli1, glass1, haberman, Pima, andyeast1. The effectiveness of the method is verified by using k-nearest neighbor (kNN) and neural network classifier. Through the comparison of experimental results, the F1 value, geometric mean (G-mean) value and area under curve (AUC) value of IProWSyn are higher than those of other oversampling methods on most datasets. It indicates that IProWSyn has better comprehensive classification performance and better generalization performance on these datasets.
An improved proximity weighted synthetic oversampling technique (IProWSyn) is proposed to address the shortcomings of the proximity weighted synthetic oversampling technique (ProWSyn) in that it does not remove noise samples during the synthesis of samples, and when the smoothing factor takes values in the range from 0 to 1, the weight ratio is difficult to cover the entire search space. The calculation strategy of weights is changed, and by introducing a common exponential function with a base ranging from 0 to 1, the base is dynamically changed to allow the weights to cover a larger range of the search space, thereby finding better weights. Apply IProWSyn, ASNSMOTE, and ProWSyn oversampling methods to six imbalanced datasets: ada, ecoli1, glass1, haberman, Pima, andyeast1. The effectiveness of the method is verified by using k-nearest neighbor (kNN) and neural network classifier. Through the comparison of experimental results, the F1 value, geometric mean (G-mean) value and area under curve (AUC) value of IProWSyn are higher than those of other oversampling methods on most datasets. It indicates that IProWSyn has better comprehensive classification performance and better generalization performance on these datasets.
Translated title of the contribution | Improved proximity weighted synthetic oversampling |
---|---|
Original language | Chinese (Simplified) |
Journal | Shenzhen Daxue Xuebao (Ligong Ban)/Journal of Shenzhen University Science and Engineering |
Publication status | E-pub ahead of print - 23 Aug 2024 |