Standard classification algorithms assume the class distribution of data to be roughly balanced. Class imbalance problem usually occurs in real-life applications, such as direct marketing, fraud detection and churn prediction. Class imbalance problem is referred to the issue that the number of examples belonging to a class is significantly higher than those of the others. When training a standard classifier with class imbalance data, the classifier is usually biased toward the majority class. In this work, we propose two novel cost-sensitive methods to address class imbalance problem, namely Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE). CSDNN is a cost-sensitive version of Stacked Denoising Autoencoders. CSDE is an ensemble learning version of CSDNN. Random undersampling and layer-wise feature extraction from the hidden layers of the deep neural network are applied in CSDE to improve the generalization performance over CSDNN. In some literatures, various methods handling class imbalance problem were proposed. However, the experiments discussed in those studies were usually conducted on relatively small data sets and also on artificial data. The performance of those methods on modern real-life data sets, which are more complicated, is unclear. In our experiment, we examine the performance of our proposed methods and the other methods using six large real-life data sets in different business domains ranging from direct marketing, churn prediction, default payment to firm fraud detection. The results show that the proposed methods obtain promising results in handling class imbalance problem and also outperform all the other compared methods.
Bibliographical noteThis research is supported by LEO Dr David P. Chan Institute of Data Science.
- Stacked denoising autoencoders
- Class imbalance