In data mining, classification is a task to build a model which classifies data into a given set of categories. Most classification algorithms assume the class distribution of data to be roughly balanced. In real-life applications such as direct marketing, fraud detection and churn prediction, class imbalance problem usually occurs. Class imbalance problem is referred to the issue that the number of examples belonging to a class is significantly greater than those of the others. When training a standard classifier with class imbalance data, the classifier is usually biased toward majority class. However, minority class is the class of interest and more significant than the majority class. In the literature, existing methods such as data-level, algorithmic-level and cost-sensitive learning have been proposed to address this problem. The experiments discussed in these studies were usually conducted on relatively small data sets or even on artificial data. The performance of the methods on modern real-life data sets, which are more complicated, is unclear. In this research, we study the background and some of the state-of-the-art approaches which handle class imbalance problem. We also propose two costsensitive methods to address class imbalance problem, namely Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE). CSDNN is a deep neural network based on Stacked Denoising Autoencoders (SDAE). We propose CSDNN by incorporating cost information of majority and minority class into the cost function of SDAE to make it costsensitive. Another proposed method, CSDE, is an ensemble learning version of CSDNN which is proposed to improve the generalization performance on class imbalance problem. In the first step, a deep neural network based on SDAE is created for layer-wise feature extraction. Next, we perform Bagging’s resampling procedure with undersampling to split training data into a number of bootstrap samples. In the third step, we apply a layer-wise feature extraction method to extract new feature samples from each of the hidden layer(s) of the SDAE. Lastly, the ensemble learning is performed by using each of the new feature samples to train a CSDNN classifier with random cost vector. Experiments are conducted to compare the proposed methods with the existing methods. We examine their performance on real-life data sets in business domains. The results show that the proposed methods obtain promising results in handling class imbalance problem and also outperform all the other compared methods. There are three major contributions to this work. First, we proposed CSDNN method in which misclassification costs are considered in training process. Second, we incorporate random undersampling with layer-wise feature extraction to perform ensemble learning. Third, this is the first work that conducts experiments on class imbalance problem using large real-life data sets in different business domains ranging from direct marketing, churn prediction, credit scoring, fraud detection to fake review detection.
|Date of Award
- Department of Computing and Decision Sciences
|Man Leung WONG (Supervisor)