1 Introduction
2 Methodology
3 Sampling techniques
3.1 Random over sampling
Algorithm 1: ROS |
Step 1: D is the original data set Step 2: E is new set and adds it by appending randomly selected examples from the minority class (with replacement). Step 3: D = Dmin + Dmaj + E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively |
3.2 Random under sampling (RUS)
Algorithm 2: RUS |
Step 1: D is the original data set Step 2: E is new set and it is a subset of D which is created randomly with or without replacement. Step 3: D = Dmaj + Dmin - E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively |
3.3 Synthetic minority oversampling technique (SMOTE)
Algorithm 3: SMOTE |
Step 1: D is the original data set Step 2: Create data set I with minority set observation of I € D Step 3: Identify the value of K which is the number of nearest neighbors of Minority class Step 4: Identify the value of N which is the number of synthetic examples which needs to be created. Step 5: Create a dataset D’ which is the random sample of I of size N. For each example xk ∊ I D’ = x + rand(0,1) * | x - xk | |
3.4 Adaptive synthetic sampling (ADASYN)
Algorithm 4: ADASYN |
Step 1: D is the dataset with m class examples {((x1,y1)(,(x2,y2)…..(xn,yn)} Step 2: Calculate the Imbalance ratio IR = |Dmaj / |Dmin|. If IR is less than the threshold value then the data is Imbalanced Step 3: Identify G the no of synthetic examples which need to be generated for the minority class G = (|Dmaj| - |DMin|) × β Step 4: For each instance in minority class identify the K nearest neighbor using Euclidean distance (Δi) and calculate the ration Ri which is defined as Ri = Δi /K Step 5: Normalize Ri Step 6: Identify the number of synthetic data (gi) example for each minority instance gi = ˆri × G Step 7: For each minority xi choose the nearest neighbor xk and generate the synthetic example as follows si = xi + rand(0,1) * |xi - xk| |
3.5 Edited nearest neighbour (ENN)
Algorithm 5: ENN |
Step 1: Let D is the original training set, and T is the edited set Step 2: For each xi in D remove xk if it is misclassified using the k-NN rule |
3.6 Condensed nearest neighbour (CNN)
Algorithm 6: CNN |
Step 1: Let D be the original dataset(x1,x2,x3….xn) Step 2: Take a random instance xi from D and create a Subset T Step 3: Scan all members of D and add to T where x € T does not match class using KNN Rule and add it to T. Step 4: Repeat step 3 until all members of xi have been checked |
4 Experimental setup
Figure 1. Rebalancing framework for imbalanced data classification. |
Figure 2. Sample records of medical appointment no-show dataset. |
5 Results and discussion
Table 1 Random under sampling performance with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
RUS_TRIAL1 | 0.81934847 | 0.666132 | 0.574591 | 0.734838 |
RUS_TRIAL2 | 0.28668942 | 0.501493 | 0.585529 | 0.364821 |
RUS_TRIAL3 | 0.295886076 | 0.586207 | 0.619411 | 0.39327 |
RUS_TRIAL4 | 0.851084813 | 0.677394 | 0.607102 | 0.754371 |
RUS_TRIAL5 | 0.810810811 | 0.655226 | 0.519437 | 0.724763 |
RUS_TRIAL6 | 0.826679649 | 0.673275 | 0.574101 | 0.742133 |
RUS_TRIAL7 | 0.851485149 | 0.667702 | 0.593466 | 0.748477 |
RUS_TRIAL8 | 0.866866867 | 0.679749 | 0.635887 | 0.761989 |
RUS_TRIAL9 | 0.830491474 | 0.669361 | 0.601898 | 0.741271 |
RUS_TRIAL10 | 0.841451767 | 0.684006 | 0.575977 | 0.754604 |
RUS_TRIAL11 | 0.295492487 | 0.55836 | 0.614722 | 0.386463 |
RUS_TRIAL12 | 0.273703041 | 0.496753 | 0.591256 | 0.352941 |
RUS_TRIAL13 | 0.827420901 | 0.693173 | 0.593065 | 0.754371 |
RUS_TRIAL14 | 0.83960396 | 0.667191 | 0.587395 | 0.743534 |
RUS_TRIAL15 | 0.840352595 | 0.682578 | 0.603679 | 0.753292 |
RUS_TRIAL16 | 0.281350482 | 0.589226 | 0.623086 | 0.380849 |
RUS_TRIAL17 | 0.859683794 | 0.671815 | 0.60312 | 0.754226 |
RUS_TRIAL18 | 0.843062201 | 0.691523 | 0.594228 | 0.75981 |
RUS_TRIAL19 | 0.832358674 | 0.697143 | 0.619238 | 0.758774 |
RUS_TRIAL20 | 0.81237525 | 0.650679 | 0.555999 | 0.722592 |
RUS TRIAL MEAN | 0.699309894 | 0.642949 | 0.593659 | 0.669946 |
Table 2 Random oversampling performance with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
ROS_TRIAL1 | 0.853307766 | 0.696401 | 0.610623 | 0.766911 |
ROS_TRIAL2 | 0.298611111 | 0.508876 | 0.594374 | 0.376368 |
ROS_TRIAL3 | 0.854684512 | 0.696804 | 0.608655 | 0.767711 |
ROS_TRIAL4 | 0.83109405 | 0.687847 | 0.585859 | 0.752716 |
ROS_TRIAL5 | 0.847152847 | 0.669826 | 0.605871 | 0.748125 |
ROS_TRIAL6 | 0.838461538 | 0.689873 | 0.594937 | 0.756944 |
ROS_TRIAL7 | 0.854917235 | 0.682737 | 0.604107 | 0.759187 |
ROS_TRIAL8 | 0.840770791 | 0.657937 | 0.598086 | 0.738201 |
ROS_TRIAL9 | 0.853472883 | 0.700781 | 0.609766 | 0.769627 |
ROS_TRIAL10 | 0.304878049 | 0.511696 | 0.597263 | 0.382096 |
ROS_TRIAL11 | 0.293165468 | 0.506211 | 0.59935 | 0.371298 |
ROS_TRIAL12 | 0.83030303 | 0.662903 | 0.598118 | 0.73722 |
ROS_TRIAL13 | 0.850943396 | 0.70304 | 0.602309 | 0.769953 |
ROS_TRIAL14 | 0.313559322 | 0.534682 | 0.605858 | 0.395299 |
ROS_TRIAL15 | 0.861248761 | 0.67208 | 0.608027 | 0.754996 |
ROS_TRIAL16 | 0.298181818 | 0.514107 | 0.60639 | 0.377445 |
ROS_TRIAL17 | 0.844054581 | 0.687847 | 0.60932 | 0.757987 |
ROS_TRIAL18 | 0.850551655 | 0.663537 | 0.600402 | 0.745495 |
ROS_TRIAL19 | 0.843444227 | 0.67874 | 0.596946 | 0.752182 |
ROS_TRIAL20 | 0.838150289 | 0.690476 | 0.598179 | 0.75718 |
ROS TRIAL MEAN | 0.710047667 | 0.64082 | 0.601722 | 0.67366 |
Table 3 Adaptive synthetic sampling performance with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
ADAYSYN_TRAIL1 | 0.790373654 | 0.991263 | 0.510294 | 0.879493 |
ADAYSYN_TRAIL2 | 0.806615776 | 0.98677 | 0.510846 | 0.887644 |
ADAYSYN_TRAIL3 | 0.800127714 | 0.989731 | 0.526303 | 0.884887 |
ADAYSYN_TRAIL4 | 0.804692454 | 0.987549 | 0.504885 | 0.886792 |
ADAYSYN_TRAIL5 | 0.788973384 | 0.993615 | 0.516981 | 0.879548 |
ADAYSYN_TRAIL6 | 0.789574062 | 0.987281 | 0.509723 | 0.877428 |
ADAYSYN_TRAIL7 | 0.815974441 | 0.98534 | 0.518986 | 0.892695 |
ADAYSYN_TRAIL8 | 0.805590851 | 0.987539 | 0.509592 | 0.887334 |
ADAYSYN_TRAIL9 | 0.799492386 | 0.990566 | 0.513576 | 0.884831 |
ADAYSYN_TRAIL10 | 0.778693722 | 0.991922 | 0.513917 | 0.872469 |
ADAYSYN_TRAIL11 | 0.80393401 | 0.989071 | 0.51021 | 0.886944 |
ADAYSYN_TRAIL12 | 0.790343075 | 0.992817 | 0.520904 | 0.880085 |
ADAYSYN_TRAIL13 | 0.806883365 | 0.984448 | 0.50974 | 0.886865 |
ADAYSYN_TRAIL14 | 0.78643853 | 0.989633 | 0.507822 | 0.876412 |
ADAYSYN_TRAIL15 | 0.79949077 | 0.985871 | 0.509807 | 0.882953 |
ADAYSYN_TRAIL16 | 0.792141952 | 0.993641 | 0.517288 | 0.881523 |
ADAYSYN_TRAIL17 | 0.813095995 | 0.986122 | 0.507912 | 0.891289 |
ADAYSYN_TRAIL18 | 0.812903226 | 0.971473 | 0.507188 | 0.885142 |
ADAYSYN_TRAIL19 | 0.778059607 | 0.991115 | 0.512132 | 0.871758 |
ADAYSYN_TRAIL20 | 0.789240506 | 0.991256 | 0.508786 | 0.878788 |
ADASYN TRIAL MEAN | 0.793672332 | 0.970829 | 0.516991 | 0.873358 |
Table 4 Synthetic minority oversampling performance with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
SMOTE_TRAIL1 | 0.298013245 | 0.555556 | 0.611634 | 0.387931 |
SMOTE_TRAIL2 | 0.857976654 | 0.694488 | 0.626032 | 0.767624 |
SMOTE_TRAIL3 | 0.273462783 | 0.53481 | 0.592561 | 0.361884 |
SMOTE_TRAIL4 | 0.8382643 | 0.669291 | 0.586161 | 0.744308 |
SMOTE_TRAIL5 | 0.827884615 | 0.689904 | 0.590691 | 0.752622 |
SMOTE_TRAIL6 | 0.290552585 | 0.512579 | 0.601063 | 0.370876 |
SMOTE_TRAIL7 | 0.851669941 | 0.674708 | 0.597672 | 0.752931 |
SMOTE_TRAIL8 | 0.842105263 | 0.666667 | 0.611742 | 0.744186 |
SMOTE_TRAIL9 | 0.841359773 | 0.699372 | 0.592017 | 0.763823 |
SMOTE_TRAIL10 | 0.841112214 | 0.672756 | 0.601774 | 0.747573 |
SMOTE_TRAIL11 | 0.85257032 | 0.688332 | 0.608872 | 0.761698 |
SMOTE_TRAIL12 | 0.310344828 | 0.49422 | 0.595595 | 0.381271 |
SMOTE_TRAIL13 | 0.855327468 | 0.680934 | 0.605546 | 0.758232 |
SMOTE_TRAIL14 | 0.834834835 | 0.661905 | 0.588305 | 0.73838 |
SMOTE_TRAIL15 | 0.848393574 | 0.664308 | 0.601971 | 0.74515 |
SMOTE_TRAIL16 | 0.303664921 | 0.494318 | 0.587303 | 0.376216 |
SMOTE_TRAIL17 | 0.290718039 | 0.533762 | 0.609783 | 0.376417 |
SMOTE_TRAIL18 | 0.844984802 | 0.664013 | 0.609623 | 0.743647 |
SMOTE_TRAIL19 | 0.851887706 | 0.684825 | 0.599555 | 0.759275 |
SMOTE_TRAIL20 | 0.846989141 | 0.675591 | 0.602947 | 0.751643 |
SMOTE MEAN | 0.68010585 | 0.630617 | 0.601042 | 0.654427 |
Table 5 Edited nearest neighbor performance with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
ENN_TRAIL1 | 0.791353383 | 1 | 0.505935 | 0.883526 |
ENN_TRAIL2 | 0.805886036 | 0.999224 | 0.502817 | 0.892201 |
ENN_TRAIL3 | 0.796493425 | 1 | 0.504573 | 0.88672 |
ENN_TRAIL4 | 0.799249531 | 1 | 0.501553 | 0.888425 |
ENN_TRAIL5 | 0.780839073 | 1 | 0.504249 | 0.876934 |
ENN_TRAIL6 | 0.793621013 | 1 | 0.501511 | 0.884937 |
ENN_TRAIL7 | 0.787593985 | 0.998411 | 0.502138 | 0.88056 |
ENN_TRAIL8 | 0.811166876 | 0.999227 | 0.507784 | 0.895429 |
ENN_TRAIL9 | 0.79111945 | 1 | 0.501493 | 0.88338 |
ENN_TRAIL10 | 0.790100251 | 1 | 0.5059 | 0.882744 |
ENN_TRAIL11 | 0.793988729 | 1 | 0.504518 | 0.885166 |
ENN_TRAIL12 | 0.808630394 | 1 | 0.501629 | 0.894191 |
ENN_TRAIL13 | 0.78334377 | 1 | 0.504298 | 0.878511 |
ENN_TRAIL14 | 0.794855709 | 0.998424 | 0.505254 | 0.885086 |
ENN_TRAIL15 | 0.7922403 | 0.999211 | 0.501107 | 0.88377 |
ENN_TRAIL16 | 0.805764411 | 1 | 0.506369 | 0.892436 |
ENN_TRAIL17 | 0.778473091 | 1 | 0.502809 | 0.87544 |
ENN_TRAIL18 | 0.80075188 | 0.998438 | 0.502344 | 0.888734 |
ENN_TRAIL19 | 0.796875 | 1 | 0.5 | 0.886957 |
ENN_TRAIL20 | 0.790362954 | 1 | 0.502967 | 0.882908 |
ENN TRAIL MEAN | 0.794635463 | 0.999647 | 0.503462 | 0.885429 |
Table 6 Condensed nearest neighbor performance with decision tree. |
Samplin method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
CNN_TRAIL1 | 0.803154574 | 0.995309 | 0.511673 | 0.888966 |
CNN_TRAIL2 | 0.784770296 | 1 | 0.515581 | 0.879408 |
CNN_TRAIL3 | 0.802639849 | 0.996877 | 0.506276 | 0.889276 |
CNN_TRAIL4 | 0.802639849 | 0.998436 | 0.510122 | 0.889895 |
CNN_TRAIL5 | 0.791457286 | 0.998415 | 0.508083 | 0.882971 |
CNN_TRAIL6 | 0.798865069 | 0.995287 | 0.509876 | 0.886324 |
CNN_TRAIL7 | 0.803650094 | 0.999218 | 0.515137 | 0.890827 |
CNN_TRAIL8 | 0.797604035 | 0.998421 | 0.517229 | 0.886786 |
CNN_TRAIL9 | 0.792085427 | 1 | 0.511799 | 0.883982 |
CNN_TRAIL10 | 0.797356828 | 0.998424 | 0.512807 | 0.886634 |
CNN_TRAIL11 | 0.79886148 | 0.995272 | 0.517273 | 0.886316 |
CNN_TRAIL12 | 0.797986155 | 0.998425 | 0.512849 | 0.887023 |
CNN_TRAIL13 | 0.804770873 | 0.999221 | 0.509074 | 0.891516 |
CNN_TRAIL14 | 0.789077213 | 0.998411 | 0.506537 | 0.881487 |
CNN_TRAIL15 | 0.795725959 | 0.999211 | 0.511617 | 0.885934 |
CNN_TRAIL16 | 0.805782527 | 1 | 0.514151 | 0.892447 |
CNN_TRAIL17 | 0.784722222 | 0.995994 | 0.513622 | 0.877825 |
CNN_TRAIL18 | 0.795725959 | 0.998423 | 0.509754 | 0.885624 |
CNN_TRAIL19 | 0.811439346 | 1 | 0.514563 | 0.895906 |
CNN_TRAIL20 | 0.790302267 | 0.996823 | 0.510142 | 0.88163 |
CNN TRAIL MEAN | 0.797430865 | 0.998108 | 0.511908 | 0.886555 |
Table 7 Comparative study of different sampling performances with decision tree. |
Sampling method | Precision | Recall | AUCROC | F1 Score |
---|---|---|---|---|
RUS | 0.69931 | 0.642949 | 0.593659 | 0.669946 |
ROS | 0.710048 | 0.64082 | 0.601722 | 0.67366 |
ADAYSYN | 0.789241 | 0.991256 | 0.508786 | 0.873358 |
SMOTE | 0.680106 | 0.630617 | 0.601042 | 0.654427 |
ENN | 0.794635 | 0.999647 | 0.503462 | 0.885429 |
CNN | 0.797431 | 0.998108 | 0.511908 | 0.886555 |
6 Conclusion
Figure 3. Comparative study of different sampling performances. |