Hybrid classification method for imbalanced datasets

Tianxiang Gao

April 17, 2015

Motivation
Introduction
Methodology
Experimental results
Conclusion and future research

Motivation

“Learning from Imbalanced Data Sets,” Proc. Am. Assoc. for Artificial Intelligence (AAAI) Workshop, N. Japkowicz, ed., 2000, (Technical Report WS-00-05).

“Workshop Learning from Imbalanced Data Sets II,” Proc. Int'l Conf. Machine Learning, N.V. Chawla, N. Japkowicz, and A. Kolcz, eds., 2003.

N.V. Chawla, N. Japkowicz, and A. Kolcz, “Editorial: Special Issue on Learning from Imbalanced Data Sets,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 1-6, 2004.

fraud/intrusion detection

medical diagnosis/monitoring

bioinformatics

direct marketing

Nature of problem

Motivation
Introduction
Methodology
Experimental results
Conclusion and future research

Motivation
Introduction
- Classifier - Decision tree
- Evaluate performance
- Nature of problem
Methodology
Experimental results
Conclusion and future research

Classifier

Classification problem is to correctly classifiy the previously unseen testing dataset based on the given training dataset. We deal with binary cases, positive class and negative class.

An algorithm that implements classification is known as a Classifier.

Decision tree is a tree-like classifier.

Dataset of Playing Tennis

Outlook Temp. Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcase Hot High False Yes ... ... ... ... ... Rainy Mild High True No

Decision Tree

-Matt Tanner

Motivation
Introduction
- Classifier - Decision tree
- Evaluate performace
- Nature of problem
Methodology
Experimental results
Conclusion and future research

Why?

Multiple classifiers are available
For each classifier, multiple choices are available for settings
To choose best classifier

Cutoff value

Most algorithms classify via a 2-steps process:

Compute probability of belonging to $positive$ class. Compare to cutoff value, and classify accordingly.

Default cutoff value is 0.5

If >= 0.5, classify as $positive$.
If < 0.5, classify as $negative$.

Confusion Matrix

Predicted + Predicted - Actual + True Positive (TP) False Negative (FN) Actual - False Positive (FP) True Negative (TN)

Accuracy

$ ACC = \frac{TP+TN}{TP + TN + FN + FP}$

May not very useful if imbalanced datasets.

TPR & FPR

True positive rate (TPR) $= \frac{TP}{TP + FN} $

False positive rate (FPR) $= \frac{FP}{FP + TN} $

ROC curve

In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.

-Wikipedia

ROC & AUC

Area Under the ROC curve (AUC) is to quantify the overal performance of a classifier.

Class Balance Accuracy

$CBA =\frac{\sum_{i=1}^{k} \frac{C_{ii}}{max(C_{i.}, C_{.i}) } }{k}$

where $C_{i.} = \sum_{j=1}^{k} C_{ij} $ and $C_{.i} = \sum_{i=1}^{k} C_{ji} $.

-L. Mosley and S. Olafsson 2013.

Confusion Matrix

Predicted + Predicted - Actual + True Positive (TP) False Negative (FN) Actual - False Positive (FP) True Negative (TN)

Motivation
Introduction
- Classifier - Decision tree
- Evaluate performance
- Nature of problem
Methodology
Experimental results
Conclusion and future research

Nature of problem

Overlap, within-class imbalance, disjunct rules, authenticity of data.

Motivation
Introduction
Methodology
Experimental results
Conclusion and future research

Motivation
Introduction
Methodology
- Sampling
- Instance Selection
- Hybrid method
Experimental results
Conclusion and future research

Sampling

Undersampling
Oversampling (with replacement)
SMOTE

Under-sampling the majority class enables better classifiers to be built than over-sampling the minority class. If replicate the minority class, the decision region for the minority class becomes very specific and will cause new splits in the decision tree...in essence, overfitting. Replication of the minority class does not cause its decision boundary to spread into the majority class region.

- Chawla, Nitesh V., et al 2002.

SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique

- Chawla, Nitesh V., et al 2002.

SMOTE

- He, Haibo, Learning from imbalanced datasets 2009

Continuous - $x_{new}=x_i+(\hat{x}_i-x_i)\times \delta$

Categorical - $x_{new}=majorityVote(x_i)$

SMOTE

tuning parameter

number of nearest neighbors
percentage of oversampling minority class
percentage of undersampling majority class

Motivation
Introduction
Methodology
- Sampling
- Instance Selection
- Hybrid method
Experimental results
Conclusion and future research

Instance Selection

Selects subset of training dataset such that removes superfluous instances, maintain performances.

Wrapper and Filter

Wrapper selects a subset of instances based on accuracy of a predefined classifier.

Filter ranks instances through non-classifier based function and then selects the instances from the ranking.

Greedy Selection

Greedy selection is a two-steps wrapper method: generates a number of candidate subsets, and starts with one candidate subset and continuouly combines the other subsets if combining improves the performance of classifier.

- W. Bennette and S. Olafsson 2013.

Greedy Selection

tuning parameter

number of candidate subsets
size of each subset

Motivation
Introduction
Methodology
- Sampling
- Instance Selection
- Hybrid method
Experimental results
Conclusion and future research

Hybrid Method

Hyrid method is a combination method of SMOTE and greedy selection.
Generate synthetic instances for minority class, and combines those synthetic instances with majority instances.
Select the ideal subset from the SMOTEd instances by using greedy selection as the final training dataset.

Analysis of Hybrid Method

Motivation
Introduction
Methodology
Experimental results
- Characteristics of Datasets
- SMOTEd Training Datasets
- Greedy-Selected Training Datasets
- Hybrid-Selected Training Datasets
- Results
Conclusion and future research

Characteristics of Datasets

4 well-known imbalanced datasets in UCI machine learning repository, and one medical dataset

- Chawla, Nitesh V. 2002.

- Gang Wu 2003.

- Nathalie Japkowicz 2004.

- Haibo He 2009.

Motivation
Introduction
Methodology
Experimental results
- Characteristics of Datasets
- SMOTEd Training Datasets
- Greedy-Selected Training Datasets
- Hybrid-Selected Training Datasets
Conclusion and future research

SMOTE Training Datasets

tuning parameters

number of nearest neighbors
percentage of oversampling
percentage of undersampling

Number of nearest neighbors

Default value is 5.

percentage of undersampling

The percentage of undersampling is not percentage of undersampling the entire majority class.

Instead, it is to maintain percentage of majority class with respective to SMOTEd minority class. In another word, it depends on percentage of oversampling minority class.

Arbitrarily sets it to 200%.

Note:if undersampling majority instances is more than its original instances, majority class would oversample with replacement.

percentage of oversampling

We implemented 10-fold cross validation to select the ideal percentage of oversampling among 100% to 1000%.

We used both ACC and AUC as assessment metrics to evaluate performances.

percentage of oversampling

From the previous figures and this table, 200% has been selected for yearst5 dataset since it has highest AUC, competitive accuracy, more stable, and simpler.

percentage of oversampling

Based on the strategy, the selected percentages for all dataset are listed in the table.

Motivation
Introduction
Methodology
Experimental results
- Characteristics of Datasets
- SMOTEd Training Datasets
- Greedy-Selected Training Datasets
- Hybrid-Selected Training Datasets
- Results
Conclusion and future research

Greedy Selection

tuning parameters

number of candidate subset
size of each candiate subset

tuning parameters

Usually over 90 percent of time comsumption is to find approporate tuning parameters. It's not very efficient.

Instead, we simply divided a dataset into 100 parts. Then, in each part, the size of subset has been fixed automatically.

The key is to choose an appropriate assessment metric.

Motivation
Introduction
Methodology
Experimental results
- Characteristics of Datasets
- SMOTEd Training Datasets
- Greedy-Selected Training Datasets
- Hybrid-Selected Training Datasets
Conclusion and future research

Hybrid-Selected Training Datasets

tuning parameters

number of nearest neighbors
percentage of oversampling
percentage of undersampling
number of candidate subsets
size of each candidate subset

tuning parameters

Usually assigns 5 to number of nearest neighbors.

Also use 10-fold cross validation to choose a good sampling percentages.

Separate dataset into 100 parts and the size of each candidate subsets has been fixed.

tunning parameters

Based on the strategy, the selected percentages for each dataset are listed in the table.

Motivation
Introduction
Methodology
Experimental results
- Characteristics of Datasets
- SMOTEd Training Datasets
- Greedy-Selected Training Datasets
- Hybrid-Selected Trianing Datasets
- Results
Conclusion and future research

Results

Randomly select 4/5 of a dataset as original training dataset, and the rest is testing dataset.

Implement those strategies to preprocess the dataset and we got four different training datasets: Control, Greedy Selection, SMOTE, and Hybrid.

Fit those four different training datasets through regular decision tree.

Predict the test dataset.

Process these steps over 100 times among each dataset. Then, evaluate predications through computing AUC, CBA, and ACC.

Dataset: G7

Dataset: car3

Dataset: yeast5

Dataset: medical

Motivation
Introduction
Methodology
Experimental results
Conclusion and future research

Conclusion

Hybrid Classification method makes decision tree works better.
Robust
Selecting an appropriate assessment metric is essential to wrapper-based method.
Comprehensive assessment metric works better than non-comprehensive.

Future research

Choices of different assessment metrics
Choices of various selection techniques

Questions?

Hybrid classification method for imbalanced datasets Tianxiang Gao April 17, 2015

Hybrid classification method for imbalanced datasets – Motivation – Dataset of Playing Tennis

gaotx

Hybrid classification method for imbalanced datasets – Motivation – Dataset of Playing Tennis

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

presentations

Hybrid classification method for imbalanced datasets

Contents

Motivation

fraud/intrusion detection

medical diagnosis/monitoring

bioinformatics

direct marketing

Nature of problem

Contents

Contents

Classifier

Dataset of Playing Tennis

Decision Tree

Contents

Why?

Cutoff value

Confusion Matrix

Accuracy

TPR & FPR

ROC curve

ROC & AUC

Class Balance Accuracy

Confusion Matrix

Contents

Nature of problem

Contents

Contents

Sampling

SMOTE

SMOTE

SMOTE

tuning parameter

Contents

Instance Selection

Wrapper and Filter

Greedy Selection

Greedy Selection

tuning parameter

Contents

Hybrid Method

Analysis of Hybrid Method

Analysis of Hybrid Method

Contents

Characteristics of Datasets

Contents

SMOTE Training Datasets

tuning parameters

Number of nearest neighbors

percentage of undersampling

percentage of oversampling

percentage of oversampling

percentage of oversampling

percentage of oversampling

percentage of oversampling

Contents

Greedy Selection

tuning parameters

tuning parameters

Contents

Hybrid-Selected Training Datasets

tuning parameters

tuning parameters

tunning parameters

Contents

Results

Dataset: G7

Dataset: car3

Dataset: yeast5

Dataset: medical

Contents

Conclusion

Future research

Questions?

0 0