Classifier Evaluation and Selection Review and Overview of Methods
Things to consider Ø Interpretation vs. Prediction Ø Model Parsimony vs. Model Error Ø Type of prediction task: Ø Decisions Interested only in resulting classification Ø Rankings Interested in ranking individuals by their true likelihood of an outcome Ø Estimates Interested in predicting probabilities or a continuous outcome accurately
Model Fit Statistics Summary Prediction Type Model Fit Statistics Decisions Accuracy/ Misclassification Profit/Loss KS-Statistic Rankings ROC Index (concordance statistic) Gini Coefficient Estimates Average Squared error SBC/Likelihood MAPE R "
Confusion Matrix Metrics from Confusion Matrix: 1. Accuracy: Proportion of total predictions that were correct 2. Precision/ Positive Predictive Value: Proportion of predicted positive that were actually positive 3. Negative Predictive Value: Proportion of predicted negative that were actually negative 4. Sensitivity/Recall: Proportion of actual positive cases correctly identified 5. Specificity: Proportion of actual negative cases which are correctly identified
Kolmogorov-Smirnov (KS) Statistic 100% 90% 80% 80% of negative observations have predicted probability <48% 70% 60% 50% 40% 30% 20% 10% 0% Cumulative NEG % Cumulative POS % 0% 16% 32% 48% 64% 80% 100% Predicted Probability from Model 25% of positive observations have predicted probability <48%
Kolmogorov-Smirnov (KS) Statistic 100% 90% 80% 70% 60% 50% 40% Max Distance: Kolmogorov-Smirnov (KS) Statistic 30% 20% 10% 0% Cumulative NEG % Cumulative POS % 0% 16% 32% 48% 64% 80% 100% Predicted Probability from Model
ROC Charts Each point on ROC curve corresponds to fraction of cases, ordered by decreasing predicted value. The x,y coordinates assume we predict that fraction of cases positive.
ROC Charts For example, this point might represent the 40% of cases with the highest predicted probabilities.
ROC Charts 70% of the actual positive outcome cases are captured => True Positive Rate = 0.7
ROC Charts ~10% of the actual negative outcome cases are captured => False Positive Rate = 0.1
Gini Coefficient Gini = 2*Shaded Area = 2*(AUC-0.5)
ROC Charts for Decision Trees p=3/4 p=1/3 p $%&'(( = 1 TPR = 0 FPR = 0 1 3 < p $%&'(( < 3 4 TPR = 0.6 FPR = 0.2 p $%&'(( < 1 3 TPR = 1 FPR = 1
ROC Charts for Decision Trees
Response/Gain Charts 100% 90% Cumulative % Responders 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 18% 36% 54% 72% 90% Percentile of Modeled Values
Response/Gain Charts Cumulative % Responders 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 18% 36% 54% 72% 90% Percentile of Modeled Values Of top 18% of observations by predicted probability, 90% are responders (positive outcomes)
Response/Gain Charts Cumulative % Responders 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 18% 36% 54% 72% 90% Percentile of Modeled Values Overall population response rate is ~27%
Lift Chart While it s great to know what percent of responders you should get using the top p% of observations scored by the model, it s even better to know how this compares to random selection. Lift = % Responders from Model % Responders from Random Selection
Cumulative Lift 4 3.5 3 At a depth of ~20%, we have a lift of almost 3.5 2.5 2 1.5 1 0.5 0 If we target the top 20% of customers as scored by our model, we ll get 3.5 times as many responders than we would if we randomly targeted customers. 0% 18% 36% 54% 72% 90%
Average Squared Error (ASE) M J 1 nl D D yf " GH y GH HKL GKL Ø For class targets, let L be the number of levels in the target. Ø This objective function sets y GH = 1 if observation i takes level j of the target and 0 otherwise. Ø Computes sum of squared error with probabilities.
Average Squared Error (ASE) M J 1 nl D D yf " GH y GH HKL GKL Example: Name P(red) P(blue) P(none) Actual JimBob 0.3 0.4 0.3 BLUE BillyBob 0.1 0.5 0.4 NONE
Average Squared Error (ASE) M J 1 nl D D yf " GH y GH HKL GKL Example: Name P(red) P(blue) P(none) Actual JimBob 0.3 0.4 0.3 BLUE BillyBob 0.1 0.5 0.4 NONE 0 0.3 " + 1 0.4 " + 0 0.3 " + 0 0.1 " + 0 0.5 " + 1 0.4 " 2 3
Average Squared Error (ASE) M J 1 nl D D yf " GH y GH HKL GKL Example: Name P(red) P(blue) P(none) Actual JimBob 0.3 0.4 0.3 BLUE BillyBob 0.1 0.5 0.4 NONE 0 0.3 " + 1 0.4 " + 0 0.3 " + 0 0.1 " + 0 0.5 " + 1 0.4 " 2 3
Decisions: Accounting for Profit/Loss (or other external evaluation metrics)
Decisions in SAS EM Ø Enter information about profit/loss into the decisions on a dataset panel Ø Enterprise miner calculates the most profitable or least costly decision for each obs. Ø Click Build when first opening prompt, then open decisions tab.
Decisions in SAS EM Ø Decision and Cost Matrices do not affect: Ø Estimating parameters in the regression node Ø Learning weights in the neural network node Ø Growing decision trees Ø Fit statistics Ø Residuals, error functions, misclassification rate Ø Decision and Cost Matrices do affect: Ø Choice of models in regression node Ø Pruning trees in decision tree node
Undersampling/ Oversampling and Prior Probabilities Can be accounted for automatically in SAS EM
Undersampling and Prior Probabilities Ø Say you have a rare event as target (<10% of data) Ø Fraud Ø Catastrophic failure Ø 10%+ single day change in value of stock market index Ø May have trouble modelling because a model is accurate for classifying everything as nonevent! Ø Potential Solution: Create a biased sample Ø Under-represent the common events in the training data. Ø Keep all rare events and only an equal number of common events
Undersampling and Prior Probabilities Ø Models provide posterior probabilities for events. Ø The accuracy of the posterior probabilities rely on a representative sample. Ø If we bias our sample, must adjust the posterior probabilities to account for this.
Undersampling and Prior Probabilities Ø Let l = l L, l ",, l J be the levels of the target variable Ø Let i = 1,2,, n index the observations in the data Ø Let OldPost(i, l) be the posterior probability from the model on oversampled data Ø Let OldPrior(l) be the proportion of target level in the oversampled data Ø Let Prior(l) be the correct proportion of target level in true population NewPost i, l = Prior(l) OldPost(i, l) OldPrior(l) J Prior(l H ) HKL OldPost(i, l H ) OldPrior(l H )
Ø Priors are also adjusted in the decisions on a dataset panel. Entering Priors into SAS EM Ø Click Build when first opening the prompt, then click priors tab.
Undersampling and Prior Probabilities Ø In SAS EM, accounting for priors has no effect on: Ø Estimating parameters in logistic regression Ø Learning weights in Neural Network Ø Fit statistics like misclassification rate and average squared error Ø Growing decision trees Ø Priors do affect: Ø Pruning decision trees Ø Net Effects: Ø Increasing a prior probability increases the posterior probability Ø Decreasing a prior decreases the posterior probability Ø Changing prior will have more noticeable effect if the original posterior is near 0.5 than if it is near 0 or 1.
Oversampling Ø Instead of undersampling the common events, we can replicate the rare events in our data. Ø We have to be careful to do this after the training/validation split so that we don t have the same observation in both training and validation set. Ø OR, use a hybrid technique like SMOTE (Chawla, 2002) that creates new data points like the rare events (not exact replicates) as well as undersamples the common events
Using the Model Comparison Node in SAS EM
Cutoff Node Ø Cutoff node used to specify a cutoff probability other than 0.5 when you have decision factors. Ø Currently, the model comparison node does not use the cutoff probability from the cutoff node. Ø Most of the assessment statistics are not affected anyway, aside from misclassification rate.
Self Study: Using Enterprise Miner to Determine a Custom Probability Cutoff Profit/Loss or other Decisions
Average Profit on Pred_Yes Ø EM can use a decision matrix to compute the average profit per observation. Ø This calculation assumes that you have some level of profit/loss for every person in the data and want to average over every person in the data. Ø What if you only stand to profit/lose from those observations which you predict positive? i.e. nothing ventured, nothing gained (or lost). Ø Then you d want to take the profit from the model and average it only over those who were predicted positive. Ø EM cannot use a decision matrix to compute an average profit per positive prediction. Ø But we can do it quite easily with the program editor and dataset explorer!
Open Results from Cutoff Node
Open Model Diagnostics Table
Save Model Diagnostics Table
Save Model Diagnostics Table
Open Program Editor
Write Program to Calculate Avg. Profit
Run Program
Check Log
Open Explorer
Navigate to Dataset and Open
Sort by Average Profit Find largest for validation data