Performance Analysis: Metrics to Analyze: Numeric Prediction Models

This provides a description of machine learning algorithm performance metrics that are provided for numeric prediction models. In LityxIQ, model types such as Numeric Prediction, Customer Value, and Number of Visits fall into this category.
 
Key Terms:
  • Absolute value is the magnitude of a number without regard to is sign.  So -6 and 6 have the same absolute value.
  • Correlation is a number between −1 and +1 representing the linear dependence of two variables such as the correlation of age and income.  A negative correlation means as one goes up the other goes down while a positive correlation means as one goes up the other also goes up.  Values below .30 are considered weak correlations while values above .70 are considered strong correlations.
  • Error refers to the difference between an actual outcome and a predicted outcome.
  • Lift is how well a model is able to rank order the variable the model was built on.  For example if the model is a revenue model then how well it can rank order the highest expected revenue customers from the lowest. 
  • Mean is the same as average which is calculated by summing up the field being averaged and then dividing the sum by the number of records just summed.  This is considered unweighted mean or average.
  • Median is the middle value following a ranking of all the values.
 
Performance Metrics:
  • Median Absolute Error - median is middle, absolute is without regard to positive or negative and error is difference.  Comparing the predicted values to the actual values some predictions are higher and some lower than actual.  Think of an absolute difference as making all differences positive.  The median is the difference value right in the middle of all the difference values.  Medians are nice to compare to means because unlike means they are not influenced by outliers (very high or very low values).  Consider this metric is in relation to what you are trying to predict.  What is this value compared to that average? 
  • Mean Absolute Error - mean is average, absolute is without regard to positive or negative and error is difference.  Comparing the predicted values to the actual values some predictions are higher and some lower than actual.  Think of an absolute difference as making all differences positive.  The mean is the sum of all the absolute difference values divided by the number of difference values.  So Mean Absolute Error is the average difference between actual and predicted performance after making all negative differences positive.  It is good to compare to the Median Absolute Error and if they are not very similar then there are some very big or very small differences causing them to differ. Consider this metric in relation to what you are trying to predict.  What is this value compared to that average? 
  • Median Error – like the above definition but without first taking the absolute value of the differences so allowing for negative differences to exist in the list of all differences from which the middle value is selected.
  • Mean Error – like the previous definition but without first taking the absolute value from which the sum of all differences are divided by the number of difference values.  So negative differences will impact the mean error.
  • Median Percent “Absolute” Error – the middle percent of difference from all percent differences between actual and predicted.
  • SSE – stands for sum of squared errors and is calculated by taking the actual value minus the predicted value then square it and do this for each actual/predicted pair and add them all up.  You get a value that can be compared to other versions of the model as one way to measure the goodness of fit.  The lower the SSE the better.
  • MSE – stands for mean square error and is the SSE divided by the number of observations used to calculate the SSE.  Think of it as an average squared error or an average difference between an actual and predicted value squared.
  • R-squared – a measure of goodness of fit for numeric prediction models.  It can be interpreted as a multi-variate correlation that describes how well the full series of predictor variables is predictive of the target variable.  The value ranges from 0 to 1, with values closer to 1 representing a more accurate model.  Technically, it is the percentage of overall variation in the target variable that is explained by the predictors.
  • R-squared adjusted  – This is an adjusted form of the R-Squared metric.  The adjustment factor relates to how many independent terms are being used to make the prediction.  The more terms (i.e., the more complex the model is), the more Adjusted R-squared will be decreased.  It is generally an improvement at evaluating just R-Squared because it helps account for the tradeoff between accuracy and model complexity.
  • Predicted Value Correlation – correlation between the actual values and their associated predicted values.
  • Ranked Value Correlation - correlation between the actual value ranks and their associated predicted value ranks.
  • Lift - using the variable the model is built on categorizes the top 25% of records as "1's" and remainder as "0's" then measures the model's ability through its score rank to separate out the 1's from the 0's.  The area under the ROC curve from these 1's and 0's is the Lift.  The Lift has a value from 0 to 100 with 0 being no better than random.  If the model were perfect all of the targets would get assigned higher scores than all of the non-targets and the lift would be 100.   
  • Lift 1 vs. 2 – using the variable the model is built on categorizes the top 25% of records as "1's" and remainder as "0's" then measures the model's ability through its score to separate out the 1's from the 0's. The percent of decile1 1's as compared to decile2 '1s expressed in the form of an index.  For example a value of 1.2586 indicates decile1 has nearly 26% more 1's in it than decile2 or 26% more of the top 25% records in it than decile2.
  • Lift 1 and 2 vs. Rest – using the variable the model is built on categorizes the top 25% of records as "1's" and remainder as "0's" then measures the model's ability through its score to separate out the 1's from the 0's.  The percent of decile1 combined with decile2 1's as compared to decile3-10 '1s expressed in the form of an index.  For example a value of 2.2249 indicates deciles 1 and 2 combined percent of 1's is 2.2x more than the bottom 8 deciles.
  • Lift 1 Over Random – using the variable the model is built on categorizes the top 25% of records as "1's" and remainder as "0's" then measures the model's ability through its score to separate out the 1's from the 0's.  The percent of decile1 1's as compared to the average percent of 1's across all deciles expressed in the form of an index.  For example a value of 4.2249 indicates decile1 1's is 4.2x greater than the average number of 1's across all deciles.
  • Numeric Lift - uses the model score to rank order and then measures by rank the percent of total for the variable the model was built on to calculate the area under the ROC curve.  For example with a revenue model it would be the percent of total actual revenue captured as the curve moves from the highest score to the lowest score.  Like Lift the value is from 0 to 100 with 0 being no better than random.  Numeric Lift is not the same as Lift and so their values should not be the same and may not even be terribly close to one another  Lift tells you more about how the model is doing in the top deciles while Numeric Lift how the model is doing overall through all the deciles.     
  • Numeric Lift 1 vs. 2 – uses the model score to rank order and then measures the average performance of the variable the model was built on comparing decile1 to decile2 in the form of an index.  For example a revenue model with a Numeric Lift 1 vs. 2 of 1.2586 indicates decile1 has an average revenue per person nearly 26% higher than decile2.
  • Numeric Lift 1 and 2 vs. Rest – uses the model score to rank order and then measures the average performance of the variable the model was built on comparing decile1 combined with decile2 to decile3-10 in the form of an index.  For example a revenue model with a Numeric Lift 1 and 2 vs. Rest value of 2.2249 indicates deciles 1 and 2 combined have an average revenue per person 2.2x higher than the bottom 8 deciles.
  • Numeric Lift 1 Over Random - uses the model score to rank order and then measures the average performance of the variable the model was built on comparing decile1 to all deciles in the form of an index.  For example a revenue model with a Numeric Lift 1 Over Random of 4.2249 indicates decile1 has an average revenue per person nearly 4.2x higher than the average revenue per person across all deciles.
  • Percent Correct - computed by creating a confusion matrix of actual vs predicted, where the rows and columns are “Top 25%” and “Bottom 75%”, and determining the % on the diagonal.  Just like for binary response, but the “response” is created based on calling the top 25% of values “responders” and top 25% of predictions “predicted responders”
  • Percent Within 10 Pct - percent of records for which the predicted value is within 10 percent of the actual value.
  • Percent Within 15 Pct - percent of records for which the predicted value is within 15 percent of the actual value
  • Percent Within 25 Pct - percent of records for which the predicted value is within 25 percent of the actual value
  • Percent Within 40 Pct - percent of records for which the predicted value is within 40 percent of the actual value
  • Mean Actual Value - across all observations the average actual value of the variable the model was built on.
  • Mean Predicted Value - across all observations the average prediction value of the variable the model was built on.
  • Median Actual Value - across all observations the middle actual value of the variable the model was built on.
  • Median Predicted Value - across all observations the middle prediction value of the variable the model was built on.
  • Minimum Score - across all observations the lowest prediction value
  • Maximum Score - across all observations the highest prediction value
  • Overfit Potential - Overfitting of a model is a concept that relates to how predictive a model will be on new data it has never seen.  An overfit model may look strong on the dataset on which it was built, but doesn't translate well to new data.  The Overfit Potential metric in LityxIQ ranges from 0 to 100.  Lower values represent a model that is not overfit, while larger values signify more likelihood that the model is overfit.  Note that even for models in which there is a higher overfit potential, the performance metrics reported in LityxIQ (such as the ones mentioned above) will be a good representation of how it will perform on new data as long as you used validation techniques such as Holdout or Cross-Validation.  This is because LityxIQ computes performance metrics against new data, and so attempts to still report unbiased performance metrics even in the face of overfitting.