The Model Analyzer allows you to dig deeply into the inter-relationships of variables in a predictive machine learning model, and their combined effects on model predictions. This opens up greater model interpretability for even the most complex machine learning algorithms like XGBoost and DeepNets.
NOTE: The Model Analyzer is currently in Beta.
To begin exploring a model, select the model from the Available Models list in Predict and click the menu item Model Analyzer.
When you open the Model Analyzer, the interface will look similar to the following. We will explain the different parts of the Analyzer window and how to interact with it.
1) The Version dropdown box allows you to select which version of the model you wish to explore. When you first open the Analyzer, it defaults to the production version and iteration of the model if it is in production, or otherwise the latest version and iteration of the model. When you change the selected model version, the chart automatically refreshes. However, if the new model version does not include the variables that had previously been selected as the Simulate or Compare variables, you will need to re-select those first.
2) The remaining space on the right side of the Analyzer shows all variables in the model, and a specific value for each. For each variable, you can select the specific variable setting to be analyzed. The initial setting for each value is the average value in the modeling dataset for numeric variables, and is the first listed value for categorical variables.
- For numeric variables, you can change the value using the provided slider.
- For categorical variables, you can change the value using the dropdown.
3) The Simulate variable is the one whose values will be shown along the horizontal axis of the resulting chart. The variable selected here will be evaluated along its full range of possible values.
4) The Compare variable is the one whose values will determine the different lines shown on the resulting chart. For numeric variables, the variable is split into five groups, with one line shown for each group. For categorical variables (or numeric variables with a small number of unique values), each unique value of the variable has a line to represent it.
- Note: Variables that have a large number of unique values are not available for choosing as either the Simulate or Compare variables.
- Note: The variables selected as the Simulate and Compare variables will not be available on the right side of the dialog.
5) The Calculate button will become active (white) when any value on the right side is changed. Click the button when you are done changing values to see the results charted. Note that changing the Simulate or Compare variables automatically updates the chart.
6) The main chart area of the Model Analyzer shows the resulting model scores (the vertical axis). The model scores are computed based on a matrix of possible values of the Simulate variable and Compare variables, with each other variable that was in the model being set to the fixed value set on the right side of the Analyzer window. The Analyzer chart lets you see how the Simulate and Compare variables interact with each other to create the resulting model score.
7) The icon on the upper right of the window can be used to see the raw data behind the displayed chart.
Using the example pictured above, we can interpret the result in the following way:
The model suggests that response rate (the target variable / vertical axis) generally increases with Age (the Simulate variable) since we see the lines increasing as we move to the right. In addition, the longer the individuals Length of Residence (the different lines) also increase response rate since we see that the lines for longer lengths of residence (e.g., Orange representing 44-55 years) are higher than lines representing shorter lengths of residence (e.g., Red, representing 11-22 years). This interpretation holds specifically for those with Customer Status = F, Gender = FEMALE, Income in the 1,000 - 14,999 range and other settings seen on the right side. But there are some nuances:
- The lowest length of residence group (Dark Blue line representing 0-11 years, overlapping somewhat with the Yellow line) actually have a higher response rate than the 11-22 year group.
- The increase in response rate as Age increases seems to accelerate as one gets older. This can be seen by the fact that (especially for the Orange and Blue lines) the lines have a higher slope on the right side of the chart than on the left side.
If we were to change our fixed settings on the right side (but keep the Simulate and Compare variables the same), we may get a similar or potentially very different result. For example, if we wondered what response rate looks like for Customer Status = N, Gender = MALE, Income in the 200k-249k (much higher income) range, we can make those changes and click the Calculate button. The result might be something like below:
We can see that the overall relationships Age and Length of Residence have with response rate do not change much (and their relationship with each other does not change much), but the absolute response rates themselves have changed to be higher (seen by looking at the scale of the axis on the left). These fixed variable settings have characteristics that lead to higher response rates across the board. This may lead us to explore further some of those variables whose values we changed, such as Income, by selecting them as either a Simulate or Compare variable, as shown below.