= explainer_rfc.model_profile(variables="Age")
profile_rfc
=False) profile_rfc.plot(show
Step 7. Partial Dependence and Accumulated Local Effects
Once we know which variables are important, it is usually interesting to determine the relationship between a particular variable and the model prediction. Popular techniques for this type of Explanatory Model Analysis are Partial Dependence (PD) and Accumulated Local Effects (ALE).
PD profiles were initially proposed in 2001 for gradient boosting models but can be used in a model agnostic fashion. This method is based on analysis of average model response after replacing variable \(i\) with the value of \(t\).
Both methods are described in detail in Chapter 17 of the Explanatory Model Analysis
More formally, Partial Dependence profile for variable \(i\) is a function of \(t\) defined as
\[ PD(i, t) = E\left[ f(x_1, ..., x_{i-1}, t, x_{i+1}, ..., x_p) \right], \]
where the expected value is calculated over the data distribution. The straightforward estimator is
\[ \widehat{PD}(i, t) = \frac 1n \sum_{j=1}^n f(x^j_1, ..., x^j_{i-1}, t, x^j_{i+1}, ..., x^j_p). \]
In the data set, the variable \(i\) is replaced by the value \(t\), then an average model response is calculated.
When variables are correlated, then changes of \(i\)-th variable independently from others may lead to very untypical observations, so called off-manifold observations. One solution to this is the method called Accumulated Local Effects explained in the EMA book.
Analysis of the Partial Dependence profile for each variable carries a lot of useful information. However, keep in mind that in complex models, you should expect complex interactions. Thus, one global profile for a variable may be an oversimplification. An extension of PD profiles is to calculate them in subgroups defined by some other variables or based on segments of observations found from model responses.
Python snippets
We use the model_profile
method from the dalex
package to calculate the variable profile. The only required argument is the model to be analyzed. It is a good idea to specify names of variables for profile estimation as a second argument; otherwise, profiles are calculated for all variables, which can take some time. One can also specify the exact grid of values for calculations of profiles.
The average is calculated for the distribution specified in the data
argument in the explainer. Here we calculate the PD profiles for the Age variable for covid_summer
data.
Since we have four models it is worth comparing how they differ in terms of the model’s response to the Age
variable.
= explainer_cdc.model_profile(variables="Age")
profile_cdc = explainer_dtc.model_profile(variables="Age")
profile_dtc = explainer_rfc_tuned.model_profile(
profile_rfc_tuned ="Age")
variables
profile_cdc.plot([profile_rfc, profile_dtc, =False) profile_rfc_tuned], show
Grouped Partial Dependence profiles
By default, the average is calculated for all observations. But with the argument groups
one can force calculation of average within groups defined by a grouping variable.
= explainer_rfc_tuned.model_profile(
grouped_profile_rfc_tuned ="Age", groups="Diabetes")
variables=False) grouped_profile_rfc_tuned.plot(show
R snippets
We use the model_profile
function from the DALEX
package to calculate the variable profile. The only required argument is the model to be analyzed. It is a good idea to specify names of variables for profile estimation as a second argument; otherwise, profiles are calculated for all variables, which can take some time. One can also specify the exact grid of values for calculations of profiles.
The average is calculated for the distribution specified in the data
argument in the explainer. Here we calculate the PD profiles for the Age variable for covid_summer
data.
<- model_profile(model_ranger, "Age")
mp_ranger
plot(mp_ranger)
Since we have four models it is worth comparing how they differ in terms of the model’s response to the Age
variable.
<- model_profile(model_cdc, "Age")
mp_cdc <- model_profile(model_tree, "Age")
mp_tree <- model_profile(model_tuned, "Age")
mp_tuned
plot(model_cdc, model_tree, mp_ranger, model_tuned)
Grouped Partial Dependence profiles
By default, the average is calculated for all observations. But with the argument groups
one can specify a grouping variable. Then PD profiles are calculated independently taking into account only observations with a selected level of this variable.
<- model_profile(model_ranger, "Age",
mgroup_ranger groups = "Diabetes")
plot(mgroup_ranger)
Clustered Partial Dependence profiles
When the model is additive, then individual profiles (see the next Section related to Ceteris Paribus profiles) are parallel. But if the model has interactions, individual profiles may have different shapes for different values of variables in each interaction. To see if there are such interactions we can cluster the individual profiles.
If we specify the argument k
, then the function model_profile
performs a hierarchical clustering of the profiles, determines the group of k most different profiles and then calculates the Partial Dependence for each of these groups separately.
<- model_profile(model_ranger, "Age",
mclust_ranger k = 3, center = TRUE)
plot(mclust_ranger)