explainability

Explainability is the study of, when stuff breaks, understanding why it does.

Here are a set of explainability techniques!

policy visualization

Roll your system out and look at it

Some common strategies that people use to do this:

plot the policy: look at what the agent says to do at each state (if you have too many dimensions, just plot slices!)
slicing: one way to deal with history-dependent trajectories is to then just count the number of actions that your system takes at each step, and plot the argmax of it

feature importance

Our goal is still to understand the contribution of various features to the overall behavior of a system.

sensitivity analysis

sensitivity analysis allows us to understand how a particular output changes when a single feature is changed

take a feature
screw with it
how does it contribute to the variance of the outcomes?

this is really slow

….because preturbing each input sequentially is exponential in search space.

So instead, we could consider something like a

take the gradient of the output with respect to the input, and measure what they produce

this doesn’t really handle gradients that are saturated (i.e. the changes were big, but once you get big enough the function stops changing). So instead, we could consider integrated gradients:

For function \(f\) under test, and feature perturbation \(x \in [x_0, x_1]\), we compute:

\begin{equation} \frac{1}{x_1 - x_0} \int_{x_0}^{x_1} f\qty(x) \dd{x} \end{equation}

shapley values

One problem with sensitivity analysis is that competing feature effects neutralizes: that is, if \(z = x \vee y\), preturbing \(x\) or \(y\) alone will not have any influence on the value of \(z\). shapley values helps us understand the subsets of features.

The Shapley Value is the expectation of the difference across all possible subsets of features.

randomly fix a subset and randomly sample values in the subset
compute the target value
repeat 1-2 with the feature under test included in the randomly sampled subset
compute the difference between the case where you included and not included the target feature
compute the expectation of 4

surrogate models

see