Hello,
I am trying to understand the auxiliary function trick for importance scores linear wrt the black-box model.
In the paper it boils down to this:
$$b_i(f,x) \equiv a_i(g_x,x)$$
$$g_x = \sum_{j=1}^{d_H}f_j(x) \cdot f_j(\tilde{x})$$
From the text I don't quite get what the meaning of $\tilde{x}$ is.
What is effectively done when one wants to use a feature importance attribution method on an embedding network $f$ for a specific sample/image $x$?
We calculate the dot product of the vector $f(x)$ with itself and apply the method to that scalar output (i.e. $x = \tilde{x}$)? Or does the $\tilde{x}$ come from somewhere else and we do $f(x) \cdot f( \tilde{x})$?
Also, could you please point to literature or elaborate on what it means for a feature importance score to be "linear with respect to the black-box"?
I would really appreciate an answer!
Kind regards