Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Waris Radji

ENSEIRB-MATMECA, UB

Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

Mar 11, 2025

Hector Kohler, Quentin Delfosse, Waris Radji, Riad Akrour, Philippe Preux

Abstract:There exist applications of reinforcement learning like medicine where policies need to be ''interpretable'' by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ''simulatability'': policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

* 12 pages of main text, under review

Via

Access Paper or Ask Questions

Universal Adversarial Perturbations: Efficiency on a small image dataset

Oct 10, 2022

Waris Radji

Figure 1 for Universal Adversarial Perturbations: Efficiency on a small image dataset

Figure 2 for Universal Adversarial Perturbations: Efficiency on a small image dataset

Figure 3 for Universal Adversarial Perturbations: Efficiency on a small image dataset

Figure 4 for Universal Adversarial Perturbations: Efficiency on a small image dataset

Abstract:Although neural networks perform very well on the image classification task, they are still vulnerable to adversarial perturbations that can fool a neural network without visibly changing an input image. A paper has shown the existence of Universal Adversarial Perturbations which when added to any image will fool the neural network with a very high probability. In this paper we will try to reproduce the experience of the Universal Adversarial Perturbations paper, but on a smaller neural network architecture and training set, in order to be able to study the efficiency of the computed perturbation.

Via

Access Paper or Ask Questions