Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Nov 09, 2024

Jiayin Wang, Xiaoyu Zhang, Weizhi Ma, Min Zhang

Figure 1 for Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Figure 2 for Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Figure 3 for Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Figure 4 for Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Share this with someone who'll enjoy it:

Abstract:Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE's effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE's utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at https://github.com/Alice1998/RecSAE.

View paper on

Share this with someone who'll enjoy it:

Title:Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Paper and Code