Feature Attribution methods like SHAP [1] or LIME [2] are widely applied to explain predictions of black box machine learning models without making any assumption on the model class. However, single feature attributions cannot quantify the contribution that multiple feature achieve at the same time. For instance, in languange models, explaining the prediction by assigning contributions to individual words is not meaningful, as multiple words must be considered together. Quantifying the contributions of multiple features is known as feature interactions [3], where existing methods have been extended. In this thesis, the goal is to discover feature interactions in existing deep learning architectures to explain these predictions more comprehensively.
Keywords: Explainable AI, Deep Learning, Feature Interactions
Literature
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier. https://dl.acm.org/doi/abs/10.1145/2939672.2939778
Fumagalli, Fabian, et al. “Shap-iq: Unified approximation of any-order shapley interactions.” https://proceedings.neurips.cc/paper_files/paper/2023/hash/264f2e10479c9370972847e96107db7f-Abstract-Conference.html