EF1 – Extracting dynamical Laws from Complex Data

Project

EF1-24

Expanding Merlin-Arthur Classifiers Interpretable Neural Networks through Interactive Proof Systems

Project Heads

Sebastian Pokutta, Stephan Wäldchen

Project Members

Berkant Turan

Project Duration

15.01.2024 − 31.03.2026

Located at

ZIB

Description

Existing approaches for interpreting Neural Network classifiers that highlight features relevant for a decision are based solely on heuristics. We introduce a theory that allows us to bound the quality of the features without assumptions on the classifier model by relating classification to Interactive Proof Systems.

External Website

External Website: Interactive Optimization and Learning Lab

Related Publications

Głuch G., Turan B.*, Nagarajan S.*, Pokutta S. (2024). The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses. ArXiv preprint arXiv:2410.08864, 2024.
[Preprint]
Wäldchen, S., Sharma, K., Turan, B., Zimmer, M., and Pokutta, S. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1963-1971, 2024.
[Paper] [Poster] [Code]
Głuch, G., Nagarajan, S.*, Turan, B.* (2024). Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024.
[Paper] [Poster]
Turan, B. (2023). Extending Merlin-Arthur Classifiers for Improved Interpretability. In Joint Proceedings of the xAI-2023 Late-breaking Work, Demos and Doctoral Consortium, co-located with the 1st World Conference on eXplainable Artificial Intelligence (xAI-2023), Jul.

(* Equal Contribution)

Related Software

GitHub Repository (Merlin-Arthur-Classifiers)