In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of these models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. This work presents a case study of how our application can aid in understanding a failure mechanism in a popular large multi-modal model: LLaVA.
@misc{stan2024lvlmintrepret,
title={LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models},
author={Gabriela Ben Melech Stan and Raanan Yehezkel Rohekar and Yaniv Gurwicz and Matthew Lyle Olson and Anahita Bhiwandiwalla and Estelle Aflalo and Chenfei Wu and Nan Duan and Shao-Yen Tseng and Vasudev Lal},
year={2024},
eprint={2404.03118},
archivePrefix={arXiv},
primaryClass={cs.CV}
}