CLIP-InterpreT: An interpretability Tool for CLIP-like Models

Avinash Madasu¹, Yossi Gandelsman², Vasudev Lal¹, Phillip Howard¹

Intel Labs¹, UC Berkeley²

🤗

Introduction

CLIP-InterpreT is an interpretability tool for exploring the inner workings of CLIP-like foundational models. CLIP is one of the most popular vision-language foundational models and is heavily utilized as a base model when developing new models for tasks such as video retrieval, image generation, and visual navigation. Hence, it is critical to understand the inner workings of CLIP. To understand more about the inner workings of the algorithms used in this demo, refer to the ICLR 2024 paper. This tool supports a wide-range of CLIP-like models and provides five types of interpretability analyses:

Property-based nearest neighbors search

In this analysis, we show that CLIP layers and heads can be characterized by specific properties such as colors, locations, animals, etc. We retrieve the top-4 most similar images from ImageNet validation dataset for a selected property. The properties have been labelled with ChatGPT using in-context learning. We provide text-span ouputs and manually labelled properties as in-context examples and ask ChatGPT to identify the properties for all the other layers & heads for each of the models. The properties that are common across layers and heads for each model for combined and presented. Below, we show the examples for each of the properties. Here, the blue color corresponds to the description provided in the text input.

Top-4 nearest neighbors for "location" property. The model used is ViT-B-32 (OpenAI).
The input image is a picture of an "Eiffle tower" in Paris, France. The top-4 images are related to the popular location landmarks.

Top-4 nearest neighbors for "location" property. The model used is ViT-L-14 (Liaon).

Top-4 nearest neighbors for "animals" property. The model used is ViT-B-16 (OpenAI).

Top-4 nearest neighbors for "colors" property. The model used is ViT-B-32 (Data comp). In this example, we see that both the input and retrieved images have common orange, black, and green colors.

Top-4 nearest neighbors for "pattern" property. The model used is ViT-B-32 (OpenAI). In this example, the input image shows an animal laying on the grass. The top retrieved images also show this common pattern of an animal laying on the grass.

Topic Segmentation

In this analysis, we project a Segmentation map corresponding to an input text onto the image. The segmentation map is computed using various heads to illustrate the properties characterized by the head. The heatmap is shown in "blue" which matches the input text description.

Topic Segmentation results for Layer 22, Head 13 (a "geolocation" head). The model used is ViT-L-14 (LAION-2B). The blue color is focused on "Eiffle tower", "Christ", "Statue of Liberty" and "Taj Mahal" which are in France, NewYork, Brazil and India respectively as provided in the text input. Interesting to note that, there is no explicit information provided such as Eiffle tower is in Paris, France. The Layer 22, Head 13 has geolocation properties which implicitly identifies it.

Topic Segmentation results for Layer 11, Head 3 (an "environment/weather" head). The model used is ViT-B-16 (LAION-2B). In the first image (left), the heatmap (blue) is focused on "flowers" which matched the text description. In the second image (middle), the heatmap (blue) is concentrated on the "tornado" matching the text description. In the last image, the heatmap (blue) is focused on "sun" matching the description "Hot Summer".

Topic Segmentation results for Layer 10, Head 6 (an "emotion" head). The model used is ViT-B-32 (OpenAI-400M). In the first image (left), the heatmap (blue color) is more pronounced on the "smile" emotion in a child's face which suits the text description. In the middle image, the heatmap is focused on "fear" emotion from the Conjuring movie. Interesting fact to note that, there is no explicit information provided that the picture correponds to the "fear" emotion. In the last image, we see the heatmap is centralized on the sad emotion of "Thanos" in Marvel.

Contrastive Segmentation

In this analysis, we contrast two different text inputs using a single image. The segmentation maps for each input text are projected onto the original image to highlight how the model visually comprehends the differences between the input texts.

Image shows the contrastive Segmentation between portions of the image containing "tower" and "wine". The model used is ViT-L-16 (LAION-2B).

Image shows the contrastive Segmentation between portions of the image containing "tornado" and "thunderstorm". The model used is ViT-L-14 (LAION-2B).

Image shows the contrastive Segmentation between portions of the image containing "hammer" and "hair" of Marvel action figure Thor. The model used is ViT-L-14 (LAION-2B).

Nearest neighbors of an image

In this analysis, we show the nearest neighbors retrieved for an input image according to similarity scores computed using a single attention head. Since some heads characterize specific image properties, we can use their intermediate representations to obtain a property-specific similarity metric. We retrieve the most similar images to an input image by computing the similarity of the direct contributions of individual heads. As some heads capture specific aspects of the image (e.g. colors/objects), retrieval according to this metric results in images that are most similar regarding these aspects:

Top-8 nearest neighbors per head and image. The input image is provided on the left, with the head-specific nearest neighbors shown on the right. The model used in these examples is ViT-B-16 pretrained on OpenAI-400M.

Top-8 nearest neighbors per head and image. The model used is ViT-L-14 pretrained on LAION-2B.

Top-8 nearest neighbors per head and image. The model used is ViT-L-14 pretrained on OpenAI-400M.

Nearest neighbors for a text input

In this analysis, we retrieve the nearest neighbors for a given input text using different attention heads. We use the top TextSpan outputs identified for each head in these examples.

Nearest neighbors retrieved for the top TextSpan outputs of a given layer and head. The model used is ViT-B-16 pretrained on OpenAI-400M.

Nearest neighbors retrieved for the top TextSpan outputs of a given layer and head. The model used is ViT-L-14 pretrained on LAION-2B.