SocialCounterFactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

Abstract

While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes.

To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender).

Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.

Generating SocialCounterfactuals

Our approach to creating counterfactual image-text examples for intersectional social biases consists of three steps. (1) First, we construct sets of image captions describing a subject with counterfactual changes to intersecting social attributes. (2) We then utilize a text-to-image diffusion model with cross attention control to over-generate sets of images corresponding to the counterfactual captions, where differences among images are isolated to the induced counterfactual change (i.e., the social attributes). (3) Finally, we apply stringent filtering to identify only the highest-quality generations

Teaser — **Figure: Overview of our methodology for generating SocialCounterfactuals.**

We group counterfactual sets into three dataset segments based on the pair of attribute types used to construct the captions, which are detailed in each row of Table 1. In total, our dataset consists of 13,824 counterfactual sets with 170,832 image-text pairs, which represents the largest paired image-text dataset for investigating social biases to-date.

statistics — **Table: Details of the number of counterfactual sets, images per set, and total images which remain in our dataset after filtering.**

Probing Intersectional Biases

To probe intersectional social biases in VLMs, we calculate MaxSkew@K over our dataset for six state-of-the-art VLMs: ALIP, CLIP, FLAVA, LaCLIP, OpenCLIP and SLIP. We calculate MaxSkew@K by retrieving images from our counterfactual sets using prompts which are neutral with respect to the investigated attributes. For example, given a prompt constructed from the template A {race} {gender} construction worker, we form its corresponding attribute netural prompt A construction worker. We construct neutral prompts in this manner for each unique combination of prefixes and subjects, averaging their text representations across different prefixes to obtain a single text embedding for each subject. MaxSkew@K is then calculated by retrieving the top-K images for the computed text embedding from the set of all images generated for the subject which met our filtering and selection criteria. We set K = |A₁| × |A₂|, where A₁ and A₂ are the investigated attribute sets.

Mitigating Intersectional Biases

We investigate the suitability of our SocialCounterfactuals dataset for debiasing VLMs through additional training. For each of the three segments of our dataset we withhold counterfactual sets associated with 20% of the occupation subjects for testing and use the remainder as a training dataset. We then separately finetune ALIP, CLIP, and FLAVA on each of these three training datasets, which we hereafter refer to as the debiased variants of these models. o estimate the magnitude of debiasing, we evaluate each model's MaxSkew@K for intersectional bias using the withheld testing dataset containing 20% of the occupations.

Since our dataset was synthetically generated, a natural question to ask is how well our observed debiasing effects extend to evaluations with real image-text pairs. Unfortunately, there are no such existing resources for measuring the intersectional social biases that we investigate in this work. However, several real image-text datasets have been proposed for evaluating (marginal) social biases for attributes such as perceived race and gender. To evaluate our Debiased CLIP model on such datasets, we use the Protected-Attribute Tag Association (PATA) dataset for nuanced reporting of biases associated with race, age, and gender protected attributes. We also evaluate our Debiased CLIP model on the Visogender dataset which was curated to benchmark gender bias in image-text pronoun resolution.

Compute infrastructure

The counterfactual image-text data was created using a large AI cluster equipped with Intel 3rd Generation Xeon® processors and Intel® Gaudi-2® AI accelerators. Up to 256 Intel Gaudi-2® AI accelerators were used to generate our SocialCounterfactuals dataset.

BibTeX

@article{howard2023probing,
      title={Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples},
      author={Howard, Phillip and Madasu, Avinash and Le, Tiep and Moreno, Gustavo Lujan and Bhiwandiwalla, Anahita and Lal, Vasudev},
      journal={CVPR},
      year={2024},
    }