While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes.
To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender).
Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high quality dataset containing over 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.
Our approach to creating counterfactual image-text examples for intersectional social biases consists of three steps. (1) First, we construct sets of image captions describing a subject with counterfactual changes to intersecting social attributes. (2) We then utilize a text-to-image diffusion model with cross attention control to over-generate sets of images corresponding to the counterfactual captions, where differences among images are isolated to the induced counterfactual change (i.e., the social attributes). (3) Finally, we apply stringent filtering to identify only the highest-quality generations
We group counterfactual sets into three dataset segments based on the pair of attribute types used to construct the captions, which are detailed in each row of Table 1. In total, our dataset consists of 13,824 counterfactual sets with 170,832 image-text pairs, which represents the largest paired image-text dataset for investigating social biases to-date.
A {race} {gender} construction worker, we form its corresponding attribute netural prompt
A construction worker.We construct neutral prompts in this manner for each unique combination of prefixes and subjects, averaging their text representations across different prefixes to obtain a single text embedding for each subject. MaxSkew@K is then calculated by retrieving the top-K images for the computed text embedding from the set of all images generated for the subject which met our filtering and selection criteria. We set K = |A1| × |A2|, where A1 and A2 are the investigated attribute sets.
debiasedvariants of these models. o estimate the magnitude of debiasing, we evaluate each model's MaxSkew@K for intersectional bias using the withheld testing dataset containing 20% of the occupations.
@article{howard2023probing,
title={Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples},
author={Howard, Phillip and Madasu, Avinash and Le, Tiep and Moreno, Gustavo Lujan and Bhiwandiwalla, Anahita and Lal, Vasudev},
journal={CVPR},
year={2024},
}