LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Intel Labs

Introduction

We train and release a suite of multimodal foundation models (MMFM, aka Large Multimodal Models) built on the popular LLaVA framework with the Gemma family of Large Language Models (LLMS) recently released by Google. Of particular interest is the 2B parameter Gemma model, which provides opportunities to efficiently prototype and test hypotheses concerning the design space of LLaVA-style models. In line with findings from other work on LLaVA-style models, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably-sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent results. We publicly release training recipes, code and weights for our models, trained on Intel’s Gaudi 2 AI Accelerators with Deepspeed.


Try it out!

The base versions of the llava-gemma models are available on HuggingFace (HF) at Intel/llava-gemma-2b. While these checkpoints have been converted to the HF version of LLaVA, their usage currently requires a modified preprocessor (available here).

With preprocessing_llavagemma.py copied to the appropriate location (e.g. the directory you are running your script from), you can try out the llava-gemma checkpoint using the following code snippet:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
    tokenizer=AutoTokenizer.from_pretrained(checkpoint),
    image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "What's the content of the image?<image>"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
      
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

LLaVA

LLaVA (Large Language and Vision Assistant) is a lightweight and powerful framework for combining pretrained langauge and vision models into a multimodal chatbot capable of taking combined vision and language inputs.

The "recipe" consists of three components and two training steps. The three components are a pretrained language model, a pretrained vision encoder, and a connector. In the v1.5 recipe, the language model is Vicuna, the vision encoder CLIP ViT-L/14, and the connector a 2-layer perceptron. The first stage pretrains the MLP connector by training on a dataset of 595k vision-language samples filtered from CC3M. The second stage jointly finetunes the language modela nd connector using a mixture of 665 multimodal instruction tuning examples.


Gemma

Gemma is “[a] family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models” released by Google in February 2024.

We use the instruction-tuned checkpoints available on HuggingFace (gemma-2b-it and gemma-7b-it).


Performance

Using the LLaVA-v1.5 recipe, the models achieve reasonable but unspectacular performance on a range of multimodal benchmarks, failing to beat the 7B-parameter LLaVA model. We are currently experimenting with various techniques to improve the performance of the Gemma-based LLaVA model, which we will update this page with as we publish results.

Language Vision MME MM- POPE ScienceQA
Model Model GQA Cog. Per. Vet Acc. F1 VQAv2 MMVP Image
gemma-2b-it CLIP 0.531 236 1130 17.7 0.850 0.839 70.7 0.287 0.564
gemma-7b-it CLIP 0.472 254 895 18.2 0.848 0.829 68.7 0.327 0.625
Phi-2b CLIP - - 1335 28.9 - 0.850 71.4 - 0.684
Llama-2-7b CLIP 0.620 348 1511 30.6 0.850 0.859 78.5 46.1 0.704