LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Musashi Hinck*, Matthew L. Olson*, David J. Cobbley, Shao-Yen Tseng, Vasudev Lal

Intel Labs

Introduction

We train and release a suite of multimodal foundation models (MMFM, aka Large Multimodal Models) built on the popular LLaVA framework with the Gemma family of Large Language Models (LLMS) recently released by Google. Of particular interest is the 2B parameter Gemma model, which provides opportunities to efficiently prototype and test hypotheses concerning the design space of LLaVA-style models. In line with findings from other work on LLaVA-style models, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably-sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent results. We publicly release training recipes, code and weights for our models, trained on Intel’s Gaudi 2 AI Accelerators with Deepspeed.

Try it out!

The base versions of the llava-gemma models are available on HuggingFace (HF) at Intel/llava-gemma-2b. While these checkpoints have been converted to the HF version of LLaVA, their usage currently requires a modified preprocessor (available here).

With preprocessing_llavagemma.py copied to the appropriate location (e.g. the directory you are running your script from), you can try out the llava-gemma checkpoint using the following code snippet:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  CLIPImageProcessor
)
from processing_llavagemma import LlavaGemmaProcessor # This is in this repo

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = LlavaGemmaProcessor(
    tokenizer=AutoTokenizer.from_pretrained(checkpoint),
    image_processor=CLIPImageProcessor.from_pretrained(checkpoint)
)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "What's the content of the image?<image>"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs = {k: v.to('cuda') for k, v in inputs.items()}
      
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

LLaVA

LLaVA (Large Language and Vision Assistant) is a lightweight and powerful framework for combining pretrained langauge and vision models into a multimodal chatbot capable of taking combined vision and language inputs.

The "recipe" consists of three components and two training steps. The three components are a pretrained language model, a pretrained vision encoder, and a connector. In the v1.5 recipe, the language model is Vicuna, the vision encoder CLIP ViT-L/14, and the connector a 2-layer perceptron. The first stage pretrains the MLP connector by training on a dataset of 595k vision-language samples filtered from CC3M. The second stage jointly finetunes the language modela nd connector using a mixture of 665 multimodal instruction tuning examples.

Gemma

Gemma is “[a] family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models” released by Google in February 2024.

We use the instruction-tuned checkpoints available on HuggingFace (gemma-2b-it and gemma-7b-it).

Performance

Using the LLaVA-v1.5 recipe, the models achieve reasonable but unspectacular performance on a range of multimodal benchmarks, failing to beat the 7B-parameter LLaVA model. We are currently experimenting with various techniques to improve the performance of the Gemma-based LLaVA model, which we will update this page with as we publish results.

Language	Vision		MME		MM-	POPE				ScienceQA
Model	Model	GQA	Cog.	Per.	Vet	Acc.	F1	VQAv2	MMVP	Image
`gemma-2b-it`	`CLIP`	0.531	236	1130	17.7	0.850	0.839	70.7	0.287	0.564
`gemma-7b-it`	`CLIP`	0.472	254	895	18.2	0.848	0.829	68.7	0.327	0.625
`Phi-2b`	`CLIP`	-	-	1335	28.9	-	0.850	71.4	-	0.684
`Llama-2-7b`	`CLIP`	0.620	348	1511	30.6	0.850	0.859	78.5	46.1	0.704