LogoSparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

1Carnegie Mellon University, 2University of California, Berkeley
3IBM Research, 4MIT-IBM Watson AI Lab
*Indicates Equal Contribution

Abstract

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.

Image giving an overview of Sparse Attention Vectors method.

What are SAVs?

Sparse Attention Vectors (SAVs) are task-specific features extracted from the attention heads of large multimodal models (LMMs), offering a lightweight yet powerful solution for vision-language classification tasks. The method involves three primary steps:

  1. Extracting Attention Vectors: Attention vectors from all heads and layers of the transformer are computed for a given set of few-shot examples. These vectors represent the model's latent understanding of the input sequence.
  2. Identifying Relevant Vectors: A scoring process evaluates the ability of each attention head to classify examples correctly. This step involves computing the cosine similarity between attention vectors and class centroids, selecting only the top-performing heads.
  3. Classification with Sparse Heads: The sparse set of attention heads is then used to classify new queries. Each head votes for a class label based on its similarity to precomputed centroids, and a majority vote determines the final prediction.

This approach reveals a remarkable property of LMMs: even a small subset of attention heads can serve as highly effective features for downstream classification, dramatically reducing computational overhead while maintaining strong performance.

Image illustrating Sparse Attention Vectors

Benchmark Results

Our evaluation highlights the strength and versatility of Sparse Attention Vectors (SAVs) across various vision-language tasks. Below are the key insights from our results:

  1. Outperformance of Zero-Shot Baselines: SAVs consistently outperform leading zero-shot models like LLaVA and Qwen2-VL on tasks such as safety benchmarks, visual question answering (VQA), and image classification.
  2. Closing the Gap with Discriminative Models: SAVs drastically reduce the performance gap with discriminative vision-language models like SigLIP and CLIP, even on tasks where these models traditionally excel.
  3. Superiority to Few-Shot and Finetuning Methods: SAVs outperform advanced fine-tuning methods like LoRA, achieving top results on datasets such as EuroSAT and Pets, showcasing their efficiency in extracting task-relevant features without extensive resources.
  4. Excelling in Complex Tasks: SAVs demonstrate remarkable adaptability on perception benchmarks like BLINK and NaturalBench, which require compositional reasoning and multimodal understanding, outperforming existing methods in these challenging scenarios.
Numerical Results Table

These findings emphasize SAVs as lightweight yet powerful features for a wide array of vision-language tasks, effectively bridging the gap between general-purpose generative models and task-specific discriminative ones.

Analysis

SAVs are effective feature representations that have a variety of desireable properties which we describe in the following:

First, we demonstrate the sparsity and interpretability of SAVs by visualizing the selected heads for hallucination detection (MHaluBench), relative depth (BLINK), and image classification (EuroSAT) tasks. We emphasize here the importance of being able to identify the exact attention heads used for specific tasks, which can be an especially useful property for usecases requiring additional explainability. Furthermore, we visualize the effectiveness of the top selected attention head in separate the task data according to class label via a t-SNE plot.

Attention Head Visualizationse

To evaluate the scaling properties of SAVs, we experiment with varying the number of examples per label and number of attention vectors used. We find excitingly that SAVs can scale with increasing examples and requires a sparse set of only 20 heads to achieve optimal or near optimal performance.

Varying number of examples and attention vectors for SAVs

In our paper, we perform several other experiments to better understand SAVs. The most notable of which includes the generalization of the heads that are found during SAV extraction. We find that SAV heads extracted on MHaluBench yield strong performance benefits on VLGuard and vice-versa. On the other hand, LoRA weights, as expected, do not have such generalization properties and largely overfit to the finetuned dataset.

Another crucial benefit of SAVs is that the are truly multimodal features, able to be flexibly applied to both unimodal and multimodal tasks. This is in contrast to CLIP and SigLIP methods which have features for individual modalities. To demonstrate the benefits of our method, we compare SAVs to CLIP and SigLIP on two tasks with interleaved multimodal inputs (MHaluBench and NaturalBench). For CLIP and SigLIP we concatenate image and text features together. We find that CLIP and SigLIP vastly underperform on these tasks compared to SAVs, highlighting the value of the truly multimodal capabilities of our SAV features.

Generalization and Interleaved Tasks Experiments

BibTeX

@article{mitra2024sparse,
        title={Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers},
        author={Mitra, Chancharik and Huang, Brandon and Chai, Tianning and Lin, Zhiqiu and Arbelle, Assaf and Feris, Rogerio and Karlinsky, Leonid and Darrell, Trevor and Ramanan, Deva and Herzig, Roei},
        journal={arXiv preprint arXiv:2412.00142},
        year={2024}
      }