Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.
Sparse Attention Vectors (SAVs) are task-specific features extracted from the attention heads of large multimodal models (LMMs), offering a lightweight yet powerful solution for vision-language classification tasks. The method involves three primary steps:
This approach reveals a remarkable property of LMMs: even a small subset of attention heads can serve as highly effective features for downstream classification, dramatically reducing computational overhead while maintaining strong performance.
Our evaluation highlights the strength and versatility of Sparse Attention Vectors (SAVs) across various vision-language tasks. Below are the key insights from our results:
These findings emphasize SAVs as lightweight yet powerful features for a wide array of vision-language tasks, effectively bridging the gap between general-purpose generative models and task-specific discriminative ones.
SAVs are effective feature representations that have a variety of desireable properties which we describe in the following:
First, we demonstrate the sparsity and interpretability of SAVs by visualizing the selected heads for hallucination detection (MHaluBench), relative depth (BLINK), and image classification (EuroSAT) tasks. We emphasize here the importance of being able to identify the exact attention heads used for specific tasks, which can be an especially useful property for usecases requiring additional explainability. Furthermore, we visualize the effectiveness of the top selected attention head in separate the task data according to class label via a t-SNE plot.
To evaluate the scaling properties of SAVs, we experiment with varying the number of examples per label and number of attention vectors used. We find excitingly that SAVs can scale with increasing examples and requires a sparse set of only 20 heads to achieve optimal or near optimal performance.
In our paper, we perform several other experiments to better understand SAVs. The most notable of which includes the generalization of the heads that are found during SAV extraction. We find that SAV heads extracted on MHaluBench yield strong performance benefits on VLGuard and vice-versa. On the other hand, LoRA weights, as expected, do not have such generalization properties and largely overfit to the finetuned dataset.
Another crucial benefit of SAVs is that the are truly multimodal features, able to be flexibly applied to both unimodal and multimodal tasks. This is in contrast to CLIP and SigLIP methods which have features for individual modalities. To demonstrate the benefits of our method, we compare SAVs to CLIP and SigLIP on two tasks with interleaved multimodal inputs (MHaluBench and NaturalBench). For CLIP and SigLIP we concatenate image and text features together. We find that CLIP and SigLIP vastly underperform on these tasks compared to SAVs, highlighting the value of the truly multimodal capabilities of our SAV features.
@article{mitra2024sparse,
title={Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers},
author={Mitra, Chancharik and Huang, Brandon and Chai, Tianning and Lin, Zhiqiu and Arbelle, Assaf and Feris, Rogerio and Karlinsky, Leonid and Darrell, Trevor and Ramanan, Deva and Herzig, Roei},
journal={arXiv preprint arXiv:2412.00142},
year={2024}
}