Mechanistic Finetuning of Vision-Language-Action Models
via Few-Shot Demonstrations
Robotic Steering
Overview
Introduction Video
Abstract
Summary
Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics.
Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks.
Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.
Experiments
Video Demonstrations
Few-Shot Adaptation Performance
Zero-Shot
No finetuning
LoRA
Standard adaptation
Robotic Steering
Task-specific heads
Few-shot adaptation comparison: Given only a handful of demonstrations, Robotic Steering identifies and finetunes task-relevant attention heads, achieving superior performance compared to zero-shot inference and standard LoRA adaptation. Note the improved precision in object manipulation and grasp stability.
Performance on Unseen Task
LoRA
On unseen task
Robotic Steering
On unseen task
Transfer to unseen tasks: By preserving general pretraining knowledge through selective finetuning, Robotic Steering maintains strong performance on novel tasks not seen during adaptation. Standard LoRA suffers from catastrophic forgetting, degrading performance on tasks outside the finetuning distribution.
Robustness to Environmental Variations
Lighting Variation
Robotic Steering
Form Variation
Robotic Steering
Distractor Objects
Robotic Steering
Environmental robustness: Robotic Steering demonstrates robust performance across environmental variations including lighting changes, object form differences, and distractor objects. The selective attention head finetuning preserves the model's ability to focus on task-relevant features.
Method
Our Approach
Robotic Steering
Our method leverages few-shot demonstrations to identify task-specific attention heads, then selectively finetunes only those heads rather than broadly updating all parameters.
Head Identification: Use k-NN regression on few-shot demonstrations to identify attention heads most aligned with task-specific physical, visual, and linguistic requirements.
Selective LoRA: Apply low-rank adaptation only to the identified task-specific heads, preserving general pretraining knowledge in other model components.
Standard Inference: Deploy the adapted model with enhanced task performance while maintaining robustness on unseen variations.
Evaluation
Results
Task Performance Comparison
Success rates across manipulation tasks comparing LoRA and Robotic Steering approaches.
Computational Efficiency
Training time vs trainable parameters. Lower-left is better.
Generalization & Robustness
Performance on unseen tasks and under environmental variations.
Reference
Citation
@article{roboticsteering2025,
title = {Mechanistic Finetuning of Vision-Language-Action Models
via Few-Shot Demonstrations},
author = {Mitra, Chancharik and Luo, Yusen and Saravanan, Raj
and Niu, Dantong and Pai, Anirudh and Thomason, Jesse
and Darrell, Trevor and Anwar, Abrar and Ramanan, Deva
and Herzig, Roei},
journal = {arXiv preprint arXiv:2511.22697},
year = {2025}
}