Mechanistic Finetuning of Vision-Language-Action Models
via Few-Shot Demonstrations

Robotic Steering

Chancharik Mitra Carnegie Mellon University
Yusen Luo University of Southern California
Raj Saravanan UC Berkeley
Dantong Niu UC Berkeley
Anirudh Pai UC Berkeley
Jesse Thomason University of Southern California
Trevor Darrell UC Berkeley
Abrar Anwar University of Southern California
Deva Ramanan Carnegie Mellon University
Roei Herzig UC Berkeley, MIT-IBM Watson AI Lab

Introduction Video

Summary

Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics.


Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks.


Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.

Video Demonstrations

1

Few-Shot Adaptation Performance

Zero-Shot

No finetuning

LoRA

Standard adaptation

Few-shot adaptation comparison: Given only a handful of demonstrations, Robotic Steering identifies and finetunes task-relevant attention heads, achieving superior performance compared to zero-shot inference and standard LoRA adaptation. Note the improved precision in object manipulation and grasp stability.

2

Performance on Unseen Task

LoRA

On unseen task

Transfer to unseen tasks: By preserving general pretraining knowledge through selective finetuning, Robotic Steering maintains strong performance on novel tasks not seen during adaptation. Standard LoRA suffers from catastrophic forgetting, degrading performance on tasks outside the finetuning distribution.

3

Robustness to Environmental Variations

Environmental robustness: Robotic Steering demonstrates robust performance across environmental variations including lighting changes, object form differences, and distractor objects. The selective attention head finetuning preserves the model's ability to focus on task-relevant features.

Our Approach

Robotic Steering approach diagram showing task specification and finetuning comparison

Robotic Steering

Our method leverages few-shot demonstrations to identify task-specific attention heads, then selectively finetunes only those heads rather than broadly updating all parameters.

1

Head Identification: Use k-NN regression on few-shot demonstrations to identify attention heads most aligned with task-specific physical, visual, and linguistic requirements.

2

Selective LoRA: Apply low-rank adaptation only to the identified task-specific heads, preserving general pretraining knowledge in other model components.

3

Standard Inference: Deploy the adapted model with enhanced task performance while maintaining robustness on unseen variations.

Results

LoRA
Ours

Task Performance Comparison

Success rates across manipulation tasks comparing LoRA and Robotic Steering approaches.

22×
Fewer Parameters
21%
Faster Training
Training Time (min) Params (M)
LoRA
Ours

Computational Efficiency

Training time vs trainable parameters. Lower-left is better.

Unseen Task (Pick Mug)
42.5%
65%
Lighting Variation
25%
47.5%
Form Variation
22.5%
37.5%
Distractor Objects
30%
40%
LoRA
Ours

Generalization & Robustness

Performance on unseen tasks and under environmental variations.

+12%
Avg. Task Improvement
22×
Parameter Reduction
+53%
Unseen Task Gain
21%
Training Time Saved

Citation

@article{roboticsteering2025,
  title   = {Mechanistic Finetuning of Vision-Language-Action Models 
             via Few-Shot Demonstrations},
  author  = {Mitra, Chancharik and Luo, Yusen and Saravanan, Raj
             and Niu, Dantong and Pai, Anirudh and Thomason, Jesse
             and Darrell, Trevor and Anwar, Abrar and Ramanan, Deva
             and Herzig, Roei},
  journal = {arXiv preprint arXiv:2511.22697},
  year    = {2025}
}