Mechanistic Finetuning of Vision-Language-Action Models
via Few-Shot Demonstrations

Robotic Steering

Chancharik Mitra Carnegie Mellon University

Yusen Luo University of Southern California

Raj Saravanan UC Berkeley

Dantong Niu UC Berkeley

Anirudh Pai UC Berkeley

Jesse Thomason University of Southern California

Trevor Darrell UC Berkeley

Abrar Anwar University of Southern California

Deva Ramanan Carnegie Mellon University

Roei Herzig UC Berkeley, MIT-IBM Watson AI Lab

Paper Code

📢 Code coming early 2026 — follow on X for updates!

Overview

Introduction Video

Abstract

Summary

Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics. Yet, unlike VLMs in the vision-language domain, VLAs for robotics require finetuning to contend with varying physical factors like robot embodiment, environment characteristics, and spatial relationships of each task. Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics.

Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task. In this work, we introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with the physical, visual, and linguistic requirements of robotic tasks.

Through comprehensive on-robot evaluations with a Franka Emika robot arm, we demonstrate that Robotic Steering outperforms LoRA while achieving superior robustness under task variation, reduced computational cost, and enhanced interpretability for adapting VLAs to diverse robotic tasks.

Experiments

Video Demonstrations

Few-Shot Adaptation Performance

Zero-Shot

No finetuning

LoRA

Standard adaptation

Robotic Steering

Task-specific heads

Few-shot adaptation comparison: Given only a handful of demonstrations, Robotic Steering identifies and finetunes task-relevant attention heads, achieving superior performance compared to zero-shot inference and standard LoRA adaptation. Note the improved precision in object manipulation and grasp stability.

Performance on Unseen Task

LoRA

On unseen task

Robotic Steering

On unseen task

Transfer to unseen tasks: By preserving general pretraining knowledge through selective finetuning, Robotic Steering maintains strong performance on novel tasks not seen during adaptation. Standard LoRA suffers from catastrophic forgetting, degrading performance on tasks outside the finetuning distribution.

Robustness to Environmental Variations

Lighting Variation

Robotic Steering

Form Variation

Robotic Steering

Distractor Objects

Robotic Steering

Environmental robustness: Robotic Steering demonstrates robust performance across environmental variations including lighting changes, object form differences, and distractor objects. The selective attention head finetuning preserves the model's ability to focus on task-relevant features.

Method

Our Approach

Robotic Steering

Our method leverages few-shot demonstrations to identify task-specific attention heads, then selectively finetunes only those heads rather than broadly updating all parameters.

Head Identification: Use k-NN regression on few-shot demonstrations to identify attention heads most aligned with task-specific physical, visual, and linguistic requirements.

Selective LoRA: Apply low-rank adaptation only to the identified task-specific heads, preserving general pretraining knowledge in other model components.

Standard Inference: Deploy the adapted model with enhanced task performance while maintaining robustness on unseen variations.

Evaluation

Results

LoRA

Robotic Steering (Ours)

Task Performance Comparison

Success rates across manipulation tasks comparing LoRA and Robotic Steering approaches.

22×

Fewer Parameters

21%

Faster Training

Training Time (min) Params (M)

LoRA

Robotic Steering (Ours)

Computational Efficiency

Training time vs trainable parameters. Lower-left is better.

Unseen Task (Pick Mug)

LoRA

42.5%

Ours

65%

Lighting Variation

LoRA

25%

Ours

47.5%

Form Variation

LoRA

22.5%

Ours

37.5%

Distractor Objects

LoRA

30%

Ours

40%

Generalization & Robustness

Performance on unseen tasks and under environmental variations.

+12%

Avg. Task Improvement

22×

Parameter Reduction

+53%

Unseen Task Gain

21%

Training Time Saved

Reference

Citation

@article{roboticsteering2025,
  title   = {Mechanistic Finetuning of Vision-Language-Action Models 
             via Few-Shot Demonstrations},
  author  = {Mitra, Chancharik and Luo, Yusen and Saravanan, Raj
             and Niu, Dantong and Pai, Anirudh and Thomason, Jesse
             and Darrell, Trevor and Anwar, Abrar and Ramanan, Deva
             and Herzig, Roei},
  journal = {arXiv preprint arXiv:2511.22697},
  year    = {2025}
}

Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations

Introduction Video

Summary

Video Demonstrations

Few-Shot Adaptation Performance

Performance on Unseen Task

Robustness to Environmental Variations

Our Approach

Robotic Steering

Results

Task Performance Comparison

Computational Efficiency

Generalization & Robustness

Citation

Mechanistic Finetuning of Vision-Language-Action Models
via Few-Shot Demonstrations