Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai^1,♫, Katie Luo^1,♫, Jonas Frey^1,2, Clark Barrett¹, Marco Pavone^1,3

¹Stanford, ²UC Berkeley, ³NVIDIA

^♫Equal Contribution

Robotics: Science and Systems (RSS) 2026

R&B-EnCoRe generates diverse embodied reasoning primitives and refines them based on action-prediction information benefit, bootstrapping policy performance by retraining on self-refined traces that reveal effective, embodiment-specific strategies. By distilling visual and semantic knowledge to construct action-predictive embodied reasoning data, the framework significantly improves VLA task success while producing more efficient Chain-of-Thought traces across manipulation, legged navigation, and autonomous driving.

Abstract

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies.

We introduce R&B-EnCoRe, Refine and Bootstrap Embodiment-specific Chain-of-Thought Reasoning, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation.

We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

+28%

Manipulation Success Rate

+101%

Legged Navigation Score

−21%

Autonomous Driving Collision Rate

Discovered Reasoning Distributions

R&B-EnCoRe autonomously discovers distinct, interpretable reasoning distributions tailored to each embodiment — without any external supervision, reward signals, or human annotation. Rather than applying a single "one-size-fits-all" template, the framework identifies which reasoning primitives are genuinely predictive of successful control for each embodiment and task type.

Reasoning primitive distributions across embodiments

Reasoning primitive distributions across domains. (a) Manipulation: differences between Franka Panda (simulation) and WidowX (hardware) — notably in Visible Objects, Move Explain, and Subtask Explain. Concise Move and Gripper Position primitives dominate, while verbose explanations and distracting details are pruned. (b) Legged navigation: structural affordances are consistently critical across all four embodiments (bipedal, wheeled, bicycle, quadruped); counterfactual reasoning is selectively used. (c) Autonomous driving: reasoning concentrates on mission goals and constraints in the form of notable and collidable objects.

Manipulation

WidowX Hardware Demonstrations

We evaluate R&B-EnCoRe on a WidowX robot arm using the Bridge v2 dataset with a 7B-parameter OpenVLA. Experiments span three task categories: in-distribution, OOD target objects (novel grasp targets), and OOD scenes with distractions (cluttered environments). R&B-EnCoRe's refined reasoning is robust to out-of-distribution shifts where reasoning with all primitives baselines degrade significantly.

Action Forcing Test-Time Reasoning Suppressed for Low Latency

We prompt the various models with an Action Forcing prompt to immediately produce action tokens without generating intermediate reasoning and thereby reduce per-step latency (this treats reasoning as a form of improving representation learning). Overall, the model trained with R&B-EnCoRe is able to achieve more robust performance on the OOD tasks compared to the other reasoning and non-reasoning models, and notably improves over the baseline model trained with reasoning over all primitives.

Put red pepper on black stove
In-distribution task

Put blue peacock in sink
OOD target object — novel grasp target not seen in training

Test-Time Reasoning Explicit Chain-of-Thought at Inference

When generating full CoT traces at inference, R&B-EnCoRe produces concise, action-predictive reasoning (~3 sec/step) versus >5 sec/step for the all-primitives baseline. Pruning verbose reasoning reduces control lag — preventing grasped objects from slipping during long trajectories — while improving OOD performance.

Put red pepper in yellow basket
OOD scene with distractions — cluttered environment with irrelevant objects

LIBERO-90: Task-Salient Perception

Exhaustive object enumeration introduces noise: a model listing all visible objects almost never produces traces where every listed object is task-relevant (0.03% criticality rate). R&B-EnCoRe autonomously identifies and retains task-salient objects — achieving a >25% object criticality rate and an 80.3% success rate versus 76.1% for the full-list baseline — without any external guidance on which objects matter.

LIBERO-90 visible object reasoning comparison

Visible Objects in LIBERO-90 across episode steps. The all-objects baseline (bottom) attends to task-irrelevant items such as the plate and bowl throughout execution. R&B-EnCoRe's model (top) generates reasoning focused on task-critical objects most of the time, demonstrating effective self-supervised filtering of distracting perceptual information.

WidowX Hardware: Reasoning Traces

Qualitative comparison of chain-of-thought reasoning traces generated by different models on the WidowX hardware platform with test-time reasoning enabled. R&B-EnCoRe generates concise, action-relevant traces prioritizing Move and Subtask primitives, while the all-primitives baseline produces verbose reasoning that attends to distracting scene elements — especially in cluttered OOD environments.

Reasoning traces on WidowX hardware across episode steps. R&B-EnCoRe produces shorter, task-focused traces with reduced test-time latency (~3 sec/step vs. >5 sec/step), while maintaining higher success rates in cluttered scenes with distracting objects.

Legged Locomotion Navigation

We evaluate R&B-EnCoRe on the NaviTrace dataset across four embodiments: bipedal, wheeled robot, bicycle, and quadruped. We use a 30B-parameter Qwen3-VL MoE model to generate and refine its own synthetic reasoning trace data, enabling fully self-supervised bootstrapping without any external annotation. R&B-EnCoRe achieves a 101% improvement in cumulative navigation score over models reasoning with all primitives (39.4 vs. 19.6).

A key insight: the model learns that structural affordances — what actions a region of the environment supports given the robot's kinematics — are the most critical signal for legged navigation, while counterfactual reasoning is selectively reserved for specific decision moments. R&B-EnCoRe also correctly identifies that subjective weather descriptions are largely irrelevant and prunes them aggressively.

Quadruped robot navigation waypoint trajectories

Quadruped navigation waypoint trajectories on a holdout task. The robot must follow a trail (with the implicit constraint of avoiding slippery ice). No Reasoning: ignores terrain hazards and traverses ice directly. All Primitives: confounded by irrelevant signals; reduces ice contact slightly but fails to follow the path. Random Primitives: tracks some of the path but traverses ice due to missing affordance reasoning. R&B-EnCoRe: identifies the effective affordance-based strategy — minimal ice contact while maintaining the correct path, matching the ground truth.

Autonomous Driving

We extend R&B-EnCoRe to the nuScenes autonomous driving dataset, finetuning a 4B-parameter Qwen3-VL model to predict ego-vehicle planning trajectories from front-camera images. Reasoning traces are refined from human-crafted annotations originally designed for agent-based planners.

R&B-EnCoRe reduces the collision rate by 21% and improves trajectory accuracy (L2 path error) compared to using all reasoning primitives. The model learns to focus on mission goals, driving plan, and collidable objects — while pruning less informative components like exhaustive scene perception. Performance also scales predictably with the number of posterior samples K, saturating around K=16, validating the theoretical guarantees of importance-weighted variational inference.

nuScenes autonomous driving qualitative results

Planned trajectories on nuScenes validation scenes. R&B-EnCoRe generates concise, focused reasoning traces that improve trajectory accuracy and reduce collisions compared to no-reasoning and full-reasoning baselines. Reasoning types are color-coded for visualization. The refined traces concentrate on mission goals, constraints, and driving plans — filtering out verbose perception enumerations that distract from accurate trajectory prediction.

BibTeX

@INPROCEEDINGS{GanaiLuoEtAl-2026RnB-EnCoRe,
    AUTHOR    = {Milan Ganai AND Katie Luo AND Jonas Frey AND Clark Barrett AND Marco Pavone},
    TITLE     = {{Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning}},
    BOOKTITLE = {Proceedings of Robotics: Science and Systems},
    YEAR      = {2026},
    ADDRESS   = {Sydney, Australia},
    MONTH     = {July}
}