ERVLA

Abstract

Embodied chain-of-thought (CoT) aims to connect reasoning with robot action, yet its effective form and integration remain underexplored. We revisit embodied CoT as reasoning pre-training for action control and show that useful CoT must link task understanding to concrete action guidance: high-level reasoning alone gives limited gains, while action-oriented signals such as movement and end-effector trajectories in 2D image space are more effective. We further identify CoT contamination, where dense but noisy grounding labels, such as unstable bounding boxes or drifting gripper coordinates, overwrite useful representations and harm control. Reasoning dropout mitigates this issue by making explicit CoT optional and encouraging reasoning to be internalized into action-relevant states. Based on these insights, we build the largest embodied CoT corpus to date, with 978,743 trajectories, 226.3M samples, and 2592.5 hours, and present ERVLA, a reasoning-capable VLA whose reasoning mode can be switched on or off. Unlike knowledge-insulated designs that block action gradients, ERVLA internalizes embodied reasoning as semantic guidance while allowing action losses to shape the VLM backbone, enabling fully end-to-end training. ERVLA achieves state-of-the-art results on LIBERO-Plus and VLABench, especially in out-of-distribution settings, with 86.9% average success on LIBERO-Plus and 53.2% average success on VLABench.

Contribution 1 A Large Scale Embodied CoT Corpus

We construct a large scale embodied chain of thought corpus with 978,743 trajectories, 226.3M samples, and 2,592.5 hours of robot manipulation data. The corpus covers goal planning, subtask decomposition, object grounding, gripper position, motion description, single arm and dual arm manipulation, and multi view observations.

Contribution 2 Understanding What Makes Embodied CoT Useful

We systematically analyze different embodied CoT fields and find that high level reasoning alone provides limited benefit. Action oriented signals, such as movement descriptions and point trajectories, are more effective because they directly connect semantic understanding to executable robot motion.

Contribution 3 ERVLA: Reasoning as Representation Shaping

We present ERVLA, an embodied reasoning VLA that treats explicit CoT as a training signal to reshape action-relevant VLM representations rather than a mandatory autoregressive prefix at inference. ERVLA couples a VLM with a diffusion action head, using a choice branch for action-level supervision, knowledge-truncated KV conditioning for cleaner semantic memory, and CoT dropout to learn from rich reasoning traces during training while predicting continuous actions without mandatory test-time CoT. This end-to-end design enables reliable scaling with embodied CoT pre-training and achieves state-of-the-art results with 86.9% on LIBERO-Plus and 53.2% on VLABench, with stronger real-world robustness under semantic ambiguity and long-horizon tasks.

We transform raw manipulation trajectories into structured embodied reasoning supervision. Instead of using free form explanations, each trajectory is decomposed into hierarchical and grounded reasoning fields, including goal, visible objects, planning, subtask reasoning, movement description, gripper pose, and point trajectory.

Figure 1: Embodied CoT dataset and reasoning supervision overview.

Structured CoT Example

t = 16 DROID Subtask 2 / 9 Frames 12–71

Instruction

Pick up the clothes from the table and put them in the box.

Understanding

Goal: Pick up the clothes from the table and put them in the box.

Grounding

Visible objects (cam0):

box — [553, 173, 830, 534]
table — [131, 373, 966, 989]
tripod — [21, 192, 168, 453]

Planning

Current subtask: contact and lift first cloth

Subtask reasoning: The robot arm is positioned above the table, approaching the first cloth located to the left of the orange box. No objects have been moved yet. The gripper is open and preparing to make contact with the cloth. The next action is to lower the arm and close the gripper to grasp the cloth.

Full plan (9 subtasks)

Approach first cloth
contact and lift first cloth
Transport first cloth to box
Release first cloth into box
Retract and reposition after first drop
Approach second cloth
contact and lift second cloth
Transport second cloth to box
Release second cloth into box

Acting

Movement: move forward 3 cm, move left 5 cm, move down 3 cm, tilt forward 10 degrees, rotate clockwise 10 degrees, keep gripper open

Gripper pose (cam0): [533, 388]

Point trajectory (10 waypoints, cam0)

[527, 385] → [521, 383] → [513, 382] → [504, 380] → [495, 379] → [484, 379] → [474, 382] → [464, 387] → [454, 394] → [443, 402]

The dataset is designed to make embodied CoT a scalable training signal. High level fields provide semantic intent, grounded fields align language with visual entities, planning fields expose task progress, and action fields connect reasoning to executable robot motion. The corpus is built from several open VLA datasets, including Bridge, Fractal, Droid, MolmoAct, and AgiBot.

ERVLA uses a VLM backbone for embodied reasoning and a diffusion transformer for continuous action generation. A choice branch injects action level supervision into the VLM, while knowledge truncation allows the action model to attend to clean semantic memory instead of shortcut control tokens.

Figure 2: ERVLA architecture and training design.

LIBERO-Plus Zero-Shot Transfer

We evaluate ERVLA on LIBERO-Plus, where models are post-trained on LIBERO and evaluated by zero-shot direct transfer under task-suite and perturbation shifts. ERVLA reaches a 86.9% total success rate, outperforming strong VLA baselines including OpenVLA-OFT, π₀, π₀-FAST, PokeVLA, and π_0.5. The gains are especially clear under out-of-distribution conditions such as camera, robot, language, lighting, background, noise, and layout perturbations.

Method	Task Suite				Perturbation Type							Total Succ. (%) ↑
Method	Spatial	Object	Goal	Long	Camera	Robot	Language	Light	Background	Noise	Layout	Total Succ. (%) ↑
ECoT(Zawalski et al., 2024)	31.8	27.9	30.6	8.6	0.3	26.8	40.2	42.6	16.4	10.2	36.9	24.3
OpenVLA-OFT(Kim et al., 2024)	84.0	66.5	63.0	66.4	56.4	31.9	79.5	88.7	93.3	75.8	74.2	69.6
π₀-FAST(Black et al., 2025)	74.4	72.7	57.6	43.4	65.1	21.6	61.0	73.2	73.2	74.4	68.8	61.6
PokeVLA(Zhu et al., 2025)	85.4	81.8	77.6	72.7	84.7	46.1	84.8	94.6	82.6	89.8	77.2	79.3
π_0.5(Black et al., 2025)	90.4	89.9	81.0	80.8	71.7	75.5	85.9	96.1	95.7	86.4	87.5	85.5
No Choice + Knowledge Isolation	83.8	78.6	71.4	69.0	80.4	43.8	81.0	91.6	79.2	85.4	74.6	76.5
Choice + No Knowledge Truncation	89.2	88.6	79.4	79.8	70.8	73.4	84.6	94.2	94.4	85.6	86.2	84.7
ERVLA (Ours)	96.2	89.6	79.6	82.1	77.2	75.3	87.1	95.1	94.7	92.3	86.4	86.9

Table 1: Quantitative comparison on LIBERO-Plus. Models are post-trained on LIBERO and evaluated by zero-shot direct transfer across task suites and perturbation types. ERVLA obtains the best total success rate.

Background Textures

Sensor Noise

Object Layout

Camera Viewpoints

Robot Initial States

Light Condition

VLABench Generalization

VLABench evaluates semantic and distribution-shift generalization across in-distribution, cross-category, commonsense, instruction-following, and texture-variation tracks. ERVLA achieves the best average success rate, progress score, and intention score among compared methods, showing that embodied CoT is most useful when it becomes an action-relevant representation rather than a mandatory autoregressive explanation.

Method	Success Rate by Track (%) ↑					Average ↑
Method	In-dist.	Category	Common	Instruction	Texture	SR	PS	IS
π₀(Black et al., 2024)	47.0	21.2	29.1	17.3	32.2	29.4	44.1	55.0
π₀-FAST(Black et al., 2025)	56.2	31.0	38.0	35.0	39.0	39.8	49.5	58.6
ACoT-VLA(Zhang et al., 2025)	-	-	-	-	-	47.4	63.5	-
π_0.5(Black et al., 2025)	65.4	38.2	43.9	48.2	44.9	48.1	62.3	64.9
Choice + No Knowledge Truncation	62.0	42.4	43.0	53.6	35.0	47.2	59.8	63.4
ERVLA (Ours)	69.7	47.0	44.0	58.0	47.4	53.2	65.9	70.4

Table 2: Comparison on VLABench. SR, PS, and IS denote success rate, progress score, and intention score. ERVLA obtains the strongest average performance across semantic and distribution-shift tracks.

Track 1: In distribution

Add bbq-sauce to the dish

Insert the rose into the vase_seen.

Track 2: Cross Category

Add hotsauce to the dish

Insert the daisy_flower into the vase_unseen.

Track 3: Commonsense Reasoning

The flavors in this dish are quite mild, it would benefit from something that adds a bold and tangy character.

Find the exquisite flower associated with admiration and place it gracefully into the vase.

Track 4: Semantic Instruction Following

While packing the picnic basket, don't forget to include something for the sandwiches that adds a zesty twist.

It's my mother's birthday today. She loves chrysanthemums. Please take the freshest chrysanthemums you can find and place them in her favorite vase as a surprise.

Track 5: Unseen Texture Variation

Add bbq-sauce to the dish

Insert the rose into the vase_seen.

VLM-to-VLA Transfer and CoT Scaling

We study whether embodied CoT can turn stronger vision-language models into stronger action policies, and whether this supervision scales with more pre-training data. On diverse VLM backbones evaluated under a unified interface, performance without explicit CoT shows only weak or even negative correlation with VLM capability on LIBERO and VLABench. With embodied CoT supervision, stronger VLMs transfer more reliably: the best results are dominated by the Qwen3-VL family, indicating that CoT acts as a transfer interface that converts semantic priors into action-relevant representations rather than merely adding extra text.

Naive autoregressive CoT plus action-token decoding does not scale reliably as more robot CoT data is added. ERVLA instead uses CoT as representation-shaping supervision together with a choice branch, knowledge truncation, and flow-based action learning. As embodied CoT pre-training grows from 35M to 226.3M samples, ERVLA improves steadily on both LIBERO-Plus and VLABench, whereas autoregressive CoT+FAST and isolated VLM+DiT baselines show weaker or saturated scaling.

Figure 3: VLM-to-VLA transfer and embodied CoT scaling. Left: ECoT better aligns VLM capability with action. Right: ERVLA scales steadily with more CoT data on both LIBERO-Plus and VLABench, whereas AR CoT+FAST and isolated VLM+DiT show weaker or saturated scaling.

Real-World Evaluation

We also deploy ERVLA on physical robots using one third-person camera and one wrist camera. The real-world task suite includes placing items into drawers and clearing tabletop waste across four tiers: Basic, Distractors, Semantic, and Long-horizon. The main gap appears under semantic ambiguity, distractors, and long-horizon dependencies, where ERVLA benefits from internalized embodied reasoning and cleaner diffusion-policy conditioning.

Basic

Put the toy car into the drawer.

Distractors

Put the toy into the drawer.

Semantic

Put the toy into the lower drawer.

Long-horizon

Put the non-toy items into the upper drawer, and put the toys into the second drawer.