Basic

Put the toy car into the drawer.

Distractors

Put the toy into the drawer.

Semantic

Put the toy into the lower drawer.

Long-horizon

Put the non-toy items into the upper drawer, and put the toys into the second drawer.

Logo ERVLA

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

Nan Sun1,2, Yuan Zhang2,†,★, Yongkun Yang3, Wentao Zhao1, Peiyan Li2,4
Jun Guo1,2, Wenxuan Song2,5, Pengxiang Ding2,6, Runze Suo2,7,9
Yifei Su2, Xin Xiao8, Xinghang Li2, Huaping Liu1,★
1 Tsinghua University 2 Xiaomi Robotics 3 Peking University 4 CASIA
5 HKUST(GZ) 6 Zhejiang University 7 Fudan University 8 Wuhan University
9 Shanghai Innovation Institute
Project Leader Corresponding Author

Abstract

Embodied chain-of-thought (CoT) aims to connect reasoning with robot action, yet its effective form and integration remain underexplored. We revisit embodied CoT as reasoning pre-training for action control and show that useful CoT must link task understanding to concrete action guidance: high-level reasoning alone gives limited gains, while action-oriented signals such as movement and end-effector trajectories in 2D image space are more effective. We further identify CoT contamination, where dense but noisy grounding labels, such as unstable bounding boxes or drifting gripper coordinates, overwrite useful representations and harm control. Reasoning dropout mitigates this issue by making explicit CoT optional and encouraging reasoning to be internalized into action-relevant states. Based on these insights, we build the largest embodied CoT corpus to date, with 978,743 trajectories, 226.3M samples, and 2592.5 hours, and present ERVLA, a reasoning-capable VLA whose reasoning mode can be switched on or off. Unlike knowledge-insulated designs that block action gradients, ERVLA internalizes embodied reasoning as semantic guidance while allowing action losses to shape the VLM backbone, enabling fully end-to-end training. ERVLA achieves state-of-the-art results on LIBERO-Plus and VLABench, especially in out-of-distribution settings, with 86.9% average success on LIBERO-Plus and 53.2% average success on VLABench.

Core Contributions

Contribution 1 A Large Scale Embodied CoT Corpus

We construct a large scale embodied chain of thought corpus with 978,743 trajectories, 226.3M samples, and 2,592.5 hours of robot manipulation data. The corpus covers goal planning, subtask decomposition, object grounding, gripper position, motion description, single arm and dual arm manipulation, and multi view observations.

Contribution 2 Understanding What Makes Embodied CoT Useful

We systematically analyze different embodied CoT fields and find that high level reasoning alone provides limited benefit. Action oriented signals, such as movement descriptions and point trajectories, are more effective because they directly connect semantic understanding to executable robot motion.

Contribution 3 ERVLA: Reasoning as Representation Shaping

We present ERVLA, an embodied reasoning VLA that treats explicit CoT as a training signal to reshape action-relevant VLM representations rather than a mandatory autoregressive prefix at inference. ERVLA couples a VLM with a diffusion action head, using a choice branch for action-level supervision, knowledge-truncated KV conditioning for cleaner semantic memory, and CoT dropout to learn from rich reasoning traces during training while predicting continuous actions without mandatory test-time CoT. This end-to-end design enables reliable scaling with embodied CoT pre-training and achieves state-of-the-art results with 86.9% on LIBERO-Plus and 53.2% on VLABench, with stronger real-world robustness under semantic ambiguity and long-horizon tasks.

Dataset

We transform raw manipulation trajectories into structured embodied reasoning supervision. Instead of using free form explanations, each trajectory is decomposed into hierarchical and grounded reasoning fields, including goal, visible objects, planning, subtask reasoning, movement description, gripper pose, and point trajectory.

Embodied CoT dataset overview

Figure 1: Embodied CoT dataset and reasoning supervision overview.

Structured CoT Example

t = 16 DROID Subtask 2 / 9 Frames 12–71

Instruction

Pick up the clothes from the table and put them in the box.

Understanding

Goal: Pick up the clothes from the table and put them in the box.

Grounding

Visible objects (cam0):

  • box — [553, 173, 830, 534]
  • table — [131, 373, 966, 989]
  • tripod — [21, 192, 168, 453]
Planning

Current subtask: contact and lift first cloth

Subtask reasoning: The robot arm is positioned above the table, approaching the first cloth located to the left of the orange box. No objects have been moved yet. The gripper is open and preparing to make contact with the cloth. The next action is to lower the arm and close the gripper to grasp the cloth.

Full plan (9 subtasks)
  • Approach first cloth
  • contact and lift first cloth
  • Transport first cloth to box
  • Release first cloth into box
  • Retract and reposition after first drop
  • Approach second cloth
  • contact and lift second cloth
  • Transport second cloth to box
  • Release second cloth into box
Acting

Movement: move forward 3 cm, move left 5 cm, move down 3 cm, tilt forward 10 degrees, rotate clockwise 10 degrees, keep gripper open

Gripper pose (cam0): [533, 388]

Point trajectory (10 waypoints, cam0)

[527, 385] → [521, 383] → [513, 382] → [504, 380] → [495, 379] → [484, 379] → [474, 382] → [464, 387] → [454, 394] → [443, 402]

The dataset is designed to make embodied CoT a scalable training signal. High level fields provide semantic intent, grounded fields align language with visual entities, planning fields expose task progress, and action fields connect reasoning to executable robot motion. The corpus is built from several open VLA datasets, including Bridge, Fractal, Droid, MolmoAct, and AgiBot.

Method

ERVLA uses a VLM backbone for embodied reasoning and a diffusion transformer for continuous action generation. A choice branch injects action level supervision into the VLM, while knowledge truncation allows the action model to attend to clean semantic memory instead of shortcut control tokens.

ERVLA architecture and training design

Figure 2: ERVLA architecture and training design.

Experiments

LIBERO-Plus Zero-Shot Transfer

We evaluate ERVLA on LIBERO-Plus, where models are post-trained on LIBERO and evaluated by zero-shot direct transfer under task-suite and perturbation shifts. ERVLA reaches a 86.9% total success rate, outperforming strong VLA baselines including OpenVLA-OFT, π0, π0-FAST, PokeVLA, and π0.5. The gains are especially clear under out-of-distribution conditions such as camera, robot, language, lighting, background, noise, and layout perturbations.

Method Task Suite Perturbation Type Total
Succ. (%) ↑
Spatial Object Goal Long Camera Robot Language Light Background Noise Layout
ECoT(Zawalski et al., 2024) 31.827.930.68.60.326.840.242.616.410.236.924.3
OpenVLA-OFT(Kim et al., 2024) 84.066.563.066.456.431.979.588.793.375.874.269.6
π0-FAST(Black et al., 2025) 74.472.757.643.465.121.661.073.273.274.468.861.6
PokeVLA(Zhu et al., 2025) 85.481.877.672.784.746.184.894.682.689.877.279.3
π0.5(Black et al., 2025) 90.489.981.080.871.775.585.996.195.786.487.585.5
No Choice + Knowledge Isolation 83.878.671.469.080.443.881.091.679.285.474.676.5
Choice + No Knowledge Truncation 89.288.679.479.870.873.484.694.294.485.686.284.7
ERVLA (Ours) 96.289.679.682.177.275.387.195.194.792.386.486.9

Table 1: Quantitative comparison on LIBERO-Plus. Models are post-trained on LIBERO and evaluated by zero-shot direct transfer across task suites and perturbation types. ERVLA obtains the best total success rate.

Background Textures

Sensor Noise

Object Layout

Camera Viewpoints

Robot Initial States

Light Condition

VLABench Generalization

VLABench evaluates semantic and distribution-shift generalization across in-distribution, cross-category, commonsense, instruction-following, and texture-variation tracks. ERVLA achieves the best average success rate, progress score, and intention score among compared methods, showing that embodied CoT is most useful when it becomes an action-relevant representation rather than a mandatory autoregressive explanation.

Method Success Rate by Track (%) ↑ Average ↑
In-dist. Category Common Instruction Texture SR PS IS
π0(Black et al., 2024) 47.021.229.117.332.229.444.155.0
π0-FAST(Black et al., 2025) 56.231.038.035.039.039.849.558.6
ACoT-VLA(Zhang et al., 2025) -----47.463.5-
π0.5(Black et al., 2025) 65.438.243.948.244.948.162.364.9
Choice + No Knowledge Truncation 62.042.443.053.635.047.259.863.4
ERVLA (Ours) 69.747.044.058.047.453.265.970.4

Table 2: Comparison on VLABench. SR, PS, and IS denote success rate, progress score, and intention score. ERVLA obtains the strongest average performance across semantic and distribution-shift tracks.

Track 1: In distribution

Add bbq-sauce to the dish

Insert the rose into the vase_seen.

Track 2: Cross Category

Add hotsauce to the dish

Insert the daisy_flower into the vase_unseen.

Track 3: Commonsense Reasoning

The flavors in this dish are quite mild, it would benefit from something that adds a bold and tangy character.

Find the exquisite flower associated with admiration and place it gracefully into the vase.

Track 4: Semantic Instruction Following

While packing the picnic basket, don't forget to include something for the sandwiches that adds a zesty twist.

It's my mother's birthday today. She loves chrysanthemums. Please take the freshest chrysanthemums you can find and place them in her favorite vase as a surprise.

Track 5: Unseen Texture Variation

Add bbq-sauce to the dish

Insert the rose into the vase_seen.

VLM-to-VLA Transfer and CoT Scaling

We study whether embodied CoT can turn stronger vision-language models into stronger action policies, and whether this supervision scales with more pre-training data. On diverse VLM backbones evaluated under a unified interface, performance without explicit CoT shows only weak or even negative correlation with VLM capability on LIBERO and VLABench. With embodied CoT supervision, stronger VLMs transfer more reliably: the best results are dominated by the Qwen3-VL family, indicating that CoT acts as a transfer interface that converts semantic priors into action-relevant representations rather than merely adding extra text.

Naive autoregressive CoT plus action-token decoding does not scale reliably as more robot CoT data is added. ERVLA instead uses CoT as representation-shaping supervision together with a choice branch, knowledge truncation, and flow-based action learning. As embodied CoT pre-training grows from 35M to 226.3M samples, ERVLA improves steadily on both LIBERO-Plus and VLABench, whereas autoregressive CoT+FAST and isolated VLM+DiT baselines show weaker or saturated scaling.

VLM-to-VLA transfer and embodied CoT scaling

Figure 3: VLM-to-VLA transfer and embodied CoT scaling. Left: ECoT better aligns VLM capability with action. Right: ERVLA scales steadily with more CoT data on both LIBERO-Plus and VLABench, whereas AR CoT+FAST and isolated VLM+DiT show weaker or saturated scaling.

Real-World Evaluation

We also deploy ERVLA on physical robots using one third-person camera and one wrist camera. The real-world task suite includes placing items into drawers and clearing tabletop waste across four tiers: Basic, Distractors, Semantic, and Long-horizon. The main gap appears under semantic ambiguity, distractors, and long-horizon dependencies, where ERVLA benefits from internalized embodied reasoning and cleaner diffusion-policy conditioning.

Basic

Put the toy car into the drawer.

Distractors

Put the toy into the drawer.

Semantic

Put the toy into the lower drawer.

Long-horizon

Put the non-toy items into the upper drawer, and put the toys into the second drawer.