VLK — Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

VLK Learning Humanoid Loco-Manipulation
from Synthetic Interactions in Reconstructed Scenes

Yen-Jen Wang^*,1,2, Jiaman Li^*,‡,1, Sirui Chen^§,1,3, Takara E. Truong^§,1,3, Pei Xu^§,1,

Pieter Abbeel^†,1,2, Rocky Duan^†,1, Koushil Sreenath^†,1,2, Angjoo Kanazawa^†,1,2,

Carmelo Sferrazza^†,1, Guanya Shi^†,1,4, C. Karen Liu^†,1,3

¹Amazon FAR · ²UC Berkeley · ³Stanford University · ⁴Carnegie Mellon University

^*Co-first authors. ^‡Project lead. ^§Equal contribution. ^†Amazon FAR Team Co-Lead.

abstract

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale.

We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid.

We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation.

real-world highlights

VLK on the real Unitree G1 — trained on purely synthetic data, deployed autonomously, zero-shot.

48k paired trajectories
per scene
5 real-world
task families
0 real-world
fine-tuning

Real world in. Real world out. Everything in between is synthesized.

Stage 01 · Synthetic Data Generation

Generate paired data inside a reconstructed scene.

1
Scene Reconstruction. iPhone → 3D Gaussian Splatting at metric scale.
2
Waypoint Generation. Sample feasible walk-to / pick-from / place-onto spots from scene geometry.
3
Motion Synthesis. Conditional diffusion produces G1 whole-body motions following the waypoints.
4
Egocentric Rendering. Replay motions in the 3DGS scene; randomize lighting, color, and camera for transfer.

Stage 02 · VLK Policy

Train a vision-language-kinematics policy on the synthetic data.

1
Inputs. Egocentric image + language instruction + current G1 state (incl. wrist contact).
2
Backbone. Initialize from pretrained π-0.5 VLA; fine-tune on synthetic data only.
3
Output. 1-second future whole-body kinematic trajectory at 30 Hz.
4
Tracker. We use SceneBot, a sim-trained whole-body tracker, to convert kinematic predictions into joint commands at 50 Hz.

After training, we deploy the model in simulation and on the real Unitree G1.

Data modes.

Six trajectory modes are synthesized in both an Apartment and a Lab 3DGS replica.

Walk totarget object navigation.

Turn aroundleft / right pivots in place.

Pick (floor)grasp from ground plane.

Place (floor)deposit onto ground plane.

Pick (surface)grasp from elevated surface.

Place (surface)deposit onto elevated surface.

Deployment in simulation.

Separate VLK policies are trained for the Lab and Apartment scenes; each is evaluated inside its own 3DGS replica under streamed multi-step language instructions.

Multi-step language streaming inside the Lab 3DGS replica.

A separately-trained VLK policy executing in the Apartment 3DGS replica.

Deployment on the real Unitree G1.

We evaluate the VLK policy on five real-world task families with no real-world fine-tuning.

Task 01Navigation

Navigation.

From random initial poses, the policy walks to the language-specified target across unseen layouts.

4-grid: walk to black chair · stool · white chair · wooden desk.

Random pose / unseen instruction. Both zero-shot.

Task 02Box Lifting

Box lifting.

Small, medium, and large boxes are picked from the floor and placed back, with no per-size tuning.

Three boxes, picked & placed in sync.

Task 03Multi-Stage

Multi-stage tasks.

Walk, pick, carry, and place are chained at runtime via streamed language instructions.

Three multi-stage chains, different starting conditions.

Task 04Robustness

Robustness to disturbance.

We test the policy under mid-action scene-layout changes and under flickering disco-light visual perturbations.

Layout altered while the robot is acting — dynamic scene change.

Flickering disco-light visual perturbation.

Task 05Long-Horizon

Long-horizon tasks.

Navigation and pick-and-place are chained under streamed language for minutes-long trials.

Long-horizon trial 1.

Long-horizon trial 2.

@misc{wang2026vlk, title = {VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes}, author = {Wang, Yen-Jen and Li, Jiaman and Chen, Sirui and Truong, Takara E. and Xu, Pei and Abbeel, Pieter and Duan, Rocky and Sreenath, Koushil and Kanazawa, Angjoo and Sferrazza, Carmelo and Shi, Guanya and Liu, C. Karen}, year = {2026}, eprint = {2606.30645}, archivePrefix = {arXiv}, primaryClass = {cs.RO}, url = {https://arxiv.org/abs/2606.30645}, }