VLK Learning Humanoid Loco-Manipulation
from Synthetic Interactions in Reconstructed Scenes

Yen-Jen Wang*,1,2, Jiaman Li*,‡,1, Sirui Chen§,1,3, Takara E. Truong§,1,3, Pei Xu§,1,

Pieter Abbeel†,1,2, Rocky Duan†,1, Koushil Sreenath†,1,2, Angjoo Kanazawa†,1,2,

Carmelo Sferrazza†,1, Guanya Shi†,1,4, C. Karen Liu†,1,3

1Amazon FAR · 2UC Berkeley · 3Stanford University · 4Carnegie Mellon University

*Co-first authors. Project lead. §Equal contribution. Amazon FAR Team Co-Lead.

abstract

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale.

We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid.

We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation.

real-world highlights

VLK on the real Unitree G1 — trained on purely synthetic data, deployed autonomously, zero-shot.

  • 48k paired trajectories
    per scene
  • 5 real-world
    task families
  • 0 real-world
    fine-tuning

method

Real world in. Real world out. Everything in between is synthesized.

Stage 01 · Synthetic Data Generation

Generate paired data inside a reconstructed scene.

  1. 1
    Scene Reconstruction. iPhone → 3D Gaussian Splatting at metric scale.
  2. 2
    Waypoint Generation. Sample feasible walk-to / pick-from / place-onto spots from scene geometry.
  3. 3
    Motion Synthesis. Conditional diffusion produces G1 whole-body motions following the waypoints.
  4. 4
    Egocentric Rendering. Replay motions in the 3DGS scene; randomize lighting, color, and camera for transfer.
Stage 02 · VLK Policy

Train a vision-language-kinematics policy on the synthetic data.

  1. 1
    Inputs. Egocentric image + language instruction + current G1 state (incl. wrist contact).
  2. 2
    Backbone. Initialize from pretrained π-0.5 VLA; fine-tune on synthetic data only.
  3. 3
    Output. 1-second future whole-body kinematic trajectory at 30 Hz.
  4. 4
    Tracker. We use SceneBot, a sim-trained whole-body tracker, to convert kinematic predictions into joint commands at 50 Hz.

After training, we deploy the model in simulation and on the real Unitree G1.

synthetic data

Data modes.

Six trajectory modes are synthesized in both an Apartment and a Lab 3DGS replica.

01
Walk totarget object navigation.
02
Turn aroundleft / right pivots in place.
03
Pick (floor)grasp from ground plane.
04
Place (floor)deposit onto ground plane.
05
Pick (surface)grasp from elevated surface.
06
Place (surface)deposit onto elevated surface.

sim deployment

Deployment in simulation.

Separate VLK policies are trained for the Lab and Apartment scenes; each is evaluated inside its own 3DGS replica under streamed multi-step language instructions.

Multi-step language streaming inside the Lab 3DGS replica.
A separately-trained VLK policy executing in the Apartment 3DGS replica.

real deployment

Deployment on the real Unitree G1.

We evaluate the VLK policy on five real-world task families with no real-world fine-tuning.

Task 01Navigation

Navigation.

From random initial poses, the policy walks to the language-specified target across unseen layouts.

4-grid: walk to black chair · stool · white chair · wooden desk.
Random pose / unseen instruction. Both zero-shot.
Task 02Box Lifting

Box lifting.

Small, medium, and large boxes are picked from the floor and placed back, with no per-size tuning.

Three boxes, picked & placed in sync.
Task 03Multi-Stage

Multi-stage tasks.

Walk, pick, carry, and place are chained at runtime via streamed language instructions.

Three multi-stage chains, different starting conditions.
Task 04Robustness

Robustness to disturbance.

We test the policy under mid-action scene-layout changes and under flickering disco-light visual perturbations.

Layout altered while the robot is acting — dynamic scene change.
Flickering disco-light visual perturbation.
Task 05Long-Horizon

Long-horizon tasks.

Navigation and pick-and-place are chained under streamed language for minutes-long trials.

Long-horizon trial 1.
Long-horizon trial 2.

paper

Preprint

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

Wang, Li, Chen, Truong, Xu, Abbeel, Duan, Sreenath, Kanazawa, Sferrazza, Shi, Liu  ·  2026

BibTeX
@misc{wang2026vlk,
  title   = {VLK: Learning Humanoid Loco-Manipulation
             from Synthetic Interactions in
             Reconstructed Scenes},
  author  = {Wang, Yen-Jen and Li, Jiaman and
             Chen, Sirui and Truong, Takara E. and
             Xu, Pei and Abbeel, Pieter and
             Duan, Rocky and Sreenath, Koushil and
             Kanazawa, Angjoo and Sferrazza, Carmelo and
             Shi, Guanya and Liu, C. Karen},
  year          = {2026},
  eprint        = {2606.30645},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2606.30645},
}

acknowledgements

We thank Xiaoyu Huang for preparing a clean retargeted OMOMO dataset, Alejandro Escontrela for guidance on 3DGS training, Jenny Zhang and Hunter Liu for providing realistic box assets for rendering, and Eric Lalumiere for providing the camera mount. We also thank Zhen Wu, Shibo Zhao, and Yunsheng Tian for valuable discussions on motion synthesis and Real2Sim pipelines. We are grateful to Haozhi Qi, Linda Shih, Youjian Huang, and Roberto Ceja for their support with the experimental hardware, and to Shuying Deng, Zihan Wang, and Charlie Cheng for their valuable feedback on the design and visual presentation of the paper figures and video.