RoboMaster

Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)

Unlike Tora that decomposes objects and uses separate trajectories to model the motion of robot arm and manipulated object, we decompose the interaction phase and unify their joint motions into a single collaborative trajectory with fine-grained object awareness. This integration alleviates the feature fusion issue in overlapping regions (see the missing apple in Tora), and improves visual quality.

Method

Given an input image and a prompt, it generates a desired robotic manipulation video with the collaborative trajectory design. Specifically, it first encodes the object masks, including robotic arm and submissive object (acquired either from 1) Grounded-SAM or 2) user-defined brush mask) with the awareness of appearance and shape to obtain object latents for maintaining identity consistency in the video. To precisely model the manipulation process, the controlled trajectory is decomposed into sub-interaction phases: pre-interaction, interaction, and post-interaction, associating each phase with object-specific latents (robotic arm latents in pre-/post-interaction phases and manipulated object latents in interaction, respectively). The collaborative trajectory latent is then injected into plug-and-play motion injectors, enabling the reasoning of video dynamics during generation.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Anonymous Submission

Friendly Reminder: If the videos are loading slowly, you can download this page from our supplementary materials and view it locally by double-clicking the "index.html" file.

Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)

Robotic Manipulation on Diverse Out-of-Domain Objects

Robotic Manipulation with Diverse Skills

Long Video Generation in Auto-Regressive Manner

Response to Reviewer ydrE

Response to Reviewer 1NRP

(1) Annotated Transition Cases

(2) Comparison with TesserAct

Comparison with Baselines (Tora, DragAnything, and IRASim)

Embodied Action Planning

Ablation Study

Method