Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Anonymous Submission

Friendly Reminder: If the videos are loading slowly, you can download this page from our supplementary materials and view it locally by double-clicking the "index.html" file.

Reproduce Website Demos Benchmark Results & Evaluation


RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios.


Core: Decompose Interaction (Ours) vs Decompose Objects (Previous, e.g. Tora)


Unlike Tora that decomposes objects and uses separate trajectories to model the motion of robot arm and manipulated object, we decompose the interaction phase and unify their joint motions into a single collaborative trajectory with fine-grained object awareness. This integration alleviates the feature fusion issue in overlapping regions (see the missing apple in Tora), and improves visual quality.


Robotic Manipulation on Diverse Out-of-Domain Objects


Robotic Manipulation with Diverse Skills


Long Video Generation in Auto-Regressive Manner



Response to Reviewer ydrE



Response to Reviewer 1NRP

(1) Annotated Transition Cases



(2) Comparison with TesserAct


Comparison with Baselines (Tora, DragAnything, and IRASim)





Embodied Action Planning



Ablation Study



Method

Given an input image and a prompt, it generates a desired robotic manipulation video with the collaborative trajectory design. Specifically, it first encodes the object masks, including robotic arm and submissive object (acquired either from 1) Grounded-SAM or 2) user-defined brush mask) with the awareness of appearance and shape to obtain object latents for maintaining identity consistency in the video. To precisely model the manipulation process, the controlled trajectory is decomposed into sub-interaction phases: pre-interaction, interaction, and post-interaction, associating each phase with object-specific latents (robotic arm latents in pre-/post-interaction phases and manipulated object latents in interaction, respectively). The collaborative trajectory latent is then injected into plug-and-play motion injectors, enabling the reasoning of video dynamics during generation.