Dear reviewers, to provide context to the rebuttal, please checkout the evaluation on RLBench, updated network architecture figures, comparison against single-task models and real-world evaluation videos, especially scene 5, 6, 7 for long-horizon tasks.
M2T2 is a unified transformer model for picking and placing. From a raw 3D point cloud, M2T2 predicts 6-DoF grasps for each object on the table and orientation-aware placements for the object holded by the robot.
M2T2 achieves zero-shot Sim2Real transfer for picking and placing unknown objects, outperforming a baseline system consisting of state-of-the-art task-specific methods by 19% in success rate.
M2T2 uses cross-attention between learned embeddings and multi-scale point cloud features to produce per-point contact masks, indicating where to make contact for picking and placing actions. Our general pick-and-place network produces G object-specific grasping masks, 1 for each graspable object in the scene, and P orientation-specific placement masks, 1 for each discretized planar rotation. 6-DoF gripper poses are then reconstructed using the contact masks and the point cloud.
M2T2 can also take other conditional inputs (e.g. language goal) to predict task-specific grasps/placements. Below is the architecture for M2T2 trained on RLBench, which is conditioned on language tokens embedded by a pretrained CLIP model.
M2T2 can perform various complex tasks with a single model by decomposing the task into pick-and-place sequences. Below are some examples from the evaluation on RLBench. M2T2 achieves 89.3%, 88.0%, 86.7% success rate on open drawer, turn tap and meet off grill respectively, whereas PerAct, a state-of-the-art multi-task model, achieves 80%, 80%, 84%.
We have trained M2T2 to only perform a single task: grasping or placing. Although these task-specialized models outperform the baselines, they are still worse than our multi-task model. This shows that it is important to formulate both picking and placing under the same framework.
![]() |
![]() |