M2T2: Multi-Task Masked-Transformer

for Object-centric Pick and Place

Dear reviewers, to provide context to the rebuttal, please checkout the evaluation on RLBench, updated network architecture figures, comparison against single-task models and real-world evaluation videos, especially scene 5, 6, 7 for long-horizon tasks.

M2T2 is a unified transformer model for picking and placing. From a raw 3D point cloud, M2T2 predicts 6-DoF grasps for each object on the table and orientation-aware placements for the object holded by the robot.

Real-world Pick-and-place

M2T2 achieves zero-shot Sim2Real transfer for picking and placing unknown objects, outperforming a baseline system consisting of state-of-the-art task-specific methods by 19% in success rate.

Network Architecture

M2T2 uses cross-attention between learned embeddings and multi-scale point cloud features to produce per-point contact masks, indicating where to make contact for picking and placing actions. Our general pick-and-place network produces G object-specific grasping masks, 1 for each graspable object in the scene, and P orientation-specific placement masks, 1 for each discretized planar rotation. 6-DoF gripper poses are then reconstructed using the contact masks and the point cloud.

M2T2 can also take other conditional inputs (e.g. language goal) to predict task-specific grasps/placements. Below is the architecture for M2T2 trained on RLBench, which is conditioned on language tokens embedded by a pretrained CLIP model.


M2T2 can perform various complex tasks with a single model by decomposing the task into pick-and-place sequences. Below are some examples from the evaluation on RLBench. M2T2 achieves 89.3%, 88.0%, 86.7% success rate on open drawer, turn tap and meet off grill respectively, whereas PerAct, a state-of-the-art multi-task model, achieves 80%, 80%, 84%.

Comparison Against Single-task Models

We have trained M2T2 to only perform a single task: grasping or placing. Although these task-specialized models outperform the baselines, they are still worse than our multi-task model. This shows that it is important to formulate both picking and placing under the same framework.