Stacking our way to more general robots

Introducing RGB-Stacking as a new benchmark for vision-based robotic manipulation.

Picking up a stick and balancing it atop a log or stacking a pebble on a stone may seem like simple — and quite similar — actions for a person. However, most robots struggle with handling more than one such task at a time. Manipulating a stick requires a different set of behaviours than stacking stones, never mind piling various dishes on top of one another or assembling furniture. Before we can teach robots how to perform these kinds of tasks, they first need to learn how to interact with a far greater range of objects. As part of DeepMind’s mission and as a step toward making more generalisable and useful robots, we’re exploring how to enable robots to better understand the interactions of objects with diverse geometries.

Diverse Stacking Behaviours

In a paper to be presented at CoRL 2021 (Conference on Robot Learning) and available now as a preprint on OpenReview, we introduce RGB-Stacking as a new benchmark for vision-based robotic manipulation. In this benchmark, a robot has to learn how to grasp different objects and balance them on top of one another. What sets our research apart from prior work is the diversity of objects used and the large number of empirical evaluations performed to validate our findings. Our results demonstrate that a combination of simulation and real-world data can be used to learn complex multi-object manipulation and suggest a strong baseline for the open problem of generalising to novel objects. To support other researchers, we’re open-sourcing a version of our simulated environment, and releasing the designs for building our real-robot RGB-stacking environment, along with the RGB-object models and information for 3D printing them. We are also open-sourcing a collection of libraries and tools used in our robotics research more broadly.

RGB-Stacking benchmark.

With RGB-Stacking, our goal is to train a robotic arm via reinforcement learning to stack objects of different shapes. We place a parallel gripper attached to a robot arm above a basket, and three objects in the basket — one red, one green, and one blue, hence the name RGB. The task is simple: stack the red object on top of the blue object within 20 seconds, while the green object serves as an obstacle and distraction. The learning process ensures that the agent acquires generalised skills through training on multiple object sets. We intentionally vary the grasp and stack affordances — the qualities that define how the agent can grasp and stack each object. This design principle forces the agent to exhibit behaviours that go beyond a simple pick-and-place strategy.

Each triplet poses its own unique challenges to the agent: Triplet 1 requires a precise grasp of the top object; Triplet 2 often requires the top object to be used as a tool to flip the bottom object before stacking; Triplet 3 requires balancing; Triplet 4 requires precision stacking (i.e., the object centroids need to align); and the top object of Triplet 5 can easily roll off if not stacked gently. In assessing the challenges of this task, we found that our hand-coded scripted baseline had a 51% success rate at stacking.

Our RGB-Stacking benchmark includes two task versions with different levels of difficulty. In “Skill Mastery,” our goal is to train a single agent that’s skilled in stacking a predefined set of five triplets. In “Skill Generalisation,” we use the same triplets for evaluation, but train the agent on a large set of training objects — totalling more than a million possible triplets. To test for generalisation, these training objects exclude the family of objects from which the test triplets were chosen. In both versions, we decouple our learning pipeline into three stages:

  • First, we train in simulation using an off-the-shelf RL algorithm: Maximum a Posteriori Policy Optimisation (MPO). At this stage, we use the simulator’s state, allowing for fast training since the object positions are given directly to the agent instead of the agent needing to learn to find the objects in images. The resulting policy is not directly transferable to the real robot since this information is not available in the real world.
  • Next, we train a new policy in simulation that uses only realistic observations: images and the robot’s proprioceptive state. We use a domain-randomised simulation to improve transfer to real-world images and dynamics. The state policy serves as a teacher, providing the learning agent with corrections to its behaviours, and those corrections are distilled into the new policy.
  • Lastly, we collect data using this policy on real robots and train an improved policy from this data offline by weighting up good transitions based on a learned Q function, as done in Critic Regularised Regression (CRR). This allows us to use the data that’s passively collected during the project instead of running a time-consuming online training algorithm on the real robots.

Decoupling our learning pipeline in such a way proves crucial for two main reasons. Firstly, it allows us to solve the problem at all, since it would simply take too long if we were to start from scratch on the robots directly. Secondly, it increases our research velocity, since different people in our team can work on different parts of the pipeline before we combine these changes for an overall improvement.

Learning pipeline.
Our agent shows novel behaviours for stacking the 5 triplets. The strongest result with Skill Mastery was a vision-based agent that achieved 79% average success in simulation (Stage 2), 68% zero-shot success on real robots (Stage 2), and 82% after the one-step policy improvement from real data (Stage 3). The same pipeline for Skill Generalisation resulted in a final agent that achieved 54% success on real robots (Stage 3). Closing this gap between Skill Mastery and Generalisation remains an open challenge.

In recent years, there has been much work on applying learning algorithms to solving difficult real-robot manipulation problems at scale, but the focus of such work has largely been on tasks such as grasping, pushing, or other forms of manipulating single objects. The approach to RGB-Stacking we describe in our paper, accompanied by our robotics resources now available on GitHub, results in surprising stacking strategies and mastery of stacking a subset of these objects. Still, this step only scratches the surface of what’s possible – and the generalisation challenge remains not fully solved. As researchers keep working to solve the open challenge of true generalisation in robotics, we hope this new benchmark, along with the environment, designs, and tools we have released, contribute to new ideas and methods that can make manipulation even easier and robots more capable.

RGB Stacking