Robotic Manipulation

Perception, Planning, and Control

Russ Tedrake

© Russ Tedrake, 2020-2022
Last modified .
How to cite these notes, use annotations, and give feedback.

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Fall 2022 semester.

Previous Chapter Table of contents Next Chapter

Deep Perception for Manipulation

In the previous chapter, we discussed deep-learning approaches to object detection and (instance-level) segmentation; these are general-purpose tasks for processing RGB images that are used broadly in computer vision. Detection and segmentation alone can be combined with geometric perception to, for instance, estimate the pose of a known object in just the segmented point cloud instead of the entire scene, or to run our point-cloud grasp selection algorithm only on the segmented point cloud in order to pick up an object of interest.

One of the most amazing features of deep learning for perception is that we can pre-train on a different dataset (like ImageNet or COCO) or even a different task and then fine-tune on our domain-specific dataset or task. But what are the right perception tasks for manipulation? Object detection and segmentation are a great start, but often we want to know more about the scene/objects to manipulate them. That is the topic of this chapter.

There is a potential answer to this question that we will defer to a later chapter: learning end-to-end "visuomotor" policies, sometimes affectionately referred to as "pixels to torques". Here I want us to think first about how we can combine a deep-learning-based perception system with the powerful existing (model-based) tools that we have been building up for planning and control.

maybe a system diagram which includes perception, planning and control? So far we’ve had two version - grasp candidates and/or object pose…

I'll start with deep-learning versions of two perception tasks we've already considered: object pose estimation and grasp selection.

Pose estimation

Grasp selection

(Semantic) Keypoints

Example: corner keypoints for boxes. (also pose+shape estimation from keypoints?)

Dense Descriptors

Task-level state

Other perceptual tasks / representations

My coverage above is necessarily incomplete and the field is moving fast. Here is a quick "shout out" to a few other very relevant ideas.

More coming soon...


Deep Object Net and Contrastive Loss

In this problem you will further explore Dense Object Nets, which were introduced in lecture. Dense Object Nets are able to quickly learn consistent pixel-level representations for visual understanding and manipulation. Dense Object Nets are powerful because the representations they predict are applicable to both rigid and non-rigid objects. They can also generalize to new objects in the same class and can be trained with self-supervised learning. For this problem you will work in to first implement the loss function used to train Dense Object Nets, and then predict correspondences between images using a trained Dense Object Net.

Previous Chapter Table of contents Next Chapter