Perception, Planning, and Control

© Russ Tedrake, 2020-2022

Last modified .

How to cite these notes, use annotations, and give feedback.

**Note:** These are working notes used for a course being taught
at MIT. They will be updated throughout the Fall 2022 semester.

Previous Chapter | Table of contents | Next Chapter |

These days, there is a lot of excitement around reinforcement learning
(RL), and a lot of literature available. The scope of what one might
consider to be a reinforcement learning algorithm has also broaden
significantly. The classic (and now updated) and still best introduction to
RL is the book by Sutton and Barto

My goal for this chapter is to provide enough of the fundamentals to get us all on the same page, but to focus primarily on the RL ideas (and examples) that are particularly relevant for manipulation. And manipulation is a great playground for RL, due to the need for rich perception, for operating in diverse environments, and potentially with rich contact mechanics. Many of the core applied research results have been motivated by and demonstrated on manipulation examples!

There are now a huge variety of RL toolboxes available online, with widely varying levels of quality and sophistication. But there is one standard that has clearly won out as the default interface with which one should wrap up their simulator in order to connect to a variety of RL algorithms: the Gym.

It's worth taking a minute to appreciate the difference in the OpenAI
Gym Environments
(`gym.Env`

) interface and the Drake System interface; I think it
is very telling. My goal (in Drake), is to present you with a rich and
beautiful interface to express and optimize dynamical systems, to expose
and exploit all possible structure in the governing equations. The goal in
Gym is to expose the absolute minimal details, so that it's possible to
easily wrap every possible system under the same common interface (it doesn't matter if it's a robot, an Atari game, or
even a compiler). Almost
by definition, you can wrap any Drake system as a Gym Environment.

An OpenAI Gym Environment is an incredibly simple wrapper around
simulators which offers a very
basic interface, most notably consisting of `reset()`

,
`step()`

, `render()`

. The
`step()`

method returns the current observations and the
one-step reward (as well as some additional termination conditions).

You can wrap any Drake simulation in an OpenAI gym environment, using

`from manipulation.drake_gym import DrakeGymEnv`

The `DrakeGym`

constructor takes a `Simulator`

as well as an input port
to associate with the actions, an output port to associate with the
observations, etc. For the reward, you can implement it as a simple function of the `Simulator`

`Context`

, or as another output port.
`DrakeGym`

is built around a `Simulator`

(not just a `System`

) or a function that produces a random
`Simulator`

because you might want to control the integrator
parameters or have each rollout contain the same robot in a different
environment, with potentially different numbers of objects. This would
mean that the underlying `System`

might have a different
number of states / ports. The notion of a function that can produce
simulators, referred to as a `SimulatorFactory`

, is core to
the stochastic
system modeling framework in Drake.

You can also use any Gym environment in the Drake ecosystem; you just won't be able to apply some of the more advanced algorithms that Drake provides. Of course, I think that you should use Drake for your work in RL, too (many people do), because it provides a the rich library of dynamical systems that are rigorously authored and tested, including a great physics engine for dealing with contact, and leaves open the option to put RL approaches head-to-head against more model-based alternatives. I admit I might be a little biased. At any rate, that's the approach we will take in these notes.

Some people might
argue that the more thoughtfully you model your system, the more
assumptions you have baked in, making yourself susceptible to "sim2real"
gaps; but I think that's simply not the case. Thoughtful modeling includes
making uncertainty models that can account for as narrow or broad of a
class of systems as we aim to explore; good things happen when we can make
the uncertainty models themselves *structured*. I think one of the
most fundamental challenges waiting for us at the intersection of
reinforcement learning and control is a deeper understanding of the class
of models that is rich enough to describe the diversity and richness of
problems we are exploring in manipulation (e.g. with RGB cameras as inputs)
while providing somewhat more structure that we can exploit with stronger
algorithms. Importantly, these models should continually expand and
improve with data.

The OpenAI Gym provides an interface for RL environments, but doesn't provide the implementation of the actual RL algorithms. There are a large number of popular repositories for the algorithms, too. As of this writing, I would recommend Stable Baselines 3: it provides a very nice and thoughfully-documented set of implementations in PyTorch.

One other class of algorithms that is very relevant to RL but not specifically designed for RL is algorithms for black-box optimization. I quite like Nevergrad, and will also use that here.

`rl/black_box.ipynb`

.
You can find more details on the derivation and some basic analysis of these algorithms here.

`rl/box_flipup.ipynb`

.
This is a great time for theoretical RL + controls, with experts from controls embracing new techniques and insights from machine learning, and vice versa. As a simple example, we've increasingly come to understand that, even though the cost landscape for many classical control problems (like the linear quadratic regulator) is not convex in the typical policy parameters, we now understand that gradient descent still works for these problems (there are no local minima), and the class of problems/parameterizations for which we can make statements like this is growing rapidly.

For this exercise, you will implement a stochastic optimization scheme that does not require exact analytical gradients. You will work exclusively in . You will be asked to complete the following steps:

- Implement gradient descent with exact analytical gradients.
- Implement stochastic gradient descent with approximated gradients.
- Prove that the expected value of the stochastic update does not change with baselines.
- Implement stochastic gradient descent with baselines.

For this exercise, you will implement the vanilla REINFORCE algorithm on a box pushing task. You will work exclusively in . You will be asked to complete the following steps:

- Implement the policy loss function.
- Implement the value loss function.
- Implement the advantage function.

- "Reinforcement Learning: An Introduction", MIT Press , 2018. ,
- "Reinforcement Learning: Theory and Algorithms", Online Draft , 2020. ,
- "Algorithms for Reinforcement Learning", Morgan and Claypool Publishers , 2010. ,

Previous Chapter | Table of contents | Next Chapter |