Git Product home page Git Product logo

midlevel-reps's Introduction

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies

What happens when robots leverage visual priors during learning? They learn faster, generalize better, and achieve higher final performance.

An agent with mid-level perception navigating inside a building.


Summary: How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (mid-level vision) within a reinforcement learning framework. This skill set provides a policy with a more processed state of the world compared to raw images, conferring significant advantages over training from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. Realizing these gains requires careful selection of the mid-level perceptual skills, and we provide an efficient and generic max-coverage feature set that can be adopted in lieu of raw images.

This repository includes code from the paper, ready-made dockers containing pre-built environments, and commands to run our experiments. We also include instructions to install the lightweight visualpriors package, which allows you to use mid-level perception in your own code as a drop-in replacement for pixels.

Please see the website (http://perceptual.actor/) for more technical details. This repository is intended for distribution of the code, environments, and installation/running instructions.

See more mid-level perception results and then try mid-level perception out for yourself

Online demos Run our examples Try it yourself
Online demos Using visualpriors! Try it yourself

Overview Video (6 min)

Papers

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies,
Arxiv 2018.
Alexander Sax, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik.

Learning to Navigate Using Mid-Level Visual Priors,
CoRL 2019.
Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik.


Contents


Quickstart [^]

Quickly transform an image into surface normals features and then visualize the result.

Step 1) Run pip install visualpriors to install the visualpriors package. You'll need pytorch!

Step 2) Using python, download an image to test.png and visualize the readout in test_normal_readout.png

from PIL import Image
import torchvision.transforms.functional as TF
import visualpriors
import subprocess

# Download a test image
subprocess.call("curl -O https://raw.githubusercontent.com/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)

# Load image and rescale/resize to [-1,1] and 3x256x256
image = Image.open('test.png')
x = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
x = x.unsqueeze_(0)

# Transform to normals feature
representation = visualpriors.representation_transform(x, 'normal', device='cpu')

# Transform to normals feature and then visualize the readout
pred = visualpriors.feature_readout(x, 'normal', device='cpu')

# Save it
TF.to_pil_image(pred[0] / 2. + 0.5).save('test_normals_readout.png')
Input image representation (3 of 8 channels) pred (after readout)

In addition to normals, you can use any of the following features in your transform:

autoencoding          depth_euclidean          jigsaw                  reshading          
colorization          edge_occlusion           keypoints2d             room_layout      
curvature             edge_texture             keypoints3d             segment_unsup2d        
class_object          egomotion                nonfixated_pose         segment_unsup25d
class_scene           fixated_pose             normal                  segment_semantic      
denoising             inpainting               point_matching          vanishing_point

A description of each of the features is contained in the supplementary of Taskonomy.


Running our experiments [^]

Using mid-level vision, it is possible to train an agent in only a single room and then generalize the training to novel spaces in different buildings. The feature-based agents learn faster and perform significantly better than their trained-from-scratch counterparts. For more extensive discussions about the benefits of visual priors and mid-level vision in particular, please see the paper. This repository focuses on delivering easy-to-use experiments and code.

We provide dockers to reproduce and extend our results. Setting up these environments can be a pain, and docker provides a containerized environment with the environments already set up. If not already installed, install Docker and Nvidia-Docker.

environments


Experiments in Habitat

In the main paper we studied how mid-level perception affects learning on various tasks. In the local planning task,

The agent must direct itself to a given nonvisual target destination (specified using coordinates) using visual inputs, avoiding obstacles and walls as it navigates to the target. This task is useful for the practical skill of local planning, where an agent must traverse sparse waypoints along a desired path. The agent receives dense positive reward proportional to the progress it makes (in Euclidean distance) toward the goal. Further details are contained in the paper.

The following steps will guide you through training an agent to do the local planning task in the Habitat environment. The following agents were submitted to the Habitat Challenge

Habitat experiment

An agent navigating to the goal. The goal is shown in the middle panel, in green. The agent sees only the left and right panels.

Step 1) Install the docker (22 GB)

In a shell, pull the docker to your local machine

docker pull activeperception/habitat:1.0

Step 2) Start up a docker container

Once the docker is installed you can start a new container. The following command will start a new container that can use ports on the host (so that visdom can be run from within the container).

docker run --runtime=nvidia -ti --rm \
    --network host --ipc=host \
    activeperception/habitat:1.0 bash

Step 3) Start visualization services

Inside the docker container we can start a visdom server (to view videos) and a tensorboard instance (for better charts).

mkdir /tmp/midlevel_logs/
screen -S visdom_server -p 0 -X stuff "visdom^M"
screen -S visdom_server -p 0 -X stuff "tensorboard --logdir .^M"

Step 3) Run the experiment

Lastly, we just need to start the experiment. Let's try training an agent that uses predicted surface normals as inputs. We'll use only 1 training and 1 val process since we're just trying to visualize the results.

python -m scripts.train_rl /tmp/midlevel_logs/normals_agent run_training with uuid=normals cfg_habitat taskonomy_decoding  cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1

If you want to compare this to an agent trained from scratch, you can swap this easily with:

python -m scripts.train_rl /tmp/midlevel_logs/scratch run_training with uuid=scratch cfg_habitat scratch  cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1

Or a blinded agent (no visual input)

python -m scripts.train_rl /tmp/midlevel_logs/blind run_training with uuid=blind cfg_habitat blind  cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1

Or using the Max-Coverage Min-Distance Featureset

python -m scripts.train_rl /tmp/midlevel_logs/max_coverage run_training with uuid=blind cfg_habitat max_coverage_perception  cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1

Note: You might see some NaNs in the first iteration. Not to worry! This is probably because the first logging occurs before any episodes have finished.

You can explore more configuration options in configs/habitat.py! We used SACRED for managing experiments, so any of these experiments can be easily modified from the command line.


Experiments in Gibson and VizDoom (Under Construction!)

In addition to local_planning in Habitat, we implemented this and other tasks in Gibson and VizDoom, again finding the same phenomena (better generalization and sample efficiency). The new tasks are defined as follows:

Navigation to a Visual Target: In this scenario the agent must locate a specific target object (Gibson: a wooden crate, Doom: a green torch) as fast as possible with only sparse rewards. Upon touching the target there is a large one-time positive reward and the episode ends. Otherwise there is a small penalty for living. The target looks the same between episodes although the location and orientation of both the agent and target are randomized. The agent must learn to identify the target during the course of training.

Visual Exploration: The agent must visit as many new parts of the space as quickly as possible. The environment is partitioned into small occupancy cells which the agent "unlocks" by scanning with a myopic laser range scanner. This scanner reveals the area directly in front of the agent for up to 1.5 meters. The reward at each timestep is proportional to the number of newly revealed cells.

Full details are contained in the main paper. The following section will guide you through training agents to use either mid-level vision or raw pixels to perform these tasks in Gibson and VizDoom.

Gibson experiment

Local planning using surface normal features in Gibson. We also implemented other tasks; Visual-Target Navigation and Visual Exploration are included in the docker.

Doom experiment

Visual navigation in Doom. The agent must navigate to the green_torch. The docker includes implementions of Visual-Target Navigation and also Visual Exploration in VizDoom.

Note: Our original results (in a code dump form) are currently public via the docker activeperception/midlevel-training:0.3. We are currently working on a cleaner and more portable release.



Using mid-level perception in your code [^]

In addition to using our dockers, we provide a simple way to use mid-level vision in your code. We provide the lightweight visualpriors package which contains functions to upgrade your agent's state from pixels to mid-level features. The visualpriors package seeks to be a drop-in replacement for raw pixels. The remainder of this section focuses installation and usage.


Installing visualpriors

The simplest way to install the visualpriors package is via pip:

pip install visualpriors

If you would prefer to have the source code, then you can clone this repo and install locally via:

git clone --single-branch --branch visualpriors [email protected]:alexsax/midlevel-reps.git
cd midlevel-reps
pip install -e .

Using visualpriors

Once you've installed visualpriors you can immediately begin using mid-level vision. The transform is as easy as

representation = visualpriors.representation_transform(x, 'normal', device='cpu')

1) A complete script for surface normals transform

from PIL import Image
import torchvision.transforms.functional as TF
import visualpriors
import subprocess

feature_type = 'normal'

# Download a test image
subprocess.call("curl -O https://raw.githubusercontent.com/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)

# Load image and rescale/resize to [-1,1] and 3x256x256
image = Image.open('test.png')
o_t = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
o_t = o_t.unsqueeze_(0)

# Transform to normals feature
representation = visualpriors.representation_transform(o_t, feature_type, device='cpu') # phi(o_t) in the diagram below

# Transform to normals feature and then visualize the readout
pred = visualpriors.feature_readout(o_t, feature_type, device='cpu')

# Save it
TF.to_pil_image(pred[0] / 2. + 0.5).save('test_{}_readout.png'.format(feature_type))

Which produces the following results:

Input image (o_t) representation (3 of 8 channels) After readout (pred)
Diagram of the above setup in an active framework. The input image (o_t) gets encoded into representation=\phi(o_t) which is decoded into the prediction pred. In this example, we choose to make the encoder (phi) a ResNet-50.

2) Now let's try transforming the image into object classification (ImageNet) features, instead of surface normals:

midlevel_feats = visualpriors.representation_transform(pre_transform_img, features='class_object')  # So easy!

3) In addition to normals and class_object, you can use any of the following features in your transform:

autoencoding          depth_euclidean          jigsaw                  reshading          
colorization          edge_occlusion           keypoints2d             room_layout      
curvature             edge_texture             keypoints3d             segment_unsup2d        
class_object          egomotion                nonfixated_pose         segment_unsup25d
class_scene           fixated_pose             normal                  segment_semantic      
denoising             inpainting               point_matching          vanishing_point

A description of each of the features is contained in the supplementary of Taskonomy.

4) You can even use multiple features at once:

from midlevel import multi_representation_transform

midlevel_feats = multi_representation_transform(pre_transform_img, # should be 3x256x256. 
                                                features=['normal', 'depth', 'class_object'])  
action = policy(midlevel_feats). # midlevel_feats will be (len(features)*8, 16, 16)

5) The obvious next question is: what's a good general-purpose choice of features? I'm glad that you asked! Our Max-Coverage Min-Distance Featureset proposes an answer, and those solver-found sets are implemented in the function max_coverage_transform. For example, if you can afford to use three features:

from visualpriors import max_coverage_transform

midlevel_feats = max_coverage_transform(pre_transform_img, featureset_size=3)
action = policy(midlevel_feats)



Embodied Vision Toolkit (Under Construction!) [^]

In addition to providing the lightweight visualpriors package, we provide code for our full research platform, evkit. This platform includes utilities for handling visual transforms, flexibility with the choice of RL algprothm (including our off-policy variant of PPO with replay buffer), and tools for logging and visualization.

This section will contain an overview of evkit, which is currently available in the evkit/ folder of this repository.




Citation

If you find this repository or toolkit useful, then please cite:

@inproceedings{midLevelReps2018,
 title={Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.},
 author={Alexander Sax and Bradley Emi and Amir R. Zamir and Leonidas J. Guibas and Silvio Savarese and Jitendra Malik},
 year={2018},
}

midlevel-reps's People

Contributors

alexsax avatar amir32002 avatar jozhang97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

midlevel-reps's Issues

Error while importing environment

Hi,
Thanks for providing this tool. I have installed the latest habitat-lab and habitat-sim. Then, I want to run the experiment, but I get an error when importing the environment:

jin@jin-MS-7B94:~/RL-code/midlevel-reps$ python -m scripts.train_rl midlevel_logs run_training with uuid=normals cfg_habitat taskonomy_decoding  cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1
Traceback (most recent call last):
  File "/home/jin/anaconda3/envs/habitat7/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/jin/anaconda3/envs/habitat7/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jin/RL-code/midlevel-reps/scripts/train_rl.py", line 30, in <module>
    from evkit.env.wrappers import ProcessObservationWrapper
  File "/home/jin/RL-code/midlevel-reps/evkit/env/__init__.py", line 3, in <module>
    from .distributed_factory import DistributedEnv
  File "/home/jin/RL-code/midlevel-reps/evkit/env/distributed_factory.py", line 1, in <module>
    from evkit.env.util.vec_env.subproc_vec_embodied_env import SubprocVecEmbodiedEnv
  File "/home/jin/RL-code/midlevel-reps/evkit/env/util/vec_env/subproc_vec_embodied_env.py", line 3, in <module>
    from baselines.common.vec_env import VecEnv, CloudpickleWrapper
ModuleNotFoundError: No module named 'baselines'

Could you give me some help? Thank you very much!

Colorization features require grayscale inputs

Thanks for this very clear and useful package.

When calling visualpriors.representation_transform(image_data, "colorization") I get an error (dimension mismatch) when I pass in an RGB color image.

I know I can get a grayscale image along the lines of 0.3 * red + 0.6 * green + 0.1 * blue, but I wanted to check to see what you used during training so it could matched as closely as possible. Perhaps representation_transform could take care of this preprocessing step for the user to take out the guesswork?

visualpriors source code

Hi,
Thanks for providing this tool. I want to use the visualpriors in my project. And I am considering maybe I can extend the visualpriors by including new features. However, I can't find the source code for this package. The only repository I found is: https://github.com/alexsax/visual-prior. But only checkpoint files are included. Is it redirected to somewhere else?
Thanks

screen command failed "No screen session found."

Hi,
In the step 3 (Start visualization services) in Experiments in Habitat: $screen -S visdom_server -p 0 -X stuff "visdom^M"
Error message "No screen session found" is returned.
May I ask how to work around this problem?
Thank you

size mismatch for decoder_output.0.weight and decoder_output.0.bias

Thanks for this great piece of work! When I change the template code from 'normal' to 'segment_unsup2d' like this:

subprocess.call("curl -O https://raw.githubusercontent.com/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)
image = Image.open('test.png')
x = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
x = x.unsqueeze_(0)
representation = visualpriors.representation_transform(x, 'segment_unsup2d', device='cpu')
pred = visualpriors.feature_readout(x, 'segment_unsup2d', device='cpu')
a = TF.to_pil_image(pred[0] / 2. + 0.5)
plt.imshow(a)

I get the following errorw which is being causes by the feature_readout function:

RuntimeError: Error(s) in loading state_dict for TaskonomyDecoder:
  size mismatch for decoder_output.0.weight: copying a param with shape torch.Size([64, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 16, 3, 3]).
  size mismatch for decoder_output.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).

Let me know if i'm doing something wrong.

Pretrained checkpoint

Hi,

In the paper, it mentions the pre-trained models are available.

I'm wondering where we can find the pre-trained checkpoints for the agents, such as "Max-Coverage Min-Distance Featureset?"

Feature readout does not work with semantic networks

I get the following images when running feature_readout

Input image:
lbkedkoomddhdkca
Normals:
dohhdobhmhdbgdmp
Segment unsup2D
pcogflojfgahakel
Semantic segment
fapmfljjjjaghone
Segment unsup25D
boaokcjkoabfchci

Does anything special need to be done to render the semantic representations?

Variable not defined

Hello, I was trying and inspecting the code of visualpriors and I think that in the max_coverage_featureset_transform function you forgot to change the name of the second parameter that is passed to the function at line 45. In fact, feature_tasks, is not defined and I think that you meant to pass k instead.

def max_coverage_featureset_transform(img, k=4, device=default_device):
'''
Transforms an RGB image into a features driven by some vision tasks.
The tasks are chosen according to the Max-Coverage Min-Distance Featureset
From the paper:
Mid-Level Visual Representations Improve Generalization and Sample Efficiency
for Learning Visuomotor Policies.
Alexander Sax, Bradley Emi, Amir R. Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik.
Arxiv preprint 2018.
This function expects inputs:
shape (batch_size, 3, 256, 256)
values [-1,1]
Outputs:
shape (batch_size, 8*k, 16, 16)
'''
return VisualPrior.max_coverage_transform(img, feature_tasks, device)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.