Git Product home page Git Product logo

d3fields's Introduction

D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Open In Colab

D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Yixuan Wang1*, Zhuoran Li2, 3*, Mingtong Zhang1, Katherine Driggs-Campbell1, Jiajun Wu2, Li Fei-Fei2, Yunzhu Li1, 2

1University of Illinois Urbana-Champaign, 2Stanford University, 3National University of Singapore

teaser_capcut.mp4

Try it in Colab!

In this notebook, we show how to build D3Fields and visualize reconstructed mesh, mask fields, and descriptor fields. We also demonstrate how to track keypoints of a video.

Installation

We recommend Mambaforge instead of the standard anaconda distribution for faster installation:

# create conda environment
mamba env create -f env.yaml
conda activate d3fields

# download pretrained models
bash scripts/download_ckpts.sh
bash scripts/download_data.sh

Visualization

python vis_repr.py # visualize the representation
python vis_tracking.py # visualize the tracking

Code Explanation

Fusion is the core class of D3Fields. It contains the following key functions:

  • update: it takes in the observation and updates the internal states.
  • text_queries_for_inst_mask: it will query the instance mask according to the text query and thresholds.
  • text_queries_for_inst_mask_no_track: it is similar to text_queries_for_inst_mask, but it will not invoke the underlying XMem tracking module.
  • eval: it will evaluate associated features for arbitrary 3D points.
  • batch_eval: for a large batch of points, it will evaluate them batch by batch to avoid out-of-memory error. The important attributes of Fusion are:
  • curr_obs_torch: a dictionary containing the following keys:
    • color: multiview color images in the format of np.uint8 BGR numpy arrays
    • color_tensor: multiview color images in the format of float32 BGR torch tensors
    • depth: multiview depth images in the format of np.float32 torch tensors, unit in meters
    • mask: multiview instance mask images in the format of np.uint8 torch tensors (V, H, W, num_inst)
    • consensus_mask_label: mask labels aggregated from all views in the format of a list of strings.

Customized Dataset

To run D3Fields on your own dataset, you could follow the following steps:

  1. Prepare dataset in the following structure:
dataset_name
├── camera_0
│   ├── color
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── depth
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── camera_extrinsics.npy
│   ├── camera_params.npy
├── camera_1
├── ...

The definition of camera_extrinsics.npy and camera_params.npy is defined as follows:

camera_extrinsics.npy: (4, 4) numpy array, the extrinsics of the camera, which transforms a point from world coordinate to camera coordinate
camera_params.npy: (4,) numpy array, the camera parameters in the following order: fx, fy, cx, cy
  1. Prepare the PCA pickle file for the query texts. Find four images of the queries texts (e.g. mug) with clean bakcground and central objects. Change obj_type within scripts/prepare_pca.py and run it.
  2. Specify the workspace boundary as x_lower, x_upper, y_lower, y_upper, z_lower, z_upper.
  3. Run python vis_repr_custom.py, such as python vis_repr_custom.py --data_path data/2023-09-15-13-21-56-171587 --pca_path pca_model/mug.pkl --query_texts mug --query_thresholds 0.3 --x_lower -0.4 --x_upper 0.4 --y_upper 0.3 --y_lower -0.4 --z_upper 0.02 --z_lower -0.2

Tips for debugging:

  • Make sure the transformation is right by visualizing pcd within vis_repr_custom.py using Open3D.
  • If the GPU is out of memory, run vis_repr_custom.py with smaller step. This will generate a more sparse voxel grid.
  • Make sure Grounded SAM outputs reasonable results by checking curr_obs_torch['mask'] and curr_obs_torch['consensus_mask_label'] of Fusion class.

Citation

If you find this repo useful for your research, please consider citing the paper

@article{wang2023d3fields,
    title={D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation},
    author={Wang, Yixuan and Li, Zhuoran and Zhang, Mingtong and Driggs-Campbell, Katherine and Wu, Jiajun and Fei-Fei, Li and Li, Yunzhu},
    journal={arXiv preprint arXiv:2309.16118},
    year={2023}
}

d3fields's People

Contributors

wangyixuan12 avatar ywang-bdai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

d3fields's Issues

Question about the shape of sematic feature map

hello! The size of dinov2 feature image is 'patch_h, patch_w', but the size of mask image is 'H, W'. They are written the same in the interpolation section of the paper. (both 'H ,W'). How is it handled in the code?

About Truncation of Distance

Hello, I found that in Eq. 5 of your paper mentioned that d should be truncated to $[-\mu, \mu]$. But it seems that it will cause $w_i$ of Eq.6 to always be one and in your code I think there's no truncation in eval_dist. It confuses me a little bit.

Confusing about optimize

Thank you for your excellent work! I am confused about one part, Is it feasible to optimize without dynamic model? And whether the cost function can be interpreted as the pixel difference between key points?

Confusing about keypoints tracking

Thank you for your excellent work! I am confused about one part, why do you need to track keypoints from time t to t+1? Isn't the representation of the scene directly obtainable through observation at time t+1? Looking forward to your answer.

Info about visualization tool

Hi! really nice work, I particularly like the idea of driving manipulation with AI generated images

Screenshot from 2023-10-13 12-08-46

As depicted in the DinoV2 repo facebookresearch/dinov2#23 (comment) when we have a bunch of images of the same class to visualize similar parts with the same color it's needed to put them in the same "batch" and perform pca on the flattenized tensor (N x f), with f feature size

So to visualize the Goal image with the same color pattern of the point cloud you needed to perform PCA of the goal together with the other images or did u manage to achieve this directly (and how?)

Thanks in advance for your time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.