Git Product home page Git Product logo

realfusion's Introduction

RealFusion: 360° Reconstruction of Any Object from a Single Image

CVPR 2023

Arxiv Conference

Table of Contents

Overview

Code Overview

This repository is based on the wonderful stable-dreamfusion repo from ashawkey and the diffusers library from HuggingFace.

I substantially refactored the repository for the public release, so some parts of it are different from the original code used for the paper. These new changes should be improvements rather than degradations. If anything is broken due to the refactor, let me know and I will fix it.

If you have any questions or contributions, feel free to leave an issue or a pull request.

Abstract

We consider the problem of reconstructing a full 360° photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using an approach inspired by DreamFields and DreamFusion, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.

Examples

Examples

Method

Diagram

A Quick Note

The method works well on some scenes and it does not work well on others. You cannot expect it to work well on every image, especially without tuning the hyperparameters (of which there are many). The failure modes are described and shown in the paper (Figure 11). Some scenes simply do not yield solid objects, while some have strange geometries. The Janus problem (i.e. multiple faces) is common when dealing with images that contain faces. We are working on future versions of the method to improve its robustness, and this repository will be updated as we find and implement these improvements!

Running the code

Dependencies

Dependencies may be installed with pip:

# We try to keep requirements light
pip install -r requirements.txt

# (Recommended) Build cuda extensions -- alternatively, they can be built on-the-fly
pip install ./raymarching
pip install ./shencoder
pip install ./freqencoder
pip install ./gridencoder

# (Optional) Install nvdiffrast for exporting textured mesh (if use --save_mesh)
pip install git+https://github.com/NVlabs/nvdiffrast/

PyTorch and torchvision are not included in requirements.txt because that sometimes messes up conda installations by trying to re-install PyTorch using pip. I assume you've already installed these by yourself.

Data

The method takes as input a square image and an object mask. These are compactly represented as an RGBA image.

If you have an image without a mask and you would like to extract the salient object, you can use the provided helper script:

python scripts/extract-mask.py --image_path ${IMAGE_PATH} --output_dir ${OUTPUT_DIR}

You can also use the script to easily combine an existing image and mask into an RGBA image:

python scripts/extract-mask.py --image_path ${IMAGE_PATH} --mask_path ${MASK_PATH} --output_dir ${OUTPUT_DIR}

There are examples in the examples folder:

examples
└── natural-images
    ├── banana_1
    │   |── (there are other files here but they are not necessary)
    │   |── learned_embeds.bin (from textual inversion; explained below)
    │   └── rgba.png
    └── cactus_1
        └── rgba.png

Textual Inversion

The first step of our method is running single-image textual inversion to obtain an embedding <e> which represents your object.

For this, you can use our code in the textual-inversion subdirectory. This is based on the code in diffusers here, except that it also adds heavy data augmentation. I'm planning on upstreaming this change to the diffusers library in the near future, so this script won't be necessary for much longer.

For example:

# Run this from within the textual-inversion directory

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="path-to-dir-containing-your-image"
export OUTPUT_DIR="path-to-desired-output-dir"

python textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="_cat_statue_" \
  --initializer_token="cat" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir=$OUTPUT_DIR \
  --use_augmentations

The script will save a model checkpoint and a learned_embeds.bin file to your output directory. For the next step (reconstruction), you can either use the model checkpoint directly or you can use the learned_embeds.bin with your original $MODEL_NAME.

Note that learned_embeds.bin is a file containing a dictionary with the learned embedding. For example, the file looks like this when printed in a terminal:

>>> import torch
>>> x = torch.load("learned_embeds.bin")
>>> print(type(x))
<class 'dict'>
>>> print(x)
{'_my_placeholder_token_': <tensor of size [512] or [768] depending on your stable diffusion model>

The only annoying thing about this textual inversion procedure is that it is slow. A full training run takes around 1 hour on a V100 GPU. In the future, I would like to replace it with something like ELITE or encoder-based tuning which should reduce the time this steps takes down to seconds.

Side note: Textual Inversion Initialization

Regarding initialization, if you would like to use CLIP to automatically find a good initializer token for textual inversion, you can use our provided script. For this you'll need to install python-fire and nltk with pip install fire nltk. Then you can run:

# Run this from within the textual-inversion directory

# First, compute and save embeddings for all noun tokens in the CLIP tokenizer vocabulary. This only has
# to be done once and it should take about 2 minutes on a V100 GPU. It saves a file (around 30MB) with 
# the embeddings, which is loaded when you call get_initialization.
python autoinit.py save_embeddings

# This will print the top 5 tokens to the terminal and save the top token to a file 
# named token_autoinit.txt in the same directory as your image.
python autoinit.py get_initialization /path/to/your/image.jpg

In most cases it is easy to come up with an initialization token yourself, but we include this script because it makes the process fully-automatic.

Reconstruction

TL;DR: You can run it with python main.py --O

Here we optimize our NeRF to match the provided image using a combination of a standard reconstruction objective and a score distillation sampling objective.

There are a large number of arguments that can be tuned for your specific image. These options are contained in nerf/options.py. The most important options are those related to (1) the camera pose for the input image, (2) the losses, and (3) the gpu/cuda acceleration.

For the camera pose, the first key parameter is the pose angle:

pose_angle: float = 75  # camera angle below the vertical

This corresponds to the angle down from the vertical. For example, see the following diagram.

The default value is 80 degrees, which corresponds to 10 degrees above the horizontal plane. If your object is viewed from a more "vertical" angle, you should change this. If your image is viewed from straight-on, you can make it 90. If your image is viewed from below (which is quite unusual), then you can make it negative.

The second key parameters are the camera radii, which control how far the cameras are from the origin.

radius_range: tuple[float, float] = [1.0, 1.5]  # training camera radius range
radius_rot: Optional[float] = 1.8  # None  # circle radius for vis

It usually works well to set recommended to set radius_rot to be a little bigger than the maximum of radius_range. If your object is small/large in your input image, then it might make sense to make radius_rot bigger/smaller.

Apart from this, there are some training options you might want to tweak. Notable options include:

iters: int = 5000  # training iters
albedo_iters: int = 1000  # training iters that only use albedo shading
HW_synthetic: int = 96  # render size for synthetic images
grid_levels_mask: int = 8  # the number of levels in the feature grid to mask (to disable use 0)
grid_levels_mask_iters: int = 3_000  # the number of iterations for feature grid masking (to disable use 1_000_000)
optim: Literal['adan', 'adam', 'adamw'] = 'adamw'  # optimizer

For the losses, we have:

# An AnnealedValue is a helper type for a value that may be annealed over the course of 
# training. It can either be a single value fixed for all of training, or a list of 
# [start_value, end_value] which is annealed linearly over all training iterations, or a 
# list of [start_value, end_value, end_iters, fn_name] which reaches its end_value at 
# end_iters and may use either linear or log annealing.
AnnealedValue = list[float]

lambda_prior: AnnealedValue = [1.0]  # loss scale for diffusion model
lambda_image: AnnealedValue = [5.0]  # loss scale for real image
lambda_mask: AnnealedValue = [0.5]  # loss scale for real mask
lambda_entropy: AnnealedValue = [1e-4]  # loss scale for alpha entropy
lambda_opacity: AnnealedValue = [0]  # loss scale for alpha value
lambda_orient: AnnealedValue = [1e-2]  # loss scale for orientation
lambda_smooth: AnnealedValue = [0]  # loss scale for surface smoothness
lambda_smooth_2d: AnnealedValue = [0.5]  # loss scale for surface smoothness (2d version)

For CUDA acceleration, there are options including --fp16 and --cuda_ray. It is recommended to use --O, which sets multiple of these options automatically.

fp16: bool = False  # use fp16
cuda_ray: bool = True  # use CUDA raymarching instead of pytorch

During development and testing, I always used --O. Currently, I have not implemented raymarching without --O. If you need this feature, leave an issue and I can try to address it.

Now that you have a sense of the parameters, here are some example commands.

Examples

export TOKEN="_cake_2_"  # set this according to your textual inversion placeholder_token or use the trick below
export DATA_DIR=$PWD/examples/natural-images/cake_2

python main.py --O \
    --image_path $DATA_DIR/rgba.png \
    --learned_embeds_path $DATA_DIR/learned_embeds.bin \
    --text "A high-resolution DSLR image of a $TOKEN" \
    --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5"
export TOKEN="_cat_statue_"  # set this according to your textual inversion placeholder_token
export DATA_DIR=$PWD/examples/natural-images/cat_statue

python main.py --O \
    --image_path $DATA_DIR/rgba.png \
    --learned_embeds_path $DATA_DIR/learned_embeds.bin \
    --text "A high-resolution DSLR image of a $TOKEN" \
    --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5"
export TOKEN="_colorful_teapot_"  # set this according to your textual inversion placeholder_token
export DATA_DIR=$PWD/examples/natural-images/cake_2

python main.py --O \
    --image_path $DATA_DIR/rgba.png \
    --learned_embeds_path $DATA_DIR/learned_embeds.bin \
    --text "A high-resolution DSLR image of a $TOKEN" \
    --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5"

Extra tips

If you are using --learned_embeds_path, then you can use <token> in your prompt and this will automatically be replaced by your learned token. For example:

export DATA_DIR=$PWD/examples/natural-images/cake_2

python main.py --O \
    --image_path $DATA_DIR/rgba.png \
    --learned_embeds_path $DATA_DIR/learned_embeds.bin \
    --text "A high-resolution DSLR image of a <token>"

To run multiple jobs in parallel on a SLURM cluster, you can use a script such as:

python scripts/example-slurm.py

Pretrained checkpoints

You can download our full checkpoints and logs for the example images using

bash ./scripts/download-example-logs-and-checkpoints.sh

Note that these are all run with the same parameters (the default parameters). To be precise, they are produced by simply running the example slurm script above. Some are better than others, and they can absolutely be improved if you (1) tweak parameters on a per-example basis, and (2) running multiple random seeds and choosing one of the better generations. You can see examples of a common failure case (two-headed generations; the Janus problem) in the teddy_bear_1 example.

Further Improvements

I indend to continue supporting and improving this respository while working toward a second version of the method. Further improvements include

  • Normals predicted with an MLP as in Magic3D
  • Use a feedforward inversion method (e.g. ELITE or encoder-based tuning) or possibly even an unCLIP-style model (example).
  • Add vanilla network for people without cuda-enabled gpus
  • Create a colab

Tips for Researchers

I have some tips in mind. I'll add them as soon as I get time.

Contribution

Pull requests are welcome!

Acknowledgement

  • The wonderful work by ashawkey:

    @misc{stable-dreamfusion,
        Author = {Jiaxiang Tang},
        Year = {2022},
        Note = {https://github.com/ashawkey/stable-dreamfusion},
        Title = {Stable-dreamfusion: Text-to-3D with Stable-diffusion}
    }
    
  • The original DreamFusion paper: DreamFusion: Text-to-3D using 2D Diffusion.

    @article{poole2022dreamfusion,
        author = {Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben},
        title = {DreamFusion: Text-to-3D using 2D Diffusion},
        journal = {arXiv},
        year = {2022},
    }
    
  • The Stable Diffusion model

    @misc{rombach2021highresolution,
        title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
        author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
        year={2021},
        eprint={2112.10752},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }
    
  • The diffusers library.

    @misc{von-platen-etal-2022-diffusers,
        author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
        title = {Diffusers: State-of-the-art diffusion models},
        year = {2022},
        publisher = {GitHub},
        journal = {GitHub repository},
        howpublished = {\url{https://github.com/huggingface/diffusers}}
    }
    
  • Our funding: Luke Melas-Kyriazi is supported by the Rhodes Trust. Andrea Vedaldi, Iro Liana and Christian Rupprecht are supported by ERC-UNION-CoG-101001212. Christian Rupprecht is also supported by VisualAI EP/T028572/1.

Citation

@inproceedings{melaskyriazi2023realfusion,
  author = {Melas-Kyriazi, Luke and Rupprecht, Christian and Laina, Iro and Vedaldi, Andrea},
  title = {RealFusion: 360 Reconstruction of Any Object from a Single Image},
  booktitle={CVPR}
  year = {2023},
  url = {https://arxiv.org/abs/2302.10663},
}

realfusion's People

Contributors

lukemelas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

realfusion's Issues

No modules named mcubes

Description

When we run the scripts python main.py --0, we get the error No module named 'mcubes'. And then we try to use pip install mcubes to install this package, but there is no matching distribution found for mcubes. We also search the package in PyPI, but we cannot find this package. So how can be install this package ?

Steps to Reproduce

Run the scripts python main.py --0, and get the error No module named 'mcubes'

Expected Behavior

Expect to install this package mcubes.

Environment

Ubuntu18.04, Cuda10.2

Package version

Hi @lukemelas, thanks for releasing your great work!

Could you please release the version of the packages you are using as well (e.g. direct export of your python environment)? I am trying out your code but there are some random issues. For example, stable_diffusion_model.text_encoder now gives a tuple of strings instead of the clip text model (looks like a version issue).

janus problem

Thank you for your work!
I found that there was a janus problem in the result of the "teddy bear" example, because the textural inversion was over-fitted to its front view, resulting in not producing the correct rear view, which could lead to janus problems.
Will there be such a phenomenon in the official result?

df_ep0100_rgb.mp4

My command is as follows:

export MODEL_NAME="/home/litaiqing/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/aa9ba505e1973ae5cd05f5aedd345178f52f8e6a"
export DATA_DIR="/media/ssd_1/litaiqing/realfusion-main/examples/natural-images/teddy_bear_1"
export OUTPUT_DIR="/media/ssd_1/litaiqing/realfusion-main/examples/natural-images/teddy_bear_1"

CUDA_VISIBLE_DEVICES=7 python textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="_teddy_bear_" \
  --initializer_token="teddy " \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir=$OUTPUT_DIR \
  --use_augmentations

export DATA_DIR=/media/ssd_1/litaiqing/realfusion-main/examples/natural-images/teddy_bear_1

CUDA_VISIBLE_DEVICES=7 python main.py --O \
    --image_path $DATA_DIR/rgba.png \
    --learned_embeds_path $DATA_DIR/learned_embeds.bin \
    --text "a  _teddy_bear_" \
    --pretrained_model_name_or_path "/home/litaiqing/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/aa9ba505e1973ae5cd05f5aedd345178f52f8e6a"

TypeError AnnealedValue = list[float]

Description

Running python main.py --0 gives an error with TypeError.

  File "main.py", line 9, in <module>
    from nerf.provider_image import NeRFDataset as ImageOnlyNeRFDataset
  File "/data/ruihan/projects/realfusion/nerf/provider_image.py", line 13, in <module>
    from .options import Options
  File "/data/ruihan/projects/realfusion/nerf/options.py", line 13, in <module>
    AnnealedValue = list[float]
TypeError: 'type' object is not subscriptable```

### Steps to Reproduce

python main.py --0

### Expected Behavior

Run the code. 

### Environment

Ubuntu 20.04, cudatoolkit 11.3.1, pytorch 1.11.0, transformers 4.28.1, diffusers 0.15.1

how to train on custom dataset?

I have a .vtp 3d mesh model, and 2d rendered images in different views. Is it possible to train the real fusion based on such data?
Thank you :-)

lovely_tensors

Could you please help me out here? What is lovely_tensors? I've never heard that before, and what is that for?

Saving mesh not working

Description

Hi @lukemelas, thanks for open sourcing your great work! Upon reproducing your examples, cat_statue specifically, I noticed that --save_mesh option does not work as expected on testing time. Here's the output:

[INFO] Trainer: df | 2023-04-10_14-25-15 | cuda | fp16 | outputs/default/2023-04-10--13-49-56--seed-101/
[INFO] num parameters: 1_806_983
[INFO] num parameters w/ grad: 1_806_983
[INFO] Loading latest checkpoint ...
[INFO] Latest checkpoint is outputs/default/2023-04-10--13-49-56--seed-101/checkpoints/df.pth
[INFO] loaded model.
[INFO] load at epoch 50, global step 5000
==> Start Test, save results to outputs/default/2023-04-10--13-49-56--seed-101/results
100% 100/100 [00:05<00:00, 18.63it/s]rgb
opacity
depth
/home/tongwang/workspace/realfusion/nerf/trainer.py:590: RuntimeWarning: invalid value encountered in cast
preds_np = (preds_tensor.detach().cpu().numpy() * 255).astype(np.uint8)
normals
textureless
grid
==> Finished Test.
100% 100/100 [00:06<00:00, 15.85it/s]
==> Saving mesh to outputs/default/2023-04-10--13-49-56--seed-101/mesh
==> Finished saving mesh.

Although the log says it is "==>Saving mesh", but it did not actually save the mesh. Could you please look into this issue? Thanks in advance.

Steps to Reproduce

python main.py --workspace $model_path --O --test --save_mesh

Expected Behavior

save a textured mesh

Environment

ubuntu 20.04, torch1.13+cu116

Code release

Hi authors, thank you very much for your great work! It is pretty appealing for me. When will you release your code?

Textual Inversion code giving error

Description

Hi,
@lukemelas, great work. Wanted something like this for a while. Your model's accuracy is better than earlier versions of 2D to 3D models.

I am running all my code on Google Collab(free version). I am following the Readme, however, I encountered the following error at the Text Inversion step. I had to edit few lines to make it run but in no vain. @lukemelas or anyone could you kindly help me out in setting up the code?

I am uploading 2 screenshots for reference.

Thank you
Screenshot (32)
Screenshot (33)

Steps to Reproduce

.

Expected Behavior

I expected the given code to run as per readme document.

Environment

Google Collab, Python 3.10

Import Error: cannot import name 'narrow_tensor_by_index' from 'torch.distributed._shard._utils'

Description

Thanks for your excellent work. But I have a ImportError when I run main.py. Details of the error are as follows:
Traceback (most recent call last):
File "/root/autodl-tmp/realfusion-main/main.py", line 12, in
from nerf.trainer import Trainer
File "/root/autodl-tmp/realfusion-main/nerf/trainer.py", line 26, in
from sd.sd import StableDiffusion
File "/root/autodl-tmp/realfusion-main/sd/init.py", line 1, in
from .sd import StableDiffusion
File "/root/autodl-tmp/realfusion-main/sd/sd.py", line 4, in
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/diffusers/init.py", line 3, in
from .configuration_utils import ConfigMixin
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/diffusers/configuration_utils.py", line 34, in
from .utils import (
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/diffusers/utils/init.py", line 21, in
from .accelerate_utils import apply_forward_hook
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/diffusers/utils/accelerate_utils.py", line 24, in
import accelerate
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/accelerate/accelerator.py", line 35, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/accelerate/utils/init.py", line 152, in
from .fsdp_utils import load_fsdp_model, load_fsdp_optimizer, save_fsdp_model, save_fsdp_optimizer
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/accelerate/utils/fsdp_utils.py", line 25, in
import torch.distributed.checkpoint as dist_cp
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/torch/distributed/checkpoint/init.py", line 7, in
from .state_dict_loader import load_state_dict
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 10, in
from .default_planner import DefaultLoadPlanner
File "/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/torch/distributed/checkpoint/default_planner.py", line 13, in
from torch.distributed._shard._utils import narrow_tensor_by_index
ImportError: cannot import name 'narrow_tensor_by_index' from 'torch.distributed._shard._utils' (/root/miniconda3/envs/realfusion/lib/python3.9/site-packages/torch/distributed/_shard/_utils.py)

And I find that the _utils.py doesn't have the function named narrow_tensor_by_index , but it has the function :def narrow_tensor(tensor: torch.Tensor, metadata: ShardMetadata)

Steps to Reproduce

I run the main.py as python main.py --O --image_path examples/natural-images/bird_2/rgba.png --learned_embeds_path examples/natural-images/bird_2/learned_embeds.bin --text "A high-resolution DSLR image of a bird" --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5"

Expected Behavior

I want to know if I have installed the wrong version of pytorch or something else.

Environment

Ubuntu 18.04 Pytorch1.12.1 CUDA 11.3

Inquiry about evaluation and dataset

I am very interested in your research and have some questions about your paper. Firstly, it seems that the evaluation part is missing in the code. Are you planning to release the related code? Secondly, were the quantitative results reported in the paper measured on images that include the background? Also, could you please provide information on the shading method used in the experiment? Thirdly, could you provide information on which 21 images were used for performance measurement in each of the 7 categories mentioned in the paper? If I have missed that part in the code, it would be very helpful if you could let me know where to refer to.

Thanks in advance

'tuple' object has no attribute 'get_input_embeddings'<title>

Description

when I do as examples showing:
python3 main.py --O --image_path $DATA_DIR/rgba.png --learned_embeds_path $DATA_DIR/learned_embeds.bin --text "A high-resolution DSLR image of a $TOKEN" --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5"
it always comes out:
'lr': 0.001,
'lr_warmup': False,
'max_ray_batch': 4096,
'max_steps': 512,
'min_lr': 1e-06,
'min_near': 0.1,
'negative': '',
'noise_real_camera': 0.001,
'noise_real_camera_annealing': True,
'num_rays': 4096,
'num_steps': 64,
'optim': 'adamw',
'pose_angle': 75,
'pretrained_model_image_size': 512,
'pretrained_model_name_or_path': 'runwayml/stable-diffusion-v1-5',
'radius_range': (1.0, 1.5),
'radius_rot': 1.8,
'real_every': 1,
'real_iters': 0,
'replace_synthetic_camera_every': 10,
'replace_synthetic_camera_noise': 0.02,
'run_name': 'default',
'save_mesh': False,
'save_test_name': 'df_test',
'seed': 101,
'suppress_face': None,
'test': False,
'test_on_real_data': False,
'text': 'A high-resolution DSLR image of a cake_2',
'uniform_sphere_rate': 0.5,
'update_extra_interval': 16,
'upsample_steps': 32,
'wandb': False,
'warm_iters': 2000,
'workspace': 'outputs/default/2023-05-16--12-57-00--seed-101'}
Grid encoder level 0 has resolution 16 and params 4920
Grid encoder level 1 has resolution 22 and params 12168
Grid encoder level 2 has resolution 30 and params 29792
Grid encoder level 3 has resolution 40 and params 65536
Grid encoder level 4 has resolution 55 and params 65536
Grid encoder level 5 has resolution 74 and params 65536
Grid encoder level 6 has resolution 100 and params 65536
Grid encoder level 7 has resolution 135 and params 65536
Grid encoder level 8 has resolution 183 and params 65536
Grid encoder level 9 has resolution 248 and params 65536
Grid encoder level 10 has resolution 336 and params 65536
Grid encoder level 11 has resolution 455 and params 65536
Grid encoder level 12 has resolution 617 and params 65536
Grid encoder level 13 has resolution 836 and params 65536
Grid encoder level 14 has resolution 1134 and params 65536
Grid encoder level 15 has resolution 1536 and params 65536
NeRFNetwork(
(encoder): GridEncoder: input_dim=3 num_levels=16 level_dim=2 resolution=16 -> 1536 per_level_scale=1.3557 params=(898848, 2) gridtype=tiled align_corners=False interpolation=linear
(sigma_net): MLP(
(net): ModuleList(
(0): Linear(in_features=32, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=64, bias=True)
(2): Linear(in_features=64, out_features=4, bias=True)
)
)
(encoder_bg): FreqEncoder: input_dim=3 degree=6 output_dim=39
(bg_net): MLP(
(net): ModuleList(
(0): Linear(in_features=39, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=3, bias=True)
)
)
)
/home/hhn/.local/lib/python3.8/site-packages/diffusers/configuration_utils.py:135: FutureWarning: Accessing config attribute unet directly via 'StableDiffusionModel' object attribute is deprecated. Please access 'unet' over 'StableDiffusionModel's config object instead, e.g. 'scheduler.config.unet'.
deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
/home/hhn/.local/lib/python3.8/site-packages/diffusers/configuration_utils.py:135: FutureWarning: Accessing config attribute text_encoder directly via 'StableDiffusionModel' object attribute is deprecated. Please access 'text_encoder' over 'StableDiffusionModel's config object instead, e.g. 'scheduler.config.text_encoder'.
deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/hhn/realfusion/main.py:164 in │
│ │
│ 161 │
│ 162 │
│ 163 if name == 'main': │
│ ❱ 164 │ main() │
│ 165 │
│ │
│ /home/hhn/realfusion/main.py:103 in main │
│ │
│ 100 │ │ stable_diffusion_model = StableDiffusionModel.from_pretrained(opt.pretrained_mod │
│ 101 │ │ # import pdb;pdb.set_trace() │
│ 102 │ │ if opt.learned_embeds_path is not None: # add textual inversion tokens to model │
│ ❱ 103 │ │ │ add_tokens_to_model_from_path( │
│ 104 │ │ │ │ opt.learned_embeds_path, stable_diffusion_model.text_encoder, stable_dif │
│ 105 │ │ │ ) │
│ 106 │ │ guidance = StableDiffusion(stable_diffusion_model=stable_diffusion_model, device │
│ │
│ /home/hhn/realfusion/sd/utils.py:40 in add_tokens_to_model_from_path │
│ │
│ 37 │ │ tokenizer: CLIPTokenizer, override_token: Optional[Union[str, dict]] = None) -> │
│ 38 │ r"""Loads tokens from a file and adds them to the tokenizer and text encoder of a mo │
│ 39 │ learned_embeds: Mapping[str, Tensor] = torch.load(learned_embeds_path, map_location= │
│ ❱ 40 │ add_tokens_to_model(learned_embeds, text_encoder, tokenizer, override_token) │
│ 41 │
│ │
│ /home/hhn/realfusion/sd/utils.py:15 in add_tokens_to_model │
│ │
│ 12 │ # Loop over learned embeddings │
│ 13 │ new_tokens = [] │
│ 14 │ for token, embedding in learned_embeds.items(): │
│ ❱ 15 │ │ embedding = embedding.to(text_encoder.get_input_embeddings().weight.dtype) │
│ 16 │ │ if override_token is not None: │
│ 17 │ │ │ token = override_token if isinstance(override_token, str) else override_toke │
│ 18 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'tuple' object has no attribute 'get_input_embeddings'

Steps to Reproduce

As examples show:
the command is "export TOKEN="cake_2" # set this according to your textual inversion placeholder_token or use the trick below
export DATA_DIR=$PWD/examples/natural-images/cake_2

python main.py --O
--image_path $DATA_DIR/rgba.png
--learned_embeds_path $DATA_DIR/learned_embeds.bin
--text "A high-resolution DSLR image of a $TOKEN"
--pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5""

Expected Behavior

Maybe I miss some key operation?

Environment

Ubuntu18.04, torch 2.0.0, CUDA 12.0

Loss description

Thanks for your great effort.

I have some Q on the loss.

  1. Is the loss image in page 6 is the reconstuction loss in reference view? (the input image)

  2. Is loss(rec, mask) is differen with loss(mask)?

  3. the loss(rec, mask) which is L2 between O and M, the O is computed using neural field which is real number. then use it with 0, 1 mask M?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.