Git Product home page Git Product logo

neuralangelo's Introduction

Neuralangelo

This is the official implementation of Neuralangelo: High-Fidelity Neural Surface Reconstruction.

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, Chen-Hsuan Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

The code is built upon the Imaginaire library from the Deep Imagination Research Group at NVIDIA.
For business inquiries, please submit the NVIDIA research licensing form.


Installation

We offer two ways to setup the environment:

  1. We provide prebuilt Docker images, where

    • docker.io/chenhsuanlin/colmap:3.8 is for running COLMAP and the data preprocessing scripts. This includes the prebuilt COLMAP library (CUDA-supported).
    • docker.io/chenhsuanlin/neuralangelo:23.04-py3 is for running the main Neuralangelo pipeline.

    The corresponding Dockerfiles can be found in the docker directory.

  2. The conda environment for Neuralangelo. Install the dependencies and activate the environment neuralangelo with

    conda env create --file neuralangelo.yaml
    conda activate neuralangelo

For COLMAP, alternative installation options are also available on the COLMAP website.


Data preparation

Please refer to Data Preparation for step-by-step instructions.
We assume known camera poses for each extracted frame from the video. The code uses the same json format as Instant NGP.


Run Neuralangelo!

EXPERIMENT=toy_example
GROUP=example_group
NAME=example_name
CONFIG=projects/neuralangelo/configs/custom/${EXPERIMENT}.yaml
GPUS=1  # use >1 for multi-GPU training!
torchrun --nproc_per_node=${GPUS} train.py \
    --logdir=logs/${GROUP}/${NAME} \
    --config=${CONFIG} \
    --show_pbar

Some useful notes:

  • This codebase supports logging with Weights & Biases. You should have a W&B account for this.
    • Add --wandb to the command line argument to enable W&B logging.
    • Add --wandb_name to specify the W&B project name.
    • More detailed control can be found in the init_wandb() function in imaginaire/trainers/base.py.
  • Configs can be overridden through the command line (e.g. --optim.params.lr=1e-2).
  • Set --checkpoint={CHECKPOINT_PATH} to initialize with a certain checkpoint; set --resume to resume training.
  • If appearance embeddings are enabled, make sure data.num_images is set to the number of training images.

Isosurface extraction

Use the following command to run isosurface mesh extraction:

CHECKPOINT=logs/${GROUP}/${NAME}/xxx.pt
OUTPUT_MESH=xxx.ply
CONFIG=logs/${GROUP}/${NAME}/config.yaml
RESOLUTION=2048
BLOCK_RES=128
GPUS=1  # use >1 for multi-GPU mesh extraction
torchrun --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py \
    --config=${CONFIG} \
    --checkpoint=${CHECKPOINT} \
    --output_file=${OUTPUT_MESH} \
    --resolution=${RESOLUTION} \
    --block_res=${BLOCK_RES}

Some useful notes:

  • Add --textured to extract meshes with textures.
  • Add --keep_lcc to remove noises. May also remove thin structures.
  • Lower BLOCK_RES to reduce GPU memory usage.
  • Lower RESOLUTION to reduce mesh size.

Frequently asked questions (FAQ)

  1. Q: CUDA out of memory. How do I decrease the memory footprint?
    A: Neuralangelo requires at least 24GB GPU memory with our default configuration. If you run out of memory, consider adjusting the following hyperparameters under model.object.sdf.encoding.hashgrid (with suggested values):

    GPU VRAM Hyperparameter
    8GB dict_size=20, dim=4
    12GB dict_size=21, dim=4
    16GB dict_size=21, dim=8

    Please note that the above hyperparameter adjustment may sacrifice the reconstruction quality.

    If Neuralangelo runs fine during training but CUDA out of memory during evaluation, consider adjusting the evaluation parameters under data.val, including setting smaller image_size (e.g., maximum resolution 200x200), and setting batch_size=1, subset=1.

  2. Q: The reconstruction of my custom dataset is bad. What can I do?
    A: It is worth looking into the following:

    • The camera poses recovered by COLMAP may be off. We have implemented tools (using Blender or Jupyter notebook) to inspect the COLMAP results.
    • The computed bounding regions may be off and/or too small/large. Please refer to data preprocessing on how to adjust the bounding regions manually.
    • The video capture sequence may contain significant motion blur or out-of-focus frames. Higher shutter speed (reducing motion blur) and smaller aperture (increasing focus range) are very helpful.

Citation

If you find our code useful for your research, please cite

@inproceedings{li2023neuralangelo,
  title={Neuralangelo: High-Fidelity Neural Surface Reconstruction},
  author={Li, Zhaoshuo and M\"uller, Thomas and Evans, Alex and Taylor, Russell H and Unberath, Mathias and Liu, Ming-Yu and Lin, Chen-Hsuan},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year={2023}
}

neuralangelo's People

Contributors

chenhsuanlin avatar mli0603 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuralangelo's Issues

Exporting mesh with texture

Hi, I could run the training and export the ply file, I wanted to ask if it is possible with Neuralangelo to export a 3d model with texture.
Thanks

Isosurface extraction error (ubuntu18.04)

Here are the issues I reported as errors, how can I solve them?

torchrun --nproc_per_node=2 projects/neuralangelo/scripts/extract_mesh.py --logdir=logs/example_group_dtu24/dtu_24 --config=projects/neuralangelo/configs/dtu.yaml --checkpoint=logs/example_group_dtu24/dtu_24/epoch_01600_iteration_000040000_checkpoint.pt --output_file=dtu_24_40000.ply --resolution=2048 --block_res=128
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
File "projects/neuralangelo/scripts/extract_mesh.py", line 25, in
from projects.neuralangelo.utils.mesh import extract_mesh # noqa: E402
File "/home/ubuntu/Desktop/modi/neuralangelo-main/projects/neuralangelo/utils/mesh.py", line 15, in
import mcubes
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/mcubes/init.py", line 4, in
from .smoothing import smooth, smooth_constrained, smooth_gaussian
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/mcubes/smoothing.py", line 11, in
from scipy import sparse
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/init.py", line 200, in getattr
return _importlib.import_module(f'scipy.{name}')
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/init.py", line 283, in
from . import csgraph
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/csgraph/init.py", line 185, in
from ._laplacian import laplacian
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/csgraph/_laplacian.py", line 7, in
from scipy.sparse.linalg import LinearOperator
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/init.py", line 120, in
from ._isolve import *
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/_isolve/init.py", line 6, in
from .lgmres import lgmres
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/_isolve/lgmres.py", line 7, in
from scipy.linalg import get_blas_funcs
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/init.py", line 209, in
from ._matfuncs import *
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs.py", line 19, in
from ._matfuncs_sqrtm import sqrtm
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs_sqrtm.py", line 24, in
from ._matfuncs_sqrtm_triu import within_block_loop
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.26' not found (required by /home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-38-x86_64-linux-gnu.so) Traceback (most recent call last): File "projects/neuralangelo/scripts/extract_mesh.py", line 25, in <module> from projects.neuralangelo.utils.mesh import extract_mesh # noqa: E402 File "/home/ubuntu/Desktop/modi/neuralangelo-main/projects/neuralangelo/utils/mesh.py", line 15, in <module> import mcubes File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/mcubes/__init__.py", line 4, in <module> from .smoothing import smooth, smooth_constrained, smooth_gaussian File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/mcubes/smoothing.py", line 11, in <module> from scipy import sparse File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/__init__.py", line 200, in __getattr__ return _importlib.import_module(f'scipy.{name}') File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/__init__.py", line 283, in <module> from . import csgraph File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/csgraph/__init__.py", line 185, in <module> from ._laplacian import laplacian File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/csgraph/_laplacian.py", line 7, in <module> from scipy.sparse.linalg import LinearOperator File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/__init__.py", line 120, in <module> from ._isolve import * File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/_isolve/__init__.py", line 6, in <module> from .lgmres import lgmres File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/sparse/linalg/_isolve/lgmres.py", line 7, in <module> from scipy.linalg import get_blas_funcs File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/__init__.py", line 209, in <module> from ._matfuncs import * File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs.py", line 19, in <module> from ._matfuncs_sqrtm import sqrtm File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs_sqrtm.py", line 24, in <module> from ._matfuncs_sqrtm_triu import within_block_loop ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.26' not found (required by /home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-38-x86_64-linux-gnu.so)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5037) of binary: /home/ubuntu/anaconda3/envs/nelo/bin/python3.8
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/nelo/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/nelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

projects/neuralangelo/scripts/extract_mesh.py FAILED

Failures:
[1]:
time : 2023-08-17_11:15:59
host : ubuntu
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 5038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-08-17_11:15:59
host : ubuntu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5037)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

`RuntimeError: cannot reshape tensor of 0 elements into shape [2, 512, 0, -1] because the unspecified dimension size -1 can be any value and is ambiguous`

root@beast2:/opt/data# torchrun --nproc_per_node=${GPUS} train.py     --logdir=logs/${GROUP}/${NAME}     --config=${CONFIG}     --show_pbar
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 20000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: 117
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 1.0
   * root: IMG_6088_skip12/dense
   * train:
      * batch_size: 2
      * image_size: [1079, 1923]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 534]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/example_group/example_name
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 500000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 8
      * enabled: True
   * background:
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 8
               * step: 5000
            * hashgrid:
               * dict_size: 22
               * dim: 8
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: True
            * num_layers: 2
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 0
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.001
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [300000, 400000]
      * type: two_steps_with_warmup
      * warm_up_end: 5000
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/toy_example.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 5000
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
model parameter count: 366,707,676
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Train dataset length: 117                                                                                                       
Val dataset length: 4                                                                                                           
Training from scratch.
Initialize wandb
Traceback (most recent call last):                                                                                              
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 93, in main
    trainer.train(cfg,
  File "/opt/data/projects/neuralangelo/trainer.py", line 106, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/opt/data/projects/nerf/trainers/base.py", line 109, in train
    data_all = self.test(self.eval_data_loader, mode="val", show_pbar=show_pbar)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/data/projects/nerf/trainers/base.py", line 138, in test
    output = model.inference(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/data/projects/neuralangelo/model.py", line 70, in inference
    output = self.render_image(data["pose"], data["intr"], image_size=self.image_size_val,
  File "/opt/data/projects/neuralangelo/model.py", line 96, in render_image
    output_batch = self.render_rays(center, ray_unit, sample_idx=sample_idx, stratified=stratified)
  File "/opt/data/projects/neuralangelo/model.py", line 122, in render_rays
    output_background = self.render_rays_background(center, ray_unit, far, app_outside, stratified=stratified)
  File "/opt/data/projects/neuralangelo/model.py", line 185, in render_rays_background
    rgbs, densities = self.background_nerf.forward(points, rays_unit, app_outside)  # [B,R,N,3]
  File "/opt/data/projects/neuralangelo/utils/modules.py", line 271, in forward
    points_enc = self.encode(points_3D)  # [...,4+LD]
  File "/opt/data/projects/neuralangelo/utils/modules.py", line 297, in encode
    points_enc = nerf_util.positional_encoding(points, num_freq_bases=self.cfg_background.encoding.levels)
  File "/opt/data/projects/nerf/utils/nerf_util.py", line 146, in positional_encoding
    input_enc = input_enc.view(*input.shape[:-1], -1)  # [B,...,2NL].
RuntimeError: cannot reshape tensor of 0 elements into shape [2, 512, 0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 796) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-12_08:09:34
  host      : beast2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 796)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@beast2:/opt/data# 

Number of epochs

Hi everyone, I'm training with the following command from readme file:

EXPERIMENT=toy_example
GROUP=example_group
NAME=example_name
CONFIG=projects/neuralangelo/configs/custom/${EXPERIMENT}.yaml
GPUS=1  # use >1 for multi-GPU training!
torchrun --nproc_per_node=${GPUS} train.py \
    --logdir=logs/${GROUP}/${NAME} \
    --config=${CONFIG} \
    --show_pbar

It's currently running and is in Epoch number 1200. I read the files and there's a max_epoch parameter, which is set to 9999999999. I wanted to ask what is the normal and optimal epoch number, and should I change anything in the code before running the command?
Thanks

Error mesh shape after extracting

Hi, when I try to extract the mesh from the checkpoint, the mesh shape seemed to be strange. However, the log from the wandb seemed to be fine, the result is as follows. I wonder how to deal with this. Thanks.

(neuralangelo) vrlab@k8s-master-38:~/wangph1/workspace/neuralangelo$ CUDA_VISIBLE_DEVICES=1 torchrun --master_port 29501 --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py     --logdir=logs/${GROUP}/${N
AME}     --config=${CONFIG}     --checkpoint=${CHECKPOINT}     --output_file=${OUTPUT_MESH}     --resolution=${RESOLUTION}     --block_res=${BLOCK_RES}
Running mesh extraction with 1 GPUs.
Make folder logs/custom/museum_new
Setup trainer.
Using random seed 0
model parameter count: 366,708,076
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Loading checkpoint (local): logs/custom/museum_new/epoch_02253_iteration_000160000_checkpoint.pt
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 2048 863 1955
vertices: 17603135                                                                                                                                                                                                            
faces: 34649378

屏幕截图 2023-08-16 111639
media_images_val_vis_normal_175000_579743fafe84f7007fb3
media_images_val_vis_rgb_render_160000_e0e6dfa9c1556e10b736

ModuleNotFoundError: No module named 'projects.neuralangelo'

I have error with 'projects.neuralangelo' tried to run using toy_example, I configure this data and create yaml file but receive the error ModuleNotFoundError: No module named 'projects.neuralangelo'

I use this structure

EXPERIMENT=toy_example
GROUP=toy_group
NAME=toy
CONFIG=projects/neuralangelo/configs/custom/${EXPERIMENT}.yaml
GPUS=1 # use >1 for multi-GPU training!
torchrun --nproc_per_node=${GPUS} train.py \
    --logdir=logs/${GROUP}/${NAME} \
    --config=${CONFIG} \
    --show_pbar

ERROR -------

ModuleNotFoundError: No module named 'projects.neuralangelo'

YAML
image

toy_example
image

FULL ERROR

image

Thanks!!

How do you use the docker image under windows?

When starting the docker image, I get

2023-08-19 22:07:25 
2023-08-19 22:07:25 =============
2023-08-19 22:07:25 == PyTorch ==
2023-08-19 22:07:25 =============
2023-08-19 22:07:25 
2023-08-19 22:07:25 NVIDIA Release 23.04 (build 58180998)
2023-08-19 22:07:25 PyTorch Version 2.1.0a0+fe05266
2023-08-19 22:07:25 
2023-08-19 22:07:25 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2023-08-19 22:07:25 
2023-08-19 22:07:25 Copyright (c) 2014-2023 Facebook Inc.
2023-08-19 22:07:25 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
2023-08-19 22:07:25 Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
2023-08-19 22:07:25 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
2023-08-19 22:07:25 Copyright (c) 2011-2013 NYU                      (Clement Farabet)
2023-08-19 22:07:25 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
2023-08-19 22:07:25 Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
2023-08-19 22:07:25 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
2023-08-19 22:07:25 Copyright (c) 2015      Google Inc.
2023-08-19 22:07:25 Copyright (c) 2015      Yangqing Jia
2023-08-19 22:07:25 Copyright (c) 2013-2016 The Caffe contributors
2023-08-19 22:07:25 All rights reserved.
2023-08-19 22:07:25 
2023-08-19 22:07:25 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
2023-08-19 22:07:25 
2023-08-19 22:07:25 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2023-08-19 22:07:25 By pulling and using the container, you accept the terms and conditions of this license:
2023-08-19 22:07:25 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2023-08-19 22:07:25 
2023-08-19 22:07:25 WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
2023-08-19 22:07:25    Use the NVIDIA Container Toolkit to start this container with GPU support; see
2023-08-19 22:07:25    https://docs.nvidia.com/datacenter/cloud-native/ .
2023-08-19 22:07:25 
2023-08-19 22:07:25 NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
2023-08-19 22:07:25    insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
2023-08-19 22:07:25    docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
2023-08-19 22:07:25

And that's it, end of execution.

How do you use it?

Data preparation stuck to the json creation

After going trough the data preparation from the example mov and the example guide, I correctly generated the following folder structure:
PATH_TO_IMAGES
|__ database.db (COLMAP databse)
|__ raw_images (raw input images)
|__ dense
|____ images (undistorted images)
|____ sparse (COLMAP correspondences, intrinsics and sparse point cloud)
|____ stereo (COLMAP files for MVS)

The images folder and sparse folder contain as predicted respectively some img files, and bin files for the other, but each folder inside stereo is empty, i did not receive any error during the process so i tried to go on anyway.

When i then tried to run :
"PATH_TO_IMAGES=toy_example_skip30
SCENE_TYPE=object # {outdoor,indoor,object}
python3 projects/neuralangelo/scripts/convert_data_to_json.py --data_dir ${PATH_TO_IMAGES}/dense --scene_type ${SCENE_TYPE}"

i run it, but without any errors, or log, it just never stops nor logs anything out

Any help in understanding what issue may be causing this? i run normally everything else as described in the repo guide:
https://github.com/NVlabs/neuralangelo/blob/main/DATA_PROCESSING.md

How to use checkpoint correctly

After training, I got the xxx.pt with 160000 iterations. When I use the script to extract mesh, the mesh was extracted correctly.
But when I use
"torchrun --nproc_per_node=4 train.py --logdir=logs/object_new --config configs/object.yaml --show_pbar --wandb --checkpoint xxx.pt"
to continue training, the vis result before the first epoch was totally different than the previous training in 1600000 iterations (much worse than the previous one). What should I do

Hyperparameters for High Quality object reconstruction

Hello, Thank you so much for releasing the source code. I was trying out the toy dataset that is mentioned in the repository. I trained it for 100K iterations and extracted the isosurface mesh. But the results dont look great. I have the following questions

  1. Could you tell me what hyperparameters would give me a very good reconstruction of it?
  2. How long am I supposed to train it for?
  3. Is there a way to speed up the training process?

Multiple GPU training is slow.

Dear author, Thanks for seeing this question.
When I trained toy_example with an eight-card 4090GPU server, I found that the training speed was not very fast. Similar to single card training. And it takes more than 100 hours to complete 500,000 epochs for single-card training. I understand that eight-card distributed training should take more than ten hours to complete, but the actual speed is not much faster. What is the reason for this?
image
image

Request for config of Analytical Gradient variant

Hi @chenhsuanlin, @mli0603!

Congratulations on the amazing work and results! Could you please point me to the config for the Analytical Gradient variant of the method? I am trying to reproduce the results in Table 1 of the paper but cannot find any config for the method variant labeled "AG" and "AG + P".

When I switched the gradient mode here, I found curvature error to explode and started getting NaN loss values after about 81k steps.

Thanks in advance!

data preparation error

I use the wsl2 and ubuntu22.04,and cuda11.8,and use the conda the create envs. when I use the EXPERIMENT_NAME=toy_example PATH_TO_VIDEO=toy_example.MOV SKIP_FRAME_RATE=24 SCENE_TYPE=object # {outdoor,indoor,object} bash projects/neuralangelo/scripts/preprocess.sh ${EXPERIMENT_NAME} ${PATH_TO_VIDEO} ${SKIP_FRAME_RATE} ${SCENE_TYPE}, it report errors.

**[swscaler @ 0x5628e034efc0] deprecated pixel format used, make sure you did set range correctly
Output #0, image2, to 'datasets/toy_example_skip24/raw_images/%06d.jpg':
Metadata:
major_brand : qt
minor_version : 512
compatible_brands: qt
encoder : Lavf58.76.100
Stream #0:0: Video: mjpeg, yuvj420p(pc, progressive), 800x800 [SAR 1:1 DAR 1:1], q=2-31, 200 kb/s, 25 fps, 25 tbn (default)
Metadata:
handler_name : VideoHandler
vendor_id : FFMP
encoder : Lavc58.134.100 mjpeg
Side data:
cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
frame= 9 fps=0.0 q=2.0 Lsize=N/A time=00:00:07.72 bitrate=N/A speed=42.5x
video:649kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, xcb.

*** Aborted at 1692614036 (unix time) try "date -d @1692614036" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x272f) received by PID 10031 (TID 0x7f0d05dbd080) from PID 10031; stack trace: ***
@ 0x7f0d0b968046 (unknown)
@ 0x7f0d09c10520 (unknown)
@ 0x7f0d09c64a7c pthread_kill
@ 0x7f0d09c10476 raise
@ 0x7f0d09bf67f3 abort
@ 0x7f0d0a205ba3 QMessageLogger::fatal()
@ 0x7f0d0a80c713 QGuiApplicationPrivate::createPlatformIntegration()
@ 0x7f0d0a80cc08 QGuiApplicationPrivate::createEventDispatcher()
@ 0x7f0d0a435b17 QCoreApplicationPrivate::init()
@ 0x7f0d0a80fb70 QGuiApplicationPrivate::init()
@ 0x7f0d0af23ced QApplicationPrivate::init()
@ 0x55f19813a3dd colmap::RunFeatureExtractor()
@ 0x55f19812c499 main
@ 0x7f0d09bf7d90 (unknown)
@ 0x7f0d09bf7e40 __libc_start_main
@ 0x55f19812f3e5 _start
projects/neuralangelo/scripts/run_colmap.sh: line 22: 10031 Aborted colmap feature_extractor --database_path ${1}/database.db --image_path ${1}/raw_images --ImageReader.camera_model=RADIAL --SiftExtraction.use_gpu=true --SiftExtraction.num_threads=32 --ImageReader.single_camera=true
qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, xcb.

*** Aborted at 1692614036 (unix time) try "date -d @1692614036" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x2730) received by PID 10032 (TID 0x7f243bbcc080) from PID 10032; stack trace: ***
@ 0x7f2441777046 (unknown)
@ 0x7f243fa1f520 (unknown)
@ 0x7f243fa73a7c pthread_kill
@ 0x7f243fa1f476 raise
@ 0x7f243fa057f3 abort
@ 0x7f2440014ba3 QMessageLogger::fatal()
@ 0x7f244061b713 QGuiApplicationPrivate::createPlatformIntegration()
@ 0x7f244061bc08 QGuiApplicationPrivate::createEventDispatcher()
@ 0x7f2440244b17 QCoreApplicationPrivate::init()
@ 0x7f244061eb70 QGuiApplicationPrivate::init()
@ 0x7f2440d32ced QApplicationPrivate::init()
@ 0x560626178e76 colmap::RunSequentialMatcher()
@ 0x56062616d499 main
@ 0x7f243fa06d90 (unknown)
@ 0x7f243fa06e40 __libc_start_main
@ 0x5606261703e5 _start
projects/neuralangelo/scripts/run_colmap.sh: line 26: 10032 Aborted colmap sequential_matcher --database_path ${1}/database.db --SiftMatching.use_gpu=true

==============================================================================
Loading database

Loading cameras... 0 in 0.001s
Loading matches... 0 in 0.000s
Loading images... 0 in 0.001s (connected 0)
Building correspondence graph... in 0.000s (ignored 0)

Elapsed time: 0.000 [minutes]

WARNING: No images with matches found in the database.

ERROR: failed to create sparse model

==============================================================================
Reading reconstruction

F0821 18:33:57.171896 10035 reconstruction.cc:745] cameras, images, points3D files do not exist at datasets/toy_example_skip24/sparse/0
*** Check failure stack trace: ***
@ 0x7fc87b025b03 google::LogMessage::Fail()
@ 0x7fc87b02d9d1 google::LogMessage::SendToLog()
@ 0x7fc87b0257c2 google::LogMessage::Flush()
@ 0x7fc87b02778f google::LogMessageFatal::~LogMessageFatal()
@ 0x55843ac4e6f2 colmap::Reconstruction::Read()
@ 0x55843aba787a colmap::RunImageUndistorter()
@ 0x55843ab91499 main
@ 0x7fc8792c1d90 (unknown)
@ 0x7fc8792c1e40 __libc_start_main
@ 0x55843ab943e5 _start
projects/neuralangelo/scripts/run_colmap.sh: line 38: 10035 Aborted colmap image_undistorter --image_path ${1}/raw_images --input_path ${1}/sparse/0 --output_path ${1}/dense --output_type COLMAP --max_image_size 2000
Traceback (most recent call last):
File "projects/neuralangelo/scripts/convert_data_to_json.py", line 199, in
auto_bound(args)
File "projects/neuralangelo/scripts/convert_data_to_json.py", line 161, in auto_bound
cameras, images, points3D = read_model(os.path.join(args.data_dir, "sparse"), ext=".bin")
File "/mnt/f/xbruanProject/neuralangelo/third_party/colmap/scripts/python/read_write_model.py", line 435, in read_model
cameras = read_cameras_binary(os.path.join(path, "cameras" + ext))
File "/mnt/f/xbruanProject/neuralangelo/third_party/colmap/scripts/python/read_write_model.py", line 134, in read_cameras_binary
with open(path_to_model_file, "rb") as fid:
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/toy_example_skip24/dense/sparse/cameras.bin'
Traceback (most recent call last):
File "projects/neuralangelo/scripts/generate_config.py", line 81, in
generate_config(args)
File "projects/neuralangelo/scripts/generate_config.py", line 30, in generate_config
num_images = len(os.listdir(os.path.join(args.data_dir, "images")))
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/toy_example_skip24/dense/images'**

neuralangelo docker run issue - WSL2 + Ubuntu 20.04 LTS

After done this.
#10

I'm trying with WSL2 + Ubuntu 20.04 LTS + docker.
The log is below.

(neuralangelo) root@altava-farer:~/neuralangelo# nvidia-smi
Thu Aug 17 10:18:03 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.226.00   Driver Version: 536.67       CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  Off |
|  0%   36C    P8    32W / 450W |   2974MiB / 24564MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0        20      G   /Xwayland                                  N/A      |
|    0        33      G   /Xwayland                                  N/A      |
+-----------------------------------------------------------------------------+
(neuralangelo) root@altava-farer:~/neuralangelo#
(neuralangelo) root@altava-farer:~/neuralangelo# docker run --gpus all -it docker.io/chenhsuanlin/neuralangelo:23.04-py3
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/de790850947733812be2cb67e6dd791f79c546dfa8d87cd115ac2d82e2f352eb/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0000] error waiting for container: context canceled
(neuralangelo) root@altava-farer:~/neuralangelo#

But it works without "--gpus all".

(neuralangelo) root@altava-farer:~/neuralangelo# docker run -it docker.io/chenhsuanlin/neuralangelo:23.04-py3

=============
== PyTorch ==
=============

NVIDIA Release 23.04 (build 58180998)
PyTorch Version 2.1.0a0+fe05266

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@bbc348e95135:/workspace#

And I run torchrun like below.

root@bbc348e95135:/workspace/neuralangelo# ll
total 92
drwxr-xr-x 9 root root 4096 Aug 17 01:23 ./
drwxrwxrwx 1 root root 4096 Aug 17 01:21 ../
drwxr-xr-x 8 root root 4096 Aug 17 01:21 .git/
-rw-r--r-- 1 root root 3497 Aug 17 01:21 .gitignore
-rw-r--r-- 1 root root  104 Aug 17 01:21 .gitmodules
-rw-r--r-- 1 root root  143 Aug 17 01:21 .pre-commit-config.yaml
-rw-r--r-- 1 root root 5246 Aug 17 01:21 DATA_PROCESSING.md
-rw-r--r-- 1 root root 4454 Aug 17 01:21 LICENSE.md
-rw-r--r-- 1 root root 4158 Aug 17 01:21 README.md
drwxr-xr-x 2 root root 4096 Aug 17 01:21 assets/
drwxr-xr-x 2 root root 4096 Aug 17 01:21 docker/
drwxr-xr-x 6 root root 4096 Aug 17 01:21 imaginaire/
-rw-r--r-- 1 root root  378 Aug 17 01:21 neuralangelo.yaml
drwxr-xr-x 4 root root 4096 Aug 17 01:21 projects/
-rw-r--r-- 1 root root  368 Aug 17 01:21 requirements.txt
drwxr-xr-x 3 root root 4096 Aug 17 01:21 third_party/
-rwxr-xr-x 1 root root  584 Aug 16 02:38 toy_example.yaml*
drwxr-xr-x 4 root root 4096 Aug 16 02:38 toy_example_skip24/
-rw-r--r-- 1 root root 4130 Aug 17 01:21 train.py
root@bbc348e95135:/workspace/neuralangelo#
root@bbc348e95135:/workspace/neuralangelo#
root@bbc348e95135:/workspace/neuralangelo# EXPERIMENT=toy_example
root@bbc348e95135:/workspace/neuralangelo# GROUP=example_group
E=examproot@bbc348e95135:/workspace/neuralangelo# NAME=example_name
root@bbc348e95135:/workspace/neuralangelo#
root@bbc348e95135:/workspace/neuralangelo# CONFIG=./toy_example.yaml
root@bbc348e95135:/workspace/neuralangelo# GPUS=1
root@bbc348e95135:/workspace/neuralangelo#
root@bbc348e95135:/workspace/neuralangelo# torchrun --nproc_per_node=${GPUS} train.py \
>     --logdir=logs/${GROUP}/${NAME} \
>     --config=${CONFIG} \
>     --show_pbar
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1478, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file too short

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 20, in <module>
    from imaginaire.utils.gpu_affinity import set_affinity
  File "/workspace/neuralangelo/imaginaire/utils/gpu_affinity.py", line 22, in <module>
    pynvml.nvmlInit()
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1450, in nvmlInit
    nvmlInitWithFlags(0)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1433, in nvmlInitWithFlags
    _LoadNvmlLibrary()
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1480, in _LoadNvmlLibrary
    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 414) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-17_01:26:31
  host      : bbc348e95135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 414)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@bbc348e95135:/workspace/neuralangelo#

Is there anything missing ?

CUDA error: device-side assert triggered

When I do

EXPERIMENT=toy_example
GROUP=example_group
NAME=example_name
CONFIG=projects/neuralangelo/configs/custom/${EXPERIMENT}.yaml
GPUS=1  # use >1 for multi-GPU training!
torchrun --nproc_per_node=${GPUS} train.py \
    --logdir=logs/${GROUP}/${NAME} \
    --config=${CONFIG} \
    --show_pbar

I got this ERROR

terminate called after throwing an instance of 'c10::Error'                                                                                                                                                         
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

triggered by

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fa378c62efc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fa378c26486 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fa378cf270c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xe07fe9 (0x7fa379b1dfe9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x50a9ca (0x7fa3bc7249ca in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3db61 (0x7fa378c46b61 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1ce (0x7fa378c3e27e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7fa378c3e3ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7a5188 (0x7fa3bc9bf188 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x335 (0x7fa3bc9bf545 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x5af53a]
frame #13: /usr/bin/python() [0x613bd5]
frame #14: _PyEval_EvalFrameDefault + 0x8d9b (0x57406b in /usr/bin/python)
frame #15: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #16: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #17: PyObject_Call + 0x62 (0x5f5c02 in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x1f2c (0x56d1fc in /usr/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #20: /usr/bin/python() [0x50b2b0]
frame #21: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #23: /usr/bin/python() [0x50b2b0]
frame #24: _PyEval_EvalFrameDefault + 0x57f2 (0x570ac2 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #26: /usr/bin/python() [0x50b2b0]
frame #27: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x1b6 (0x5f6836 in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x72d (0x56b9fd in /usr/bin/python)
frame #30: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #31: PyEval_EvalCode + 0x27 (0x68e7b7 in /usr/bin/python)
frame #32: /usr/bin/python() [0x680001]
frame #33: /usr/bin/python() [0x68007f]
frame #34: /usr/bin/python() [0x680121]
frame #35: PyRun_SimpleFileExFlags + 0x197 (0x680db7 in /usr/bin/python)
frame #36: Py_RunMain + 0x212 (0x6b8122 in /usr/bin/python)
frame #37: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #38: __libc_start_main + 0xf3 (0x7fa405578083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #39: _start + 0x2e (0x5fb39e in /usr/bin/python)

I don't know how to solve it.
I think it may caused by my gcc version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

where is the neuralangelo.datasets.json_data

hi~thanks for your excellent work!

When I tried to run the Neuralangelo code using the instant-ngp fox dataset with the default template config, I encountered an error stating No module named 'projects.neuralangelo.datasets'

I would appreciate it if you could do me a favor!

commit 3b1b95f still get torch.distributed.elastic.multiprocessing.errors.ChildFailedError

          Fixed in 3b1b95f! Please feel free to reopen if the issue persists.

Originally posted by @chenhsuanlin in #15 (comment)

Hi !
Taking the commit 3b1b95f I still get the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

# torchrun --nproc_per_node=${GPUS} train.py     --logdir=logs/${GROUP}/${NAME}     --config=${CONFIG}     --show_pbar

(Setting affinity with NVML failed, skipping...)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 20000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: 2
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 1.0
   * root: toy_example_skip24/dense
   * train:
      * batch_size: 2
      * image_size: [687, 1230]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 537]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/example_group/example_name
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 500000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 8
      * enabled: True
   * background:
      * enabled: True
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 4
               * step: 5000
            * hashgrid:
               * dict_size: 22
               * dim: 8
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: False
            * num_layers: 1
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 32
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.01
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [300000, 400000]
      * type: two_steps_with_warmup
      * warm_up_end: 5000
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/toy_example.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 5000
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 79, in main
    trainer = get_trainer(cfg, is_inference=False, seed=args.seed)
  File "/workspace/neuralangelo/imaginaire/trainers/utils/get_trainer.py", line 32, in get_trainer
    trainer = trainer_lib.Trainer(cfg, is_inference=is_inference, seed=seed)
  File "/workspace/neuralangelo/projects/neuralangelo/trainer.py", line 26, in __init__
    super().__init__(cfg, is_inference=is_inference, seed=seed)
  File "/workspace/neuralangelo/projects/nerf/trainers/base.py", line 28, in __init__
    super().__init__(cfg, is_inference=is_inference, seed=seed)
  File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 50, in __init__
    self.model = self.setup_model(cfg, seed=seed)
  File "/workspace/neuralangelo/imaginaire/trainers/base.py", line 116, in setup_model
    lib_model = importlib.import_module(cfg.model.type)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/workspace/neuralangelo/projects/neuralangelo/model.py", line 21, in <module>
    from projects.neuralangelo.utils.modules import NeuralSDF, NeuralRGB, BackgroundNeRF
  File "/workspace/neuralangelo/projects/neuralangelo/utils/modules.py", line 16, in <module>
    import tinycudann as tcnn
  File "/usr/local/lib/python3.8/dist-packages/tinycudann/__init__.py", line 9, in <module>
    from tinycudann.modules import free_temporary_memory, NetworkWithInputEncoding, Network, Encoding
  File "/usr/local/lib/python3.8/dist-packages/tinycudann/modules.py", line 59, in <module>
    raise EnvironmentError(f"Could not find compatible tinycudann extension for compute capability {system_compute_capability}.")
OSError: Could not find compatible tinycudann extension for compute capability 61.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 467) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-16_09:11:57
  host      : 0014c90f7923
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 467)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Running on Windows 10, GTX 1050 Ti, from WSL Ubuntu 18.04.6 with --gpus all flag

Inference time on test set

Hi. Is there info on how long it took to run Neuralangelo on videos from your test set or other in the wild videos?
I see the inference speed in seconds
Screen Shot 2023-08-17 at 9 28 50 AM

I read through your paper and am still trying to understand certain components so I am not sure if this table is regarding a particular step in Neuralangelo or the entire pipeline as a whole?

The loss becomes 'nan' when resuming training from checkpoints.

Hi,
Thank you for open-sourcing such great work. I was training my own outdoor scene and when I resumed training from the last epoch, a bug appeared where the training loss turned into 'nan'. Here are some of my terminal outputs, as well as the training results from the last epoch.

cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
model parameter count: 366,729,596
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Train dataset length: 1487                                                                                                                                                               
Val dataset length: 4                                                                                                                                                                    
Loading checkpoint (local): logs/cambridge/StMarysChurch/epoch_00121_iteration_000090000_checkpoint.pt
- Loading the model...
- Loading the optimizer...
- Loading the scheduler...
Done with loading the checkpoint (epoch 121, iter 90000).
Initialize wandb
wandb: Currently logged in as: bmzhao (bmzhao99). Use `wandb login --relogin` to force relogin
cat: /sys/module/amdgpu/initstate: No such file or directory
ERROR:root:Driver not initialized (amdgpu not found in modules)
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in logs/cambridge/StMarysChurch/wandb/run-20230815_182850-7tfevxuo
wandb: Run `wandb offline` to turn off syncing.
wandb: Resuming run StMarysChurch
wandb: ⭐️ View project at https://wandb.ai/bmzhao99/StMarysChurch
wandb: 🚀 View run at https://wandb.ai/bmzhao99/StMarysChurch/runs/7tfevxuo
Evaluating with 4 samples.                                                                                                                                                               
Training epoch 122:  13%|███████████████▎                                                                                                   | 99/743 [00:20<02:14,  4.78it/s, iter=90100]wandb: Waiting for W&B process to finish... (success).
wandb: - 2.931 MB of 2.931 MB uploaded (0.000 MB deduped)
wandb: Run summary:
wandb:                  epoch 134
wandb:              iteration 99600
wandb:               optim/lr 0.001
wandb:             time/epoch 221.29674
wandb:         time/iteration 0.07249
wandb:             train/PSNR 22.44942
wandb:    train/active_levels 16
wandb: train/curvature_weight 5e-05
wandb:   train/eikonal_weight 0.1
wandb:          train/epsilon 0.00049
wandb:   train/loss/curvature 103.02345
wandb:     train/loss/eikonal 0.02801
wandb:      train/loss/render 0.12252
wandb:       train/loss/total 0.13093
wandb:            train/s-var 5.56595
wandb:               val/PSNR 19.31314
wandb:      val/active_levels 16
wandb:   val/curvature_weight 5e-05
wandb:     val/eikonal_weight 0.1
wandb:     val/loss/curvature 97.7878
wandb:       val/loss/eikonal 0.03254
wandb:        val/loss/render 0.07198
wandb:         val/loss/total 0.08056
wandb:              val/s-var 5.54685
wandb: 
wandb: 🚀 View run StMarysChurch at: https://wandb.ai/bmzhao99/StMarysChurch/runs/7tfevxuo
wandb: Synced 3 W&B file(s), 6 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: logs/cambridge/StMarysChurch/wandb/run-20230815_182850-7tfevxuo/logs
/home/zhaoboming/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/wandb/sdk/wandb_run.py:2089: UserWarning: Run (7tfevxuo) is finished. The call to `_console_raw_callback` will be ignored. Please make sure that you are using an active run.
  lambda data: self._console_raw_callback("stderr", data),
Traceback (most recent call last):                                                                                                                                                       
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 93, in main
    trainer.train(cfg,
  File "/mnt/data1/zhaoboming/neuralangelo/projects/neuralangelo/trainer.py", line 106, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/mnt/data1/zhaoboming/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
  File "/mnt/data1/zhaoboming/neuralangelo/imaginaire/trainers/base.py", line 511, in train
    self.end_of_iteration(data, current_epoch, current_iteration)
  File "/mnt/data1/zhaoboming/neuralangelo/imaginaire/trainers/base.py", line 319, in end_of_iteration
    self._end_of_iteration(data, current_epoch, current_iteration)
  File "/mnt/data1/zhaoboming/neuralangelo/projects/nerf/trainers/base.py", line 51, in _end_of_iteration
    raise ValueError("Training loss has gone to NaN!!!")
ValueError: Training loss has gone to NaN!!!

Here are my training results.

image

image

Removing redundant part

Hi,

Thanks for your excellent work. I am trying to test on nerf_synthetic dataset, following the data preparation to convert the format and using the base.yaml config. I find the quality is perfect except the geometry.

There is some redundant part on the bottom of the model. Is that reasonable with some post-processing, or there is something I missed?

Thanks

image
image

I got error while trying to process toy_example.MOV .

Thanks for your awesome project !

I'm using docker image "docker.io/chenhsuanlin/neuralangelo:23.04-py3".
After downloading "toy_example.MOV" I run command below.
Can I get help ?

root@7981961409cf:/workspace/neuralangelo# bash projects/neuralangelo/scripts/preprocess.sh ${EXPERIMENT_NAME} ${PATH_TO_VIDEO} ${SKIP_FRAME_RATE} ${SCENE_TYPE}
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'toy_example.MOV':
  Metadata:
    major_brand     : qt
    minor_version   : 512
    compatible_brands: qt
    encoder         : Lavf59.27.100
  Duration: 00:00:31.93, start: 0.000000, bitrate: 3488 kb/s
    Stream #0:0: Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, smpte170m/smpte170m/bt709), 1280x720, 3485 kb/s, 29.97 fps, 30 tbr, 19200 tbn, 38400 tbc (default)
    Metadata:
      handler_name    : Core Media Video
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native))
Press [q] to stop, [?] for help
[swscaler @ 0x55b694126000] deprecated pixel format used, make sure you did set range correctly
Output #0, image2, to 'toy_example_skip24/raw_images/%06d.jpg':
  Metadata:
    major_brand     : qt
    minor_version   : 512
    compatible_brands: qt
    encoder         : Lavf58.29.100
    Stream #0:0: Video: mjpeg, yuvj420p(pc), 1280x720, q=2-31, 200 kb/s, 30 fps, 30 tbn, 30 tbc (default)
    Metadata:
      handler_name    : Core Media Video
      encoder         : Lavc58.54.100 mjpeg
    Side data:
      cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1
frame=   40 fps=0.0 q=2.0 Lsize=N/A time=00:00:31.26 bitrate=N/A speed=82.8x
video:2526kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, xcb.

*** Aborted at 1691980089 (unix time) try "date -d @1691980089" if you are using GNU date ***
PC: @     0x7feb7b99700b gsignal
*** SIGABRT (@0xb7c) received by PID 2940 (TID 0x7feb77193900) from PID 2940; stack trace: ***
    @     0x7feb7d585631 (unknown)
    @     0x7feb7cac8420 (unknown)
    @     0x7feb7b99700b gsignal
    @     0x7feb7b976859 abort
    @     0x7feb7bf5baad QMessageLogger::fatal()
    @     0x7feb7c53d7ae QGuiApplicationPrivate::createPlatformIntegration()
    @     0x7feb7c53e708 QGuiApplicationPrivate::createEventDispatcher()
    @     0x7feb7c162f55 QCoreApplicationPrivate::init()
    @     0x7feb7c540543 QGuiApplicationPrivate::init()
    @     0x7feb7cc4a3bd QApplicationPrivate::init()
    @     0x5650366f7602 RunFeatureExtractor()
    @     0x5650366e3eaf main
    @     0x7feb7b978083 __libc_start_main
    @     0x5650366e7f6e _start
projects/neuralangelo/scripts/run_colmap.sh: line 22:  2940 Aborted                 colmap feature_extractor --database_path ${1}/database.db --image_path ${1}/raw_images --ImageReader.camera_model=RADIAL --SiftExtraction.use_gpu=true --SiftExtraction.num_threads=32 --ImageReader.single_camera=true
qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, xcb.

*** Aborted at 1691980089 (unix time) try "date -d @1691980089" if you are using GNU date ***
PC: @     0x7f9706d0500b gsignal
*** SIGABRT (@0xb7d) received by PID 2941 (TID 0x7f9702501900) from PID 2941; stack trace: ***
    @     0x7f97088f3631 (unknown)
    @     0x7f9707e36420 (unknown)
    @     0x7f9706d0500b gsignal
    @     0x7f9706ce4859 abort
    @     0x7f97072c9aad QMessageLogger::fatal()
    @     0x7f97078ab7ae QGuiApplicationPrivate::createPlatformIntegration()
    @     0x7f97078ac708 QGuiApplicationPrivate::createEventDispatcher()
    @     0x7f97074d0f55 QCoreApplicationPrivate::init()
    @     0x7f97078ae543 QGuiApplicationPrivate::init()
    @     0x7f9707fb83bd QApplicationPrivate::init()
    @     0x5586e8f4fb4d RunSequentialMatcher()
    @     0x5586e8f3aeaf main
    @     0x7f9706ce6083 __libc_start_main
    @     0x5586e8f3ef6e _start
projects/neuralangelo/scripts/run_colmap.sh: line 26:  2941 Aborted                 colmap sequential_matcher --database_path ${1}/database.db --SiftMatching.use_gpu=true

==============================================================================
Loading database
==============================================================================

Loading cameras... 0 in 0.000s
Loading matches... 0 in 0.000s
Loading images... 0 in 0.000s (connected 0)
Building correspondence graph... in 0.000s (ignored 0)

Elapsed time: 0.000 [minutes]

WARNING: No images with matches found in the database.

F0814 02:28:09.444535  2944 reconstruction.cc:806] cameras, images, points3D files do not exist at toy_example_skip24/sparse/0
*** Check failure stack trace: ***
    @     0x7f20af09c1c3  google::LogMessage::Fail()
    @     0x7f20af0a125b  google::LogMessage::SendToLog()
    @     0x7f20af09bebf  google::LogMessage::Flush()
    @     0x7f20af09c6ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x5578a9167617  colmap::Reconstruction::Read()
    @     0x5578a90ab4d1  RunImageUndistorter()
    @     0x5578a90a0eaf  main
    @     0x7f20ad49c083  __libc_start_main
    @     0x5578a90a4f6e  _start
projects/neuralangelo/scripts/run_colmap.sh: line 38:  2944 Aborted                 colmap image_undistorter --image_path ${1}/raw_images --input_path ${1}/sparse/0 --output_path ${1}/dense --output_type COLMAP --max_image_size 2000
Traceback (most recent call last):
  File "projects/neuralangelo/scripts/convert_data_to_json.py", line 199, in <module>
    auto_bound(args)
  File "projects/neuralangelo/scripts/convert_data_to_json.py", line 161, in auto_bound
    cameras, images, points3D = read_model(os.path.join(args.data_dir, "sparse"), ext=".bin")
  File "/workspace/neuralangelo/third_party/colmap/scripts/python/read_write_model.py", line 435, in read_model
    cameras = read_cameras_binary(os.path.join(path, "cameras" + ext))
  File "/workspace/neuralangelo/third_party/colmap/scripts/python/read_write_model.py", line 134, in read_cameras_binary
    with open(path_to_model_file, "rb") as fid:
FileNotFoundError: [Errno 2] No such file or directory: 'toy_example_skip24/dense/sparse/cameras.bin'
Traceback (most recent call last):
  File "projects/neuralangelo/scripts/generate_config.py", line 81, in <module>
    generate_config(args)
  File "projects/neuralangelo/scripts/generate_config.py", line 30, in generate_config
    num_images = len(os.listdir(os.path.join(args.data_dir, "images")))
FileNotFoundError: [Errno 2] No such file or directory: 'toy_example_skip24/dense/images'

Other methods for installation

Something is wrong when I installing docker on my server,would you provide some other ways to install and run this project? thanks!

Camera poses & coordinate system used

Hey! Great work. I had a couple questions on the surface reconstruction when the camera in the input video moves a lot.

Does Neuralangelo optimize the implicit surface in a canonical object-centric coordinate system or directly in world coordinates?

For example, say I take a video of the Nvidia HQ from 2 different angles (but still show the whole HQ in both videos), will the reconstructions be the same? Will they be in the same coordinates?

CUDA out of memory, google colab T4, 15G VRAM on toy example

2023-08-13 07:32:27.550330: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-13 07:32:28.516815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Training with 1 GPUs.
Using random seed 0
Make folder logs/2023_0813_0732_30_toy_example

  • checkpoint:
    • save_epoch: 9999999999
    • save_iter: 20000
    • save_latest_iter: 9999999999
    • save_period: 9999999999
    • strict_resume: True
  • cudnn:
    • benchmark: True
    • deterministic: False
  • data:
    • name: dummy
    • num_images: 29
    • num_workers: 4
    • preload: True
    • readjust:
      • center: [0.0, 0.0, 0.0]
      • scale: 1.0
    • root: /content/toy/toy_example_skip24/dense
    • train:
      • batch_size: 2
      • image_size: [707, 1258]
      • subset: None
    • type: projects.neuralangelo.data
    • use_multi_epoch_loader: True
    • val:
      • batch_size: 2
      • image_size: [300, 533]
      • max_viz_samples: 16
      • subset: 4
  • image_save_iter: 9999999999
  • inference_args:
  • local_rank: 0
  • logdir: logs/2023_0813_0732_30_toy_example
  • logging_iter: 9999999999999
  • max_epoch: 9999999999
  • max_iter: 500000
  • metrics_epoch: None
  • metrics_iter: None
  • model:
    • appear_embed:
      • dim: 8
      • enabled: True
    • background:
      • enabled: True
      • encoding:
        • levels: 10
        • type: fourier
      • encoding_view:
        • levels: 3
        • type: spherical
      • mlp:
        • activ: relu
        • activ_density: softplus
        • activ_density_params:
        • activ_params:
        • hidden_dim: 256
        • hidden_dim_rgb: 128
        • num_layers: 8
        • num_layers_rgb: 2
        • skip: [4]
        • skip_rgb: []
      • view_dep: True
      • white: False
    • object:
      • rgb:
        • encoding_view:
          • levels: 3
          • type: spherical
        • mlp:
          • activ: relu_
          • activ_params:
          • hidden_dim: 256
          • num_layers: 4
          • skip: []
          • weight_norm: True
        • mode: idr
      • s_var:
        • anneal_end: 0.1
        • init_val: 3.0
      • sdf:
        • encoding:
          • coarse2fine:
            • enabled: True
            • init_active_level: 4
            • step: 5000
          • hashgrid:
            • dict_size: 22
            • dim: 8
            • max_logres: 11
            • min_logres: 5
            • range: [-2, 2]
          • levels: 16
          • type: hashgrid
        • gradient:
          • mode: numerical
          • taps: 4
        • mlp:
          • activ: softplus
          • activ_params:
            • beta: 100
          • geometric_init: True
          • hidden_dim: 256
          • inside_out: False
          • num_layers: 2
          • out_bias: 0.5
          • skip: []
          • weight_norm: True
    • render:
      • num_sample_hierarchy: 4
      • num_samples:
        • background: 32
        • coarse: 64
        • fine: 16
      • rand_rays: 512
      • stratified: True
    • type: projects.neuralangelo.model
  • nvtx_profile: False
  • optim:
    • fused_opt: False
    • params:
      • lr: 0.001
      • weight_decay: 0.001
    • sched:
      • gamma: 10.0
      • iteration_mode: True
      • step_size: 9999999999
      • two_steps: [300000, 400000]
      • type: two_steps_with_warmup
      • warm_up_end: 5000
    • type: AdamW
  • pretrained_weight: None
  • source_filename: /content/neuralangelo/projects/neuralangelo/configs/custom/toy_example.yaml
  • speed_benchmark: False
  • test_data:
    • name: dummy
    • num_workers: 0
    • test:
      • batch_size: 1
      • is_lmdb: False
      • roots: None
    • type: imaginaire.datasets.images
  • timeout_period: 9999999
  • trainer:
    • amp_config:
      • backoff_factor: 0.5
      • enabled: False
      • growth_factor: 2.0
      • growth_interval: 2000
      • init_scale: 65536.0
    • ddp_config:
      • find_unused_parameters: False
      • static_graph: True
    • depth_vis_scale: 0.5
    • ema_config:
      • beta: 0.9999
      • enabled: False
      • load_ema_checkpoint: False
      • start_iteration: 0
    • grad_accum_iter: 1
    • image_to_tensorboard: False
    • init:
      • gain: None
      • type: none
    • loss_weight:
      • curvature: 0.0005
      • eikonal: 0.1
      • render: 1.0
    • type: projects.neuralangelo.trainer
  • validation_iter: 5000
  • wandb_image_iter: 10000
  • wandb_scalar_iter: 100
    cudnn benchmark: True
    cudnn deterministic: False
    Setup trainer.
    Using random seed 0
    model parameter count: 366,706,268
    Initialize model weights using type: none, gain: None
    Using random seed 0
    Allow TensorFloat32 operations on supported devices
    Train dataset length: 29
    /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
    warnings.warn(_create_warning_msg(
    Val dataset length: 4
    Training from scratch.
    Initialize wandb
    Evaluating: 0% 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
    warnings.warn(_create_warning_msg(
    Evaluating with 4 samples.
    Traceback (most recent call last):
    File "/content/neuralangelo/train.py", line 104, in
    main()
    File "/content/neuralangelo/train.py", line 93, in main
    trainer.train(cfg,
    File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 106, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
    File "/content/neuralangelo/projects/nerf/trainers/base.py", line 115, in train
    super().train(cfg, data_loader, single_gpu, profile, show_pbar)
    File "/content/neuralangelo/imaginaire/trainers/base.py", line 503, in train
    self.train_step(data, last_iter_in_epoch=(it == len(data_loader) - 1))
    File "/content/neuralangelo/imaginaire/trainers/base.py", line 446, in train_step
    self.scaler.scale(total_loss).backward()
    File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
    File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 200, in backward
    Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
    File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
    File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 116, in backward
    input_grad, params_grad = _module_function_backward.apply(ctx, doutput, input, params, output)
    File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs) # type: ignore[misc]
    File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 129, in forward
    params_grad = null_tensor_like(params) if params_grad is None else (params_grad / ctx_fwd.loss_scale)
    torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 698.00 MiB (GPU 0; 14.75 GiB total capacity; 13.21 GiB already allocated; 244.81 MiB free; 14.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33279) of binary: /usr/bin/python3
    Traceback (most recent call last):
    File "/usr/local/bin/torchrun", line 8, in
    sys.exit(main())
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
    return f(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
    ============================================================
    train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-13_07:34:34
host : 5310e9ddb173
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 33279)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Poor results on Barn

Dear authors,

Thanks for relaesing the code. I ran the code with the given tnt confige on Barn (processed folowing the guide) but the result is bad and far away from which shown in the paper. Can you please give me some advices or share the config that you used?

Thnaks alot,
Best.
Mulin

Docker Container Fail

Hello, Ive been attempting to build the COLMAP docker container provided, and it keeps on failing due to various errors, but the farthest that I got it to go was a successful run of the: "docker build -f docker/Dockerfile-colmap -t chenhsuanlin/colmap:3.9 ." command, however when I do "docker start chenhsuanlin/colmap:3.9" I get the following error: "Error response from daemon: No such container: chenhsuanlin/colmap:3.9 Error: failed to start containers: chenhsuanlin/colmap:3.9" . I was wondering what I can do to fix this? I was very excited to try out your repo but have been stuck on this for hours. Thank you very much in advance! Additionally, I would suggest to remove the "&& git checkout dev" portion on line 34, as this errors out since the COLMAP repo seems to have closed that branch.

Thank you in advance, and awaiting your response

Data preparation for the 360 degree data

Dear Author,

Thanks for your great work.
I want to test your data on Nerf-synthetic data, could you please teach me how to convert nerf-synthetic data to current data format and how to modify the config considering there is no background.

Looking forward to your reply!

Bests,
Runsong

Poor Performance on ScanNet

Thanks for you excellent work!
I have tried it on ScanNet datsets, however, I got a totally bad results. This is the reconstruction results after 30k iters:

where the gt mesh is like:

I haved adjusted the dataset format and config files, but I don't know whether the result is due to the method or due to some hyperparameters.
The adjusted config file is below:

model:
    object:
        sdf:
            mlp:
                inside_out: True
                out_bias: 0.6
            encoding:
                coarse2fine:
                    init_active_level: 8
    appear_embed:
        enabled: True
        dim: 8
    background:
        enabled: False
    render:
        num_samples:
            background: 0

Additinally, I normalize ScanNet scene into the cube of [-0.5, 0.5], and I changed the near/far sampling to 0 and 1.5

Do you have any ideas why ScanNet appears so poor performance?

Problem with Docker Container

Hi! Your project is so amazing that even a person who knows nothing about coding (yes, that's me) decided to try it :) Expectedly, I had some problems getting everything to work. I use WSL2 on Windows 11, and when I run the script:

docker run -it chenhsuanlin/colmap:3.8 /bin/bash

I get the following warning about nvidia driver:

==========
== CUDA ==

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

But when I run the same with --gpus, like:
docker run --gpus all -it chenhsuanlin/colmap:3.8 /bin/bash

I get this:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/1140653493db5ca0f2b71b42c2194624b1e9e50bd0f9f72121bf836058a77900/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.
ERRO[0000] error waiting for container:

What am I doing wrong?

ValueError when extracting mesh

Hi, when I was extracting mesh from the checkpoint, I got a value error where the terminal output is as follows:

(neuralangelo) vrlab@k8s-master-38:~/wangph1/workspace/neuralangelo$ CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py \
>     --logdir=logs/${GROUP}/${NAME} \
>     --config=${CONFIG} \
>     --checkpoint=${CHECKPOINT} \
>     --output_file=${OUTPUT_MESH} \
>     --resolution=${RESOLUTION} \
>     --block_res=${BLOCK_RES}
Running mesh extraction with 1 GPUs.
Make folder logs/example_group/example_name
Setup trainer.
Using random seed 0
model parameter count: 366,706,268
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Loading checkpoint (local): logs/example_group/example_name/epoch_02857_iteration_000040000_checkpoint.pt
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 2048 2048 2048
vertices: 0
faces: 0
Traceback (most recent call last):
  File "projects/neuralangelo/scripts/extract_mesh.py", line 95, in <module>
    main()
  File "projects/neuralangelo/scripts/extract_mesh.py", line 90, in main
    mesh.vertices = mesh.vertices * meta["sphere_radius"] + np.array(meta["sphere_center"])
ValueError: operands could not be broadcast together with shapes (0,) (3,)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 502080) of binary: /home/vrlab/anaconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
  File "/home/vrlab/anaconda3/envs/neuralangelo/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/vrlab/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/vrlab/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/vrlab/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/vrlab/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/vrlab/anaconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
projects/neuralangelo/scripts/extract_mesh.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-14_19:58:51
  host      : k8s-master-38
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 502080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The server GPU used for extracting is a single RTX3090 with 24GB VRAM.
I wonder if this is a bug or mistake made by myself. Thanks.

Unable to run Neuralangelo; NVML not supported

Getting the below error

torchrun --nproc_per_node=1 train.py --logdir=logs/sample/toy_example --config=projects/neuralangelo/configs/custom/toy_example.yaml --show_pbar
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 46, in main
    set_affinity(args.local_rank)
  File "/data/imaginaire/utils/gpu_affinity.py", line 74, in set_affinity
    os.sched_setaffinity(0, dev.get_cpu_affinity())
  File "/data/imaginaire/utils/gpu_affinity.py", line 50, in get_cpu_affinity
    for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, Device._nvml_affinity_elements):
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1745, in nvmlDeviceGetCpuAffinity
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 442) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-14_16:29:36
  host      : c7c816135a1c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 442)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag

Error 403 (Forbidden) when accessing data_DTU.zip

The preprocess_dtu.sh seems to be encountering an issue, as there is an "access denied" error while attempting to access the data_DTU.zip through the provided Google Drive link 1zgD-uTLjO8hXcjLqelU444rwS9s9-Syg. Would it be possible to consider adjusting the permission to "Anyone with the link", assuming that the intention is to make it accessible to the public? Thank you, @chenhsuanlin .

$ bash projects/neuralangelo/scripts/preprocess_dtu.sh ${PATH_TO_DTU}
Download DTU data
Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1zgD-uTLjO8hXcjLqelU444rwS9s9-Syg 

unzip:  cannot find or open data_DTU.zip, data_DTU.zip.zip or data_DTU.zip.ZIP.
rm: cannot remove 'data_DTU.zip': No such file or directory
Generate json files

How can I reproduce TNT-Meetingroom result?

Hi there, thanks for publishing the code.

I'm trying to reproduce the result of Meetingroom in the tanks and temples dataset. I run projects/neuralangelo/scripts/convert_tnt_to_json.py to process the data and use the following config:

_parent_: projects/neuralangelo/configs/base.yaml

model:
    object:
        sdf:
            mlp:
                inside_out: True   # True for Meetingroom.
            encoding:
                coarse2fine:
                    init_active_level: 8
    appear_embed:
        enabled: True
        dim: 8

data:
    type: projects.neuralangelo.data
    root: .../data/TNT/Meetingroom
    num_images: 371  # The number of training images.
    train:
        image_size: [835,1500]
        batch_size: 1
        subset:
    val:
        image_size: [300,540]
        batch_size: 1
        subset: 1
        max_viz_samples: 16

However, I only get a PSNR around 22 at iteration 300k. Do you have any suggestions to improve it?

Thank you!

Help

Could some kind soul get the code to run on google colab or hugging face or some similar platform?

Question about finite difference taps

Hi~Thanks for the great work and provided code!
Is there any difference in the accuracy of finite difference when taps is equal to 6 and 4? Will taps=6 be more accurate? Because it is 6 in the paper, but it is 4 by default in the code.

automatically save last epoch

currently checkpoints will only be saved at the configured checkpoint interval.

When running for 50000 iterations, with checkpoints set to save every 20000 iterations, I will get a final checkpoint of iteration 40000 even though my training ran to 50000.

When checkpoint_interval < total_checkpoints, the training runs but doesnt save any checkpoints due to the same reason.

Any Recommendations for a COLMAP alternative?

Using very simple turntables (w/ no background) used in the past with COLMAP, COLMAP failed to create working camera intrinsics & extrinsics for the matrix matching step for usage with nvidia-ngp.

I was hoping there might be an alternative method or set of methods users could try in case of the above failure case with COLMAP?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.