Git Product home page Git Product logo

Comments (8)

hzhshok avatar hzhshok commented on August 19, 2024 1

Hello,
Now i found other samples about building to try this feature, and labeled the samples to eliminate the background, it seems like passing the first train/mesh , but it failed to do second train/mesh, so please give a suggestion, thanks!

The question:
a. Why it breaked off during running? the log seeing the console output?
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3]
Maybe the MSE/PSNR are not the good one, if so, could you help give me suggestion please about how to imporve training images aspects? or trainning images not suitable the rule๏ผŸ
b. Why the result have serious blur? see the intermediate picture and two original image sampels.
c. How to imporve the last result? input trining images or other.

img_dmtet_pass1_000000
img_dmtet_pass1_000041
img_dmtet_pass1_000045
img_dmtet_pass1_000079
img_dmtet_pass1_000080

IMG_4287
IMG_4332

Hardware: 3080(24G) - win11

Samples: (see the attached two picture for the example)
a. 50 images.
b. Resolution: 2456(width)x1638(hight)
c. Transform parameters with: --aabb_scale 2 for colmap.

{
"ref_mesh": "data/nerf_synthetic/building",
"random_textures": true,
"iter": 8000,
"save_interval": 100,
"texture_res": [5120,5120],
"train_res": [1638, 2456],
"batch": 1,
"learning_rate": [0.03, 0.0001],
"ks_min" : [0, 0.08, 0.0],
"dmtet_grid" : 128,
"mesh_scale" : 5,
"laplace_scale" : 3000,
"display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}],
"layers" : 4,
"background" : "white",
"out_dir": "nerf_building"
}

The console log:
iter= 8000, img_loss=0.061592, reg_loss=0.016066, lr=0.00075, time=499.0 ms, rem=0.00 s
Running validation
MSE, PSNR
0.02140407, 17.038
Base mesh has 214359 triangles and 105082 vertices.
Writing mesh: out/nerf_building\dmtet_mesh/mesh.obj
writing 105082 vertices
writing 224020 texcoords
writing 105082 normals
writing 214359 faces
Writing material: out/nerf_building\dmtet_mesh/mesh.mtl
Done exporting mesh
Traceback (most recent call last):
File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 620, in
geometry, mat = optimize_mesh(glctx, geometry, base_mesh.material, lgt, dataset_train, dataset_validate, FLAGS,
File "D:\zhansheng\proj\windows\nvdiffrec\train.py", line 428, in optimize_mesh
total_loss.backward()
File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "C:\Users\jinshui\anaconda3\envs\dmodel\lib\site-packages\torch\autograd_init_.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function texture2d_mipBackward returned an invalid gradient at index 0 - got [1, 4, 4, 3] but expected shape compatible with [1, 5, 5, 3]

Regards

from nvdiffrec.

jmunkberg avatar jmunkberg commented on August 19, 2024 1

That is an error from nvdiffrast. I would try to use power-of-two resolutions on the textures and training, e.g.,

    "texture_res": [ 1024, 1024 ],
    "train_res": [1024, 1024],

In case the texture2d_mipBackward is not stable for all (non-pow2) resolutions.

from nvdiffrec.

jmunkberg avatar jmunkberg commented on August 19, 2024

After the first pass, we run xatlas to create a UV parameterization on the triangle mesh. If the first pass failed to create a reasonable mesh, this step can take quite some time or even fail. How does the mesh look in your case at the end of the first pass?

For memory consumption, you can log the usage using nvidia-smi --query-gpu=memory.used --format=csv -lms 100 when you run to get a feel for the usage. Memory is a function of image resolution, batch size and if you have depth peeling enabled. We ran the results in the paper using HPUs with 32+GB of memory, but it should run on lower-spec GPUs if you decrease the rendering resolution and/or batch size.

from nvdiffrec.

hzhshok avatar hzhshok commented on August 19, 2024

Thanks @jmunkberg!

Why i raised this issue is because the general command 'nvidia-smi' without parameters showed it only used 7.4G totally, and that GPU memory has 11G, so i suspected that is not memory issue.

Sorry, once i checked the related output including mesh, but i forgot what output it is, and now that host was destroyed, so i am only able to continue to track this after making my new environment.

In addition, the GPU system even was made pending(nvidia-smi was blocked, no any response.) every time train.py failed.
I understand why this feature uses Starvation algorithm for the GPU memory allocation, but,

Does team have the plan to optimize this feature for it's memory allocation strategy?

Regards

from nvdiffrec.

jmunkberg avatar jmunkberg commented on August 19, 2024

Hello @hzhshok ,

Looking at the error metrics:

MSE, PSNR
0.25911513, 6.480

That's extremely large errors, so I assume the first pass did not train properly. What do the images look like in the out/nerf_handong folder (or the name of your current experiment)? If the reconstruction succeeded, I would expect a PSNR of 25 dB or higher. If the reconstruction fails, it is very hard to create a uv-parameterization (it is hard to uv map a triangle soup), and xatlas would fail/hang.

I suspect something else is wrong already in the first pass. A few things that can affect quality:

  • Is the lighting setup constant in all training data?
  • Do you have high quality foreground segmentation masks?
  • Are you sure that the poses are correct and that the pose indices and corresponding images match?
  • Does the training images contain substantial motion or defocus blur

Also, just to verify, is the example from the readme python train.py --config configs/bob.json working without issues on your setup?

from nvdiffrec.

hzhshok avatar hzhshok commented on August 19, 2024

Hello, Thanks @jmunkberg!
Yes, that something should be wrong in the first pass, and i will check the trained effect of the labeled images using current images as the fundamental.

Is the lighting setup constant in all training data?
-- No, some of images has the strong light, because this images is from outdoor.
Do you have high quality foreground segmentation masks?
--Do you mean the high quality foreground is that the image was labeled on the pure background as the examples(chair or other...)?
This time, I just used the wild outdoor images to check if this can be done for non-constant light and relatively pure color in images, and did not label the images, and i will check the effect with the high quality foreground.-)
In addition, i just want to check the effect of texturing mesh after using the traditional sfm tool.
Are you sure that the poses are correct and that the pose indices and corresponding images match?
-- Yes, it should be, the colmap can get result although it did not had good effect.
Does the training images contain substantial motion or defocus blur
-- it no more blur, that images are resized from high resolution 3003x4000 to 1264x1264, i think i should have impact, but

Regards

from nvdiffrec.

ZirongChan avatar ZirongChan commented on August 19, 2024

@hzhshok I ran into the same error about "texture2d_mipBackward returned an invalid gradient" , in my case, it is "got [1,2,2,3] but expected [1,3,3,3]".
Did you solve this issue? or did you find the reason about it .

@jmunkberg thanks for your great work btw, what would you suggest that can cause this issue? bad segmentation or sth. ?

from nvdiffrec.

hzhshok avatar hzhshok commented on August 19, 2024

@ZirongChan, i am not sure if it is caused by memory, you know i used images near 2kX2k which costed more memory for my single 24G GPU, so I used the strategy of spliting image to small blocks as Jmunkberg talked inside other issue, at least it did not happen such error.

Regards

from nvdiffrec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.