When I tried to run 'fluid_net_train' I get: <div class="snippet-clipboard-content

That is how looks like my commands: GENERATING TRAINING DATA <code class="notr

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

criterion error is NaN about fluidnet HOT 11 CLOSED

google commented on May 3, 2024

criterion error is NaN

from fluidnet.

Comments (11)

jonathantompson commented on May 3, 2024

Sorry for the slow reply. I was OOO for two weeks on vacation and not checking email.

The first thing to try would be to run the test suite. It's not particularly robust but might catch an issue if you get lucky.

In the README.md file, scroll down to "UNIT TESTING" and follow the instructions there.

The second thing to try is to visualize the raw torch training data. In fluid_net_train.lua there's a commented out line.

When does the error occur? Is it on the first batch? During the first epoch? What is the learning rate and config parameters that you're using (I'm assuming default).

from fluidnet.

RiLights commented on May 3, 2024

That is how looks like my commands:
GENERATING TRAINING DATA
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --addModelGeometry True --addSphereGeometry True

RUNNING TRAINING
qlua fluid_net_train.lua -gpu 1 -dataset output_current_model_sphere -modelFilename myModel2D

LEARNING RATE AND EPOCH

      -criterion = fluid
      -epoch # 1 [bSize = 16] [learnRate = 0.0025] [optim = adam]
      [=========================================>....]  192/208 err=9.0393e-03
      WARNING: criterion error (nan) is NaN or > 1000000000
      qlua: lib/run_epoch.lua:221: criterion error is NaN or > 1e3.

PASSING TESTS
lib/modules/test_ALL_MODULES.lua -- OK
./manta ../scenes/_testData.py ---------- Without errors, but I worry about following output lines:

     FluidSolver::solvePressure iterations:51000, res:-nan
     FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
     FluidSolver::solvePressure iterations:85, res:6.013132e-06

Is it ok ?

qlua -ltfluids -e "tfluids.test()" -------------- ERROR

      Running 17 tests
      Completed 442 asserts in 17 tests with 0 failures and 4 errors
      Function call failed
     ...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: Hard-coded just in case something stupid happens
     stack traceback:
[C]: in function 'assert'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: in function 'loadMantaBatch'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:651: in function <...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:640>
[C]: in function 'xpcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:477: in function '_pcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:436: in function '_run'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:355: in function 'run'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:1261: in function 'test'
[string "tfluids.test()"]:1: in main chunk

4 times the same error.

Visualize the raw torch training data looks very pleasant, correct me please if i'm wrong, grayscale images there is a density ?

from fluidnet.

jonathantompson commented on May 3, 2024

Interesting, Manta has trouble solving one of the Linear systems for test data. I haven't seen this before. Sorry for the hassle.

You could try changing the seed:

https://github.com/kristofe/manta/blob/master/scenes/_testData.py#L24

I'm almost positive that would fix that particular issue. However, I also do think this is the root cause of your training instability (since that portion of the test-suit is testing for functions not involved in training the CNN). But lets make sure the full suite of unit tests runs, just to rule out any issues.

If changing the seed doesn't work (and you should try a few just to make sure), can you please figure out which line in _testData.py is causing the solvePressure call to fail? Then I can try debugging it.

from fluidnet.

RiLights commented on May 3, 2024

You right, seed helped in '_testData.py'
./manta ../scenes/_testData.py --seed 55
Tried to do it with the same seed number in "_trainingData.py"
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --seed 55 --addModelGeometry True --addSphereGeometry True
But it didn't help to pass through training (already coupe days ago tried difference seeds)

Looks like training data is ok, my opinion based on your commented "Visualize a Training Batch" where I can see density of training data.

When I changed batchSize to 8 training pass through first epoch and training still running.
I will let you know about result.
...

from fluidnet.

jonathantompson commented on May 3, 2024

Ahh interesting. So it's a training stability issue... I was worried this might happen if users generate new data, because I couldn't see an easy way to ensure the python + manta random generators would be seeded consistently across platforms.

Yeah, so now the standard techniques for tuning SGD / ADAM hyperparams would all be relevant. Try playing with BatchSize, the L2 Norm gradient clipping value (I forget the exact config parameter), Learning Rate and Momentum.

Actually, I would turn down the gradient clipping magnitude first and see if that works. By default it's 1, but I would try as low as 0.2.

from fluidnet.

RiLights commented on May 3, 2024

Works great !

But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)

What does it mean 'res' ? Is it resolution ?

from fluidnet.

joepareti54 commented on May 3, 2024

i also get the same error when running:
./manta ../scenes/_trainingData.py --dim 3 --addModelGeometry True --addSphereGeometry True

FluidSolver::solvePressure skipping CorrectVelocity since res is nan

What was the fix?

from fluidnet.

RiLights commented on May 3, 2024

Hi Joepareti,

What if you go further and try to train your fluid_net ? And for example with '-batchSize' equal to 10 or even less.

Cheers
Ostap

from fluidnet.

cshouu commented on May 3, 2024

Works great !

But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)

What does it mean 'res' ? Is it resolution ?

Hi, @RiLights , i have the same problem about 'solvePressure'.
According to your dialogue with jonathan, i tried to change the seed, but it didn't work. It seems to break down often in the simulation 4 or 5(of 640(total)).
Now do you have the solution or any idea about the "solvePressure" problem in generating training data?

from fluidnet.

RiLights commented on May 3, 2024

Hi @cdibona,

To be honest I don't really remember what the problem was.
Take a look at your simulated data (data for training). Are all data correct?

From what I remember, one of the biggest problem was related to my GPU. When I switched to Nvidia 1080 Ti then most of the issues disappeared.

from fluidnet.

jonathantompson commented on May 3, 2024

Closing this out because it seems like a GPU change fixed this issue. Otherwise feel free to reopen and let me know if you still run into problems (I reran training today and didn't have any issues).

from fluidnet.

criterion error is NaN about fluidnet HOT 11 CLOSED

Comments (11)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent