Git Product home page Git Product logo

Comments (11)

jonathantompson avatar jonathantompson commented on May 3, 2024

Sorry for the slow reply. I was OOO for two weeks on vacation and not checking email.

The first thing to try would be to run the test suite. It's not particularly robust but might catch an issue if you get lucky.

In the README.md file, scroll down to "UNIT TESTING" and follow the instructions there.

The second thing to try is to visualize the raw torch training data. In fluid_net_train.lua there's a commented out line.

When does the error occur? Is it on the first batch? During the first epoch? What is the learning rate and config parameters that you're using (I'm assuming default).

from fluidnet.

RiLights avatar RiLights commented on May 3, 2024

That is how looks like my commands:
GENERATING TRAINING DATA
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --addModelGeometry True --addSphereGeometry True

RUNNING TRAINING
qlua fluid_net_train.lua -gpu 1 -dataset output_current_model_sphere -modelFilename myModel2D

LEARNING RATE AND EPOCH

      -criterion = fluid
      -epoch # 1 [bSize = 16] [learnRate = 0.0025] [optim = adam]
      [=========================================>....]  192/208 err=9.0393e-03
      WARNING: criterion error (nan) is NaN or > 1000000000
      qlua: lib/run_epoch.lua:221: criterion error is NaN or > 1e3.

PASSING TESTS
lib/modules/test_ALL_MODULES.lua -- OK
./manta ../scenes/_testData.py ---------- Without errors, but I worry about following output lines:

     FluidSolver::solvePressure iterations:51000, res:-nan
     FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
     FluidSolver::solvePressure iterations:85, res:6.013132e-06

Is it ok ?

qlua -ltfluids -e "tfluids.test()" -------------- ERROR

      Running 17 tests
      Completed 442 asserts in 17 tests with 0 failures and 4 errors
      Function call failed
     ...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: Hard-coded just in case something stupid happens
     stack traceback:
[C]: in function 'assert'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: in function 'loadMantaBatch'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:651: in function <...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:640>
[C]: in function 'xpcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:477: in function '_pcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:436: in function '_run'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:355: in function 'run'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:1261: in function 'test'
[string "tfluids.test()"]:1: in main chunk

4 times the same error.

Visualize the raw torch training data looks very pleasant, correct me please if i'm wrong, grayscale images there is a density ?

from fluidnet.

jonathantompson avatar jonathantompson commented on May 3, 2024

Interesting, Manta has trouble solving one of the Linear systems for test data. I haven't seen this before. Sorry for the hassle.

You could try changing the seed:

https://github.com/kristofe/manta/blob/master/scenes/_testData.py#L24

I'm almost positive that would fix that particular issue. However, I also do think this is the root cause of your training instability (since that portion of the test-suit is testing for functions not involved in training the CNN). But lets make sure the full suite of unit tests runs, just to rule out any issues.

If changing the seed doesn't work (and you should try a few just to make sure), can you please figure out which line in _testData.py is causing the solvePressure call to fail? Then I can try debugging it.

from fluidnet.

RiLights avatar RiLights commented on May 3, 2024

You right, seed helped in '_testData.py'
./manta ../scenes/_testData.py --seed 55
Tried to do it with the same seed number in "_trainingData.py"
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --seed 55 --addModelGeometry True --addSphereGeometry True
But it didn't help to pass through training (already coupe days ago tried difference seeds)

Looks like training data is ok, my opinion based on your commented "Visualize a Training Batch" where I can see density of training data.

When I changed batchSize to 8 training pass through first epoch and training still running.
I will let you know about result.
...

from fluidnet.

jonathantompson avatar jonathantompson commented on May 3, 2024

Ahh interesting. So it's a training stability issue... I was worried this might happen if users generate new data, because I couldn't see an easy way to ensure the python + manta random generators would be seeded consistently across platforms.

Yeah, so now the standard techniques for tuning SGD / ADAM hyperparams would all be relevant. Try playing with BatchSize, the L2 Norm gradient clipping value (I forget the exact config parameter), Learning Rate and Momentum.

Actually, I would turn down the gradient clipping magnitude first and see if that works. By default it's 1, but I would try as low as 0.2.

from fluidnet.

RiLights avatar RiLights commented on May 3, 2024

Works great !

But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)

What does it mean 'res' ? Is it resolution ?

from fluidnet.

joepareti54 avatar joepareti54 commented on May 3, 2024

i also get the same error when running:
./manta ../scenes/_trainingData.py --dim 3 --addModelGeometry True --addSphereGeometry True

FluidSolver::solvePressure skipping CorrectVelocity since res is nan

What was the fix?

from fluidnet.

RiLights avatar RiLights commented on May 3, 2024

Hi Joepareti,

What if you go further and try to train your fluid_net ? And for example with '-batchSize' equal to 10 or even less.

Cheers
Ostap

from fluidnet.

cshouu avatar cshouu commented on May 3, 2024

Works great !

But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)

What does it mean 'res' ? Is it resolution ?

Hi, @RiLights , i have the same problem about 'solvePressure'.
According to your dialogue with jonathan, i tried to change the seed, but it didn't work. It seems to break down often in the simulation 4 or 5(of 640(total)).
Now do you have the solution or any idea about the "solvePressure" problem in generating training data?

from fluidnet.

RiLights avatar RiLights commented on May 3, 2024

Hi @cdibona,

To be honest I don't really remember what the problem was.
Take a look at your simulated data (data for training). Are all data correct?

From what I remember, one of the biggest problem was related to my GPU. When I switched to Nvidia 1080 Ti then most of the issues disappeared.

from fluidnet.

jonathantompson avatar jonathantompson commented on May 3, 2024

Closing this out because it seems like a GPU change fixed this issue. Otherwise feel free to reopen and let me know if you still run into problems (I reran training today and didn't have any issues).

from fluidnet.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.