Comments (11)
Sorry for the slow reply. I was OOO for two weeks on vacation and not checking email.
The first thing to try would be to run the test suite. It's not particularly robust but might catch an issue if you get lucky.
In the README.md file, scroll down to "UNIT TESTING" and follow the instructions there.
The second thing to try is to visualize the raw torch training data. In fluid_net_train.lua there's a commented out line.
When does the error occur? Is it on the first batch? During the first epoch? What is the learning rate and config parameters that you're using (I'm assuming default).
from fluidnet.
That is how looks like my commands:
GENERATING TRAINING DATA
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --addModelGeometry True --addSphereGeometry True
RUNNING TRAINING
qlua fluid_net_train.lua -gpu 1 -dataset output_current_model_sphere -modelFilename myModel2D
LEARNING RATE AND EPOCH
-criterion = fluid
-epoch # 1 [bSize = 16] [learnRate = 0.0025] [optim = adam]
[=========================================>....] 192/208 err=9.0393e-03
WARNING: criterion error (nan) is NaN or > 1000000000
qlua: lib/run_epoch.lua:221: criterion error is NaN or > 1e3.
PASSING TESTS
lib/modules/test_ALL_MODULES.lua -- OK
./manta ../scenes/_testData.py ---------- Without errors, but I worry about following output lines:
FluidSolver::solvePressure iterations:51000, res:-nan
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
FluidSolver::solvePressure iterations:85, res:6.013132e-06
Is it ok ?
qlua -ltfluids -e "tfluids.test()" -------------- ERROR
Running 17 tests
Completed 442 asserts in 17 tests with 0 failures and 4 errors
Function call failed
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: Hard-coded just in case something stupid happens
stack traceback:
[C]: in function 'assert'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:192: in function 'loadMantaBatch'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:651: in function <...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:640>
[C]: in function 'xpcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:477: in function '_pcall'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:436: in function '_run'
...o/big_data/distro/install/share/lua/5.1/torch/Tester.lua:355: in function 'run'
...ta/distro/install/share/lua/5.1/tfluids/test_tfluids.lua:1261: in function 'test'
[string "tfluids.test()"]:1: in main chunk
4 times the same error.
Visualize the raw torch training data looks very pleasant, correct me please if i'm wrong, grayscale images there is a density ?
from fluidnet.
Interesting, Manta has trouble solving one of the Linear systems for test data. I haven't seen this before. Sorry for the hassle.
You could try changing the seed:
https://github.com/kristofe/manta/blob/master/scenes/_testData.py#L24
I'm almost positive that would fix that particular issue. However, I also do think this is the root cause of your training instability (since that portion of the test-suit is testing for functions not involved in training the CNN). But lets make sure the full suite of unit tests runs, just to rule out any issues.
If changing the seed doesn't work (and you should try a few just to make sure), can you please figure out which line in _testData.py is causing the solvePressure call to fail? Then I can try debugging it.
from fluidnet.
You right, seed helped in '_testData.py'
./manta ../scenes/_testData.py --seed 55
Tried to do it with the same seed number in "_trainingData.py"
./manta ../scenes/_trainingData.py --dim 2 --numTest 20 --numTraining 20 --numFrames 10 --frameStride 1 --seed 55 --addModelGeometry True --addSphereGeometry True
But it didn't help to pass through training (already coupe days ago tried difference seeds)
Looks like training data is ok, my opinion based on your commented "Visualize a Training Batch" where I can see density of training data.
When I changed batchSize to 8 training pass through first epoch and training still running.
I will let you know about result.
...
from fluidnet.
Ahh interesting. So it's a training stability issue... I was worried this might happen if users generate new data, because I couldn't see an easy way to ensure the python + manta random generators would be seeded consistently across platforms.
Yeah, so now the standard techniques for tuning SGD / ADAM hyperparams would all be relevant. Try playing with BatchSize, the L2 Norm gradient clipping value (I forget the exact config parameter), Learning Rate and Momentum.
Actually, I would turn down the gradient clipping magnitude first and see if that works. By default it's 1, but I would try as low as 0.2.
from fluidnet.
Works great !
But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)
What does it mean 'res' ? Is it resolution ?
from fluidnet.
i also get the same error when running:
./manta ../scenes/_trainingData.py --dim 3 --addModelGeometry True --addSphereGeometry True
FluidSolver::solvePressure skipping CorrectVelocity since res is nan
What was the fix?
from fluidnet.
Hi Joepareti,
What if you go further and try to train your fluid_net ? And for example with '-batchSize' equal to 10 or even less.
Cheers
Ostap
from fluidnet.
Works great !
But lets com back to issue related to generate training data and seeds
FluidSolver::solvePressure skipping CorrectVelocity since res is nan!
Changing seeds will help only for part of simulation, other part will be with error ".. res is nan" (especially if we have a lot of simulation probability of error increasing)What does it mean 'res' ? Is it resolution ?
Hi, @RiLights , i have the same problem about 'solvePressure'.
According to your dialogue with jonathan, i tried to change the seed, but it didn't work. It seems to break down often in the simulation 4 or 5(of 640(total)).
Now do you have the solution or any idea about the "solvePressure" problem in generating training data?
from fluidnet.
Hi @cdibona,
To be honest I don't really remember what the problem was.
Take a look at your simulated data (data for training). Are all data correct?
From what I remember, one of the biggest problem was related to my GPU. When I switched to Nvidia 1080 Ti then most of the issues disappeared.
from fluidnet.
Closing this out because it seems like a GPU change fixed this issue. Otherwise feel free to reopen and let me know if you still run into problems (I reran training today and didn't have any issues).
from fluidnet.
Related Issues (11)
- Error when run " luarocks make tfluids-1-00.rockspec" HOT 3
- wget link not working HOT 3
- what torch should be installed and how HOT 1
- NaN output when running manta HOT 2
- I can't access NTU 3D Model Database models HOT 2
- Failed installing dependency: HOT 1
- wow
- CUDA 11 and Ubuntu 20.04.5
- CUDA compute capability or CUDA version requirement? HOT 3
- too few arguments in function call HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fluidnet.