Comments (11)
It is always the exact same stack trace, the number of epochs before the error occurs is variable.
Thank you for the workaround, but as restarting works, I think I'll keep the flexibility of the training model file, for now I stop most runs before 800 epochs anyhow.
from nequip.
OK, I have submitted a PR to e3nn that should implement a workaround to resolve this issue: e3nn/e3nn#297
If this is a problem for you, please try to install my branch from the linked PR.
from nequip.
I found that training on a set of 10 configurations, the error occurs after 780 epochs (around 18 min),
my gpu is NVIDIA GeForce GTX 1050 Ti.
In the zip are the data and the config file used to get the error
from nequip.
Hi @keano130 ,
Thanks for reaching out!
This is very strange. We've seen this in our group only once before and I assumed it was some kind of corruption, but if you've seen it in a different computing environment it's definitely not.
@Nicola89 could you post your version information from when you saw this so we can compare to @keano130's?
At this point I don't really have any suspicions about the source of this, but I will look into it and let you know if I find anything or have questions.
Are you able to successfully restart the training session using nequip-restart
? Does the bug reoccur after restarting?
Thanks!
from nequip.
Sure! When I encountered this error I was using the following env:
python=3.8.11
cudatoolkit=11.1.74
pytorch=1.9.0 (cu111)
pytorch-geometric=1.7.2
e3nn=0.3.3
nequip=0.3.3
I can also export and attach here the conda env if that is helpful. I concur with @keano130 about when the bug happens, i.e., deep in the training. I am attaching the error file of the test aspirin run where this happened.
recursion_depth.err.zip
from nequip.
Were either of you running under wandb
, and if you were, could you check the GPU & system memory consumption?
One possibility is that this is a memory leak leading to an eventual OOM error that just isn't very informative, since new_zeros
is an alloc... although in that circumstance you wouldn't expect it to consistently fail on this new_zeros
...
from nequip.
I was running wandb, both GPU and system memory consumption were far from the maximum consumption in my case.
from nequip.
After the error, it is possible to just restart the training, and it trains normally until after around 800 more epochs, where it fails again.
from nequip.
interesting, thanks for the info @keano130. Does it fail in the exact same way, exact same stack trace?
A workaround appears to be enabling compile_model: True
in your config. This compiles the model down to TorchScript for training. (Please note that if you do this the trained model file is somewhat less flexible / useful, since you can't go poking around in the Python module tree later, although I think the parameters can be loaded from it into a Python model if you really need to.)
from nequip.
Update on my side: restart gives the same problem after a similar number of epochs (2923 vs 2964).
from nequip.
e3nn has made a new release incorporating the bugfix: https://github.com/e3nn/e3nn/releases/tag/0.3.5
So you can now get around this just by installing e3nn==0.3.5
.
from nequip.
Related Issues (20)
- π [BUG] Cannot use training loss as metrics key HOT 2
- π [BUG] Crash when using large dataset HOT 13
- π [BUG] Different results when using model.train() and model.eval() HOT 1
- π [BUG] Issue with AtomicDataset process() function HOT 1
- π [BUG] Error during training with training set of different cell size HOT 1
- Inconsistent runtime error when using a nequip model to preform Langevin Dynamics in ASEπ [BUG] HOT 5
- π [BUG] e3nn 0.3.5 may not be compatible with nequip 0.5.5 HOT 2
- β [QUESTION] About parity in irreducible representation HOT 1
- π [BUG] GPU acceleration on NequIPCalculator HOT 1
- β [QUESTION] How to perform transfer training with nequip? HOT 2
- π [BUG] Segfault with float64 models HOT 4
- π [BUG] Tutorial example runs infinitely HOT 3
- π [BUG] lammps/build/lmp: No such file (tutorial) HOT 3
- wrong ValueError text HOT 1
- π [BUG] NotADirectoryError when attempting to run git HOT 2
- β [QUESTION] CIF files & other target properties HOT 1
- β [QUESTION] Newton pair when running lammps HOT 2
- π [FEATURE] OpenMM HOT 46
- Nequip memory requirements β [QUESTION] HOT 10
- π [FEATURE] Support for newer PyTorch HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nequip.