Comments (7)
Follow-up: I updated to PyTorch 1.9.1, same error.
from nequip.
Follow-up 2:
I tried with the lammps module instead of the ASE module. Same error. It works without GPU, but fails on the GPU. The GPU is nvidia RTX3090 with compute capability 8.6.
I really don't know what I am doing, but I tried to google for similar problems. Other software packages produce this error when there is a dot in the name of a variable passed to a function. That fits the error in the attached error message. The first error is at line 18, and that line reads
extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis.bessel_weights) {
The last parameter contains numerous dots in the variable name, that certainly looks wrong.
from nequip.
Follow-up 3:
This is neither related to our GPUs, nor to having PyTorch installed with EasyBuild.
I tried it out on another cluster with NVIDIA TITAN Xp gpus, and with python and pytorch installed with conda
. Same result.
EDIT: This install used the main branch of NequIP, not the developer branch.
from nequip.
PROBLEM SOLVED:
I think it is a documentation problem. According to the main page (README.md), "NequIP is also not currently compatible with PyTorch 1.10; PyTorch 1.9 can be specified with pytorch==1.9 in the install command."
It looks like PyTorch 1.9 is the problem, upgrading to 1.10.0 is solving the issue.
from nequip.
Hi @schiotz β
Wow! You were busy working on a solution before I even got a chance to respond, really appreciate it!
This is something we've seen before and it's the result of a PyTorch bug (one that frankly I'm amazed didn't get surfaced and fixed sooner). The bug is fixed in 1.10.
Unfortunately, the version stuff in the README is not actually entirely a typo... this bug does not happen consistently and so far from my testing PyTorch 1.10 seems to introduce entirely new difficult to debug/reproduce TorchScript bugs. That was my reasoning so far for keep the allowed version down.
For the moment, given that we seem to have success on some systems with 1.9 and mysterious failure on others with 1.10, I'm gonna keep 1.9 as the current max on main and point those like you who have this issue to install 1.10. In the meantime I will try to get our develop branch fully working on 1.10 so we can leave this confusion behind.
Please let me know if 1.10 is/is not working for you! That will help me understand what exactly is going on.
Thanks!
from nequip.
Thanks for your comments, it makes a lot of sense.
Depending on large third-party packages saves a lot of work, but occasionally gives a bit of trouble when bugs and incompatibilites are introduces - we run into the same kind of issues with ASE, so I fully understand the situation. :-)
from nequip.
Closing this, thanks @schiotz!
from nequip.
Related Issues (20)
- β [QUESTION] Colab tutorial HOT 1
- β [QUESTION] Restart run HOT 1
- issue when using nequip-deploy π [BUG] HOT 9
- π [BUG] Cannot restart run with different dataset HOT 4
- Is it possible to train on xyz format data with multiple moleculesβ [QUESTION] HOT 2
- run works on colab but fails on spyderβ [QUESTION] HOT 4
- β [QUESTION] Custom layer with control structure not supported? HOT 1
- The use of nequip command HOT 1
- π [BUG] Cannot run nequip-train with provided example HOT 4
- π [FEATURE]How to Train and Validate on Separate Datasets HOT 2
- How to do custom EarlyStopping?β [QUESTION] HOT 4
- β [QUESTION]
- β [QUESTION] About the data class AtomicData HOT 3
- bugs with "initialize_from_state"π [BUG] HOT 5
- how to choose nosehoover the value of nvt_qβ [QUESTION]
- β [QUESTION]Finetuning Validation Error Higher than Pre-training Error in Nequip HOT 1
- Confusion about `num_frames` attribute in `HDF5Dataset` HOT 1
- β [QUESTION]how can I use the ase calculator for testing ?
- What is the unit of virialsβ [QUESTION] HOT 3
- MLFF for Silicon HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nequip.