Git Product home page Git Product logo

Comments (7)

schiotz avatar schiotz commented on July 23, 2024

Follow-up: I updated to PyTorch 1.9.1, same error.

from nequip.

schiotz avatar schiotz commented on July 23, 2024

Follow-up 2:

I tried with the lammps module instead of the ASE module. Same error. It works without GPU, but fails on the GPU. The GPU is nvidia RTX3090 with compute capability 8.6.

I really don't know what I am doing, but I tried to google for similar problems. Other software packages produce this error when there is a dot in the name of a variable passed to a function. That fits the error in the attached error message. The first error is at line 18, and that line reads

extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis.bessel_weights) {

The last parameter contains numerous dots in the variable name, that certainly looks wrong.

from nequip.

schiotz avatar schiotz commented on July 23, 2024

Follow-up 3:

This is neither related to our GPUs, nor to having PyTorch installed with EasyBuild.

I tried it out on another cluster with NVIDIA TITAN Xp gpus, and with python and pytorch installed with conda. Same result.

EDIT: This install used the main branch of NequIP, not the developer branch.

from nequip.

schiotz avatar schiotz commented on July 23, 2024

PROBLEM SOLVED:

I think it is a documentation problem. According to the main page (README.md), "NequIP is also not currently compatible with PyTorch 1.10; PyTorch 1.9 can be specified with pytorch==1.9 in the install command."

It looks like PyTorch 1.9 is the problem, upgrading to 1.10.0 is solving the issue.

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on July 23, 2024

Hi @schiotz β€”

Wow! You were busy working on a solution before I even got a chance to respond, really appreciate it!

This is something we've seen before and it's the result of a PyTorch bug (one that frankly I'm amazed didn't get surfaced and fixed sooner). The bug is fixed in 1.10.

Unfortunately, the version stuff in the README is not actually entirely a typo... this bug does not happen consistently and so far from my testing PyTorch 1.10 seems to introduce entirely new difficult to debug/reproduce TorchScript bugs. That was my reasoning so far for keep the allowed version down.

For the moment, given that we seem to have success on some systems with 1.9 and mysterious failure on others with 1.10, I'm gonna keep 1.9 as the current max on main and point those like you who have this issue to install 1.10. In the meantime I will try to get our develop branch fully working on 1.10 so we can leave this confusion behind.

Please let me know if 1.10 is/is not working for you! That will help me understand what exactly is going on.

Thanks!

from nequip.

schiotz avatar schiotz commented on July 23, 2024

Thanks for your comments, it makes a lot of sense.

Depending on large third-party packages saves a lot of work, but occasionally gives a bit of trouble when bugs and incompatibilites are introduces - we run into the same kind of issues with ASE, so I fully understand the situation. :-)

from nequip.

simonbatzner avatar simonbatzner commented on July 23, 2024

Closing this, thanks @schiotz!

from nequip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.