Git Product home page Git Product logo

Comments (10)

chaeunl avatar chaeunl commented on June 14, 2024 1

Thank you, @diego-plan9 .
I am also trying the above code in another environment (CUDA/PyTorch are differed from this case).

from aihwkit.

maljoras avatar maljoras commented on June 14, 2024 1

Hi @chaeunl,
Indeed, I now found an issue that produced a segfault for the CUDA vector device (the base class of the transfer compound). I am not completely sure whether it is the same issue that you observed, but it occurred when only one of mutliple sub devices of the vector device had diffusion or reset defined. It is fixed in the upcoming releases. Hopefully, that resolves your issue as well.

from aihwkit.

maljoras avatar maljoras commented on June 14, 2024 1

@chaeunl now it is merged. You need to recompile. Let us know if you still see the segfault. Thanks!

from aihwkit.

diego-plan9 avatar diego-plan9 commented on June 14, 2024

Thanks for the detailed report @chaeunl - I'm trying to replicate the issue, as we recently had a segmentation fault appearing in #112 that might have some relation.

Can you confirm the issue can be reproduced constantly (ie. is it always happening consistently at the first iteration of the loop)?

Edit: so far I have not been able to reproduce it with:

from torch import nn
from aihwkit.nn import AnalogSequential, AnalogLinear

def create_model(k):
    model = AnalogSequential(
        AnalogLinear(784, k),
        nn.Sigmoid(),
        AnalogLinear(k, 10)).cuda()
    return model

x = [100, 200, 300]
for i in x:
    model = create_model(i)

under PyTorch 1.7.1, Python 3.6 / 3.8 / 3.9.

from aihwkit.

chaeunl avatar chaeunl commented on June 14, 2024

@diego-plan9 , as I simplify my code, I didn't notice that the above code has no problem.
Instead, I found where the error comes from. I think it's about declaring the TransferCompound device.

This is a simplified code and I confirmed it raises error:

from torch import nn
from aihwkit.nn import AnalogSequential, AnalogLinear
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice, TransferCompound

def create_rpu(which_device="LinearStepDevice"):
    if which_device == "LinearStepDevice":
        rpu = UnitCellRPUConfig(device=LinearStepDevice())
    elif which_device == "TransferCompound":
        rpu = UnitCellRPUConfig(device=TransferCompound(unit_cell_devices=[LinearStepDevice(), LinearStepDevice()]))
    else:
        raise ValueError("Undefined Device")
    return rpu

def create_model(k, which_device="LinearStepDevice"):
    model = AnalogSequential(AnalogLinear(784, k, rpu_config=create_rpu(which_device=which_device)),nn.Sigmoid(),AnalogLinear(k, 10, rpu_config=create_rpu(which_device=which_device))).cuda()
    return model

x = [100, 200, 300]
device = "TransferCompound"
for i in x:
    model = create_model(i, which_device=device)
    print("model built")

If I change device to "LinearStepDevice", it does not raise seg fault. Also, even though using TransferCompound, if I build a network without cuda() , it works.

*ps: the above code does not give any problem at the time I installed aihwkit through git clone on Nov. last year (it's the first time to install it while we contact through e-mail.). At that time, I use PyTorch 1.8.0 and cuda 11.2 (because of GPU model; RTX 3090). As of now, I am trying with another sever. This one has RTX 6000 and PyTorch 1.7.1 under cuda 11.0 is installed. So, the difference is that we installed aihwkit at different time and use different PyTorch version. And so, while installing, we use different command to specify the cuda architecture: -DRPU_CUDA_ARCHITECTURES. I hope the information is helpful to you :)

from aihwkit.

diego-plan9 avatar diego-plan9 commented on June 14, 2024

Thanks @chaeunl as usual for all the detailed info: I still could not reproduce it but still have some options and configurations to try (and having the exact code and all the pointers help) - will update the issue after exhausting them.

from aihwkit.

chaeunl avatar chaeunl commented on June 14, 2024

Thank you so much, @maljoras
Is it merged into github source as of now? As I clone the code from github directly, I can confirm it now :)

from aihwkit.

maljoras avatar maljoras commented on June 14, 2024

Hi @chaeunl,
I found another issue with the CUDA TransferCompound in case of transfer_every smaller than the batch size. The update was in this case not done correctly, and in some cases (network size and batch sizes) the random states size mismatch could crash the training. Please checkout the newest version once the fix is merged.

from aihwkit.

chaeunl avatar chaeunl commented on June 14, 2024

@maljoras , I recompile the up-to-date version of the simulator, but there exist still same issues. I think it might be related with the version of CUDA, PyTorch, or NVIDIA driver. I am trying to reinstall the simulator in other environments (because a few months ago, I didn't have trouble with the issue under another environment). As soon as we install the simulator under other envs, I will report it again. Thank you.

from aihwkit.

maljoras avatar maljoras commented on June 14, 2024

This should be fixed. Reopen if not.

from aihwkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.