Comments (10)
Thank you, @diego-plan9 .
I am also trying the above code in another environment (CUDA/PyTorch are differed from this case).
from aihwkit.
Hi @chaeunl,
Indeed, I now found an issue that produced a segfault for the CUDA vector device (the base class of the transfer compound). I am not completely sure whether it is the same issue that you observed, but it occurred when only one of mutliple sub devices of the vector device had diffusion or reset defined. It is fixed in the upcoming releases. Hopefully, that resolves your issue as well.
from aihwkit.
@chaeunl now it is merged. You need to recompile. Let us know if you still see the segfault. Thanks!
from aihwkit.
Thanks for the detailed report @chaeunl - I'm trying to replicate the issue, as we recently had a segmentation fault appearing in #112 that might have some relation.
Can you confirm the issue can be reproduced constantly (ie. is it always happening consistently at the first iteration of the loop)?
Edit: so far I have not been able to reproduce it with:
from torch import nn
from aihwkit.nn import AnalogSequential, AnalogLinear
def create_model(k):
model = AnalogSequential(
AnalogLinear(784, k),
nn.Sigmoid(),
AnalogLinear(k, 10)).cuda()
return model
x = [100, 200, 300]
for i in x:
model = create_model(i)
under PyTorch 1.7.1, Python 3.6 / 3.8 / 3.9.
from aihwkit.
@diego-plan9 , as I simplify my code, I didn't notice that the above code has no problem.
Instead, I found where the error comes from. I think it's about declaring the TransferCompound
device.
This is a simplified code and I confirmed it raises error:
from torch import nn
from aihwkit.nn import AnalogSequential, AnalogLinear
from aihwkit.simulator.configs import UnitCellRPUConfig
from aihwkit.simulator.configs.devices import LinearStepDevice, TransferCompound
def create_rpu(which_device="LinearStepDevice"):
if which_device == "LinearStepDevice":
rpu = UnitCellRPUConfig(device=LinearStepDevice())
elif which_device == "TransferCompound":
rpu = UnitCellRPUConfig(device=TransferCompound(unit_cell_devices=[LinearStepDevice(), LinearStepDevice()]))
else:
raise ValueError("Undefined Device")
return rpu
def create_model(k, which_device="LinearStepDevice"):
model = AnalogSequential(AnalogLinear(784, k, rpu_config=create_rpu(which_device=which_device)),nn.Sigmoid(),AnalogLinear(k, 10, rpu_config=create_rpu(which_device=which_device))).cuda()
return model
x = [100, 200, 300]
device = "TransferCompound"
for i in x:
model = create_model(i, which_device=device)
print("model built")
If I change device
to "LinearStepDevice"
, it does not raise seg fault. Also, even though using TransferCompound
, if I build a network without cuda()
, it works.
*ps: the above code does not give any problem at the time I installed aihwkit through git clone on Nov. last year (it's the first time to install it while we contact through e-mail.). At that time, I use PyTorch 1.8.0 and cuda 11.2 (because of GPU model; RTX 3090). As of now, I am trying with another sever. This one has RTX 6000 and PyTorch 1.7.1 under cuda 11.0 is installed. So, the difference is that we installed aihwkit at different time and use different PyTorch version. And so, while installing, we use different command to specify the cuda architecture: -DRPU_CUDA_ARCHITECTURES
. I hope the information is helpful to you :)
from aihwkit.
Thanks @chaeunl as usual for all the detailed info: I still could not reproduce it but still have some options and configurations to try (and having the exact code and all the pointers help) - will update the issue after exhausting them.
from aihwkit.
Thank you so much, @maljoras
Is it merged into github source as of now? As I clone the code from github directly, I can confirm it now :)
from aihwkit.
Hi @chaeunl,
I found another issue with the CUDA TransferCompound
in case of transfer_every
smaller than the batch size. The update was in this case not done correctly, and in some cases (network size and batch sizes) the random states size mismatch could crash the training. Please checkout the newest version once the fix is merged.
from aihwkit.
@maljoras , I recompile the up-to-date version of the simulator, but there exist still same issues. I think it might be related with the version of CUDA, PyTorch, or NVIDIA driver. I am trying to reinstall the simulator in other environments (because a few months ago, I didn't have trouble with the issue under another environment). As soon as we install the simulator under other envs, I will report it again. Thank you.
from aihwkit.
This should be fixed. Reopen if not.
from aihwkit.
Related Issues (20)
- "Misplaced &" in the document HOT 2
- ImportError of libmkl_intel_lp64.so.2 during compiling HOT 9
- Model Initialized outsize [w_min, w_max] HOT 7
- Erroneous reference to "ADD_NORMAL" in WeightNoiseType HOT 2
- Continuing training based on checkpoint using torch tile HOT 7
- fusion_import function modifies the input model with the new conductance data HOT 3
- `WeightModifierType.DISCRETIZE` does not work on CUDA HOT 17
- Question about IOParameters arguments
- Considering weight updates for half-biased devices HOT 4
- Code modification and application method HOT 4
- WeightModifierType.DROP_CONNECT not implemented HOT 4
- Training with half-precision doesn't work for the torch tile or CUDA bindings HOT 10
- For convolutional layers, `convert_to_analog` and `convert_to_analog_mapped` behave differently HOT 1
- Backward pass not ideal when using InferenceRPUConfig HOT 1
- Segment Fault Happens when Applying Gradient Update on TransferCompound on CUDA in Debug Mode HOT 1
- drift on non-backpropagation based algorithms HOT 4
- Model Initialized outsize [w_min, w_max]: Pinpointing the bug in issue #604 HOT 5
- AnalogConv2d incorrectly converts the input size HOT 3
- AnalogConv2d fails when using TT-v2 HOT 5
- undefined read noise in PCM inference model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aihwkit.