Comments (6)
I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH (on
one of my systems, only libcuda.so.1 was there, but the usual libcuda.so
symlink was absent). Can you please check that? If it turns out that's
your issue, you can either simply ln -s libcuda.so.1 libcuda.so
in the
relevant directory or you can modify NCCL to dlopen libcuda.so.1 instead of
libcuda.so.
Thanks,
Cliff
On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin [email protected]
wrote:
I build nccl with cuda-7.5:
make CUDA_HOME=/usr/local/cuda-7.5 test
And run test with the following command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_testcauses a segmentation fault:
Using devices
Segmentation fault
But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?β
Reply to this email directly or view it on GitHub
#8.
from nccl.
Oh, and yes, NCCL is normally compatible with CUDA 7.5. It actually is a
bit more complete on CUDA 7.5 than on 7.0, since 7.0 lacked some of the
necessary support for the fp16 'half' datatype.
On Thu, Jan 14, 2016 at 10:41 PM, Cliff Woolley [email protected]
wrote:
I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH
(on one of my systems, only libcuda.so.1 was there, but the usual
libcuda.so symlink was absent). Can you please check that? If it turns
out that's your issue, you can either simplyln -s libcuda.so.1 libcuda.so
in the relevant directory or you can modify NCCL to dlopen
libcuda.so.1 instead of libcuda.so.Thanks,
CliffOn Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin [email protected]
wrote:I build nccl with cuda-7.5:
make CUDA_HOME=/usr/local/cuda-7.5 test
And run test with the following command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_testcauses a segmentation fault:
Using devices
Segmentation fault
But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?β
Reply to this email directly or view it on GitHub
#8.
from nccl.
Side note: assuming this is the same issue with libcuda.so that I'm referring to, we should fix the tests to fail more gracefully when the communicator cannot be created. The segfault happens when we pass a NULL communicator to some subsequent routine.
from nccl.
@cliffwoolley Thanks for the explanation.
I create a symlink to libcuda.so.1
and now it works!
So it's the same issue with libcuda.so
.
from nccl.
Great! Glad to hear it. We'll leave this issue open to deal with the libcuda.so[.1] loading (perhaps we could try both variants before giving up) as well as to detect communicator creation failure in the test apps without segfaulting. I believe @nluehr already has fixes pending for one or both of these issues.
from nccl.
These issues are resolved in change sets caa40b8 and 2758353.
from nccl.
Related Issues (20)
- question about a new single-node communication mode
- what does non-blocking communicator forοΌ HOT 4
- deadlock when using multiple communicators for Point-To-Point Communication within the same GPU Group
- Network IP setup and physical wiring
- Enabling read for P2p transport HOT 1
- How to tell nccl that those network communication is disabled? HOT 2
- Is it possible to swap the calling order of `initTransportsRank` and `ncclTunerPluginLoad` HOT 1
- NCCL Logs Communicator Query HOT 1
- work request complete err: status 5 and vendor err 249 HOT 7
- Is there someway to measure gpu i/o usage or allreduce waiting time? HOT 1
- About sync in nvls algorithm
- NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics HOT 7
- how does NCCL support peer-to-peer connections across NUMA nodes without the features of NICs and NVLinks? HOT 2
- How can I test IB bandwidth when NCCL is running?
- Single or double ring HOT 1
- Missing header file HOT 7
- Why does NVLSTree Allreduce perform worse than Ring Allreduce? HOT 1
- Encountering Random Segmentation Fault During NCCL-Tests HOT 14
- Ring broadcast
- inter-node nvls process when ib sharp not supported HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.