Comments (9)
Hi MannyKayy,
Thank you for trying ChainerMN.
It seems that gcc cannot find nccl.h in your include path.
Will you check the following things?
BTW, do you mind if we add your case to the troubleshooting guide?
As of now, we haven't tried Nvidia-docker and your experience really helps.
The file nccl.h
exists somewhere in your environment. Maybe in /usr/local/include
, but I have no idea.
if (1) is yes: check your CPATH
environment variable as indicated in http://chainermn.readthedocs.io/en/latest/installation/guide.html#nvidia-nccl
if (1) is no: Go to https://github.com/NVIDIA/nccl . Please follow its README and make sure it works. Then, go back to http://chainermn.readthedocs.io/en/latest/installation/guide.html#nvidia-nccl .
Hope it helps.
Thanks!
Keisuke
from chainermn.
Hey all,
I tried to make Dockerfile of ChinerMN as below;
There were some warnings ,but Build was OK.
I run my ChainerMN docker image and started up ChainerMN container.
And I tested train_mnist.py for ChinerMN on my container.
But I faced two issues as below.
Could you give me any advices?
I'd really appreciate it if you will teach me the way to solve these issues.
★Issues
- mpixec run as root issue
mpiexec has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.
You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.
2.openmpi setting issue
root@1f3a59529a69:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec -allow-run-as-root -n 4 pyhton train_mnist.py
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:
plm_rsh_agent: ssh : rsh
Please either unset the parameter, or check that the path is correct
★My Dockerfile for ChinerMN
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
ENV http_proxy my.company.com
ENV https_proxy my.company.con
RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
git
wget
make
nano
wget
file
python-dev
python-pip
cython &&
rm -rf /var/lib/apt/lists/*
WORKDIR /home/
RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz
#RUN file -z openmpi-2.1.1.tar.gz
RUN tar xzvf openmpi-2.1.1.tar.gz
WORKDIR openmpi-2.1.1
RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig
RUN which mpicc
RUN mpicc -show
RUN which mpiexec
RUN mpiexec --version
WORKDIR /home/
RUN git clone https://github.com/NVIDIA/nccl.git
WORKDIR /home/nccl/
RUN make CUDA_HOME=/usr/local/cuda test
RUN make install
ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH
RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn
WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn
RUN wget https://github.com/pfnet/chainermn/tree/master/examples/mnist/train_mnist.py
from chainermn.
Hi MannyKayy,
I faced on same issue , gcc compiler could not find cuda_runtime_api.h and etc.
Mybe you set up right Path ,this isuue will not be deleted.
I copied all files under /usr/local/cuda/include to /usr/local/include.
(I set nccl library at /usr/local/include.)
I did it, and this issue was be deleted.
from chainermn.
Sorry, I think I misread your question. You compiler did not find cuda_runtime_api.h, not nccl.h.
But anyway, as yoshihingis implied, the problem should be solved by setting CPATH
to point the directory which contains cuda_runtime_api.h, or copying the file to the system's default include path.
Thanks.
from chainermn.
@yoshihingis,
Thank you for your comment and question.
Could you create a separate issue for your docker-related question, so that we can track it more easily?
Thanks!
from chainermn.
Dear Keisuke,
Thank you for your comments.
I think that you will be able to track my docker issue easily.
But ChainerMN Framework is very difficult, hence I made new simple Dockerfile which use openmpi and simple code .
This Container (which wast started up by this new Dockerfile) will re-create same issues of my ChainerMN dockerfile.
New Docker file
FROM ubuntu:14.04
ENV http_proxy my company proxy
ENV https_proxy my company proxy
RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
git
wget
make
nano
wget
file
python-dev
python-pip
cython &&
rm -rf /var/lib/apt/lists/*
WORKDIR /home/
#RUN file -z openmpi-2.1.1.tar.gz
RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz
RUN tar xzvf openmpi-2.1.1.tar.gz
WORKDIR openmpi-2.1.1
RUN ./configure
RUN make -j4
RUN make install
RUN ldconfig
RUN which mpicc
RUN mpicc -show
RUN which mpiexec
RUN mpiexec --version
WORKDIR /home/
RUN mkdir OpenMpi
WORKDIR /home/OpenMPi
RUN wget http://www.open-mpi.org/papers/workshop-2006/hello.c
RUN mpicc hello.c -o hello
from chainermn.
Thanks, @keisukefukuda and @yoshihingis .
I now have a working Dockerfile (tested on aws g2.8x instances).
If you want me to send a pull request, let me know.
from chainermn.
Hi, @MannyKayy
It'd be really helpful if you contribute your Docker file via a PR.
Will you create a directory docker
and put the Dockerfile in it?
Thanks!
from chainermn.
@keisukefukuda #71 done
from chainermn.
Related Issues (20)
- Don't inicialize global NCCL comm when HOT 2
- Checkpointer doesn't resume current learning rate HOT 8
- Adding allreduce for ndarray HOT 10
- mpirun doesn't exit when exception is thrown in some process HOT 7
- Asynchronous Allreduce HOT 2
- Handle list of dicts in MultiNodeIterator HOT 1
- would you please share hype parameters of GPUs=4 for resnet50 training with us ? HOT 23
- Expose `intra_size`, `inter_rank` and `inter_size` of communicators at readthedocs
- Provide functions for allreduce
- Manual selection for gpus in distributed training HOT 5
- CommunicatorBase.{scatter, allgather} is missing in the document
- Add `force_equal_length` flag to `scatter_dataset` method
- optimizer.setup() created by create_multi_node_optimizer returns an original optimizer HOT 2
- FP16 support HOT 1
- Forcing forkserver spawn earlier HOT 2
- When `in_size=None` is used in `Liner` and it is not used, an error occurs
- NCCL_ERROR_SYSTEM_ERROR: unhandled system error HOT 3
- CUDA streams usage HOT 6
- Non-Blocking Methodology on ChainerMN HOT 3
- Installation should do nothing but omit a warning.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chainermn.