Comments (2)
Thank you for your info @yoshihingis .
I will investigate it, but currently, I don't have access to GPU-equipped & docker-enabled machine so it may take time. I will first try on non-GPU docker and see what happens.
from chainermn.
Dear keisuke
I modified my Dockerfile and ChinerMN could be operated on Docker , but Only CPU mode was OK, I got some erros at GPU Mode .
I added "apt-get install openssh-server" and delete wget train_mnist.py for ChainerMN.
If you tray to test My ChainerMN on Docker container, Please be careful two points.
- the way to test train_manist.py
mpiexec - -allow-run-as-root -n X python train_mnist.py
2.Build Dockerfile and run Dockerimage and login in container by bash, please write train_mnist.py code by your self.
Because I do not wget train_mnist.py in my dockerfile.
I write GPU mode Errors and My ChinerMN Dockerfile.
I'd really appreciate it if you give me any advices about GPU mode erros.
Regards,
==GPU mode Error==
root@2db4d2db4676:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec --allow-run-as-root -n 2 python train_mnist.py -g
Using hierarchical communicator
Using hierarchical communicator
Traceback (most recent call last):
File "train_mnist.py", line 119, in
Traceback (most recent call last):
File "train_mnist.py", line 119, in
main()
File "train_mnist.py", line 56, in main
comm = chainermn.create_communicator(args.communicator)
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/init.py", line 49, in create_communicator
return HierarchicalCommunicator(mpi_comm=mpi_comm)
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/hierarchical_communicator.py", line 14, in init
main()
self.gpu_buffer_a = _memory_utility.DeviceMemory()
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/_memory_utility.py", line 48, in init
File "train_mnist.py", line 56, in main
"Cupy is not available.")
comm = chainermn.create_communicator(args.communicator)
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/init.py", line 49, in create_communicator
RuntimeError: DeviceMemory cannot be used: Cupy is not available.
return HierarchicalCommunicator(mpi_comm=mpi_comm)
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/hierarchical_communicator.py", line 14, in init
self.gpu_buffer_a = _memory_utility.DeviceMemory()
File "/usr/local/lib/python2.7/dist-packages/chainermn/communicators/_memory_utility.py", line 48, in init
"Cupy is not available.")
RuntimeError: DeviceMemory cannot be used: Cupy is not available.
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[63255,1],1]
Exit code: 1
====ChinerMN my Dockerfile======
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
ENV http_proxy my_company_http_proxy
ENV https_proxy my_compnay_https_proxy
ENV LANG en_US.UTF-8
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
openssh-server \
git \
wget \
make \
nano \
wget \
file \
python-dev \
python-pip \
cython && \
rm -rf /var/lib/apt/lists/*
WORKDIR /home/
RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz
#RUN file -z openmpi-2.1.1.tar.gz
RUN tar xzvf openmpi-2.1.1.tar.gz
WORKDIR openmpi-2.1.1
RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig
RUN which mpicc
RUN mpicc -show
RUN which mpiexec
RUN mpiexec --version
WORKDIR /home/
RUN git clone https://github.com/NVIDIA/nccl.git
WORKDIR /home/nccl/
RUN make CUDA_HOME=/usr/local/cuda test
RUN make install
ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH
RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn
WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn
from chainermn.
Related Issues (20)
- Don't inicialize global NCCL comm when HOT 2
- Checkpointer doesn't resume current learning rate HOT 8
- Adding allreduce for ndarray HOT 10
- mpirun doesn't exit when exception is thrown in some process HOT 7
- Asynchronous Allreduce HOT 2
- Handle list of dicts in MultiNodeIterator HOT 1
- would you please share hype parameters of GPUs=4 for resnet50 training with us ? HOT 23
- Expose `intra_size`, `inter_rank` and `inter_size` of communicators at readthedocs
- Provide functions for allreduce
- Manual selection for gpus in distributed training HOT 5
- CommunicatorBase.{scatter, allgather} is missing in the document
- Add `force_equal_length` flag to `scatter_dataset` method
- optimizer.setup() created by create_multi_node_optimizer returns an original optimizer HOT 2
- FP16 support HOT 1
- Forcing forkserver spawn earlier HOT 2
- When `in_size=None` is used in `Liner` and it is not used, an error occurs
- NCCL_ERROR_SYSTEM_ERROR: unhandled system error HOT 3
- CUDA streams usage HOT 6
- Non-Blocking Methodology on ChainerMN HOT 3
- Installation should do nothing but omit a warning.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chainermn.