chainer / chainermn Goto Github PK
View Code? Open in Web Editor NEWChainerMN: Scalable distributed deep learning with Chainer
Home Page: https://chainer.org
License: MIT License
ChainerMN: Scalable distributed deep learning with Chainer
Home Page: https://chainer.org
License: MIT License
Today, Hadoop filesystem is the most popular distributed file system in multiple machine.
Because Big Data need distributed file system to store, to train.
So hope chainer to support HDFS API to read data to train.
$ nosetests -v -a '!test_gpu'
test_get_epoch_trigger (test_dataset.TestDataset) ... ok
test_get_n_iterations_for_one_epoch (test_dataset.TestDataset) ... ok
test_scatter_dataset (test_dataset.TestDataset) ... ok
test_inter_rank_and_size (test_node_aware_communicator.TestNodeAwareCommunicator) ... ok
test_intra_rank_and_size (test_node_aware_communicator.TestNodeAwareCommunicator) ... ok
test_intra_rank_with_env (test_node_aware_communicator.TestNodeAwareCommunicator) ... SKIP
test_intra_size_with_env (test_node_aware_communicator.TestNodeAwareCommunicator) ... SKIP
test_inter_rank_and_size (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... ok
test_intra_rank_and_size (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... ok
test_intra_rank_with_env (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... SKIP
test_intra_size_with_env (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... SKIP
Tests for NaiveCommunicator should also be executed.
After Chainer v2 was released, we need to run tests for both of Chainer v1 and v2 the number of test cases has been doubled.
It requires more than an hour, which is not acceptable.
We need to reduce the size of the test matrix by carefully selecting MPI version to be tested.
NOTE: As of now we test 2 MPICH versions and 2 Open MPI versions.
Hi,
I am trying to use chainercv with chainermn.
I used chainercv with some of my new projects and when i attempt to distribute training using chainermn, I receive the following error from the scatter_dataset
method. All i am doing is applying a random_flip
transform to the training data. I get the error for all my projects that use chainercv and have replicated it using the chainermn mnist example file .
I'm not sure as to where to raise this issue so I have raised it in both the chainercv and chainermn repos.
Hi,I have installed ChainerMN in my docker environment. Under the Environment I have tried the normal chainer/example/train_mnist.py and chainermn/example/train_mnist.py
But I got the result of normal chainer faster training time compare to chainermn one.
(1)train_mnist.py on Nvidia-docker container with GTX TITAN GPU
root@c620cda4aae6:~/chainer/examples/mnist# python train_mnist.py -g 0 /usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future. util.experimental('cupy.core.fusion') GPU: 0 # unit: 1000 # Minibatch-size: 100 # epoch: 20 /usr/local/lib/python2.7/dist-packages/chainer/training/extensions/plot_report.py:25: UserWarning: matplotlib is not installed on your environment, so nothing will be plotted at this time. Please install matplotlib to plot figures. $ pip install matplotlib warnings.warn('matplotlib is not installed on your environment, ' epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time 1 0.190802 0.0801559 0.942267 0.9749 3.11936 2 0.0740039 0.086214 0.976666 0.9739 5.58505 3 0.0459295 0.0743716 0.985182 0.9765 8.04036 4 0.0369725 0.0622468 0.987615 0.9813 10.4912 5 0.0286685 0.109555 0.990565 0.9687 12.9728 6 0.0232461 0.070821 0.992848 0.9812 15.8562 7 0.0197429 0.0849811 0.993581 0.9777 18.415 8 0.0180477 0.091499 0.993699 0.9778 20.9872 9 0.0158452 0.0779782 0.994565 0.9824 23.4289 10 0.0166323 0.100067 0.994432 0.9773 26.0959 11 0.014609 0.0975057 0.995432 0.9813 28.6492 12 0.0103739 0.109653 0.997016 0.9785 31.4284 13 0.0134773 0.0936768 0.996082 0.983 33.8616 14 0.0140495 0.0937355 0.995732 0.9821 36.5562 15 0.0102782 0.0950652 0.997066 0.9812 39.0016 16 0.0116389 0.106709 0.996849 0.9802 41.5919 17 0.00899479 0.0889611 0.997449 0.982 44.014 18 0.00671547 0.107222 0.998033 0.9808 46.7314 19 0.0103872 0.115333 0.996899 0.9805 49.1938 20 0.0115492 0.0843908 0.996966 0.9837 51.805
(2) Docker container with GTX TITAN and GTX 1080 two GPUs
root@c620cda4aae6:~/chainermn/examples/mnist# mpiexec --allow-run-as-root -n 2 python train_mnist.py -g /usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future. util.experimental('cupy.core.fusion') /usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future. util.experimental('cupy.core.fusion') ========================================== Num process (COMM_WORLD): 2 Using GPUs Using hierarchical communicator Num unit: 1000 Num Minibatch-size: 100 Num epoch: 20 ========================================== epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time 1 0.228555 0.0938969 0.930967 0.9702 5.05315 2 0.0781616 0.0802494 0.975968 0.9731 8.25713 3 0.0495357 0.0608414 0.984067 0.981 11.5505 4 0.0311331 0.0665458 0.99 0.9787 14.7125 5 0.0218156 0.0729297 0.992267 0.9799 17.9524 6 0.0225498 0.0836327 0.992167 0.9762 20.999 7 0.017645 0.0783703 0.994533 0.9793 24.3923 8 0.0150505 0.0763708 0.994833 0.9805 27.2897 9 0.0126401 0.0751651 0.995334 0.983 30.2712 10 0.014152 0.0800896 0.995633 0.9816 33.1372 11 0.00949437 0.118996 0.996534 0.9738 36.1006 12 0.00907453 0.0781274 0.997134 0.984 38.9748 13 0.0125371 0.0901105 0.996 0.9798 41.9752 14 0.0106789 0.0867079 0.996434 0.9817 44.8528 15 0.0108346 0.0836941 0.996433 0.9813 47.8183 16 0.00800829 0.100341 0.997233 0.9807 51.5278 17 0.00982617 0.0995677 0.996633 0.9799 55.3525 18 0.0044051 0.0925263 0.9988 0.983 58.166 19 0.00590992 0.0931387 0.998367 0.9823 61.0527 20 0.0067997 0.0864983 0.998233 0.983 64.0285
Why the training time is shorter in normal chainer with 1 gpu compare to 2 gpus?
Is this normal?
I am trying to setup a dockerfile for chainermn with the latest nvidia-docker container. When installing chainermn, I get a gcc error about the cuda include files.
I have already gone through the troubleshooting guide and current setup passes all the checks for the single-node environment/
gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.5m -c chainermn/nccl/nccl.c -o build/temp.linux-x86_64-3.5/chainermn/nccl/nccl.o chainermn/nccl/nccl.c:432:30: fatal error: cuda_runtime_api.h: No such file or directory #include "cuda_runtime_api.h" ^ compilation terminated. error: command 'gcc' failed with exit status 1
mpirun hostname
mpirun python -c 'import mpi4py.MPI.COMM_WORLD as comm; print(comm.rank)'
mpirun nosetest
I made new Dockerfile of ChainerMN ; I added "Cupy install" to my old Docker file.
Chainer MN was operated without any Error Messages,but Multi GPU mode did not operate correctly on Docker Container as below.
I set n=2 but it seemed like that train_mnist.py used only one GPU, I set 2 GPUs:Pascal TUTIN X.
on Normal Ubuntu(non Docker), two gpu mode was operated correctly.
I use same Ubuntu pc for ChainerMN tests on Docker mode and on Non Dcoker mode (Normal Ubuntu 14.04).
Could you give me any advice about this issue?
(1)train_manist.py on Docker container
root@0b2d3d4c3fea:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec --allow-run-as-root -n 2 python train_mnist.py -g
Using hierarchical communicator
Using hierarchical communicator
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 20
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.226259 0.0917777 0.9321 0.9706 2.17506
2 0.0727778 0.0802592 0.978434 0.9733 3.69603
3 0.0461018 0.0792906 0.985501 0.9757 5.2229
4 0.032463 0.062461 0.989467 0.9792 6.7875
5 0.0232544 0.0649005 0.992534 0.9818 8.31209
6 0.0194039 0.0723934 0.9937 0.9796 9.85948
7 0.0159692 0.0708339 0.9946 0.9806 11.4088
8 0.0112379 0.0851562 0.996367 0.9797 12.935
9 0.0185411 0.0813743 0.9939 0.9797 14.4731
10 0.00952662 0.0778052 0.996834 0.9809 15.9948
11 0.0139726 0.0689402 0.995467 0.983 17.5234
12 0.0100885 0.0850812 0.996267 0.982 19.0561
13 0.0089669 0.082975 0.997 0.9817 20.5937
14 0.0102099 0.0774641 0.997134 0.9833 22.1206
15 0.00610311 0.0989188 0.998033 0.98 23.6562
16 0.0108334 0.098169 0.9965 0.9799 25.1871
17 0.0110047 0.0897099 0.996433 0.9822 26.7332
18 0.00507773 0.0902613 0.998033 0.9831 28.268
19 0.00678471 0.0961368 0.9979 0.98 29.7909
20 0.00619084 0.0877243 0.998 0.984 31.3245
root@0b2d3d4c3fea:/usr/local/lib/python2.7/dist-packages/chainermn#
(2)train_mnist.py on non-docker (normal Ubuntu)
yoshiki@arisax:~/chainermn$ mpiexec -n 2 python train_mnist.py -g
GPU: 1
# unit: 1000
# Minibatch-size: 100
# epoch: 20
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 20
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.222195 0.0888905 0.9336 0.9724 2.19636
2 0.0742974 0.0748331 0.976834 0.977 3.83245
3 0.0433798 0.0833005 0.9862 0.9754 5.4308
4 0.0333733 0.0789987 0.989333 0.9767 7.06066
5 0.023105 0.0829177 0.991667 0.9771 8.66969
6 0.0185574 0.085721 0.993467 0.9773 10.3025
7 0.0167888 0.0824494 0.994467 0.9791 11.9232
8 0.0143293 0.0700943 0.9952 0.9816 13.5375
9 0.0134493 0.0807009 0.995667 0.9807 15.1565
10 0.0135038 0.0819335 0.995167 0.9809 16.7755
11 0.0104189 0.0711287 0.997033 0.9837 18.3991
12 0.00860689 0.0993782 0.997234 0.9789 20.0254
13 0.0136934 0.0894296 0.995467 0.9808 21.6475
14 0.011271 0.0865426 0.996133 0.9808 23.2727
15 0.010519 0.0885877 0.9972 0.9798 24.9043
16 0.00662932 0.0930554 0.997967 0.981 26.5198
17 0.00875542 0.0905427 0.997 0.9826 28.1383
18 0.0074622 0.0930135 0.9979 0.9819 29.7506
19 0.00749945 0.0876489 0.997533 0.981 31.3879
20 0.00758309 0.100907 0.997467 0.9802 33.0163
yoshiki@arisax:~/chainermn$
(3)my Docker file
You must need make train_mnist.py example after log in by bash into Docker container.
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
ENV http_proxy my company proxy
ENV https_proxy my company proxy
ENV LANG en_US.UTF-8
RUN apt-get update && apt-get install -y --no-install-recommends \\
build-essential \\
openssh-server \\
git \\
wget \\
make \\
nano \\
wget \\
file \\
python-dev \\
python-pip \\
cython && \\
rm -rf /var/lib/apt/lists/*
WORKDIR /home/
RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz
RUN tar xzvf openmpi-2.1.1.tar.gz
WORKDIR openmpi-2.1.1
RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig
#RUN which mpicc
#RUN mpicc -show
#RUN which mpiexec
#RUN mpiexec --version
WORKDIR /home/
RUN git clone https://github.com/NVIDIA/nccl.git
WORKDIR /home/nccl/
RUN make CUDA_HOME=/usr/local/cuda test
RUN make install
ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH
RUN pip install --upgrade cupy
RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn
WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn
A segmentation fault occurs in Dcgan example under the following conditions:
When I executed it under the other conditions, it finished normally.
For example,
As of now, we use Travis-CI for continuous integration. However, it does not support GPUs. We need to build a similar CI ecosystem as Chainer's.
sudo
command does not inherit environment variables of a user. So, this operation does not work correctly because sudo
command cannot refer CPATH
environment variable a user set.
$ export CPATH=...
$ sudo pip install chainermn
Instead a user need to set variable in sudo
command like this:
$ sudo CPATH=... pip install chainermn
This is a common problem in Python (pip?), and is not a problem of chainermn itself. But many users may face the problem. I think it is better to add this tips.
C.f., chainer.datasets.split_dataset_n_random
Currently ChainerMN ignores BN parameters (var, mean), which are treated as 'persistents' in Chainer. They should be averaged for better test accuracy. As they are not used in training at all, it is enough to reduce them just before validation. Therefore, I propose to implement an extension for that.
Does chainermn currently support some method of creating and resuming from a snapshot object?
I can see the --resume
argument for the parser in the example files, but chainermn is unable to create a snapshot object when the snapshot extension is called.
Thanks
I am trying to modify imagenet example to run on CPUs. After couple of modifications, the code starts but immediately exits with the following error.
File "train_imagenet.py", line 80, in get_example
image -= self.mean[:, top:bottom, left:right]
ValueError: non-broadcastable output operand with shape (1,227,227) doesn't match the broadcast shape (3,227,227)
Could you please let me know how to address this!?
I modified the MNIST example to use MultiprocessIterator
, and found the performance degradation.
I added the -i
option to switch between the two iterators.
$ mpiexec -n 1 python train_mnist.py -e 10 -g -i s
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 10
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.195262 0.0969903 0.94065 0.9679 3.51585
2 0.0735881 0.0793202 0.977016 0.9757 6.25958
3 0.0498798 0.0703787 0.984082 0.9797 9.0313
4 0.0363364 0.0805852 0.988499 0.9766 11.7959
5 0.0289838 0.067941 0.990466 0.9822 14.5519
6 0.0215275 0.0720275 0.993182 0.983 17.3163
7 0.0218637 0.0721944 0.992598 0.9815 20.0835
8 0.019435 0.0728776 0.993716 0.9831 22.8549
9 0.0174047 0.0845325 0.994565 0.9788 25.6255
10 0.0137951 0.0884498 0.995716 0.9814 28.3982
$ mpiexec -n 1 python train_mnist.py -e 10 -g -i m
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 10
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.197972 0.101519 0.939567 0.9667 52.1017
2 0.0744714 0.07385 0.976449 0.9765 77.5037
3 0.0482926 0.0734199 0.984132 0.9772 103.2
4 0.0359041 0.0606402 0.988332 0.9831 128.861
5 0.0276897 0.0864552 0.991231 0.9756 154.151
6 0.0257647 0.080412 0.991582 0.9791 179.945
7 0.0175599 0.0833933 0.994382 0.9771 205.538
8 0.0189049 0.0810598 0.994165 0.9812 231.028
9 0.0193482 0.0807478 0.993865 0.9808 256.623
10 0.0141072 0.0772984 0.995548 0.9821 282.182
I've evaluated distribution efficiency of ChainerMN on AWS GPU Instances.
The article is here https://qiita.com/sonots/private/22384bbc61284f2fdf94 (Japanese).
My evaluation told that:
I first thought that this result is as I expected because ChainerMN recommends to use Infiniband, but p2.16xlarge has only 20Gbps network bandwidth. However, ChainerMN used only 6.0Gbps during my experiment. So, network was not the bottleneck.
I investigated more, and it seemed that sys CPU usage is increased on the experiment of multi-nodes.
%Cpu0 : 40.4 us, 59.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 31.4 us, 0.0 sy, 0.0 ni, 67.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu3 : 36.3 us, 61.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 2.6 si, 0.0 st
%Cpu4 : 96.0 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 3.6 si, 0.0 st
%Cpu9 : 99.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu10 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 27.5 us, 72.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 34.7 us, 65.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 32.5 us, 66.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu19 : 93.7 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 6.3 si, 0.0 st
%Cpu23 : 34.3 us, 65.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu27 : 30.8 us, 69.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu28 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu29 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu30 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu31 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu34 : 68.9 us, 0.0 sy, 0.0 ni, 31.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu40 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
So, the reason would be because kernel syscalls steal CPU of chainer process, probably for network traffic, and chainer process can not perform main task well. But, this is just my guess.
Do you guys have any reasons to explain this, and have any idea to improve performance on AWS?
In the current version, users need Cython module to install ChainerMN. This is an only install-time dependency, so we want to remove the dependency by generating C++ code on sdist
.
I noticed that it is not only MPI4py.
File "train.py", line 132, in main
train_dataset = chainermn.scatter_dataset(train_dataset, comm)
File "/home/aixile/anaconda3/lib/python3.6/site-packages/chainermn/datasets/scatter_dataset.py", line 91, in scatter_dataset
comm.send(subds, dest=i)
File "MPI/Comm.pyx", line 1175, in mpi4py.MPI.Comm.send (src/mpi4py.MPI.c:106424)
File "MPI/msgpickle.pxi", line 210, in mpi4py.MPI.PyMPI_send (src/mpi4py.MPI.c:42085)
File "MPI/msgpickle.pxi", line 112, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40704)
TypeError: can't pickle Transaction objects
^C^Z[warn] Epoll ADD(4) on fd 28 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
This happens when my dataset loader tries to load images from a lmdb file.
class lsun_bedroom_train(datasets_base):
def __init__(self, path, img_size=256):
self.all_keys = self.read_image_key_file_json(path + '/key_bedroom.json')
self.db = lmdb.open(path+"/bedroom_train_lmdb", readonly=True).begin(write=False)
super(lsun_bedroom_train, self).__init__(flip=1, resize_to=img_size, crop_to=0)
def __len__(self):
return len(self.all_keys)
def get_example(self, i):
id = self.all_keys[i]
img = None
val = self.db.get(id.encode())
img = cv2.imdecode(np.fromstring(val, dtype=np.uint8), 1)
img = self.do_augmentation(img)
img = self.preprocess_image(img)
return img
This issue is the central place to discuss the future plans. Any suggestion and contribution are appreciated. We only discuss relatively large tasks here, and smaller tasks are managed in separate issues as usual. We continuously update the following task list.
https://chainermn.readthedocs.io/en/latest/tutorial/step1_communicators_optimizers.html#run
We need to add --gpu
option? Otherwise every user will see warnings due to #69.
When searching a word on the readthedocs.io, the result links are broken.
Example:
The result of "communicator":
http://chainermn.readthedocs.io/en/latest/search.html?q=communicator&check_keywords=yes&area=default
The first result is:
http://chainermn.readthedocs.io/en/latest/reference/index.rst.html?highlight=communicator
This is because the extension (.rst.html) is wrong, but simply ".html" is correct.
It can be checked by the following test:
mpiexec -n 2 nosetests -v tests/test_communicator.py
This test sequence would stop at test_allreduct_grad_gpu
. It seems NcclCommunicator
has failed to initialize itself.
MPI: OpenMPI-2.1.0 (do not reproduce on MVAPICH2-2.2)
When a user is using cleargrads
and grad
is not produced for some Variables, uses encounter the following kind of exceptions:
File "/home/*****/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/chainermn/communicators/hierarchical_communicator.py", line 26, in <genexpr>
n_elems_total = sum(param.grad.size for param in params)
AttributeError: 'NoneType' object has no attribute 'size'
We would like to make it more user-friendly.
When ImageNet example is run without "--gpu" flag, it causes an error.
This is because the code is fixed to HierarchicalCommunicator.
Traceback (most recent call last):
File "examples/imagenet/train_imagenet.py", line 194, in
main()
File "examples/imagenet/train_imagenet.py", line 190, in main
trainer.run()
File "/home/kfukuda/chainer/chainer/training/trainer.py", line 295, in run
update()
File "/home/kfukuda/chainer/chainer/training/updater.py", line 175, in update
self.update_core()
File "/home/kfukuda/chainer/chainer/training/updater.py", line 186, in update_core
optimizer.update(loss_func, *in_arrays)
File "/home/kfukuda/chainermn/chainermn/multi_node_optimizer.py", line 28, in update
self.communicator.allreduce_grad(target)
File "/home/kfukuda/chainermn/chainermn/communicators/hierarchical_communicator.py", line 34, in allreduce_grad
params, itemsize, 'grad', self.gpu_buffer_a)
File "/home/kfukuda/chainermn/chainermn/communicators/_memory_utility.py", line 82, in pack_params
buffer.from_device(grad, size, offset)
File "/home/kfukuda/chainermn/chainermn/communicators/_memory_utility.py", line 61, in from_device
dst.copy_from_device(src.data, size)
TypeError: Argument 'src' has incorrect type (expected cupy.cuda.memory.MemoryPointer, got memoryview)
I have a custom extension that changes the learning rate after a certain number of epochs.
This extension is called via a lambda function in the trainer 'observe value' extension, as shown below:
trainer.extend(extensions.observe_value('lr', lambda _: lr_shift()), trigger=(150, 'epoch'))
I am wondering if this extension should be called by a single work or by all the workers?
e.g. should it be..
if comm.rank == 0: trainer.extend(extensions.observe_value('lr', lambda _: lr_shift()), trigger=(150, 'epoch'))
`
Thanks
We found that ChainerMN crashes under some conditions with Open MPI. We are investigating this issue, but we recommend MVAPICH at this moment for those who encounter this problem.
We found that it was the problem of environment. We now confirm that tests are passing with both Open MPI and MVAPICH.
This is a dummy issue to check Slack integration.
v3 RC is out now, so we need to work on it shortly.
I create this issue as new issue again.
Hey all,
I tried to make Dockerfile of ChinerMN as below;
There were some warnings ,but Build was OK.
I run my ChainerMN docker image and started up ChainerMN container.
And I tested train_mnist.py for ChinerMN on my container.
But I faced two issues as below.
Could you give me any advices?
I'd really appreciate it if you will teach me the way to solve these issues.
★Issues
mpixec run as root issue
mpiexec has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.
You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.
2.openmpi setting issue
root@1f3a59529a69:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec -allow-run-as-root -n 4 pyhton train_mnist.py
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:
plm_rsh_agent: ssh : rsh
Please either unset the parameter, or check that the path is correct
★My Dockerfile for ChinerMN
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
ENV http_proxy my.company.com
ENV https_proxy my.company.con
RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
git
wget
make
nano
wget
file
python-dev
python-pip
cython &&
rm -rf /var/lib/apt/lists/*
WORKDIR /home/
RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz
#RUN file -z openmpi-2.1.1.tar.gz
RUN tar xzvf openmpi-2.1.1.tar.gz
WORKDIR openmpi-2.1.1
RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig
RUN which mpicc
RUN mpicc -show
RUN which mpiexec
RUN mpiexec --version
WORKDIR /home/
RUN git clone https://github.com/NVIDIA/nccl.git
WORKDIR /home/nccl/
RUN make CUDA_HOME=/usr/local/cuda test
RUN make install
ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH
RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn
WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn
RUN wget https://github.com/pfnet/chainermn/tree/master/examples/mnist/train_mnist.py
test issue
I am sorry I directly put the question here and also I put it in stackoverflow:
https://stackoverflow.com/questions/46901992/how-to-use-mix-link-multi-cpu-of-parallel-computing-in-chainer-v2-1-0
My paper submit deadline is 11.08, which is urgent situation. I am in hurry. so please forgive me to put question here.
In my research . I wrote 2 layers in neural network, the bottom first layer is RNN which runs on GPU, the top second layer is CPU(the algorithm model nature is more suited to CPU), I implemented it in CPU in chainer self-defined Link.
But, the CPU layer is slow , I can't wait for deadline of my paper submit. So I want to use parallel computing of this layer.
What is the best practice and fast way to implement parallel this link?
Hi there,
I was trying to test, and after installed it and testing the MNIST example I got this:
`[nelson-lab0:04968] *** Process received signal ***
[nelson-lab0:04968] Signal: Segmentation fault (11)
[nelson-lab0:04968] Signal code: Invalid permissions (2)
[nelson-lab0:04968] Failing at address: 0x2c0d820000
[nelson-lab0:04968] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f60496ad390]
[nelson-lab0:04968] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f6049420566]
[nelson-lab0:04968] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f600f328cb7]
[nelson-lab0:04968] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f600fa6d6bc]
[nelson-lab0:04968] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f600fa6f67c]
[nelson-lab0:04968] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f600fd82d99]
[nelson-lab0:04968] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04968] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [10] python[0x4de6fe]
[nelson-lab0:04968] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04968] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04968] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04968] [21] python[0x4f1def]
[nelson-lab0:04968] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04968] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04968] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04968] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f60492f3830]
[nelson-lab0:04968] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04968] *** End of error message ***
[nelson-lab0:04969] *** Process received signal ***
[nelson-lab0:04969] Signal: Segmentation fault (11)
[nelson-lab0:04969] Signal code: Invalid permissions (2)
[nelson-lab0:04969] Failing at address: 0x2c0d820000
[nelson-lab0:04969] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4a9d3a1390]
[nelson-lab0:04969] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f4a9d114566]
[nelson-lab0:04969] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f4a6301ccb7]
[nelson-lab0:04969] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f4a637616bc]
[nelson-lab0:04969] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f4a6376367c]
[nelson-lab0:04969] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f4a63a76d99]
[nelson-lab0:04969] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04969] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [10] python[0x4de6fe]
[nelson-lab0:04969] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04969] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04969] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04969] [21] python[0x4f1def]
[nelson-lab0:04969] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04969] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04969] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04969] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4a9cfe7830]
[nelson-lab0:04969] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04969] *** End of error message ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 4968 RUNNING AT nelson-lab0
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
`
The first line about segmentation error, and the second about Invalid permision (?) I am wondering about a mistake during installing? or with the GPUs??
My computer has 3 GPUs (2 Titan X and 1 960)
Btw, I do not have any problem testing on no-gpu
Regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.