chainer / chainermn Goto Github PK

ChainerMN: Scalable distributed deep learning with Chainer

License: MIT License

chainermn's Issues

request for supporting for HDFS file system

Today, Hadoop filesystem is the most popular distributed file system in multiple machine.
Because Big Data need distributed file system to store, to train.
So hope chainer to support HDFS API to read data to train.

Travis CI test insufficient

$ nosetests -v -a '!test_gpu'
test_get_epoch_trigger (test_dataset.TestDataset) ... ok
test_get_n_iterations_for_one_epoch (test_dataset.TestDataset) ... ok
test_scatter_dataset (test_dataset.TestDataset) ... ok
test_inter_rank_and_size (test_node_aware_communicator.TestNodeAwareCommunicator) ... ok
test_intra_rank_and_size (test_node_aware_communicator.TestNodeAwareCommunicator) ... ok
test_intra_rank_with_env (test_node_aware_communicator.TestNodeAwareCommunicator) ... SKIP
test_intra_size_with_env (test_node_aware_communicator.TestNodeAwareCommunicator) ... SKIP
test_inter_rank_and_size (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... ok
test_intra_rank_and_size (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... ok
test_intra_rank_with_env (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... SKIP
test_intra_size_with_env (test_node_aware_communicator_base.TestNodeAwareCommunicatorBase) ... SKIP

Tests for NaiveCommunicator should also be executed.

Reduce CI test cases on Travis CI.

After Chainer v2 was released, we need to run tests for both of Chainer v1 and v2 the number of test cases has been doubled.
It requires more than an hour, which is not acceptable.
We need to reduce the size of the test matrix by carefully selecting MPI version to be tested.
NOTE: As of now we test 2 MPICH versions and 2 Open MPI versions.

Recommend OpenMPI rather than MVAPICH

pickle and mpi error.. using chainercv with chainermn

Hi,

I am trying to use chainercv with chainermn.

I used chainercv with some of my new projects and when i attempt to distribute training using chainermn, I receive the following error from the scatter_dataset method. All i am doing is applying a random_flip transform to the training data. I get the error for all my projects that use chainercv and have replicated it using the chainermn mnist example file .

I'm not sure as to where to raise this issue so I have raised it in both the chainercv and chainermn repos.

Add requirements.txt

Complete Document

ChainerMN mnist and Chainer Mnist example speed test(Question)

Hi,I have installed ChainerMN in my docker environment. Under the Environment I have tried the normal chainer/example/train_mnist.py and chainermn/example/train_mnist.py

But I got the result of normal chainer faster training time compare to chainermn one.

(1)train_mnist.py on Nvidia-docker container with GTX TITAN GPU

root@c620cda4aae6:~/chainer/examples/mnist# python train_mnist.py -g 0
/usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 20

/usr/local/lib/python2.7/dist-packages/chainer/training/extensions/plot_report.py:25: UserWarning: matplotlib is not installed on your environment, so nothing will be plotted at this time. Please install matplotlib to plot figures.

  $ pip install matplotlib

  warnings.warn('matplotlib is not installed on your environment, '
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.190802    0.0801559             0.942267       0.9749                    3.11936
2           0.0740039   0.086214              0.976666       0.9739                    5.58505
3           0.0459295   0.0743716             0.985182       0.9765                    8.04036
4           0.0369725   0.0622468             0.987615       0.9813                    10.4912
5           0.0286685   0.109555              0.990565       0.9687                    12.9728
6           0.0232461   0.070821              0.992848       0.9812                    15.8562
7           0.0197429   0.0849811             0.993581       0.9777                    18.415
8           0.0180477   0.091499              0.993699       0.9778                    20.9872
9           0.0158452   0.0779782             0.994565       0.9824                    23.4289
10          0.0166323   0.100067              0.994432       0.9773                    26.0959
11          0.014609    0.0975057             0.995432       0.9813                    28.6492
12          0.0103739   0.109653              0.997016       0.9785                    31.4284
13          0.0134773   0.0936768             0.996082       0.983                     33.8616
14          0.0140495   0.0937355             0.995732       0.9821                    36.5562
15          0.0102782   0.0950652             0.997066       0.9812                    39.0016
16          0.0116389   0.106709              0.996849       0.9802                    41.5919
17          0.00899479  0.0889611             0.997449       0.982                     44.014
18          0.00671547  0.107222              0.998033       0.9808                    46.7314
19          0.0103872   0.115333              0.996899       0.9805                    49.1938
20          0.0115492   0.0843908             0.996966       0.9837                    51.805

(2) Docker container with GTX TITAN and GTX 1080 two GPUs

root@c620cda4aae6:~/chainermn/examples/mnist# mpiexec --allow-run-as-root -n 2 python train_mnist.py -g
/usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')
/usr/local/lib/python2.7/dist-packages/cupy/core/fusion.py:659: FutureWarning: cupy.core.fusion is experimental. The interface can change in the future.
  util.experimental('cupy.core.fusion')
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.228555    0.0938969             0.930967       0.9702                    5.05315
2           0.0781616   0.0802494             0.975968       0.9731                    8.25713
3           0.0495357   0.0608414             0.984067       0.981                     11.5505
4           0.0311331   0.0665458             0.99           0.9787                    14.7125
5           0.0218156   0.0729297             0.992267       0.9799                    17.9524
6           0.0225498   0.0836327             0.992167       0.9762                    20.999
7           0.017645    0.0783703             0.994533       0.9793                    24.3923
8           0.0150505   0.0763708             0.994833       0.9805                    27.2897
9           0.0126401   0.0751651             0.995334       0.983                     30.2712
10          0.014152    0.0800896             0.995633       0.9816                    33.1372
11          0.00949437  0.118996              0.996534       0.9738                    36.1006
12          0.00907453  0.0781274             0.997134       0.984                     38.9748
13          0.0125371   0.0901105             0.996          0.9798                    41.9752
14          0.0106789   0.0867079             0.996434       0.9817                    44.8528
15          0.0108346   0.0836941             0.996433       0.9813                    47.8183
16          0.00800829  0.100341              0.997233       0.9807                    51.5278
17          0.00982617  0.0995677             0.996633       0.9799                    55.3525
18          0.0044051   0.0925263             0.9988         0.983                     58.166
19          0.00590992  0.0931387             0.998367       0.9823                    61.0527
20          0.0067997   0.0864983             0.998233       0.983                     64.0285

Why the training time is shorter in normal chainer with 1 gpu compare to 2 gpus?
Is this normal?

chainermn docker container

I am trying to setup a dockerfile for chainermn with the latest nvidia-docker container. When installing chainermn, I get a gcc error about the cuda include files.

I have already gone through the troubleshooting guide and current setup passes all the checks for the single-node environment/

gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.5m -c chainermn/nccl/nccl.c -o build/temp.linux-x86_64-3.5/chainermn/nccl/nccl.o chainermn/nccl/nccl.c:432:30: fatal error: cuda_runtime_api.h: No such file or directory #include "cuda_runtime_api.h" ^ compilation terminated. error: command 'gcc' failed with exit status 1

Write installation check page in the document

MPI check --- such as mpirun hostname
CUDA-Aware MPI check
MPI4py check --- mpirun python -c 'import mpi4py.MPI.COMM_WORLD as comm; print(comm.rank)'
ChainerMN check --- mpirun nosetest

Multi GPU mode did not operate on Dokcer container

I made new Dockerfile of ChainerMN ; I added "Cupy install" to my old Docker file.
Chainer MN was operated without any Error Messages,but Multi GPU mode did not operate correctly on Docker Container as below.

I set n=2 but it seemed like that train_mnist.py used only one GPU, I set 2 GPUs:Pascal TUTIN X.

on Normal Ubuntu(non Docker), two gpu mode was operated correctly.

I use same Ubuntu pc for ChainerMN tests on Docker mode and on Non Dcoker mode (Normal Ubuntu 14.04).

Could you give me any advice about this issue?

(1)train_manist.py on Docker container

root@0b2d3d4c3fea:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec --allow-run-as-root -n 2 python train_mnist.py -g
Using hierarchical communicator
Using hierarchical communicator
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 20
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.226259    0.0917777             0.9321         0.9706                    2.17506       
2           0.0727778   0.0802592             0.978434       0.9733                    3.69603       
3           0.0461018   0.0792906             0.985501       0.9757                    5.2229        
4           0.032463    0.062461              0.989467       0.9792                    6.7875        
5           0.0232544   0.0649005             0.992534       0.9818                    8.31209       
6           0.0194039   0.0723934             0.9937         0.9796                    9.85948       
7           0.0159692   0.0708339             0.9946         0.9806                    11.4088       
8           0.0112379   0.0851562             0.996367       0.9797                    12.935        
9           0.0185411   0.0813743             0.9939         0.9797                    14.4731       
10          0.00952662  0.0778052             0.996834       0.9809                    15.9948       
11          0.0139726   0.0689402             0.995467       0.983                     17.5234       
12          0.0100885   0.0850812             0.996267       0.982                     19.0561       
13          0.0089669   0.082975              0.997          0.9817                    20.5937       
14          0.0102099   0.0774641             0.997134       0.9833                    22.1206       
15          0.00610311  0.0989188             0.998033       0.98                      23.6562       
16          0.0108334   0.098169              0.9965         0.9799                    25.1871       
17          0.0110047   0.0897099             0.996433       0.9822                    26.7332       
18          0.00507773  0.0902613             0.998033       0.9831                    28.268        
19          0.00678471  0.0961368             0.9979         0.98                      29.7909       
20          0.00619084  0.0877243             0.998          0.984                     31.3245       
root@0b2d3d4c3fea:/usr/local/lib/python2.7/dist-packages/chainermn#

(2)train_mnist.py on non-docker (normal Ubuntu)

yoshiki@arisax:~/chainermn$ mpiexec -n 2 python train_mnist.py -g
GPU: 1
# unit: 1000
# Minibatch-size: 100
# epoch: 20
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 20
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.222195    0.0888905             0.9336         0.9724                    2.19636       
2           0.0742974   0.0748331             0.976834       0.977                     3.83245       
3           0.0433798   0.0833005             0.9862         0.9754                    5.4308        
4           0.0333733   0.0789987             0.989333       0.9767                    7.06066       
5           0.023105    0.0829177             0.991667       0.9771                    8.66969       
6           0.0185574   0.085721              0.993467       0.9773                    10.3025       
7           0.0167888   0.0824494             0.994467       0.9791                    11.9232       
8           0.0143293   0.0700943             0.9952         0.9816                    13.5375       
9           0.0134493   0.0807009             0.995667       0.9807                    15.1565       
10          0.0135038   0.0819335             0.995167       0.9809                    16.7755       
11          0.0104189   0.0711287             0.997033       0.9837                    18.3991       
12          0.00860689  0.0993782             0.997234       0.9789                    20.0254       
13          0.0136934   0.0894296             0.995467       0.9808                    21.6475       
14          0.011271    0.0865426             0.996133       0.9808                    23.2727       
15          0.010519    0.0885877             0.9972         0.9798                    24.9043       
16          0.00662932  0.0930554             0.997967       0.981                     26.5198       
17          0.00875542  0.0905427             0.997          0.9826                    28.1383       
18          0.0074622   0.0930135             0.9979         0.9819                    29.7506       
19          0.00749945  0.0876489             0.997533       0.981                     31.3879       
20          0.00758309  0.100907              0.997467       0.9802                    33.0163       
yoshiki@arisax:~/chainermn$

(3)my Docker file
You must need make train_mnist.py example after log in by bash into Docker container.
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04

ENV http_proxy my company proxy
ENV https_proxy my company proxy

ENV LANG en_US.UTF-8

RUN apt-get update && apt-get install -y --no-install-recommends \\
        build-essential \\
        openssh-server \\
        git \\
        wget \\
        make \\
        nano \\
        wget \\
        file \\
        python-dev \\
        python-pip \\
        cython && \\
   rm -rf /var/lib/apt/lists/*

WORKDIR /home/

RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz

RUN tar xzvf openmpi-2.1.1.tar.gz

WORKDIR  openmpi-2.1.1

RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig

#RUN which mpicc
#RUN mpicc -show
#RUN which mpiexec
#RUN mpiexec --version

WORKDIR /home/

RUN git clone https://github.com/NVIDIA/nccl.git

WORKDIR /home/nccl/

RUN make CUDA_HOME=/usr/local/cuda test

RUN make install

ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH

RUN pip install --upgrade cupy
RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn

WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn

Try different Chainer versions at CI

Add comments to ImageNet example

Add unit tests of model parallel & dual parallel MNIST example

Error of Dcgan example

A segmentation fault occurs in Dcgan example under the following conditions:

Python 2.7 + MVAPICH2

When I executed it under the other conditions, it finished normally.
For example,

Python 2.7 + OpenMPI
Python 3.6 + OpenMPI
Python 3.6 + MVAPICH2

Type specification in MultiNodeChainList's docstrings

It seems PyCharm is angry because None is not included in the type specification of docstrings in MultINodeChainList.add_link.

CI on GPU-equipped environments

As of now, we use Travis-CI for continuous integration. However, it does not support GPUs. We need to build a similar CI ecosystem as Chainer's.

Add description about `sudo` and environment variable problem

sudo command does not inherit environment variables of a user. So, this operation does not work correctly because sudo command cannot refer CPATH environment variable a user set.

$ export CPATH=...
$ sudo pip install chainermn

Instead a user need to set variable in sudo command like this:

$ sudo CPATH=... pip install chainermn

This is a common problem in Python (pip?), and is not a problem of chainermn itself. But many users may face the problem. I think it is better to add this tips.

Add `shuffle` option to scatter_dataset

C.f., chainer.datasets.split_dataset_n_random

Break a link for Documentation in README

Extension to all-reduce batch normalization parameters

Currently ChainerMN ignores BN parameters (var, mean), which are treated as 'persistents' in Chainer. They should be averaged for better test accuracy. As they are not used in training at all, it is enough to reduce them just before validation. Therefore, I propose to implement an extension for that.

Compatibility with Chainer v2

Creating trainer snapshots?

Does chainermn currently support some method of creating and resuming from a snapshot object?

I can see the --resume argument for the parser in the example files, but chainermn is unable to create a snapshot object when the snapshot extension is called.

Thanks

Error with imagenet example for CPU

I am trying to modify imagenet example to run on CPUs. After couple of modifications, the code starts but immediately exits with the following error.

File "train_imagenet.py", line 80, in get_example
image -= self.mean[:, top:bottom, left:right]
ValueError: non-broadcastable output operand with shape (1,227,227) doesn't match the broadcast shape (3,227,227)

Could you please let me know how to address this!?

MNIST slowdown when MultiprocessIterator is used

I modified the MNIST example to use MultiprocessIterator, and found the performance degradation.

I added the -i option to switch between the two iterators.

SerialIterator Results

$ mpiexec -n 1  python train_mnist.py -e 10 -g -i s
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 10
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.195262    0.0969903             0.94065        0.9679                    3.51585
2           0.0735881   0.0793202             0.977016       0.9757                    6.25958
3           0.0498798   0.0703787             0.984082       0.9797                    9.0313
4           0.0363364   0.0805852             0.988499       0.9766                    11.7959
5           0.0289838   0.067941              0.990466       0.9822                    14.5519
6           0.0215275   0.0720275             0.993182       0.983                     17.3163
7           0.0218637   0.0721944             0.992598       0.9815                    20.0835
8           0.019435    0.0728776             0.993716       0.9831                    22.8549
9           0.0174047   0.0845325             0.994565       0.9788                    25.6255
10          0.0137951   0.0884498             0.995716       0.9814                    28.3982

MultiprocessIterator Results

$ mpiexec -n 1  python train_mnist.py -e 10 -g -i m
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 10
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.197972    0.101519              0.939567       0.9667                    52.1017
2           0.0744714   0.07385               0.976449       0.9765                    77.5037
3           0.0482926   0.0734199             0.984132       0.9772                    103.2
4           0.0359041   0.0606402             0.988332       0.9831                    128.861
5           0.0276897   0.0864552             0.991231       0.9756                    154.151
6           0.0257647   0.080412              0.991582       0.9791                    179.945
7           0.0175599   0.0833933             0.994382       0.9771                    205.538
8           0.0189049   0.0810598             0.994165       0.9812                    231.028
9           0.0193482   0.0807478             0.993865       0.9808                    256.623
10          0.0141072   0.0772984             0.995548       0.9821                    282.182

FP16 support

Write README

Distribution Efficiency is low on AWS GPU instances

I've evaluated distribution efficiency of ChainerMN on AWS GPU Instances.

The article is here https://qiita.com/sonots/private/22384bbc61284f2fdf94 (Japanese).

My evaluation told that:

Distribution efficiency is only 27% ~ 37% with multiple nodes of p2.16xlarge (which is an instance having 16 k80s)
Distribution efficiency is 85% with one node of p2.xlarge, thus, one p2.16xlarge vs three p2.16xlarge results in same speed.

I first thought that this result is as I expected because ChainerMN recommends to use Infiniband, but p2.16xlarge has only 20Gbps network bandwidth. However, ChainerMN used only 6.0Gbps during my experiment. So, network was not the bottleneck.

I investigated more, and it seemed that sys CPU usage is increased on the experiment of multi-nodes.

%Cpu0  : 40.4 us, 59.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 31.4 us,  0.0 sy,  0.0 ni, 67.9 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  : 36.3 us, 61.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu4  : 96.0 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  3.6 si,  0.0 st
%Cpu9  : 99.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu10 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 27.5 us, 72.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 34.7 us, 65.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 32.5 us, 66.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu19 : 93.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.3 si,  0.0 st
%Cpu23 : 34.3 us, 65.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 : 30.8 us, 69.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu34 : 68.9 us,  0.0 sy,  0.0 ni, 31.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu40 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

So, the reason would be because kernel syscalls steal CPU of chainer process, probably for network traffic, and chainer process can not perform main task well. But, this is just my guess.

Do you guys have any reasons to explain this, and have any idea to improve performance on AWS?

Remove cython dependency on installation

In the current version, users need Cython module to install ChainerMN. This is an only install-time dependency, so we want to remove the dependency by generating C++ code on sdist.

Enumerate dependent PyPI packages in the docs

I noticed that it is not only MPI4py.

Can't pickle Transaction objects

  File "train.py", line 132, in main
    train_dataset = chainermn.scatter_dataset(train_dataset, comm)
  File "/home/aixile/anaconda3/lib/python3.6/site-packages/chainermn/datasets/scatter_dataset.py", line 91, in scatter_dataset
    comm.send(subds, dest=i)
  File "MPI/Comm.pyx", line 1175, in mpi4py.MPI.Comm.send (src/mpi4py.MPI.c:106424)
  File "MPI/msgpickle.pxi", line 210, in mpi4py.MPI.PyMPI_send (src/mpi4py.MPI.c:42085)
  File "MPI/msgpickle.pxi", line 112, in mpi4py.MPI.Pickle.dump (src/mpi4py.MPI.c:40704)
TypeError: can't pickle Transaction objects
^C^Z[warn] Epoll ADD(4) on fd 28 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor

This happens when my dataset loader tries to load images from a lmdb file.

class lsun_bedroom_train(datasets_base):
    def __init__(self, path, img_size=256):
        self.all_keys = self.read_image_key_file_json(path + '/key_bedroom.json')
        self.db = lmdb.open(path+"/bedroom_train_lmdb", readonly=True).begin(write=False)
        super(lsun_bedroom_train, self).__init__(flip=1, resize_to=img_size, crop_to=0)

    def __len__(self):
        return len(self.all_keys)

    def get_example(self, i):
        id = self.all_keys[i]
        img = None
        val = self.db.get(id.encode())
        img = cv2.imdecode(np.fromstring(val, dtype=np.uint8), 1)
        img = self.do_augmentation(img)
        img = self.preprocess_image(img)
        return img

Roadmap

This issue is the central place to discuss the future plans. Any suggestion and contribution are appreciated. We only discuss relatively large tasks here, and smaller tasks are managed in separate issues as usual. We continuously update the following task list.

Features

Model parallelism (@levelfour)
Inter-worker batch normalization (@iwiwi)

Performance

Asynchronous SGD (@keisukefukuda)
Double buffering (@shu65)
Gradient quantization

Examples

Seq2seq example (@keisukefukuda)
DCGAN
Combination with ChainerRL

Explain `--gpu` option for MNIST example

https://chainermn.readthedocs.io/en/latest/tutorial/step1_communicators_optimizers.html#run

We need to add --gpu option? Otherwise every user will see warnings due to #69.

Broken link in search

When searching a word on the readthedocs.io, the result links are broken.
Example:
The result of "communicator":
http://chainermn.readthedocs.io/en/latest/search.html?q=communicator&check_keywords=yes&area=default

The first result is:
http://chainermn.readthedocs.io/en/latest/reference/index.rst.html?highlight=communicator

This is because the extension (.rst.html) is wrong, but simply ".html" is correct.

test_allreduce_grad_gpu fails

It can be checked by the following test:

mpiexec -n 2 nosetests -v tests/test_communicator.py

This test sequence would stop at test_allreduct_grad_gpu. It seems NcclCommunicator has failed to initialize itself.

Environment

MPI: OpenMPI-2.1.0 (do not reproduce on MVAPICH2-2.2)

Add Jenkins integration for tests with GPU

Handle empty grads

When a user is using cleargrads and grad is not produced for some Variables, uses encounter the following kind of exceptions:

  File "/home/*****/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/chainermn/communicators/hierarchical_communicator.py", line 26, in <genexpr>
    n_elems_total = sum(param.grad.size for param in params)
AttributeError: 'NoneType' object has no attribute 'size'

We would like to make it more user-friendly.

Imagenet examples cause errors in CPU mode.

When ImageNet example is run without "--gpu" flag, it causes an error.
This is because the code is fixed to HierarchicalCommunicator.

Traceback (most recent call last):
File "examples/imagenet/train_imagenet.py", line 194, in
main()
File "examples/imagenet/train_imagenet.py", line 190, in main
trainer.run()
File "/home/kfukuda/chainer/chainer/training/trainer.py", line 295, in run
update()
File "/home/kfukuda/chainer/chainer/training/updater.py", line 175, in update
self.update_core()
File "/home/kfukuda/chainer/chainer/training/updater.py", line 186, in update_core
optimizer.update(loss_func, *in_arrays)
File "/home/kfukuda/chainermn/chainermn/multi_node_optimizer.py", line 28, in update
self.communicator.allreduce_grad(target)
File "/home/kfukuda/chainermn/chainermn/communicators/hierarchical_communicator.py", line 34, in allreduce_grad
params, itemsize, 'grad', self.gpu_buffer_a)
File "/home/kfukuda/chainermn/chainermn/communicators/_memory_utility.py", line 82, in pack_params
buffer.from_device(grad, size, offset)
File "/home/kfukuda/chainermn/chainermn/communicators/_memory_utility.py", line 61, in from_device
dst.copy_from_device(src.data, size)
TypeError: Argument 'src' has incorrect type (expected cupy.cuda.memory.MemoryPointer, got memoryview)

Calling custom extensions using trainer

I have a custom extension that changes the learning rate after a certain number of epochs.
This extension is called via a lambda function in the trainer 'observe value' extension, as shown below:

trainer.extend(extensions.observe_value('lr', lambda _: lr_shift()), trigger=(150, 'epoch'))

I am wondering if this extension should be called by a single work or by all the workers?

e.g. should it be..
if comm.rank == 0: trainer.extend(extensions.observe_value('lr', lambda _: lr_shift()), trigger=(150, 'epoch'))
`

Thanks

Installation check

Add readthedocs integration

Crash with Open MPI

~~We found that ChainerMN crashes under some conditions with Open MPI. We are investigating this issue, but we recommend MVAPICH at this moment for those who encounter this problem.~~

We found that it was the problem of environment. We now confirm that tests are passing with both Open MPI and MVAPICH.

Dummy issue

This is a dummy issue to check Slack integration.

Chainer v3 support

v3 RC is out now, so we need to work on it shortly.

OpenMPI Error on Chainer docker containe

I create this issue as new issue again.

Hey all,

I tried to make Dockerfile of ChinerMN as below;
There were some warnings ,but Build was OK.
I run my ChainerMN docker image and started up ChainerMN container.
And I tested train_mnist.py for ChinerMN on my container.
But I faced two issues as below.

Could you give me any advices?

I'd really appreciate it if you will teach me the way to solve these issues.

★Issues

mpixec run as root issue
mpiexec has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

2.openmpi setting issue

root@1f3a59529a69:/usr/local/lib/python2.7/dist-packages/chainermn# mpiexec -allow-run-as-root -n 4 pyhton train_mnist.py

The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct

★My Dockerfile for ChinerMN

FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04

ENV http_proxy my.company.com
ENV https_proxy my.company.con

RUN apt-get update && apt-get install -y --no-install-recommends
build-essential
git
wget
make
nano
wget
file
python-dev
python-pip
cython &&
rm -rf /var/lib/apt/lists/*

WORKDIR /home/

RUN wget http://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.gz

#RUN file -z openmpi-2.1.1.tar.gz

RUN tar xzvf openmpi-2.1.1.tar.gz

WORKDIR openmpi-2.1.1

RUN ./configure --with-cuda
RUN make -j4
RUN make install
RUN ldconfig

RUN which mpicc
RUN mpicc -show
RUN which mpiexec
RUN mpiexec --version

WORKDIR /home/

RUN git clone https://github.com/NVIDIA/nccl.git

WORKDIR /home/nccl/

RUN make CUDA_HOME=/usr/local/cuda test

RUN make install

ENV PATH /usr/local/bin:/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
ENV LIBRARY_PATH /usr/local/lib:$LIBRARY_PATH
ENV CPATH /usr/local/cuda/include:/usr/local/include:$CPATH

RUN pip install --upgrade urllib3
RUN pip install --upgrade pip
RUN pip install --upgrade cython
RUN pip install chainermn

WORKDIR /usr/local/lib/python2.7/dist-packages/chainermn

RUN wget https://github.com/pfnet/chainermn/tree/master/examples/mnist/train_mnist.py

Tests for optimizer and evaluator

test issue

How to use mix link Multi-CPU of parallel computing in chainer v2.1.0

I am sorry I directly put the question here and also I put it in stackoverflow:
https://stackoverflow.com/questions/46901992/how-to-use-mix-link-multi-cpu-of-parallel-computing-in-chainer-v2-1-0
My paper submit deadline is 11.08, which is urgent situation. I am in hurry. so please forgive me to put question here.
In my research . I wrote 2 layers in neural network, the bottom first layer is RNN which runs on GPU, the top second layer is CPU(the algorithm model nature is more suited to CPU), I implemented it in CPU in chainer self-defined Link.

But, the CPU layer is slow , I can't wait for deadline of my paper submit. So I want to use parallel computing of this layer.

What is the best practice and fast way to implement parallel this link?

Add `version`

Bad termination / Segmentation fault MNIST test

Hi there,
I was trying to test, and after installed it and testing the MNIST example I got this:


`[nelson-lab0:04968] *** Process received signal ***
[nelson-lab0:04968] Signal: Segmentation fault (11)
[nelson-lab0:04968] Signal code: Invalid permissions (2)
[nelson-lab0:04968] Failing at address: 0x2c0d820000
[nelson-lab0:04968] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f60496ad390]
[nelson-lab0:04968] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f6049420566]
[nelson-lab0:04968] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f600f328cb7]
[nelson-lab0:04968] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f600fa6d6bc]
[nelson-lab0:04968] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f600fa6f67c]
[nelson-lab0:04968] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f600fd82d99]
[nelson-lab0:04968] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04968] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [10] python[0x4de6fe]
[nelson-lab0:04968] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04968] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04968] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04968] [21] python[0x4f1def]
[nelson-lab0:04968] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04968] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04968] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04968] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f60492f3830]
[nelson-lab0:04968] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04968] *** End of error message ***
[nelson-lab0:04969] *** Process received signal ***
[nelson-lab0:04969] Signal: Segmentation fault (11)
[nelson-lab0:04969] Signal code: Invalid permissions (2)
[nelson-lab0:04969] Failing at address: 0x2c0d820000
[nelson-lab0:04969] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4a9d3a1390]
[nelson-lab0:04969] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f4a9d114566]
[nelson-lab0:04969] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f4a6301ccb7]
[nelson-lab0:04969] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f4a637616bc]
[nelson-lab0:04969] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f4a6376367c]
[nelson-lab0:04969] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f4a63a76d99]
[nelson-lab0:04969] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04969] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [10] python[0x4de6fe]
[nelson-lab0:04969] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04969] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04969] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04969] [21] python[0x4f1def]
[nelson-lab0:04969] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04969] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04969] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04969] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4a9cfe7830]
[nelson-lab0:04969] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04969] *** End of error message ***

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4968 RUNNING AT nelson-lab0
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
`

The first line about segmentation error, and the second about Invalid permision (?) I am wondering about a mistake during installing? or with the GPUs??
My computer has 3 GPUs (2 Titan X and 1 960)

Btw, I do not have any problem testing on no-gpu
Regards