Git Product home page Git Product logo

mpit's People

Contributors

hughperkins avatar ljk628 avatar shuzi avatar sixin-zh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mpit's Issues

more comments on the asyncsgd example ?

Hello
thanks a lot for your example.
I want to use MPI for multi nodes but I do not know how can I set the configure.
Would you please add some explanation of the content in "local conf" and how can I set for multi node environment? That would be very useful for many people who want to use this package.
Thanks a lot!

mpicc not detected on ubuntu

On ubuntu 14.04, sudo apt-get libopenmpi-dev installs mpicc to /usr/bin/mpicc, which is not detected by cmake. (and not compatible with the location ${POME}/exe/mpi/bin/mpicc either)

Advice on installing?

Hi,

Thanks for releasing this library! I'm on Ubuntu 14.04, and I've tried using both sudo apt-get install mpich and sudo apt-get install libopenmpi-dev to install MPI. I can compile and run a basic hello world C program using MPI, but can't figure out how to build this library (in particular, I don't see liboshmem.so anywhere on my system). Do you have advice for the simplest way to set up MPI?

Invalid device ordinal

mpiT works perfectly with CPU mode.
cunn works perfectly with single node.
Problem occurring when running mpiT and cunn, multiple node, even two nodes.

[Error message dump]

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu line=20 error=10 : invalid device ordinal
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu line=20 error=10 : invalid device ordinal
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu line=20 error=10 : invalid device ordinal
/lustre/atlas/proj-shared/med100/torch/install/bin/luajit: ...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: ...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
[C]: in function 'error'
...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
test.lua:2: in main chunk
[C]: in function 'dofile'
...d100/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e70
/lustre/atlas/proj-shared/med100/torch/install/bin/luajit/lustre/atlas/proj-shared/med100/torch/install/bin/luajit:: ...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: ...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
[C]: in function 'error'
...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
test.lua:2: in main chunk
[C]: in function 'dofile'
...d100/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e70...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: ...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6648/cutorch/lib/THC/THCTensorRandom.cu:20
stack traceback:
[C]: in function 'error'
...shared/med100/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
test.lua:2: in main chunk
[C]: in function 'dofile'
...d100/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405e70

errors with openMPI bindings

Hi Sixin,

mpiT works pretty well with MPICH, however, I have some errors during installation and testing mpiT with openMPI-2.0.1.

During installation, following errors appears:

[ 50%] Building C object CMakeFiles/mpiT.dir/mpiT.c.o
In file included from /home/hao/tools/mpiT/mpiT.c:18:0:
/home/hao/tools/mpiT/lua-mpi.h: In function ‘_MPI_Op’:
/home/hao/tools/openmpi-2.0.1/install/include/mpi.h:313:46: warning: passing argument 2 of ‘luampi_push_MPI_Op’ from incompatible pointer type
 #define OMPI_PREDEFINED_GLOBAL(type, global) ((type) ((void *) &(global)))
                                              ^
/home/hao/tools/mpiT/lua-mpi.h:19:28: note: in definition of macro ‘MPI_STRUCT_TYPE’
     luampi_push_MPI_##s(L, inival, N);                                  \
                            ^
/home/hao/tools/openmpi-2.0.1/install/include/mpi.h:1055:27: note: in expansion of macro ‘OMPI_PREDEFINED_GLOBAL’
 #define MPI_DATATYPE_NULL OMPI_PREDEFINED_GLOBAL(MPI_Datatype, ompi_mpi_datatype_null)
                           ^
/home/hao/tools/mpiT/lua-mpi.h:51:21: note: in expansion of macro ‘MPI_DATATYPE_NULL’
 MPI_STRUCT_TYPE(Op, MPI_DATATYPE_NULL)
                     ^
/home/hao/tools/mpiT/lua-mpi.h:8:15: note: expected ‘MPI_Op’ but argument is of type ‘struct ompi_datatype_t *’
   static void luampi_push_MPI_##s(lua_State *L, MPI_##s init, int N)    \
               ^
/home/hao/tools/mpiT/lua-mpi.h:51:1: note: in expansion of macro ‘MPI_STRUCT_TYPE’
 MPI_STRUCT_TYPE(Op, MPI_DATATYPE_NULL)
 ^
In file included from /home/hao/tools/mpiT/mpiT.c:18:0:
/home/hao/tools/mpiT/lua-mpi.h: In function ‘register_constants’:
/home/hao/tools/mpiT/lua-mpi.h:150:3: warning: ‘ompi_mpi_ub’ is deprecated (declared at /home/hao/tools/openmpi-2.0.1/install/include/mpi.h:926): MPI_UB is deprecated in MPI-2.0 [-Wdeprecated-declarations]
   luampi_push_MPI_Datatype(L, MPI_UB, 1); lua_setfield(L, -2, "UB");
   ^
/home/hao/tools/mpiT/lua-mpi.h:151:3: warning: ‘ompi_mpi_lb’ is deprecated (declared at /home/hao/tools/openmpi-2.0.1/install/include/mpi.h:925): MPI_LB is deprecated in MPI-2.0 [-Wdeprecated-declarations]
   luampi_push_MPI_Datatype(L, MPI_LB, 1); lua_setfield(L, -2, "LB");
   ^

Though the building process continues and reports a successful installation, there are errors during testing with mpirun -n 2 th test.lua

[max:13085] mca_base_component_repository_open: unable to open mca_shmem_sysv: /home/hao/tools/openmpi-2.0.1/install/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[max:13085] mca_base_component_repository_open: unable to open mca_shmem_mmap: /home/hao/tools/openmpi-2.0.1/install/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
[max:13085] mca_base_component_repository_open: unable to open mca_shmem_posix: /home/hao/tools/openmpi-2.0.1/install/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[max:13085] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[15292,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Is there any document about mpiT?

Thanks for your code sharing.
Is there any tutorial or document about how to use it? I am very eager to make my algorithm in torch run with muti-machine.

fail to "luarocks make mpit-mvapich-1.rockspec"

i fllow the "https://github.com/sixin-zh/mpiT" way to install mpiT

when i "luarocks make mpit-mvapich-1.rockspec", it failed
the log is here,how can I resolve this problem ,thnks

root@117:~/mpiT-master# luarocks make mpit-mvapich-1.rockspec
cmake -E make_directory build && cd build && cmake ..
-DCMAKE_C_COMPILER=${MPI_PREFIX}/bin/mpicc -DCMAKE_CXX_COMPILER=${MPI_PREFIX}/bin/mpicxx
-DMPI_ROOT=${MPI_PREFIX}
-DOPENMPI=0
-DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/slave117/torch/install/bin/.." -DCMAKE_INSTke

-- Found Torch7 in /home/slave117/torch/install
MPI ROOT is
LMPI is /lib/klibc-P2s_k-gf23VtrGgO2_4pGkQgwMY.so
LMCA is
-- Configuring done
You have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_C_COMPILER= /bin/mpicc
CMAKE_CXX_COMPILER= /bin/mpicxx

-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error in :
The CMAKE_C_COMPILER:

/bin/mpicc

is not a full path to an existing compiler tool.

Tell CMake where to find the compiler by setting either the environment
variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
the compiler, or to the compiler name if it is in the PATH.

CMake Error in :
The CMAKE_CXX_COMPILER:

/bin/mpicxx

is not a full path to an existing compiler tool.

Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.

-- Configuring incomplete, errors occurred!
See also "/home/slave117/mpiT-master/build/CMakeFiles/CMakeOutput.log".
See also "/home/slave117/mpiT-master/build/CMakeFiles/CMakeError.log".

Error: Build error: Failed building.

CUDA test failing

Out of the box I'm seeing ptest.lua in asyncsgd fail when I set:

local usecuda = true

I get the following:

$ mpiexec -np 2 luajit ptest.lua

rank 1 is client.
rank 0 is server.
0   use cpu
1   use gpu 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30740 RUNNING AT code13
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

It looks like the problem is that the storages that are referenced in the asynchronous send/receives are actually CudaStorages which point to GPU memory. Should they in fact be FloatStorages? If I hack each CudaTensor to have a corresponding FloatStorage then it seems to work.

Thanks.

What does the '-BLK' mean?

Hi Sixin,

Thanks for your sharing of this package! I am working on to improve this package. I have a few questions during reading the code:

  1. Is the 'asyncsgd' folder an implementation of your Elastic Averaging SGD paper?
  2. What does the short comment of '--BLK' mean in goot.lua?

There are some minor mistakes that could be fixed to make it working:

  1. goot.lua: confusion:addbatch --> confusion:BatchAdd
  2. plaunch.lua: "require 'mpiT'" should appear before dofile('init.lua')

Thanks,
Hao

seg fault in Allreduce. How to fix?

Hi, I love using mpi, and I'm excited that mpi is available for torch. I'm trying to do a simple Allreduce, and it segfaults. Thoughts? Basically ,I copied test.lua to testreduceall.lua, and added th efollowing lines just before the mpiT.Finalise():

torch.manualSeed(123)
a = torch.FloatTensor(3,2):uniform()
print('a', a)
mpiT.Allreduce(a:storage(), 6, mpiT.FLOAT, mpiT.SUM, mpiT.COMM_WORLD)
print('a after', a)

I'm running like:

mpirun.mpich -np 4 luajit testreduceall.lua

What I expect to happen:

  • some sort of mangled print out of initial a tensors
  • some sort of mangled print out of a tensors multiplied by 4

What actually happens:

$ mpirun.mpich -np 4 luajit testreduceall.lua 
init
init
init
initsize
size
size
[0/4]s=0r=0
[2/4]s=2r=2

size
[3/4]s=3r=3[1/4]s=1r=1

[1/4]s=1r=0
[2/4]s=2r=1
[3/4]s=3r=2[0/4]s=0r=3

aaaa     0.6965  0.7130
 0.2861  0.4285
 0.2269  0.6909
[torch.FloatTensor of size 3x2]

     0.6965  0.7130
 0.2861  0.4285
 0.2269  0.6909
[torch.FloatTensor of size 3x2]

     0.6965  0.7130
 0.2861  0.4285
 0.2269  0.6909
[torch.FloatTensor of size 3x2]

     0.6965  0.7130
 0.2861  0.4285
 0.2269  0.6909
[torch.FloatTensor of size 3x2]


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.