microsoft / multiverso Goto Github PK
View Code? Open in Web Editor NEWParameter server framework for distributed machine learning
Home Page: http://www.dmtk.io
License: MIT License
Parameter server framework for distributed machine learning
Home Page: http://www.dmtk.io
License: MIT License
The build introduction(README) doesn't work right now. After some investigation, just found you have made dcasgd updater as a submodule.
Operating System: Ubuntu 16.04
VirtualBox Manager: Oracle VM Version 5.2.8 r121009
Docker version: 18.03.1-ce, build 9ee9f40
I am trying to build the Dockerfile to test Multiverso on my virtual machine.
docker build -f ./Dockerfile .
Everything works till the Java files are being configured within the execturion of the Dockerfile, that is:
RUN mkdir -p /usr/local/java/default && \
curl -Ls 'http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.tar.gz' -H 'Cookie: oraclelicense=accept-securebackup-cookie' | \
tar --strip-components=1 -xz -C /usr/local/java/default/
On the terminal is shown the next error
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
The command '/bin/sh -c mkdir -p /usr/local/java/default &&
curl -Ls 'http://download.oracle.com/otn-pub/java/jdk/8u65-b17/jdk-8u65-linux-x64.tar.gz'
-H 'Cookie: oraclelicense=accept-securebackup-cookie' |
tar --strip-components=1 -xz -C /usr/local/java/default/' returned a non-zero code: 2
I think the problem comes from the oracle file that it is not fetched correctly.
To better tune my application, I set name for every threads by calling pthread_setname_np API in linux. I found that high cpu usage for Multiverso's Communicator(subclass of Actor) thread. Here is a snapshot provided by htop.
I notice some code in src/communicator.cpp. According to this code, communicator always call non-blocking send and receive. I think this is the result of high cpu usage.
MessagePtr msg;
while (mailbox_->Alive()) {
// Try pop and Send
if (mailbox_->TryPop(msg)) {
ProcessMessage(msg);
}
// Probe and Recv
size_t size = net_util_->Recv(&msg);
if (size > 0) LocalForward(msg);
CHECK(msg.get() == nullptr);
net_util_->Send(msg);
}
break;
}
This "spin lock" design is great. But most applications won't call send/recv thousand or millions times in a second, which means a cpu computation resource is wasted. Why not sleep communicator for a little while when no send/recv calls?
Your work is really awesome, and I have learn a part of the project, but might not understand it very well.
I have several questions about the communication between server and worker as follows.
Thank you so much.
Got below error while running wordembedding on cluster.
I am using starcluster to create cluster and mpi to run the jobs.
mpirun noticed that process rank 22 with PID 0 on node node002 exited on signal 11 (Segmentation fault).
I would like to point out that identifiers like "_MULTIVERSO_BARRIER_H_
" and "_MULTIVERSO_SERVER_H_
" do not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
Is there some api documents ?
When I increase blocks' number,I met a Segmentation fault,following are the gdb debugging information.
Can anyone share any ideas?
Thanks you all!
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffd46403700 (LWP 37256)]
0x000000000041e83c in multiverso::Aggregator::StartThread() ()
(gdb) bt
#0 0x000000000041e83c in multiverso::Aggregator::StartThread() ()
#1 0x00007ffff7708970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007ffff7965064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007ffff6e7862d in clone () from /lib/x86_64-linux-gnu/libc.so.6
The sync server will blocked infrequently, and it can be reproduced by Matrix unittest
In multiverso document, it says:
BSP (max_delay=0): All worker processes are barriered forcibly at the end of a clock.
What's the "clock" actually means in multiverso? Some mini-batches or a period of time?
The build of Multiverso/Test files have hard-coded "mpicxx" compiler instead of respecting the option
cmake -D MPI_CXX_COMPILER=CC, which breaks builds in Cray Linux environments, and probably any user environment where compiler wrappers are used. This needs to be fixed to support building in environments where MPI compiler wrappers are common (i.e. CC, cc, ftn, etc).
For building on Cray XC systems I'm configuring multiverso with cmake/3.5.2 as follows:
cmake -D MPI_CXX_COMPILER=CC -D MPI_C_COMPILER=cc -D CMAKE_LINKER=CC -D CMAKE_CXX_COMPILER=CC -D CMAKE_C_COMPILER=cc -D CMAKE_SYSTEM_NAME=CrayLinuxEnvironment -D CMAKE_BUILD_TYPE="release" -D BUILD_SHARED_LIBS=ON -D TEST=OFF -D BOOST_ROOT=/path/to/Boost/boost-1.60.0 -D BOOST_INCLUDEDIR=/path/to/Boost/boost-1.60.0/include -D MPI_LIBRARY=${CRAY_MPICH_DIR}/lib/libmpich.so -D MPI_EXTRA_LIBRARY=${CRAY_MPICH_DIR}/lib -D CMAKE_VERBOSE_MAKEFILE=TRUE -D MPI_CXX_INCLUDE_PATH=${CRAY_MPICH_DIR}/include -D MPIEXEC=/path/to/srun -D MPIEXEC_NUMPROC_FLAG="-n" ../
-Jake
Would you like to wrap any pointer data members with the class template “std::unique_ptr”?
I expect that exception handling is usually supported by a C++ program. I wonder why your function "main" does not contain corresponding try and catch instructions so far.
How do you think about recommendations by Matthew Wilson in an article?
Would you like to adjust the implementation if you consider effects for uncaught/unhandled exceptions like they are described by Danny Kalev?
after builder as README said
run ./multiverso_server
ERROR:
error while loading shared libraries: libzmq.so.5: cannot open shared object file: No such file or directory
make all -j4
produce following:
find: /src/multiverso: No such file or directory
find: /src/multiverso_server: No such file or directory
find: /src/multiverso: No such file or directory
find: /src/multiverso_server: No such file or directory
mkdir -p /lib
mkdir: /lib: Permission denied
find: /src/multiverso: No such file or directory
make: *** [/lib] Error 1
I'm not an expert in Makefiles, but why it try install into /src/multiverso_server
?
11 class TestMultiversoSharedVariable:
12 def _test_sharedvar(self, row, col):
13 W = sharedvar.mv_shared(
14 value=np.zeros(
15 (row, col),
16 dtype=theano.config.floatX
17 ),
18 name='W',
19 borrow=True
20 )
21 delta = np.array(range(1, row * col + 1),
22 dtype=theano.config.floatX).reshape((row, col))
23 train_model = theano.function([], updates=[(W, W + delta)])
24 for i in xrange(10):
25 train_model()
26 train_model()
27 sharedvar.sync_all_mv_shared_vars() #sent to server
28 #mv.barrier()
29 # to get the newest value, we must sync again
30 mv.barrier()
31 sharedvar.sync_all_mv_shared_vars()
32 for j, actual in enumerate(W.get_value().reshape(-1)):
33 print "[%d] %d %d %d"%(i,j, (j + 1) * (i + 1) * 2 * mv.workers_num(), actual)
34
35 def test_sharedvar(self):
36 self._test_sharedvar(10, 10)
37
38
39 if name == 'main':
40 mv.init()
41 test_shared = TestMultiversoSharedVariable()
42 test_shared.test_sharedvar()
43 mv.shutdown()
I run this test, found When start one worker in one node, it is OK
but When start two worker in one node , all workers were blocked。
mpirun -hostfile alg_cluster.txt -npernode 1 python test_multi.py
mpirun -hostfile alg_cluster.txt -npernode 2 python test_multi.py
there are three ips in my cluster.
the default configuration is num_server == num_worker?
How to configure different number of machines of server and worker? Such as one server and two workers. thanks~
zmq.hpp:No such file or directory
#include "zmq.hpp"
I want to install word-embedding on linux.
I enter the following lines:
cd multiverso/Applications/WordEmbedding
cmake CMakeLists.txt
and the error occur:
CMake Error at CMakeLists.txt:22 (add_executable):
Cannot find source file:
//src
Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
.hxx .in .txx
Any help would be appreciate.
void Server::Process_EndTrain(std::shared_ptr msg_pack)
{
MsgType msg_type;
MsgArrow arrow;
int src, dst;
msg_pack->GetHeaderInfo(&msg_type, &arrow, &src, &dst);
clocks_[src] = 1 << 31;
in this place ,clocks_ is vector,so after bit manipulation clocks_[src] will become INT_MIN;so after one worker end train,other workers will be hanging in config.max_delay >=0. Will this be a problem?
I would like to build multiverse with mvapich2, and my CMAKE command is like cmake -DCMAKE_VERBOSE_MAKEFILE=TRUE -DMPI_CXX_INCLUDE_PATH=/usr/local/mvapich2-2.2/include -DMPI_CXX_LIBRARIES=/usr/local/mvapich2-2.2 -DMPI_LIBRARY=/usr/local/mvapich2-2.2 -...
. It passes the source build, but fails the Test build with error as:
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Barrier'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Iprobe' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Get_count'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Isend' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Initialized'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Allreduce' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Comm_size'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Init_thread' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Query_thread'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Wait' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Recv'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Comm_rank' /cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to
MPI_Finalize'
/cntk/build/gpu/release/lib/libmultiverso.so: undefined reference to MPI_Testall'
.
Any suggestions on this? Thank you.
if i use the mpi, how does multiverse handles the failure of the server? does it support the failover?
I see multiverso only support ASP and BSP, it support ssp?
Hi, all
I am interested in the version of python. If I use the python 2.7 not python3or3.5, what trouble will I meet?
Would you like to replace any double quotes by angle brackets around file names for include statements?
andy@Andy-UB1604:~/prj/Multiverso/build$ cmake ..
OpenMP found
/usr/local/lib/libmpi.so
/usr/local/lib/libmpi.so
CMake Warning at /usr/share/cmake-3.5/Modules/FindBoost.cmake:725 (message):
Imported targets not available for Boost version
Call Stack (most recent call first):
/usr/share/cmake-3.5/Modules/FindBoost.cmake:763 (_Boost_COMPONENT_DEPENDENCIES)
/usr/share/cmake-3.5/Modules/FindBoost.cmake:1332 (_Boost_MISSING_DEPENDENCIES)
Test/unittests/CMakeLists.txt:3 (find_package)
CMake Error at /usr/share/cmake-3.5/Modules/FindBoost.cmake:1677 (message):
Unable to find the requested Boost libraries.
Unable to find the Boost header files. Please set BOOST_ROOT to the root
directory containing Boost or BOOST_INCLUDEDIR to the directory containing
Boost's headers.
Call Stack (most recent call first):
Test/unittests/CMakeLists.txt:3 (find_package)
CMake Error at Test/unittests/CMakeLists.txt:9 (MESSAGE):
message called with incorrect number of arguments
Boost_INCLUDE_DIR-NOTFOUND
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
/home/andy/prj/Multiverso/Test/unittests/Boost_INCLUDE_DIR
used as include directory in directory /home/andy/prj/Multiverso/Test/unittests
-- Configuring incomplete, errors occurred!
See also "/home/andy/prj/Multiverso/build/CMakeFiles/CMakeOutput.log".
andy@Andy-UB1604:/prj/Multiverso/build$ sudo apt-get install libopenmpi-dev openmpi-bin/prj/Multiverso/build$
Reading package lists... Done
Building dependency tree
Reading state information... Done
libopenmpi-dev is already the newest version (1.10.2-8ubuntu1).
openmpi-bin is already the newest version (1.10.2-8ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 354 not upgraded.
andy@Andy-UB1604:
HI, all
I have installed multiverso on three ubuntu 14.04 VMs (master0, slave1, slave2) in VirtualBox, when i tried to run LR example on one single VM, program run successfully. But when i run LR example on these three VMs, the program get stuck at the step of "multiverso MPI-NET is initilized under MPI-THREAD_SERIALIZED mode", so i cann't get other running information such as "All nodes registered..."
Besides, i found LR processes have been started on these three VMs.
Finally, current VM where i started LR program get stuck and never finished...
Could anyone help me about these problem? thanks.
Hi all,
please change the table to support int64_t
I run "cmake .." in Multiverso/build/, then it appears below errors.
I have installed openmpi-1.8-1.8.1-5.el6.x86_64 using yum.
then how I solve this MPI_C problem ?
CMake Error at /usr/local/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:108 (message):
Could NOT find MPI_C (missing: MPI_C_LIBRARIES MPI_C_INCLUDE_PATH)
Call Stack (most recent call first):
/usr/local/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:315 (_FPHSA_FAILURE_MESSAGE)
/usr/local/share/cmake-2.8/Modules/FindMPI.cmake:587 (find_package_handle_standard_args)
CMakeLists.txt:11 (find_package)
-- Configuring incomplete, errors occurred!
I'm using a array table with size == 1.
Because multiverso checks that size_ > MV_NumServers()
So I get errors below.
[INFO] [2016-05-29 14:20:24] multiverso MPI-Net is initialized under MPI_THREAD_SERIALIZED mode.
[INFO] [2016-05-29 14:20:24] All nodes registered. System contains 1 nodes. num_worker = 1, num_server = 1
[INFO] [2016-05-29 14:20:24] Create a async server
[INFO] [2016-05-29 14:20:24] Rank 0: Zoo start sucessfully
[FATAL] [2016-05-29 14:20:25] Check failed: size_ > MV_NumServers() at /home/xiaoyang/repos/multiverso/src/table/array_table.cpp, line 14 .
Hello,
The multiverso does not compile on my ubuntu. I had the following error:
/usr/bin/ld: CMakeFiles/LogisticRegression.dir/src/reader.cpp.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
//lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
I could fix the issue by adding the missing pthread in https://github.com/Microsoft/Multiverso/blob/master/Applications/LogisticRegression/CMakeLists.txt as follow:
target_link_libraries(LogisticRegression multiverso ${MPI_CXX_LIBRARIES} pthread)
I am working on a linux machine with the following characteristics:
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
my gcc is gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609
I can't seem to train a model using Multiverso on the Criteo dataset (L2-regularized Logistic regression).
I can train a model successfully on the same data using Vowpal Wabbit and some simple Python code. I have tried multiple configurations of Multiverso but without much luck.
My config file is:
input_size=14
output_size=1
objective_type=logistic
regular_type=L2
updater_type=sgd
train_epoch=20
sparse=false
use_ps=false
minibatch_size=20
#train_file=/mnt/efs/criteo_derivatives/day_1.csv_parsed
test_file=/mnt/efs/test_file
learning_rate_coef=0.0001
regular_coef=0.0007
I implement a new table and create a bunch of client threads to send the GET and ADD operations. I suppose the table can concurrently handle multiple GET and ADD ops. No matter how many threads I create, I can only observe "200% CPU usage" in Linux top command. I guess "100%" is on client side, and another "100%" is on server side.
I also notice "opm_threads" keyword exists in the API-document. But I failed to setup this variable and get a fatal error in FlagRegister::SetFlagIfFound method.
Does Multiverso support concurrent message handling in server table? If it does, how to enable concurrent server side processing?
An extra null pointer check is not needed in functions like the following.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.