Git Product home page Git Product logo

mlsl's Introduction

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Contact: [email protected]

Intel(R) Machine Learning Scaling Library for Linux* OS

Intel® MLSL is no longer supported, no new releases are available. Please switch to the new API introduced in Intel® oneAPI Collective Communications Library (oneCCL)

Introduction

Intel(R) Machine Learning Scaling Library (Intel(R) MLSL) is a library providing an efficient implementation of communication patterns used in deep learning.

- Built on top of MPI, allows for use of other communication libraries
- Optimized to drive scalability of communication patterns
- Works across various interconnects: Intel(R) Omni-Path Architecture,
  InfiniBand*, and Ethernet
- Common API to support Deep Learning frameworks (Caffe*, Theano*,
  Torch*, etc.)

Intel(R) MLSL package comprises the Intel MLSL Software Development Kit (SDK) and the Intel(R) MPI Library Runtime components.

SOFTWARE SYSTEM REQUIREMENTS

This section describes the required software.

Operating Systems:

- Red Hat* Enterprise Linux* 6 or 7
- SuSE* Linux* Enterprise Server 12
- Ubuntu* 16

Compilers:

- GNU*: C, C++ 4.4.0 or higher
- Intel(R) C++ Compiler for Linux* OS 16.0 through 17.0 or higher

Virtual Environments: - Docker* - KVM*

Installing Intel(R) Machine Learning Scaling Library

Installing Intel(R) MLSL by building from source:

    $ make all
    $ [MLSL_INSTALL_PATH=/path] make install

By default MLSL_INSTALL_PATH=$PWD/_install

Binary releases are available on our release page.

Installing Intel(R) MLSL using RPM Package Manager (root mode):

1. Log in as root

2. Install the package:

    $ rpm -i intel-mlsl-devel-64-<version>.<update>-<package#>.x86_64.rpm

    where <version>.<update>-<package#> is a string, such as: 2017.0-009

3. Uninstalling Intel(R) MLSL using the RPM Package Manager

    $ rpm -e intel-mlsl-devel-64-<version>.<update>-<package#>.x86_64

Installing Intel(R) MLSL using the tar file (user mode):

    $ tar zxf l_mlsl-devel-64-<version>.<update>.<package#>.tgz
    $ cd l_mlsl_<version>.<update>.<package#>
    $ ./install.sh

There is no uninstall script. To uninstall Intel(R) MLSL, delete the
full directory you have installed the package into.

Launching Sample Application

The sample application needs python with the numpy package installed. You can use [Intel Distribution for Python] (https://software.intel.com/en-us/distribution-for-python), Anaconda, or the python and numpy that comes with your OS. Before you start using Intel(R) MLSL, make sure to set up the library environment.

Use the command:

$ source <install_dir>/intel64/bin/mlslvars.sh
$ cd <install_dir>/test
$ make run

If the test fails, look in the log files in the same directory. Here <install_dir> is the Intel MLSL installation directory.

Migration to oneCCL

Intel® MLSL is no longer supported, no new releases are available. Please switch to the new API introduced in Intel® oneAPI Collective Communications Library (oneCCL) There are some examples that can help you get started with oneCCL, simply try to perform the following:

$ cd ./mlsl_to_ccl
$ . ${MLSL_ROOT}/intel64/bin/mlslvars.sh
$ . ${CCL_ROOT}/env/vars.sh
$ make run -f Makefile

If you used MLSL before, here is an example that demonstrates the key differences between libraries' APIs.

#include <iostream>
#include <stdio.h>
- #include "mlsl.hpp"
+ #include "ccl.hpp"

- using namespace MLSL;
+ using namespace ccl;

#define COUNT 128
 
int main(int argc, char** argv)
{
    int i, size, rank;
 
    auto sendbuf = new float[COUNT];
    auto recvbuf = new float[COUNT];
 
-    Environment::GetEnv().Init(&argc, &argv);
-    rank = Environment::GetEnv().GetProcessIdx();
-    size = Environment::GetEnv().GetProcessCount();     
-    auto dist = Environment::GetEnv().CreateDistribution(size, 1);
+    auto stream = environment::instance().create_stream();
+    auto comm = environment::instance().create_communicator();
+    rank = comm->rank();
+    size = comm->size();
 
    /* initialize sendbuf */
    for (i = 0; i < COUNT; i++)
        sendbuf[i] = rank;
 
    /* invoke allreduce */
-    auto req = dist->AllReduce(sendbuf, recvbuf, COUNT,                      
-                               DT_FLOAT, RT_SUM, GT_GLOBAL);
-    Environment::GetEnv().Wait(req);
+    comm->allreduce(sendbuf, recvbuf, COUNT,
+                    reduction::sum,
+                    nullptr /* coll_attr */,
+                    stream)->wait(); 
    /* check correctness of recvbuf */
    float expected = (size - 1) * ((float)size / 2);
    for (i = 0; i < COUNT; i++)
    {
        if (recvbuf[i] != expected)
        {
            std::cout << "idx " << i
                      << ": got " << recvbuf[i]
                      << " but expected " << expected
                      << std::endl;
            break;
        }
    }
 
    if (i == COUNT && rank == 0)
        std::cout << "PASSED" << std::endl;
 
-    Environment::GetEnv().DeleteDistribution(dist);
-    Environment::GetEnv().Finalize();
 
    delete[] sendbuf;
    delete[] recvbuf;
 
    return 0;
}

License

Intel MLSL is licensed under Apache License Version 2.0.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

*Other names and brands may be claimed as the property of others.

mlsl's People

Contributors

jimmycasey avatar ksenyako avatar mshiryaev avatar nikitaxgusev avatar rdower avatar rscohn2 avatar shirosankaku avatar ykiryano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlsl's Issues

mlsl_test with MLSL_NUM_SERVERS>0 error

Hi,

I'm unable to run mlsl_test app with MLSL_NUM_SERVERS=1. When MLSL_NUM_SERVERS=0 all works fine. I'm using PBSPro to submit the job to two KNL nodes. This is the error I get:

/opt/pbs/default/bin/pbs_tmrsh: host "r1i4n33" is not a node in job <27928.lic01>
=>> PBS: job killed: walltime 79 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

I'm using Intel MPI from mlsl, 'which mpirun' returns:
<mlsl root path>/intel64/bin/mpirun
and 'mpirun -version':
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.

There may be an issue with MPI and PBS paths, but then it would probably not work for MLSL_NUM_SERVERS=0 either. Similar reports:
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/743142
https://software.intel.com/pt-br/forums/intel-clusters-and-hpc-technology/topic/713369
env | grep I_MPI:
I_MPI_HYDRA_DEBUG=1
I_MPI_DEBUG_OUTPUT=debug_output-27928.lic01.txt
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/default/bin/pbs_tmrsh
I_MPI_DEBUG=5
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_ROOT=<mlsl root path>

I am not sure how I should query PBS when using mlsl parameter servers, but I tried many configurations, all resulting in the above error:
#PBS -l select=2:ncpus=64:mpiprocs=2:ompthreads=63
mpirun -n 4 -ppn 2 ./mlsl_test 1

#PBS -l select=2:ncpus=64:mpiprocs=1:ompthreads=63
mpirun -n 2 -ppn 1 ./mlsl_test 1

also with
export MLSL_SERVER_AFFINITY=63
export MLSL_SERVER_CREATION_TYPE=1 (also 0)

mpirun -n 2 -ppn 1 hostname
opuputs correctly
r1i4n32
r1i4n33

MLSL used is l_mlsl_2017.1.016.

Do you have any ideas what may be wrong here?

I_MPI_FABRICS: only "tcp" works

I was trying to run

#!/bin/bash
#SBATCH -N 2
#SBATCH --partition=partition1
#SBATCH --ntasks-per-node=4
#SBATCH -vvvv

mpirun -l \
    -bootstrap slurm \
    -genv I_MPI_DEBUG=2 \
    -genv I_MPI_FABRICS=ofi \
    -genv I_MPI_FALLBACK=0 \
$CAFFE_ROOT/build/tools/caffe train --solver=examples/cifar10/solver.prototxt

to use ofi as network fabric but when submitting the job it raises this error:

0] [0] MPI startup(): Multi-threaded optimized library
[1] node1.329747hfi_userinit: mmap of rcvhdrq at dabbad00040b0000 failed: Resource temporarily unavailable
[1] node1.329747hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
[1] node1.329747hfi_userinit: assign_context command failed: Invalid argument
[1] node1.329747hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
[1] node1.329747hfi_userinit: assign_context command failed: Invalid argument
[1] node1.329747hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
[1] node1.329747hfi_userinit: assign_context command failed: Invalid argument
[1] node1.329747PSM2 can't open hfi unit: -1 (err=23)
[1] [1] MPI startup(): ofi fabric is not available and fallback fabric is not enabled
[7] node2.353849hfi_userinit: mmap of rcvhdrq at dabbad00040b0000 failed: Resource temporarily unavailable
[7] node2.353849hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)
[7] node2.353849hfi_userinit: assign_context command failed: Invalid argument
[7] node2.353849hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)
[7] node2.353849hfi_userinit: assign_context command failed: Invalid argument
[7] node2.353849hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)
[7] node2.353849hfi_userinit: assign_context command failed: Invalid argument
[1] [1] ofi netmod(): ofi_openep error: ret -22, Invalid argument
...

tf_cnn_benchmarks

Has anybody tried MLSL with tf_cnn_benchmarks ( tensorflow benchmarks)? I am thinking of doing it. Just want to know if anyone is already working on it.

Can not run mlsl_test

Hi guys,

I installed MLSL in order to train models with multi-nodes.
I started two Docker containers in my local machine to do some test.
However, when I run mlsl_test as specified in Developer Guide, I got errors like this:

[root@c87b5ae7eb7a test]# mpirun -n 2 -ppn 1 ./mlsl_test 1

built with MLSL API version: 1.0, used MLSL API version: 1.0
built with MLSL API version: 1.0, used MLSL API version: 1.0

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 8316 RUNNING AT c87b5ae7eb7a
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 8316 RUNNING AT c87b5ae7eb7a
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764

If I run [root@c87b5ae7eb7a test]# mpirun -n 2 -ppn 1 -hostfile ~/mpd.hosts ./mlsl_test 1
I got exactly same error.

How do I know where is wrong ? My two containers can connect with each other through ssh.
It almost took me a whole day... Thanks a lot if you can give me some advice!

make mlsl_test.cpp with gcc failed

make mlsl_test.cpp with gcc failed
centos 7.2, gcc 4.8.5
and I didn't find rt library.

gcc -O3 -fpermissive -I/opt/intel/mlsl_2017.0.005/intel64/include -o mlsl_test mlsl_test.cpp -L/opt/intel/mlsl_2017.0.005/intel64/lib -lmlsl -lrt
/usr/bin/ld: /tmp/ccewTI6d.o: undefined reference to symbol '_ZNSs4_Rep10_M_destroyERKSaIcE@@GLIBCXX_3.4'
/usr/bin/ld: note: '_ZNSs4_Rep10_M_destroyERKSaIcE@@GLIBCXX_3.4' is defined in DSO /lib64/libstdc++.so.6 so try adding it to the linker command line
/lib64/libstdc++.so.6: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [mlsl_test] Error 1

ep_server build fails

The build fails on my ubuntu system. Maybe a difference between MPI's?

gcc -O2 -I/home/rscohn1/projects/MLSL/mpirt/include -I/home/rscohn1/local/projects/MLSL/eplib/../quant -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector -Wall -Werror -Wall -std=gnu99 -I. -DENABLE_FASTPATH -D_GNU_SOURCE -DENABLE_ASYNC_PROGRESS -lm -o ep_server server.c cqueue.o memory.o env.o sig_handler.o quant.o allreduce_pr.o -z noexecstack -z relro -z now -L/home/rscohn1/projects/MLSL/mpirt/lib -lmpi -ldl -lrt -lpthread allreduce_pr.o: In function allreduce_pr_make_progress':
allreduce_pr.c:(.text+0x2da): undefined reference to log2' allreduce_pr.c:(.text+0x385): undefined reference to pow'
allreduce_pr.c:(.text+0x3a7): undefined reference to pow' allreduce_pr.c:(.text+0x5bf): undefined reference to pow'
allreduce_pr.c:(.text+0x5e5): undefined reference to pow' collect2: error: ld returned 1 exit status Makefile:80: recipe for target 'ep_server' failed make[1]: *** [ep_server] Error 1 make[1]: Leaving directory '/home/rscohn1/local/projects/MLSL/eplib' Makefile:226: recipe for target 'eplib/ep_server' failed make: *** [eplib/ep_server] Error 2 rscohn1@vs-sycl-1:MLSL$

Run mlsl_test on multi-node

Hi, I tested the mlsl_test with this command "mpirun -n 4 -ppn 1 /opt/intel/mlsl_2017.1.016/test/mlsl_test 1" and it ran successfully.
But if I tried "mpirun -n 4 -ppn 1 -machinefile ~/mpd.hosts /opt/intel/mlsl_2017.1.016/test/mlsl_test 2 1"
, it just print "built with MLSL API version: 1.0, used MLSL API version: 1.0" and keep the status.
Does this sample code support running on the multi nodes?
Thank you.

Memory corrupted error : running MLSL on multiple node

@itemko @mshiryaev @argretzi @szollinx @rpulid2 @

tarun@pas-lab-server7:/nfs$ mpirun -f /nfs/host_file -n 2 /nfs/mlsl_test 1
built with MLSL API version: 1.0, used MLSL API version: 1.0
built with MLSL API version: 1.0, used MLSL API version: 1.0
*** Error in `/nfs/common/install/mpich/mpich_3.0.4-install/bin/hydra_pmi_proxy': malloc(): memory corruption (fast): 0x0000000001c7bfa0 ***
[mpiexec@pas-lab-server7] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@pas-lab-server7] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@pas-lab-server7] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@pas-lab-server7] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

Host file contains 2 ip addresses of two servers, where ssh is already configured.

Multi-node Training

I think Intel Caffe is using MLSL for distributed processing.

Is there any paper or report about how Intel Caffe conducts multi-node processing?

E.g. the way of communication, data partition, communication frequency, strong scaling efficiency.

Is Intel Caffe using the same idea of Parameter Sever (J. Dean, NIPS 2012)?

Multi-node Training Deadlock

With MLSL 2017 Preview, Intel Caffe on mnist dataset deadlocks for runs on more than 2 nodes. Two node executions are occasionally succesful and gets deadlocked the other times. The caffe outputs where it gets stuck are:
/ ---------------------------------------------
I0501 12343 net.cpp:424] Network initialization done.
I0501 12343 solver.cpp:119] Solver scaffolding done.
I0501 12343 caffe.cpp:318] Configuring multinode setup
I0501 12343 caffe.cpp:320] Starting Multi-node Optimization in MLSL environment
W0501 12343 multi_sync.hpp:124] RUN: PER LAYER TIMINGS ARE ENABLED, SINGLE DB SPLITTING IS DISABLED
W0501 12343 multi_sync.hpp:116] synchronize_params: bcast

/ ---------------------------------------------

I0501 26610 layer_factory.hpp:114] Creating layer mnist
I0501 26610 net.cpp:183] Creating Layer mnist
I0501 26610 net.cpp:838] mnist -> data
I0501 26610 net.cpp:838] mnist -> label
W0501 26610 net.cpp:247] SetMinibatchSize 64
I0501 11726 db_lmdb.cpp:72] Opened lmdb /caffe/data/mnist/mnist_train_lmdb
I0501 22362 db_lmdb.cpp:72] Opened lmdb /caffe/data/mnist/mnist_train_lmdb
I0501 26702 db_lmdb.cpp:72] Opened lmdb /caffe/data/mnist/mnist_train_lmdb
/ ---------------------------------------------

Run Configurations:
MLSL: 2017 Preview
IntelCaffe Commit: f96b759f71b2281835f690af267158b82b150b5c
Intel Compiler: 16.0.1 (gcc: 4.9.2)
Intel MPI Version: Intel(R) MPI Library 5.1.2
Platform: x86 (Xeon)
command: mpirun -n X -ppn Y -machinefile hostfile $CAFFEBINDIR/caffe train --solver=lenet_solver.prototxt

Is there a way to control the number of threads MLSL uses?
Also, may I also ask where can I get the following document? https://github.com/01org/MLSL/blob/master/doc/Developer_Guide_and_Reference.pdf

Thanks!

Segment Fault when Calling MLSL::Environment::GetProcessIdx()

Hi!
I'm using Intel Caffe-1.0.6 with l_mlsl_2017.1.016.

When I tried to run a simple Python program utilizing pycaffe to train lenet over mnist, it triggers segmentation fault in MLSL.

The python program is as follow:

import caffe
import numpy as np
import matplotlib.pyplot as plt
from caffe import layers as L, params as P
from pylab import *

def lenet(lmdb, batch_size):
    n = caffe.NetSpec()
    n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
                             transform_param=dict(scale=1./255), ntop=2)
    n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
    n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
    n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
    n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
    n.fc1 =   L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
    n.relu1 = L.ReLU(n.fc1, in_place=True)
    n.score = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))
    n.loss =  L.SoftmaxWithLoss(n.score, n.label)
    return n.to_proto()

def main():
    caffe_root = '/root/zhengsheng/caffe-1.0.6/'
    caffe.set_mode_cpu()
    with open('mnist/lenet_auto_train.prototxt', 'w') as f:
        f.write(str(lenet('mnist/mnist_train_lmdb', 64)))
    with open('mnist/lenet_auto_test.prototxt', 'w') as f:
        f.write(str(lenet('mnist/mnist_test_lmdb', 100)))
    print 'loading solver'
    solver = None
    solver = caffe.SGDSolver('mnist/lenet_auto_solver.prototxt')
    niter = 200
    test_interval = 25
    # losses will also be stored in the log
    train_loss = zeros(niter)
    test_acc = zeros(int(np.ceil(niter / test_interval)))
    output = zeros((niter, 8, 10))
    print 'starting loop'
    for it in range(niter):
        solver.step(1)  # SGD by Caffe
        train_loss[it] = solver.net.blobs['loss'].data
        solver.test_nets[0].forward(start='conv1')
        output[it] = solver.test_nets[0].blobs['score'].data[:8]
        if it % test_interval == 0:
            print 'Iteration', it, 'testing...'
            correct = 0
            for test_it in range(100):
                solver.test_nets[0].forward()
                correct += sum(solver.test_nets[0].blobs['score'].data.argmax(1)
                               == solver.test_nets[0].blobs['label'].data)
            print 'Acc:', correct
            test_acc[it // test_interval] = correct / 1e4

if __name__ == '__main__':
    exit(main())

where the solver mentioned in the code 'mnist/lenet_auto_solver.prototxt' comes from lenet_auto_solver.prototxt. I just appended a line "solver_mode: CPU" at the tail of the solver file.

and the traceback is as this:

#0  0x00007fc10af17bca in MLSL::Environment::GetProcessIdx() ()
   from /root/zhengsheng/caffe-1.0.6/external/mlsl/l_mlsl_2017.1.016/intel64/lib/libmlsl.so.1
#1  0x00007fc10b798c23 in caffe::ReplaceMultinodeSolverParams(caffe::SolverParameter*) ()
   from /root/zhengsheng/caffe-1.0.6/distribute/lib/libcaffe.so.1.0.0-rc3
#2  0x00007fc10b6f6a89 in caffe::Solver<float>::Init(caffe::SolverParameter const&) ()
   from /root/zhengsheng/caffe-1.0.6/distribute/lib/libcaffe.so.1.0.0-rc3
#3  0x00007fc10b6f73c3 in caffe::Solver<float>::Solver(std::string const&, caffe::Solver<float> const*) ()
   from /root/zhengsheng/caffe-1.0.6/distribute/lib/libcaffe.so.1.0.0-rc3
#4  0x00007fc10bba8fcd in boost::python::objects::make_holder<1>::apply<boost::python::objects::pointer_holder<boost::shared_ptr<caffe::SGDSolver<float> >, caffe::SGDSolver<float> >, boost::mpl::vector1<std::string> >::execute(_object*, std::string) ()
   from /root/zhengsheng/caffe-1.0.6/distribute/python/caffe/_caffe.so
#5  0x00007fc10bb7f3a3 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (*)(_object*, std::string), boost::python::default_call_policies, boost::mpl::vector3<void, _object*, std::string> > >::operator()(_object*, _object*) ()
   from /root/zhengsheng/caffe-1.0.6/distribute/python/caffe/_caffe.so
#6  0x00007fc1084b629c in boost::python::objects::function::call(_object*, _object*) const () from /lib64/libboost_python.so.1.53.0
#7  0x00007fc1084b65f8 in boost::detail::function::void_function_ref_invoker0<boost::python::objects::(anonymous namespace)::bind_return, void>::invoke(boost::detail::function::function_buffer&) () from /lib64/libboost_python.so.1.53.0
#8  0x00007fc1084beec3 in boost::python::handle_exception_impl(boost::function0<void>) () from /lib64/libboost_python.so.1.53.0
#9  0x00007fc1084b4df9 in function_call () from /lib64/libboost_python.so.1.53.0
#10 0x00007fc14e2d99a3 in PyObject_Call (func=func@entry=<Boost.Python.function at remote 0x19ae950>, arg=arg@entry=
    (<SGDSolver at remote 0x42c8838>, 'mnist/lenet_auto_solver.prototxt'), kw=kw@entry=0x0) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#11 0x00007fc14e2e8995 in instancemethod_call (func=<Boost.Python.function at remote 0x19ae950>, 
    arg=(<SGDSolver at remote 0x42c8838>, 'mnist/lenet_auto_solver.prototxt'), kw=0x0) at /usr/src/debug/Python-2.7.5/Objects/classobject.c:2602
#12 0x00007fc14e2d99a3 in PyObject_Call (func=func@entry=<instancemethod at remote 0x4012050>, arg=arg@entry=('mnist/lenet_auto_solver.prototxt',), 
    kw=kw@entry=0x0) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#13 0x00007fc14e330947 in slot_tp_init (self=<optimized out>, args=('mnist/lenet_auto_solver.prototxt',), kwds=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/typeobject.c:5692
#14 0x00007fc14e32f65f in type_call (type=<optimized out>, args=('mnist/lenet_auto_solver.prototxt',), kwds=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/typeobject.c:745
#15 0x00007fc14e2d99a3 in PyObject_Call (func=func@entry=<Boost.Python.class at remote 0x19b01a0>, 
    arg=arg@entry=('mnist/lenet_auto_solver.prototxt',), kw=kw@entry=0x0) at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#16 0x00007fc14e36e0f6 in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fff132eae00, func=<optimized out>)
    at /usr/src/debug/Python-2.7.5/Python/ceval.c:4626
#17 call_function (oparg=<optimized out>, pp_stack=0x7fff132eae00) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4431
#18 PyEval_EvalFrameEx (f=f@entry=
.... # lots of python interpreter call stacks

It looks like that the python statement
solver = caffe.SGDSolver('mnist/lenet_auto_solver.prototxt')
triggers the segmentation fault.

However, if I use caffe.bin to train the model using the same solver file, it runs successfully. The command is like follow.

cd caffe-1.0.6/examples
../distribute/bin/caffe.bin train -solver=mnist/lenet_auto_solver.prototxt

As regard to the phisycal machine CPU, lscpu shows:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping:              2
CPU MHz:               2730.843
CPU max MHz:           3200.0000
CPU min MHz:           1200.0000
BogoMIPS:              4799.86
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

Is this problem solved in the latest release? Maybe I can just try l_mlsl_2017.1.016 ?

Thank you very much!

Where is install.sh

Hi,

I was trying to follow the installation instructions for non-root installation. Where do I find the install.sh script?

Thanks!

run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0

when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
image
when i htop on evry node
image
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt

I think something is wrong with mlsl ,my mlsl version is
image
because when i run with my own openmpi,it is ok

assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed : running MLSL on multiple node

@argretzi @mshiryaev @itemko @szollinx @rpulid2

tarun@pas-lab-server7:/nfs$ mpirun -f /nfs/host_file -n 2 /nfs/mlsl_test 1
built with MLSL API version: 1.0, used MLSL API version: 1.0
built with MLSL API version: 1.0, used MLSL API version: 1.0
[proxy:1:1@pas-lab-server7] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:71): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[proxy:1:1@pas-lab-server7] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
*** Error in `/nfs/common/install/mpich/mpich_3.0.4-install/bin/hydra_pmi_proxy': malloc(): memory corruption (fast): 0x0000000000b73f80 ***
[mpiexec@pas-lab-server7] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@pas-lab-server7] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@pas-lab-server7] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@pas-lab-server7] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

Error while using MLSL with Intel Deep Learning SDK

I see the following error when I am trying to create a model using Intel DL Training tool:

I0925 18:53:53.454358 222 caffe.cpp:329] Starting Optimization
I0925 18:53:53.454391 222 solver.cpp:495] Solving OneMoreTry
I0925 18:53:53.454398 222 solver.cpp:496] Learning Rate Policy: inv
(222): /localdisk/jenkins/mlsl-build/src/comms_ep.cpp:CommsAlloc:535: ASSERT 'ptr' FAILED: NULL pointer
Attempting to use an MPI routine after finalizing MPI

There are some posts about this already that suggests to set MLSL_HEAP_SIZE_GB to 64.

I have the following questions related to this:

  1. Where must this env variable be set ? Since I am using it via Intel DL training tool browser user interface, I dont have any way to set this.
  2. Is that ASSERT due to out-of-memory condition ? What is the memory requirement for running MLSL ?

how to Run MLSL program?

mpirun -n 2 -ppn 1 ./mlsl_test 1

./mlsl_test: error while loading shared libraries: libmlsl.so.1: cannot open shared object file: No such file or directory
./mlsl_test: error while loading shared libraries: libmlsl.so.1: cannot open shared object file: No such file or directory

=======================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 127
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.