Git Product home page Git Product logo

dmlc-core's Introduction

Distributed Machine Learning Common Codebase

Build Status Documentation Status GitHub license

DMLC-Core is the backbone library to support all DMLC projects, offers the bricks to build efficient and scalable distributed machine learning libraries.

Developer Channel Join the chat at https://gitter.im/dmlc/dmlc-core

What's New

Contents

Known Issues

  • RecordIO format is not portable across different processor endians. So it is not possible to save RecordIO file on a x86 machine and then load it on a SPARC machine, because x86 is little endian while SPARC is big endian.

Contributing

Contributing to dmlc-core is welcomed! dmlc-core follows google's C style guide. If you are interested in contributing, take a look at feature wishlist and open a new issue if you like to add something.

  • DMLC-Core uses C++11 standard. Ensure that your C++ compiler supports C++11.
  • Try to introduce minimum dependency when possible

CheckList before submit code

  • Type make lint and fix all the style problems.
  • Type make doc and fix all the warnings.

NOTE

deps:

libcurl4-openssl-dev

dmlc-core's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmlc-core's Issues

Run example training on yarn failed.

Guys:
I am trying to launch example traing_mnist.py on Yarn cluster which with 50 nodes, the launch script like below:

${mxnethome}/tools/launch.py -n 2 \
      --launcher yarn \
     python ${mxnethome}/example/image-classification/train_mnist.py --network lenet --kv-store dist_async

It keeps failed and the yarn container told me below logs:

File "./train_mnist.py", line 1, in <module>
    import find_mxnet
ImportError: No module named find_mxnet

and there is no method to add -file option to the script of tools/launch.py?

Any one run the distributed training example on Yarn successfully?

Something wrong in text_parser.h

  const int nthread = omp_get_max_threads();
  data->resize(nthread);
  ...
  #pragma omp parallel num_threads(nthread_)
  {
    // threadid
    int tid = omp_get_thread_num();
    size_t nstep = (chunk.size + nthread - 1) / nthread;
   ...
  }

Why the num_threads is nthread_ , buy use nthread to calculate nstep?
In my machine, nthread_ = 2 and nthread = 20 , so I only can read 1/10 data, 18 data blocks are empty ...
Why not only use one of them? @tqchen

tracker.py not work with Python 3.5

  • sock.recv return byte instead of string
  • sendstr need encode and recvstr need decode
  • remove xrange & replace some range with list(range)
  • use division // instead of /

Compile fail on Visual Studio 2015

source code version: e9b5cdb
OS: Windows 10 x64
1>------ Build started: Project: dmlc, Configuration: Release x64 ------
1> data.cc
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1419): warning C4005: 'vsnprintf': macro redefinition
1> C:\work\dmlc-core\include\dmlc/base.h(88): note: see previous definition of 'vsnprintf'
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1421): fatal error C1189: #error: Macro definition of vsnprintf conflicts with Standard Library function declaration
1> io.cc
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1419): warning C4005: 'vsnprintf': macro redefinition
1> C:\work\dmlc-core\include\dmlc/base.h(88): note: see previous definition of 'vsnprintf'
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1421): fatal error C1189: #error: Macro definition of vsnprintf conflicts with Standard Library function declaration
1> input_split_base.cc
1>C:\work\dmlc-core\include\dmlc/logging.h(203): warning C4297: 'dmlc::LogMessageFatal::~LogMessageFatal': function assumed not to throw an exception but does
1> C:\work\dmlc-core\include\dmlc/logging.h(203): note: destructor or deallocator has a (possibly implicit) non-throwing exception specification
1> line_split.cc
1>c:\work\dmlc-core\include\dmlc./logging.h(203): warning C4297: 'dmlc::LogMessageFatal::~LogMessageFatal': function assumed not to throw an exception but does
1> c:\work\dmlc-core\include\dmlc./logging.h(203): note: destructor or deallocator has a (possibly implicit) non-throwing exception specification
1> local_filesys.cc
1>C:\work\dmlc-core\include\dmlc/logging.h(203): warning C4297: 'dmlc::LogMessageFatal::~LogMessageFatal': function assumed not to throw an exception but does
1> C:\work\dmlc-core\include\dmlc/logging.h(203): note: destructor or deallocator has a (possibly implicit) non-throwing exception specification
1> recordio_split.cc
1>c:\work\dmlc-core\include\dmlc./logging.h(203): warning C4297: 'dmlc::LogMessageFatal::~LogMessageFatal': function assumed not to throw an exception but does
1> c:\work\dmlc-core\include\dmlc./logging.h(203): note: destructor or deallocator has a (possibly implicit) non-throwing exception specification
1> recordio.cc
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1419): warning C4005: 'vsnprintf': macro redefinition
1> C:\work\dmlc-core\include\dmlc/base.h(88): note: see previous definition of 'vsnprintf'
1>C:\Program Files (x86)\Windows Kits\10\Include\10.0.10150.0\ucrt\stdio.h(1421): fatal error C1189: #error: Macro definition of vsnprintf conflicts with Standard Library function declaration
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

dmlc-submit does not recognize ssh as cluster type?


dmlc-core/tracker/dmlc-submit --cluster ssh --num-workers 2 xgboost demo/binary_classification/mushroom.conf data=demo/data/agaricus.txt.train eval=demo/data/agaricus.txt.test 
Traceback (most recent call last):
  File "dmlc-core/tracker/dmlc-submit", line 9, in <module>
    submit.main()
  File "/Users/nanzhu/code/xgboost/dmlc-core/tracker/dmlc_tracker/submit.py", line 50, in main
    raise RuntimeError('Unknown submission cluster type %s' % args.cluster)
RuntimeError: Unknown submission cluster type ssh

tracker script waiting indefinitely even after successful exit of yarn

Submitting using the dmlc-submit script is resulting in the successful spawning and completion of the job on yarn, but the dmlc-submit.py script is waiting indefinitely. The stack trace generated by killing at that point is resulting in the following stack trace :

File "./tracker/dmlc-submit", line 9, in
submit.main()
File "/idn/home/vdevabat/workingDir/tracker/dmlc_tracker/submit.py", line 46, in main
yarn.submit(args)
File "/idn/home/vdevabat/workingDir/tracker/dmlc_tracker/yarn.py", line 123, in submit
pscmd=(' '.join([YARN_BOOT_PY] + args.command)))
File "/idn/home/vdevabat/workingDir/tracker/dmlc_tracker/tracker.py", line 413, in submit
rabit.join()
File "/idn/home/vdevabat/workingDir/tracker/dmlc_tracker/tracker.py", line 329, in join
self.thread.join(100)
File "/usr/lib64/python2.6/threading.py", line 655, in join
self.__block.wait(delay)
File "/usr/lib64/python2.6/threading.py", line 258, in wait
_sleep(delay)

Env : Mapr version of hadoop 2.5.1 is being used. The error is reproducible on the basic.cc file provided as a part of documentation. Any help appreciated.

fix libhdfs problem in CDH

CDH is slightly difference to the native hadoop. We can support it or write a document to let users change the things accordingly.

  • it may do not have hdfs.h
  • it may only proive a hdfs.a

check the version, on my case:

~/hadoop version
Hadoop 2.3.0-cdh5.1.0 

the solution:

  1. to get hdfs.h

according to my version, i download

http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.1.0-src.tar.gz

extract, and copy

hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/hdfs.h

to the fold include/.

  1. change make/dmlc.mk to use the .a version
DMLC_LDFLAGS+= $(HADOOP_HDFS_HOME)/lib/native/libhdfs.a -L$(LIBJVM) -ljvm -Wl,-rpath=$(LIBJVM)

It should compile now. Then we need to set the environment properly (I guess it is also necessary for the the native hadoop). In my case, I set .bashrc

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF=/etc/hadoop/conf
export CLASSPATH=${HADOOP_CONF}:$(find ${HADOOP_HOME} -name *.jar | tr '\n' ':')

And there is warning in 64bit centos

 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...

but it seems OK to ignore it according to this

start by run_yarn.sh, following error:

runtime error log:
src/io/input_split_base.cc:117: FILE size not calculated correctly

Yarn container error:
15/05/12 15:44:28 INFO dmlc.ApplicationMaster: [DMLC] Task 2 exited with status 250 Diagnostics:Exception from container-launch.
Container id: container_1431414776876_0010_01_000004
Exit code: 250
Stack trace: ExitCodeException exitCode=250:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

Container exited with a non-zero exit code 250

JSON: Add function AddIndent()

I've had instances where I would like to first generate a JSON object in one function and then embed this into a array in another JSON object.

In such cases, I do something like:

    dmlc::JSONWriter writer(&os);
    writer.BeginArray();
    for (size_t i = 0; i < dump.size(); ++i) {
      if (i != 0) writer.WriteArraySeperator();
      os << dump[i];  // Dump the previously generated JSON here
    }
    writer.EndArray();

But this means the indenting is not right anymore. Would be nice to be able to fix the indenting with:

    dmlc::JSONWriter writer(&os);
    writer.BeginArray();
    for (size_t i = 0; i < dump.size(); ++i) {
      if (i != 0) writer.WriteArraySeperator();
      os << dmlc::JSONWriter.AddIndent(dump[i], 1);  // Dump the previously generated JSON here
    }
    writer.EndArray();

or something similar

Allow ignoring columns in libsvm_parser

For xgboost, it would be nice to be able to ignore certain columns in the config file and build a model on top of that. This would make it simple to build models with or without certain variables using a config rather than modifying the whole data file - which takes time if the data is large

Making it a header-only library?

Is there any specific reason why we didn't make it a header-only library? I think a header-only library is much easier to integrate into other projects.

Report some problems with Client.java and ApplicaitonMaster.java when use HDFS federation

Recently, my colleague change the HDFS to federation mode, then the xgboost on yarn get the following error: Diagnostics: Incomplete HDFS URI, no host: hdfs:///tmp/temp-dmlc-yarn-application_1472437483579_135758/libstdc++.so.6.
I read the Client.java and found that in line 143 Path dst = new Path("hdfs://" + tmpPath + "/"+ path.getName()); there is no given host. So I replace this line with String fsDefaultName = conf.get(FileSystem.FS_DEFAULT_NAME_KEY); Path dst = new Path(fsDefaultName + tmpPath + "/"+ path.getName()); and run again, then I encountered java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-hadoop/tmp/temp-dmlc-yarn-application_1472437483579_135926/libstdc++.so.6, expected: viewfs://ss-hadoop/, then I read the ApplicationMaster.java and found that the FileSystem was get by the default configuration. In this case, if the value of defaultFS is different between the Client Machine and the Container which the Application Master running on this error is inevitable. So I added 'FileSystem dfs = e.getValue().getFileSystem(conf);' before line 195 which is FileStatus status = dfs.getFileStatus(e.getValue());. Now the distributed xgboost runs OK again.

I think there must be more elegant solution for all these problems. If you done, don't hesitate to tell me.

hdfs accessing with ip and port

I use ImageRecordIter to access the val.rec in hdfs, like val =mx.io.ImageRecordIter(path_imgrec = "hdfs://address:54310/data_dir/val.rec",
however the error log is :Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs://address:54310:0
I track the code and it seems whether the address has a port is ignored, and a port 0 is added. Is this a bug?
this is the related issue i found:
#81

better support minibatch sgd

several suggestions to better support minibatch training:

  1. don't assume rowblock.offset[0] is always 0. therefore we can implement Rowblock.Segement(10, 20) which returns a rowblock between row 10 and row 20 with zero cost. see https://github.com/dmlc/wormhole/blob/master/learn/linear/base/minibatch_iter.h#L69

and similarly RowblockContainer.GetBlock(10, 20).

but now we need to change the codes a little bit for nonzero offset[0], one example d2af94b

  1. let the Parser support the IndexType template. so if we use unsigned as the index type, (true for criteo tera), we don't need to do a memcpy from size_t to unsigned.

it's not necessary to change the implementation, namely use strtoull to parse the string even use unsigned index type.

  1. (not so related) use a static const variable to store the magic number 10UL<<20UL. and sometimes i feel a larger block number size is better.

Runtime check failed

What does the following check failed mean ?

in dmlc-core/src/data/row_block.h

CHECK(offset.back() == value.size() || value.size() == 0);

And how to fix that ?

Thanks

Mimic complete environment in all worker nodes

It would be nice to have an argument in the dmlc-submit script which will copy all the environment variables into the worker nodes.
This would be an easy way to do it rather than doing each env variable by hand for simple tasks.
--env-all ?

Merge lint.py and lint3.py

I was looking at lint.py and lint3.py and they're the same with very minor differences, it makes sense to reduce these to a single lint.py with if conditions for python3.

There have been efforts to do this, but it's not yet complete - f1e1033#diff-460cb56a7855c776ca8943d6f4f2c6c4

The differences are:

  • Use iteritems instead of items - Here, it makes sense to jsut use items() as we are anyway looping over the whole dict.
  • Use re.sub instead of .find() and splicing - seems like this is just a minor change which would work in both.
  • os.sep instead of '/' - os.sep is available in both, and is recommended.

The error in dmlc-core/tracker/dmlc_tracker/tracker.py

@Earthson
I cloned the code today, and test the dmlc-submit, but find this error

sh run_xgboost.sh
Traceback (most recent call last):
File "/opt/meituan/xgboost/dmlc-core/tracker/dmlc-submit", line 7, in
from dmlc_tracker import submit
File "/opt/meituan/xgboost/dmlc-core/tracker/dmlc_tracker/submit.py", line 6, in
from . import local
File "/opt/meituan/xgboost/dmlc-core/tracker/dmlc_tracker/local.py", line 10, in
from . import tracker
File "/opt/meituan/xgboost/dmlc-core/tracker/dmlc_tracker/tracker.py", line 149
if e.errno in {98, 48}:
^
SyntaxError: invalid syntax

Classification using mxnet in R --- factors need to be transformed to indicators in format 0, 1, 2 ,..., p

I have tested using your package for mapping soil types (geographical data). I was first simply using 'factor' format for the train.y, but then realized that predict will also add class '0' (basically missing values). After I converted factors to integers (0,1,2,..,p) the maps were fine.

I have installed R package from source code. It is a pity that it is not also available via CRAN. Otherwise, thanks for such a great package!

can't find libdmlc.a

g++ -DMSHADOW_FORCE_STREAM -Wall -O3 -I/Users/yujinke/mxnet/mshadow/ -I/Users/yujinke/mxnet/dmlc-core/include -fPIC -Iinclude -msse3 -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -I/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/ -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSDHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 pkg-config --cflags opencv -DMXNET_USE_NVRTC=0 -std=c++0x -o bin/im2rec tools/im2rec.cc build/src/resource.o build/src/c_api/c_api.o build/src/c_api/c_api_error.o build/src/c_api/c_predict_api.o build/src/common/mxrtc.o build/src/engine/engine.o build/src/engine/naive_engine.o build/src/engine/threaded_engine.o build/src/engine/threaded_engine_perdevice.o build/src/engine/threaded_engine_pooled.o build/src/io/io.o build/src/io/iter_csv.o build/src/io/iter_image_recordio.o build/src/io/iter_mnist.o build/src/kvstore/kvstore.o build/src/ndarray/ndarray.o build/src/ndarray/ndarray_function.o build/src/operator/activation.o build/src/operator/batch_norm.o build/src/operator/block_grad.o build/src/operator/broadcast_reduce_op.o build/src/operator/cast.o build/src/operator/concat.o build/src/operator/convolution.o build/src/operator/crop.o build/src/operator/cross_device_copy.o build/src/operator/cudnn_batch_norm.o build/src/operator/deconvolution.o build/src/operator/dropout.o build/src/operator/elementwise_binary_op.o build/src/operator/elementwise_binary_scalar_op.o build/src/operator/elementwise_sum.o build/src/operator/elementwise_unary_op.o build/src/operator/embedding.o build/src/operator/fully_connected.o build/src/operator/identity_attach_KL_sparse_reg.o build/src/operator/leaky_relu.o build/src/operator/loss_binary_op.o build/src/operator/lrn.o build/src/operator/native_op.o build/src/operator/ndarray_op.o build/src/operator/operator.o build/src/operator/operator_util.o build/src/operator/pooling.o build/src/operator/regression_output.o build/src/operator/reshape.o build/src/operator/slice_channel.o build/src/operator/softmax_activation.o build/src/operator/softmax_output.o build/src/operator/swapaxis.o build/src/operator/upsampling.o build/src/optimizer/optimizer.o build/src/optimizer/sgd.o build/src/storage/storage.o build/src/symbol/graph_executor.o build/src/symbol/graph_memory_allocator.o build/src/symbol/static_graph.o build/src/symbol/symbol.o /Users/yujinke/mxnet/dmlc-core/libdmlc.a -pthread -lm -framework Accelerate pkg-config --libs opencv
ar: no archive members specified
usage: ar -d [-TLsv] archive file ...
ar -m [-TLsv] archive file ...
ar -m [-abiTLsv] position archive file ...
ar -p [-TLsv] archive [file ...]
ar -q [-cTLsv] archive file ...
ar -r [-cuTLsv] archive file ...
ar -r [-abciuTLsv] position archive file ...
ar -t [-TLsv] archive [file ...]
ar -x [-ouTLsv] archive [file ...]
make: *** [lib/libmxnet.a] Error 1
make: *** Waiting for unfinished jobs....
clang: error: no such file or directory: '/Users/yujinke/mxnet/dmlc-core/libdmlc.a'
clang: error: no such file or directory: '/Users/yujinke/mxnet/dmlc-core/libdmlc.a'
make: *** [lib/libmxnet.so] Error 1
make: *** [bin/im2rec] Error 1

dmlc-submit eats command quotes

I use RedHat 6.8 for my YARN cluster nodes, and it uses Python 2.6.6. In order to get Python 2.7 in RedHat one uses Software Collections. To activate a software collection one must do scl enable python27 'python -V'

Problem is when I try to submit my Python job to YARN as follows

dmlc-submit --cluster=yarn scl enable python27 'python -V'

... it seems to eat up the quotes and produces this error (instead of the expected Python 2.7.8)

Unable to open /etc/scl/prefixes/python!

That is the same output one gets when doing following at the bash prompt on any machine

scl enable python27 python -V

I'm trying to figure out how to fool argparse into letting this through but thought I'd bring it up

Missing dmlc/base.h problem when installing R package

Hi, I am trying to install mxnet package for R on Ubuntu 16.04 LTS. However I am running into an issue.

These are the steps I took:

  1. Type in the terminal:

sudo apt-get install -y build-essential git libatlas-base-dev libopencv-dev git clone --recursive https://github.com/dmlc/mxnet cd mxnet; make -j$(nproc)

  1. Then build and install the R package

R CMD BUILD R-package/ R CMD INSTALL mxnet_0.7.tar.gz

But when running install I get the following error

* installing to library ‘/home/mich/R/x86_64-pc-linux-gnu-library/3.2’
* installing *source* package ‘mxnet’ ...
** libs
g++ -I/usr/share/R/include -DNDEBUG -I../inst/include  -I"/home/mich/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include"   -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c executor.cc -o executor.o
In file included from executor.cc:9:0:
./base.h:12:23: fatal error: dmlc/base.h: No such file or directory
compilation terminated.
/usr/lib/R/etc/Makeconf:141: recipe for target 'executor.o' failed
make: *** [executor.o] Error 1
ERROR: compilation failed for package ‘mxnet’
* removing ‘/home/mich/R/x86_64-pc-linux-gnu-library/3.2/mxnet’

Am I missing something?
Thanks





use wildcard directory get error

include/dmlc/logging.h:245: [08:12:44] src/io/hdfs_filesys.cc:156: Check failed: files != NULL Error when ListDirectory /directory//train.gz terminate called after throwing an instance of 'dmlc::Error' what(): [08:12:44] src/io/hdfs_filesys.cc:156: Check failed: files != NULL Error when ListDirectory /directory//train.gz

http://doc.mapr.com/pages/viewpage.action?pageId=29658211

maybe hdfsListDirectory() does not support wildcard in directory, could we support this feature?

make lint error

when i run command "make lint", i get following errors. Is there something i need to configure?
python scripts/lint.py dmlc "all" include src scripts
Traceback (most recent call last):
File "scripts/lint.py", line 13, in
from pylint import epylint
ImportError: No module named pylint
make: *** [lint] Error 1

Something wrong in test/iostream_test.cc

Here is the suspected code snippets:

int main(int args, char *argv[]) {
if (argc < 2) {
printf("Usage: \n");
return 0;
}
dmlc::Stream *fs = dmlc::Stream::Create(argv[1], "w");
dmlc::ostream os(fs);
os << "hello-world" << 1e-10 << std::endl;
delete fs;
}

fs is a pointer. When os is destructed, it call its member variable stream_, which stream_ = fs.
So if we delete fs before os is destructed, as above code shows, it will case dangling pointer.
I suggest delete fs in dmlc::ostream's destructor.

tracker.py OSError: [Errno 48] Address already in use

λ EarthsonPC dmlc-core → λ git master* → python tracker/dmlc_tracker/tracker.py --num-workers 2
2016-07-17 18:03:18,522 WARNING gethostbyname(socket.getfqdn()) failed... trying on hostname()
Traceback (most recent call last):
  File "tracker/dmlc_tracker/tracker.py", line 145, in __init__
    sock.bind((hostIP, port))
OSError: [Errno 48] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tracker/dmlc_tracker/tracker.py", line 475, in <module>
    main()
  File "tracker/dmlc_tracker/tracker.py", line 470, in main
    start_rabit_tracker(args)
  File "tracker/dmlc_tracker/tracker.py", line 432, in start_rabit_tracker
    rabit = RabitTracker(hostIP=get_host_ip(args.host_ip), nslave=args.num_workers)
  File "tracker/dmlc_tracker/tracker.py", line 149, in __init__
    if e.errno in 98:
TypeError: argument of type 'int' is not iterable
Exception ignored in: <bound method RabitTracker.__del__ of <__main__.RabitTracker object at 0x101f9b0b8>>
Traceback (most recent call last):
  File "tracker/dmlc_tracker/tracker.py", line 163, in __del__
    self.sock.close()
AttributeError: 'RabitTracker' object has no attribute 'sock'

tracker.py only check error code 98 to reselect a new port, I think code 98 should also be considered:)

compile error with include/dmlc/json.h in mxnet

I find this file changed 15 hours ago and this change may cause mxnet build error:

In file included from /usr/include/c++/4.8/typeindex:35:0,
from /dl/xlvector/mxnet/dmlc-core/include/dmlc/./json.h:19,
from /dl/xlvector/mxnet/dmlc-core/include/dmlc/parameter.h:22,
from include/mxnet/base.h:12,
from src/ndarray/./ndarray_function.h:11,
from src/ndarray/ndarray_function.cu:9:
/usr/include/c++/4.8/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support for the

support csv format

A typical CSV file:

Year,Make,Model,Description,Price,Sell 
1997,Ford,E350,"ac, abs, moon",3000.00,40000

We assume there are only two types of columns: real number, such as Price, and string, such as Model and Description. While the category column, such as Year, can be viewed either as real or as string. The parser will use one-shot encoding for string column.

A user can specify the column format:

my_file.csv&label=col[5]&str=col[0-3]&real=col[4]

where col[0-3] presents column 0, 1, 2, and 3. All col[0,1,2,3], col[0,1-3], col[0,1-2,3] are valid and equal.

More options:

  • noheader, there is no header line
  • sep=\t, treat \t as the separater

dmlc can't work on hadoop cluster using LCE

dmlc will throw exception on hadoop cluster using LCE which uses nobody to run tasks:

Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : user is nobody
main : requested yarn user is dmlc

reference:
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/td-p/26050

improve libsvm parser performance

currently libsvm_parser parses each line twice, which can be improved, and handling of isnumberchars() is awkard. Fix them when you have time.

Allow additional params in data file parsers

It is not always easy to have all the required arguments in the file URI spec.
Hence, it would be nice to be able to pass an additional set of arguments in the Parser::Create() funciton.

Bad Support for HDFS

I had make the changes according to #10 , but it still come out with the error as following:
/usr/bin/ld: /usr/lib/hadoop/lib/native/libhdfs.a(hdfs.c.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
/usr/lib/hadoop/lib/native/libhdfs.a: could not read symbols: Bad value

support mesos job tracker

Curently there is no support for mesos trackers, which is another common way to submit jobs to a cluster other than YARN. It could be interesting to support job submission tracker for mesos

Compile Error on Windows 7-64bit

hi all, when I was compiling mxnet using vs2013 solution generated by CMake Gui, there were always compilation errors such as:

"mxnet\include\mxnet/ndarray.h(594): error : has_saveload is not a template"
"D:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\INCLUDE\utility(201): error : incomplete type is not allowed detected during instantiation of class "std::pair<_Ty1, _Ty2> [with _Ty1=const int, _Ty2=mxnet::NDArray]" " 
"mxnet\src\optimizer\./sgd-inl.h(114): error : incomplete type is not allowed"
"mxnet\src\optimizer\./sgd-inl.h(114): error : pointer to incomplete class type is not allowed"

I used "Microsoft Visual C++ Compiler Nov 2013 CTP" files overwrite the ones exist in the vs2013 installation path. Version of the third-party libraires are Opencv3.0.0, Cuda 7.5 and cudnn is not used. Can anyone help me point out the problem? Thanks a lot!

fails to call launchDummyTask function

Hi, tianqi. I ran XGBoost on yarn. I have ran a test code:mushroom. My script's parameters is as followed:

 15     $WORK_DIR/dmlc-core/tracker/dmlc-submit \
 16         --cluster=yarn \
 17         --num-workers=2 \
 18         --num-servers=1 \
 19         --worker-cores=2 \
 20         --log-file="$log_file" \
 21         --ship-libcxx="/data0/zeus/soft/gcc/lib64" \
 22         --worker-memory=5g \
 23         --queue=root.zeus.zeus \
 24         --jobname=dmlc-xgboost \
 25         --hdfs-tempdir=ns1/user/zeus/xgboost \
 26         --log-level='DEBUG' \
 27         ../xgboost  \
 28         $CONF \
 29         data="hdfs://ns1/dw_ext/comm_rec/ctr/xgboost/test/train/agaricus.txt.train" \
 30         eval[test]="hdfs://ns1/dw_ext/comm_rec/ctr/xgboost/test/test/agaricus.txt.test" \
 31         model_dir="hdfs://ns1/dw_ext//comm_rec/ctr/xgboost/test/model"

When it calls launchDummyTask, it will run launcher.py without delivering any parameters. My yarn logs is as followed:

 LogType: stdout
 LogLength: 32
 Log Contents:
 Usage: launcher.py your command

Refine support for Intel compiler and more optimization flags

In many cases, simply recompiling code against Intel compiler (ICC) and Math Kernel Library (MKL) gains 5%~15% performance improvement over GCC. So I propose to give users the choice to build dmlc-core with ICC or GCC.

Wider vectorization has been introduced in recent (?Intel only?) CPU. For example, AVX with 256-bit SIMD (for float only) since Sandy Bridge in 2012 and AVX2 with 256-bit SIMD (for both float and integer) since Haswell in 2014. Compiler options can be introduced to enable AVX or AVX2, for example USE_AVX and USE_AVX2.

Even the dmlc-core's building process is straight-forward enough, adding sections like "Build" and "Install" in README will pack all the necessary things together. This also makes dmlc-core more like a standalone deployable package.Cluster administrators would love to see that.

In summary,

  1. Add support for Intel compiler.
  2. Add support for AVX, AVX2.
  3. Add Build and Install sections in README.

I think I can help and run some tests on a standard cluster. But that would be after mid August.

[Discussion] Unified Tracker

This is the proposal for a unified tracker script that is used to launch all DMLC jobs. I will post all the necessary parameters here, and please respond the discussion if we want to iterate. Previously the tracker script of DMLC job submission has been separated into different python files. While this can be convenient, it also makes user hard to find a central command instruction to submit all kinds of jobs. So I am proposing a new centralized script dmlc-submit that will submit the jobs.

This discussion thread will eventually becomes the document as well as spec for the submission script. I am hoping we can do it in one to two weeks

Let us also discuss parameters, I think we could avoid using abbreviations and enforce full names, to make submission script more clear, as in other job submission scripts like

Some of these parameters could be set from environment variables, with prefix DMLC.

Parameters

  • --num-workers integer, required
    • Number of workers in the job.
  • `--num-servers`` integer, default 0. Number of servers in the job.
  • --worker-cores integer, default 1 . Number of cores needed to be allocated for worker job.
  • --server-cores integer, default 1, Number of cores needed to be allocated for server job.
  • --worker-memory string, integer + [g|m], default 1g, Memory needed for worker job.
  • --server-memory string, integer + [g|m], default 1g, Memory needed for worker job.
  • --mode string, 'mpi' or 'yarn' or 'local', default to ${DMLC_SUBMIT_MODE}, submission mode.

[DMLC] Task 0 killed because of exceeding allocated virtual memory

I submit job via

tracker/dmlc-submit \
    --cluster yarn \
    --num-workers 1 \
    --num-servers 1 \
    --queue my_queue \
    --worker-cores 4 \
    --server-cores 4 \
    --ship-libcxx /opt/gcc-4.8.2/lib64/ \
    bin/linear.dmlc demo/linear/conf.linear.train

But the application is killed because of exceeding allocated virtual memory. The full log listed below.

16/09/01 14:48:03 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1472556645971_451712_01_000003
16/09/01 14:48:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:03 INFO dmlc.ApplicationMaster: onContainerStarted Invoked
16/09/01 14:48:03 INFO dmlc.ApplicationMaster: onContainerStarted Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: [DMLC] Task 0 killed because of exceeding allocated virtual memory
16/09/01 14:48:13 INFO impl.NMClientAsyncImpl: Processing Event EventType: STOP_CONTAINER for Container container_1472556645971_451712_01_000002
16/09/01 14:48:13 INFO impl.NMClientAsyncImpl: Processing Event EventType: STOP_CONTAINER for Container container_1472556645971_451712_01_000003
16/09/01 14:48:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : rz-data.rz.xxx.com:8043
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: onContainerStopped Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: onContainerStopped Invoked
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: Application completed. Stopping running containers
16/09/01 14:48:13 INFO dmlc.ApplicationMaster: Diagnostics., num_tasks2, finished=0, failed=2
[DMLC] Task 0 killed because of exceeding allocated virtual memory
16/09/01 14:48:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
Exception in thread "main" java.lang.Exception: Application not successful
        at org.apache.hadoop.yarn.dmlc.ApplicationMaster.run(ApplicationMaster.java:290)
        at org.apache.hadoop.yarn.dmlc.ApplicationMaster.main(ApplicationMaster.java:115)
End of LogType:stderr

How to solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.