shifuml / shifu Goto Github PK

View Code? Open in Web Editor NEW

250.0 42.0 110.0 16.53 MB

An end-to-end machine learning and data mining framework on Hadoop

Home Page: https://github.com/ShifuML/shifu/wiki

License: Apache License 2.0

Shell 0.42% Java 94.82% PigLatin 1.01% Python 3.75%

shifu hadoop machine-learning pipeline bigdata neural-network gbdt random-forest end-to-end-machine-learning

shifu's People

Stargazers

Watchers

Forkers

xxdev lisahua raoulcjohnson yshan adityabits centerline richarwu leoliujie zhangjiaqi zqyang0124 yinhuagang zhang7575 kanewang didiwuliu feihugis jesesun zhengxiaobin cooldata-arc senseb aglne brucezhou2012 situgongyuan minizw desperado1992 yubobo isaactl mindis xiangpingbu benpwd lilf llsshh1985 1994juventus hillyu yankaics regardfs ldw-sh-cn alexshvid aditya-tangirala seanzhou1023 junshiguo xiaoxiao-7 zhangpengshan danickzhu pjpan princessd8251 yucl80 cursedninja sunnyyang12 ezhong0812 chaiyunhe hzy9981 mzhang11 klyan ddkongbb wuyuehao sjl421 lichaoliu666 sun-energy wenkaibai lamyat2005 chenweiye83 maroju100 xiongxianze tjz1990 zhuanglineu stenpiren minizhuwei xin999 bolaik huzza zhangjunqiang helanyao fatu mikefong kanddy spbohai tool-recommender-bot cass-green dotrado osanm wangclover roastegg wuhaifengdhu kaumudi moony320 qutang1 sethubargavi yilia05 raceli lugaxu liu-delin devinwu m4rkl1u angadhjot kiminh parsedark zhaocz765 liweisnake surydude kendung

shifu's Issues

Shifu Eval Step Scalability Improvement

In shifu 0.2.3, data are loaded in heap and then sort and then compute eval metrix.

Should be sorted in pig and computed one by one, not loading data into heap. Better to decrease iterate times.

Shifu Yarn Package Contains 'hadoop-core-0.20*.jar'

Although shifu yarn package works on our yarn clusters well, but there is a jar with version 0.20.*. We should remove it.

Source code formatting issue

Please format your code in a consistent style before publish it, so that we can read easier.

Add Column Name Unique Validation

Status Description For Each Step is NOT Good

So far in each step, no matter hadoop jobs successful or failed, Shifu will report XXX step is finished successful. Should be more clear if failed tell user to retry or contact us.

Upload Eval1score.meta.column.names while using shifu eval –score <eval_setname>

Currently, only shifu eval command will upload this Eval1score file. If after shifu train, go directly with shifu eval –score then the score file only output very limited number of fields ignoring Eval1score.meta.column.names since it is not copied from local.

`shifu new -t SVM` will create -t directory

Need the modelset name, if it is empty print usage

Weights binning KS IV WoE computing

This one is depending on #79

Shifu Eval Confuion and Performance Improvement

In branch develop-0.2.4, we improve scalability of eval step while two steps:
runConfusionMatrix(evalConfig);
runPerformance(evalConfig);

The first step creates one file and read in second step. We can merge these two steps into one without creating one file.

support namespace in variable name

It's OK to follow Pig namespace style.

Enable Munro & Paterson percentiles algorithm in EqualPositive

Add missing value Count as a Bin

Missing value is very important for binning but currently in shifu all missing values are ignored.

Program hangs after training step

Program hangs after running the EncogNNTrainer. Currently need to ctrl-c to exit.

To reproduce:
$ shifu 6_train.json

Add Background Running for All Shifu Steps

For big data, over 1 hour is needed to wait for Shifu step finished. Better to provide a parameter to run java process in Shifu bash in background.

Support Data Set without Header Setting

If no header files, we'd better name columns as $1, $2, $3, ... and start to run shifu whole process

missing package when trying to build Shifu

[WARNING] The POM for com.paypal.risk.dst:decision-tree:jar:0.1.0 is missing, no dependency information available
[WARNING] The POM for com.paypal:guagua-mapreduce:jar:hadoop1:0.3.0 is missing, no dependency information available

I clone the project. When I try to build it, I cannot find 2 packages above. I found guagua-mapreduce:jar:hadoop1:0.3.0 in the lib folder of project release, but com.paypal.risk.dst:decision-tree:jar:0.1.0 is still missing.

Different Name Space, Same Column Name: Shifu Report a Same Column Exception

Shifu should combine namespace and column name together as column name.

Persist Sensitivity Report to HDFS And Add Mean, Variance Value

So far, sensitivity only be used in variable selection, we'd better write such report into HDFS files for analysis.

Shifu Performance Stats on Each Step

So far no page to let users know the performance of our shifu. We'd better to provide such numbers on one doc page.

Add convergence parameter to Shifu d-train

To stop d-training iteration, so far max iteration number is used, better to provide a convergence parameter then user can set both convergence and max iteration number together.

Configure Shifu to make zookeeperServers optional

Guagua 0.4.2 supports embeded zookeeper server, change shifu zookeeperServers configuration to optional, if not set, let guagua launch embeded zookeeper server.

Distributed Sensitivity Analysis Variable Selection

So far in Shifu, only KS and IV static variable selection are supported.

Another way is to train a model and they compare the score diff if removing one column and get final MSE for all records. Sort all variables by MSE values and we can remove percentage of variables.

Add Distributed LR implementation Based on Guagua

Create a common email group for dev and users

Adaptive Learning Rate Support

So far, our learning rate is a fixed value, some paper and mllib are starting to use adaptive learning rate according to current iteration.

This is useful to decrease total iteration number.

Release New Shifu Version 0.4.0 or 0.2.4

Since we add a lot of new features like sensitivity analysis variable selection, binning stats improvement, bug fixing. I suggest to release a new version 0.4.0 or 0.2.4.

Before this release:

Integration test
Docs preparation
Shifu-website

Any ideas?

All package is ml.shifu.shifu

We should simplify it and remove one shifu.

The value of category variables couldn't contain comma(,)

When there are comma(,) in category variables, shifu stats will get wrong result about binning using mapred mode to run.
This is caused categorical values will be converted into string, after binning. And when update the ColumnConfig, the string will be converted into list again.

Category Variable Function Checking

Shifu supported category variables in the past, check this function to see

If support category variables for all process in Shifu
If support category variables to probability in training
If support one-hot encoding for category variables

Change the order of Normalization and VarSelect

Normalizing all candidate variables and in VarSelect and Train step select columns FinalSelect = true.

This is useful in varselect to help using wrapper or variable sensitivity to select useful variables.

Embeded Zookeeper Support

For guagua 0.5.0, user doesn't need to specify a zookeeper server, then guagua will help to build a embed zookeeper server for Shifu.

This feature is useful for user to use shifu without zookeeper setting since we found several times user asked how to install zookeeper and configure.

For big data training, independent zookeeper cluster is still strongly recommended.

.gitignore should include project related files

.classpath
.project
.settings

Shifu stats CAN NOT handle big data

Another bottleneck we found -
Shifu stats CAN NOT handle big data. It keeps failing like this

2014-08-27 06:50:33: INFO MapReduceLauncher - 84% complete
2014-08-27 06:56:38: INFO MapReduceLauncher - job job_201407230215_861842 has failed! Stop running all dependent jobs
2014-08-27 06:56:38: INFO MapReduceLauncher - 100% complete
2014-08-27 06:56:38: WARN Launcher - There is no log file to write to.
2014-08-27 06:56:38: ERROR Launcher - Backend error message
Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR PigStatsUtil - 1 map reduce job(s) failed!

and we have to keep sampling down the data size. Are we improving this feature as well?

Add weights to binning

So far in binning, we only consider count, but sometimes, using weights like(dollar) is good to add weights for binning

Build shifu package in one including both Hadoop 1.0 and Hadoop 2.0

Currently we have two packages for hadoop 1.0 and Hadoop 2.0, better we can merge them together and check the hadoop version and then select the right hadoop jars.

Support WoE transformation when doing normalization

Wow transformation is widely used in modeling team. It's better to support Wow transformation in normalization.

Normalize missing value as mean value as a option

In normalizing, missing values are replaced by 0.0. Should set an option to use mean value to replace missing values.

Partial Complete Support

In current solution, each iteration, every worker is sucessful, then master can go on next iteration. Although there are two parameters to define the percentage of successful workers in each iteration, but at last iteration, all mapper tasks are done and successful, the job can be successful.

If we have 10 workers, 9 worker at last iteration are done successful, how about just terminate the job with successful state.

This should be easy in YARN, but not easy in MapReduce implemenataion.

Intelligent Variable Detection or Selection

Auto de-select id columns
Auto detect category variables (if too small distinct values)
Auto de-select columns(IV is very bad)
Auto de-select columns(too many nulls or empties)

Filter/cap rows by an id

Suppose there is a high volume account in the training set, without capping the number of txns by that customer in the training set, the model can become biased to that account. By imposing a limit/cap, the model doesn't become biased to the high volume account.

Set the Hadoop parallel number automatically

Set the default value of Hadoop parallel number to be empty;
If the Hadoop parallel number is empty, calculate the best Hadoop parallel number.

The best hadoop parallel number should be between (1, total-column-number]. And it also should consider the data size.

Drop variables in train step

shifu drop variables in normalize step, but it's not very friendly for modeler, because they love to change columnconfig in any time and train them, so I would like to propose new steps:

in normalize, shifu should normalize all variables, and keep all into normalized file.
in train, shifu check the columnconfig and drop those variables aren't "final select".

once shifu work like this, modeler could update columnconfig in anytime and skip the normalize step.

Percentile for better understand variables

Percentile is a very helpful indicator for statisticians, it could tell if the data is skew, or a normalize distribution.

A good implementation for percentile is datafu by linkedin, http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/stats/Quantile.html

Just wondering would we add this feature for further release?

Values not sorted in BinomialUnivariateStatsContCalculator

In ml.shifu.core.di.builtin.stats.BinomialUnivariateStatsContCalculator, validValues must be sorted before finding interQuantileRange.

Cannot Config Guagua Properties in shifuconfig file

When configure guagua.split.maxCombinedSplitSize in shifuconfig

Errors:

/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 25: guagua.split.maxCombinedSplitSize=33554432: command not found
/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 26: guagua.split.combinable=true: command not found

Thanks,
Zhang, Pengshan(David)

Add Eclipse Code Templates Files

Template files should be uploaded for other devs.

shifuml / shifu Goto Github PK

shifu's People

Stargazers

Watchers

Forkers

shifu's Issues

Recommend Projects

Recommend Topics

Recommend Org