Git Product home page Git Product logo

shifu's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shifu's Issues

Shifu Eval Step Scalability Improvement

In shifu 0.2.3, data are loaded in heap and then sort and then compute eval metrix.

Should be sorted in pig and computed one by one, not loading data into heap. Better to decrease iterate times.

Status Description For Each Step is NOT Good

So far in each step, no matter hadoop jobs successful or failed, Shifu will report XXX step is finished successful. Should be more clear if failed tell user to retry or contact us.

Shifu Eval Confuion and Performance Improvement

In branch develop-0.2.4, we improve scalability of eval step while two steps:
runConfusionMatrix(evalConfig);
runPerformance(evalConfig);

The first step creates one file and read in second step. We can merge these two steps into one without creating one file.

missing package when trying to build Shifu

[WARNING] The POM for com.paypal.risk.dst:decision-tree:jar:0.1.0 is missing, no dependency information available
[WARNING] The POM for com.paypal:guagua-mapreduce:jar:hadoop1:0.3.0 is missing, no dependency information available

I clone the project. When I try to build it, I cannot find 2 packages above. I found guagua-mapreduce:jar:hadoop1:0.3.0 in the lib folder of project release, but com.paypal.risk.dst:decision-tree:jar:0.1.0 is still missing.

Add convergence parameter to Shifu d-train

To stop d-training iteration, so far max iteration number is used, better to provide a convergence parameter then user can set both convergence and max iteration number together.

Distributed Sensitivity Analysis Variable Selection

So far in Shifu, only KS and IV static variable selection are supported.

Another way is to train a model and they compare the score diff if removing one column and get final MSE for all records. Sort all variables by MSE values and we can remove percentage of variables.

Adaptive Learning Rate Support

So far, our learning rate is a fixed value, some paper and mllib are starting to use adaptive learning rate according to current iteration.

This is useful to decrease total iteration number.

Release New Shifu Version 0.4.0 or 0.2.4

Since we add a lot of new features like sensitivity analysis variable selection, binning stats improvement, bug fixing. I suggest to release a new version 0.4.0 or 0.2.4.

Before this release:

  1. Integration test
  2. Docs preparation
  3. Shifu-website

Any ideas?

The value of category variables couldn't contain comma(,)

When there are comma(,) in category variables, shifu stats will get wrong result about binning using mapred mode to run.
This is caused categorical values will be converted into string, after binning. And when update the ColumnConfig, the string will be converted into list again.

Category Variable Function Checking

Shifu supported category variables in the past, check this function to see

  1. If support category variables for all process in Shifu
  2. If support category variables to probability in training
  3. If support one-hot encoding for category variables

Change the order of Normalization and VarSelect

Normalizing all candidate variables and in VarSelect and Train step select columns FinalSelect = true.

This is useful in varselect to help using wrapper or variable sensitivity to select useful variables.

Embeded Zookeeper Support

For guagua 0.5.0, user doesn't need to specify a zookeeper server, then guagua will help to build a embed zookeeper server for Shifu.

This feature is useful for user to use shifu without zookeeper setting since we found several times user asked how to install zookeeper and configure.

For big data training, independent zookeeper cluster is still strongly recommended.

Shifu stats CAN NOT handle big data

Another bottleneck we found -
Shifu stats CAN NOT handle big data. It keeps failing like this

2014-08-27 06:50:33: INFO MapReduceLauncher - 84% complete
2014-08-27 06:56:38: INFO MapReduceLauncher - job job_201407230215_861842 has failed! Stop running all dependent jobs
2014-08-27 06:56:38: INFO MapReduceLauncher - 100% complete
2014-08-27 06:56:38: WARN Launcher - There is no log file to write to.
2014-08-27 06:56:38: ERROR Launcher - Backend error message
Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR PigStatsUtil - 1 map reduce job(s) failed!

and we have to keep sampling down the data size. Are we improving this feature as well?

Add weights to binning

So far in binning, we only consider count, but sometimes, using weights like(dollar) is good to add weights for binning

Partial Complete Support

In current solution, each iteration, every worker is sucessful, then master can go on next iteration. Although there are two parameters to define the percentage of successful workers in each iteration, but at last iteration, all mapper tasks are done and successful, the job can be successful.

If we have 10 workers, 9 worker at last iteration are done successful, how about just terminate the job with successful state.

This should be easy in YARN, but not easy in MapReduce implemenataion.

Intelligent Variable Detection or Selection

Auto de-select id columns
Auto detect category variables (if too small distinct values)
Auto de-select columns(IV is very bad)
Auto de-select columns(too many nulls or empties)

Filter/cap rows by an id

Suppose there is a high volume account in the training set, without capping the number of txns by that customer in the training set, the model can become biased to that account. By imposing a limit/cap, the model doesn't become biased to the high volume account.

Set the Hadoop parallel number automatically

  1. Set the default value of Hadoop parallel number to be empty;
  2. If the Hadoop parallel number is empty, calculate the best Hadoop parallel number.

The best hadoop parallel number should be between (1, total-column-number]. And it also should consider the data size.

Drop variables in train step

shifu drop variables in normalize step, but it's not very friendly for modeler, because they love to change columnconfig in any time and train them, so I would like to propose new steps:

  1. in normalize, shifu should normalize all variables, and keep all into normalized file.
  2. in train, shifu check the columnconfig and drop those variables aren't "final select".

once shifu work like this, modeler could update columnconfig in anytime and skip the normalize step.

Cannot Config Guagua Properties in shifuconfig file

When configure guagua.split.maxCombinedSplitSize in shifuconfig

Errors:

/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 25: guagua.split.maxCombinedSplitSize=33554432: command not found
/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 26: guagua.split.combinable=true: command not found

Continuous Model Training

So far model training step is independent.

Allow train models depending on existing models and start from that model to do a new training.

Distributed Bagging Training Job Always The Same Job Name

If I set bagging num to 3. Hadoop job names should be:
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:2
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:3

But all the names are the same with id 1.
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1

FindBugs Issues

Hi All Devs,

Please use findbugs to check your code and try to fix all default findbugs errors and warnings asap.

Thanks,
Zhang, Pengshan(David)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.