shifuml / shifu Goto Github PK
View Code? Open in Web Editor NEWAn end-to-end machine learning and data mining framework on Hadoop
Home Page: https://github.com/ShifuML/shifu/wiki
License: Apache License 2.0
An end-to-end machine learning and data mining framework on Hadoop
Home Page: https://github.com/ShifuML/shifu/wiki
License: Apache License 2.0
In shifu 0.2.3, data are loaded in heap and then sort and then compute eval metrix.
Should be sorted in pig and computed one by one, not loading data into heap. Better to decrease iterate times.
Although shifu yarn package works on our yarn clusters well, but there is a jar with version 0.20.*. We should remove it.
Please format your code in a consistent style before publish it, so that we can read easier.
So far in each step, no matter hadoop jobs successful or failed, Shifu will report XXX step is finished successful. Should be more clear if failed tell user to retry or contact us.
Currently, only shifu eval command will upload this Eval1score file. If after shifu train, go directly with shifu eval –score then the score file only output very limited number of fields ignoring Eval1score.meta.column.names since it is not copied from local.
Need the modelset name, if it is empty print usage
This one is depending on #79
In branch develop-0.2.4, we improve scalability of eval step while two steps:
runConfusionMatrix(evalConfig);
runPerformance(evalConfig);
The first step creates one file and read in second step. We can merge these two steps into one without creating one file.
It's OK to follow Pig namespace style.
Missing value is very important for binning but currently in shifu all missing values are ignored.
Program hangs after running the EncogNNTrainer. Currently need to ctrl-c to exit.
To reproduce:
$ shifu 6_train.json
For big data, over 1 hour is needed to wait for Shifu step finished. Better to provide a parameter to run java process in Shifu bash in background.
If no header files, we'd better name columns as $1, $2, $3, ... and start to run shifu whole process
[WARNING] The POM for com.paypal.risk.dst:decision-tree:jar:0.1.0 is missing, no dependency information available
[WARNING] The POM for com.paypal:guagua-mapreduce:jar:hadoop1:0.3.0 is missing, no dependency information available
I clone the project. When I try to build it, I cannot find 2 packages above. I found guagua-mapreduce:jar:hadoop1:0.3.0 in the lib folder of project release, but com.paypal.risk.dst:decision-tree:jar:0.1.0 is still missing.
Shifu should combine namespace and column name together as column name.
So far, sensitivity only be used in variable selection, we'd better write such report into HDFS files for analysis.
So far no page to let users know the performance of our shifu. We'd better to provide such numbers on one doc page.
To stop d-training iteration, so far max iteration number is used, better to provide a convergence parameter then user can set both convergence and max iteration number together.
Guagua 0.4.2 supports embeded zookeeper server, change shifu zookeeperServers configuration to optional, if not set, let guagua launch embeded zookeeper server.
So far in Shifu, only KS and IV static variable selection are supported.
Another way is to train a model and they compare the score diff if removing one column and get final MSE for all records. Sort all variables by MSE values and we can remove percentage of variables.
So far, our learning rate is a fixed value, some paper and mllib are starting to use adaptive learning rate according to current iteration.
This is useful to decrease total iteration number.
Since we add a lot of new features like sensitivity analysis variable selection, binning stats improvement, bug fixing. I suggest to release a new version 0.4.0 or 0.2.4.
Before this release:
Any ideas?
We should simplify it and remove one shifu.
When there are comma(,) in category variables, shifu stats will get wrong result about binning using mapred mode to run.
This is caused categorical values will be converted into string, after binning. And when update the ColumnConfig, the string will be converted into list again.
Shifu supported category variables in the past, check this function to see
Normalizing all candidate variables and in VarSelect and Train step select columns FinalSelect = true.
This is useful in varselect to help using wrapper or variable sensitivity to select useful variables.
For guagua 0.5.0, user doesn't need to specify a zookeeper server, then guagua will help to build a embed zookeeper server for Shifu.
This feature is useful for user to use shifu without zookeeper setting since we found several times user asked how to install zookeeper and configure.
For big data training, independent zookeeper cluster is still strongly recommended.
.classpath
.project
.settings
Another bottleneck we found -
Shifu stats CAN NOT handle big data. It keeps failing like this
2014-08-27 06:50:33: INFO MapReduceLauncher - 84% complete
2014-08-27 06:56:38: INFO MapReduceLauncher - job job_201407230215_861842 has failed! Stop running all dependent jobs
2014-08-27 06:56:38: INFO MapReduceLauncher - 100% complete
2014-08-27 06:56:38: WARN Launcher - There is no log file to write to.
2014-08-27 06:56:38: ERROR Launcher - Backend error message
Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Task attempt_201407230215_861842_r_000000_0 failed to report status for 1207 seconds. Killing!
2014-08-27 06:56:38: ERROR PigStatsUtil - 1 map reduce job(s) failed!
and we have to keep sampling down the data size. Are we improving this feature as well?
So far in binning, we only consider count, but sometimes, using weights like(dollar) is good to add weights for binning
Currently we have two packages for hadoop 1.0 and Hadoop 2.0, better we can merge them together and check the hadoop version and then select the right hadoop jars.
Wow transformation is widely used in modeling team. It's better to support Wow transformation in normalization.
In normalizing, missing values are replaced by 0.0. Should set an option to use mean value to replace missing values.
In current solution, each iteration, every worker is sucessful, then master can go on next iteration. Although there are two parameters to define the percentage of successful workers in each iteration, but at last iteration, all mapper tasks are done and successful, the job can be successful.
If we have 10 workers, 9 worker at last iteration are done successful, how about just terminate the job with successful state.
This should be easy in YARN, but not easy in MapReduce implemenataion.
Auto de-select id columns
Auto detect category variables (if too small distinct values)
Auto de-select columns(IV is very bad)
Auto de-select columns(too many nulls or empties)
Suppose there is a high volume account in the training set, without capping the number of txns by that customer in the training set, the model can become biased to that account. By imposing a limit/cap, the model doesn't become biased to the high volume account.
The best hadoop parallel number should be between (1, total-column-number]. And it also should consider the data size.
shifu drop variables in normalize step, but it's not very friendly for modeler, because they love to change columnconfig in any time and train them, so I would like to propose new steps:
once shifu work like this, modeler could update columnconfig in anytime and skip the normalize step.
Percentile is a very helpful indicator for statisticians, it could tell if the data is skew, or a normalize distribution.
A good implementation for percentile is datafu by linkedin, http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/stats/Quantile.html
Just wondering would we add this feature for further release?
In ml.shifu.core.di.builtin.stats.BinomialUnivariateStatsContCalculator, validValues must be sorted before finding interQuantileRange.
When configure guagua.split.maxCombinedSplitSize in shifuconfig
Errors:
/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 25: guagua.split.maxCombinedSplitSize=33554432: command not found
/x/home/yanzzhou/shifu-0.2.3//conf/shifuconfig: line 26: guagua.split.combinable=true: command not found
So far model training step is independent.
Allow train models depending on existing models and start from that model to do a new training.
If I set bagging num to 3. Hadoop job names should be:
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:2
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:3
But all the names are the same with id 1.
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Shifu Master-Workers NN Iteration: ato_all_shifu_pengzhang id:1
Hi All Devs,
Please use findbugs to check your code and try to fix all default findbugs errors and warnings asap.
Thanks,
Zhang, Pengshan(David)
Template files should be uploaded for other devs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.