The kaggle_ieee-fraud-detection from yota-p

Feature script to be independent

Description

Currently feature scripts are not executable by itself, which needs to be called from featurefatory. Thus it is difficult to debug, I need to make it executable by itself.

Goal

Features raw.py, altgor.py are executable by itself.
Remove feature creating process and add feature loading to the main script.

Impact

Easy to debug feature creating scripts.

Reference

https://github.com/hakubishin3/kaggle_ieee

Additional

Change directories of mylog.py and reduce_mem_usage.py into src/utils.
Set ROOTDIR by script itself. (Not to set from config)

Tasks

Run raw, nroman, main without modification
Modify & run raw
Run main with raw modified
Modify & run nroman
Run main with raw, nroman modified

Description

Currently train process is not logged.
Thus I can't check model training process after the batch process.

Goal

To output training log.

Impact

Be able to save & check training results after batch experiment.

Tasks

Replace print() with logger_train()
Get training result(evals_result ) from model
Add stop_watch() to all methods called from Pipeline()

Description

Currently I'm using Full-process piped architecture, which means preprocess, train, predict are connected with pipe. However, Pipeline definition causes a lot of modification for switching pipes. I recently noticed it doesn't have to be luigi pipeline, it can be altered by creating a method to judge the task itself is done. Thus I want to rebuild my architecture.

Goal

To implement full architecture introduced by TIS Kubo 2018.

Impact

Full project

Reference

TIS Kubo 2018

Tasks

Implement empty modules and classes from sequence chart and class diagram
Migrate modules from previous version, check consistency

Reproduce gold solution

Description

Project architecture has been refined. Now it's time to reproduce the top Kaggler's scores.

Goal

Reproduce gold solution, which is Public LB > 0.946627.

Impact

Gets evidence of this project being able to reproduce top solutions.

Reference

Chirs Deotte - 1st place solution
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111308
https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

Additional

Fixed feature implement script to create logger for executing itself.
Fixed feature implement script to obtain options, such as force calcutaion and debug mode.

Run from Make

Description

Currently we run training script by typing python src/main.py 0000, which is not self-evident (here 0000 indicates a parameter set).
I want to gather commands in Makefile.

Goal

Run training by 'make train 0000'.
Run prediction by 'make pred 0000'.

Impact

Easy to run for beginners.
Easy to type.

Additional

Need to separate training and predicting code.

Tasks

Add make run to Makefile

Define filename by run

Description

Currently output filename is defined at both config and run.
Directory structure(log/, data/...) requires filenames to be determined on certain rules, thus defining filename at config is not necessary.
In this issue, naming rules will be defined, and modify configs not to define filenames, dirs.

Goal

Define filenames based on RULE below:

RULE
{filetype}_{VERSION}(.small){.extention}

data/feature/name_test(.small).pkl
data/feature/name_train(.small).pkl
data/raw/*.csv
data/model/model_0000(.small).json
data/submission/submission_0000.csv
log/main_0000(.small).log
log/train_0000(.small).tsv

Impact

Config file to be slim.
Dynamically change filename at run_{VERSION}.py.

Reference

None

Additional

utilize method islatest
output option to log
remove print() from source
remove TODO from source
remove unneeded lines (unused modules, comment outs)
remove RUN_TRAIN, RUN_PRED from config
Deprecate 0000-0009

Tasks

Transformer resampling & train num_boost_round not recognized

Description

If transformer reads .small data under --small run, it resamples already .small data.
xgboost.train does not recognize num_boost_round in params and default value used.
Feature importance score is not logged.
Deprecate train_and_validate. Validation data is optional.

Expected behavior

Transformer to skip sampling step if --small and read data from file.
Pass num_boost_round to xgboost.train().
Log feature importance for full dataset.

To reproduce

python run_gbdt.py --small

Impact

To finish the objectives above.

Reference

None

Additional

Tasks

Directory missing

Description

I cloned source and tried to run, but couldn't finish running.
This because the required directoriy did not exist.

Expected behavior

Check for directory data/interim, raw, processed, exterim before running.
Create them if not exist.

To reproduce

Clone master
python src/main.py 0000

Impact

Can't run further than downloading raw data.

Additional

Check for directory doesn't need to be executed automatically done when running.
Add directory creating step to Makefile. Later it will be depended from make run in other issue.

Tasks

Add creating directory step to Makefile.

Divide & Pipe preprocess method

Description

Currently preprocess.py contains all preprocessing steps in one method.
Thus if this method raised error while running, we have to execute every steps from the head of method.
I want preprocess method to be divided and piped.

Goal

Preprocess method detects and skip finished steps.

Impact

Reduce re-executing time.

Tasks

Divide each steps in preprocess() into methods
Implement pipeline

Organize directory

Description

Current directory is having datas at top level, though

Goal

Finish Tasks below

Impact

Better access to each files.
Remove unneeded files.

Reference

None

Additional

This is how the directories will look like:
(move src/* to top, remove 's' from directory name, data in data/*)

data/model *ignore
data/submission *ignore
data/raw *ignore
data/feature *ignore
log/train_0000.tsv
log/main_0000.log
log/importance_0000.json
log/params_0000.json
util/
feature/
config/
run0000.py
...
run9999.py

Tasks

Reproduce 1st place solution

Description

Currently my highest score is Private 0.932896, Public 0.958595.
1st place solution is 0.967722 public, 0.945884 private.

Goal

To achieve Private score > 0.945884.

Catboost (0.963915 public / 0.940826 private)
LGBM (0.961748 / 0.938359)
XGB (0.960205 / 0.932369)

Impact

Proof of correct model.

Reference

1st Place Solution - Part 1
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284

1st Place Solution - Part 2
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111308

Very short summary
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111257

IEEE - Notebooks series (LB 0.9480/no blend)
https://www.kaggle.com/c/ieee-fraud-detection/discussion/104142

How to find UIDs
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111510

How the magic works
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111453

XGB Fraud with Magic - [0.9600]
https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

Kaggle Ensembling Guide
https://mlwave.com/kaggle-ensembling-guide/

Additional

None

Tasks

Simplify experiment

Description

Now I'm struggling to reproduce past experiment.
The main cause is because the dependent modules & utils change oftenly.

Goal

Discontinue complex class dependencies.
Integrate important methods & parameters for each experiments.

Impact

Reproductivity increases.

Reference

None

Additional

None

Tasks

None

Log validation scores at each steps

Description

Current log file doesn’t include evaluation scores at each steps.

Goal

Evaluation scores are logged in each log files.

Impact

Able to analyze score variation over steps

Reference

None

Additional

None

Tasks

On lgb
- Create lgb exclusive logger creating & getting method on trainer
- Organize model parameters
- move unneeded messages to main logger
- Specify digits for scores on train logger
- Output fold on train logger
Apply same things to xgb
Swich to small datasets if debug mode
Consistency check for lgb, xgb

Organize model and trainer

Description

GBDT models (LightGBM, XGBoost, CatBoost) have similar but slightly different methods.
Let's use them at same procedure.

Goal

To Switch models from config
i.e. To finish the tasks below.

Impact

Easy to switch GBDT models.

Reference

None

Additional

None

Tasks

Neural network model from TensorFlow

Description

Currently the models are gradient boost trees only, we need to try other models such as neutral networks.

Goal

Implement neural network model

Impact

For future knowledge

Reference

None

Additional

None

Tasks

None

Small enhancements

Description

Small enhancements

Goal

deprecate c.runtime
reduce Slack notification
use JSON for configs

Impact

None

Reference

None

Additional

None

Tasks

Create test code

Description

This project doesn't have a test code.
Need a test suite for essential modules.

Goal

One can run test suite by make test.

Impact

Keep codes consistent to existing functions.
Reduce bugs in production code.
Reduce time for bug fixing.

Tasks

Dig up modules to test
Implement make test
Implement test cases
- module1, 2, ...

Divide model into Model, ModelAPI, Trainer

Description

Currently the models are called directly from the training method.
Thus model & library specific codes are having dense union to the common code.
I want to call model using common method.

Goal

Model & library specific codes not implemented out of src/model.
Create two models implementing Model, Trainer, ModelAPI.
Can change model from config (without modifying the code!)

Impact

Be able to handle different models using same method.
Reduce risks of changing improper codes.

Reference

Use architecture from TIS Kubo.
TIS_Kubo_2016
TIS_Kubo_2018

Additional

Each models contain Model, Trainer, ModelAPI.
Class Model implements the model & library specific codes.
To train: Trainer.train() calls Model.train()
To predict: ModelAPI.predict() calls Model.predict()

Tasks

Create AbstractClass of ModelAPI, Trainer
Implement lightgbm into Model, ModelAPI, Trainer
Implement NN into Model, ModelAPI, Trainer
Choose model from config file

Small refactoring requests

Description

This is a issue to gather small refactoring requests to improve productivity.
Add them in this issue and will be modified at once.

Note: This issue will accept up to 5 requests.
Each requests should be able to be fixed within 20 lines of coding.
Unsuitable tasks will not be fixed in this issue.

Goal

To complete all requests in this issue.

Impact

Usability imporvement.

Additional

Change output directory of lightgbm oof

Currently oof training data is being outputted to src/ directory, which is not for output.
Change output directory to log/train/.

Gather key path

Key files like .slack_token, .kaggle.json are dispersed in the project or system directory.
I need to gather keys to one directory to make them manageable.

Remove AbstractTask.py

Currently class AbstractTask() is for just forcing it's inheritors to implement output() and run().
This is replaceable by luigi.Task(), so AbstractTask is no longer needed.

Tasks

Change output directory of lightgbm oof
Gather Key path
Remove AbstractTask.py

Each script to be executable

Description

Current trainer, modelapi are not executable.

Goal

To execute trainer, modelapi by itself

Impact

Easy to debug
Reduce execution time

Reference

None

Additional

None

Tasks

None

Divide config and slack handler from save_log module

Description

Currently config is handled by save_log module to call get_options module.
This structure is taken over from external repository, but it looks weird to handle config files through log handler module.
Sample methods save_log.py contains:

get_version() # for reading config
get_training_logger() # log
send_message() # for slack

I want to remove methods for config and slack from save_log module.

Goal

Separate configure and save_log module into individual codes.

Impact

Clear architecture.

Reference

Current save_log.py

Tasks

Divide slack methods
Divide config methods

yota-p / kaggle_ieee-fraud-detection Goto Github PK

kaggle_ieee-fraud-detection's People

Contributors

Watchers

kaggle_ieee-fraud-detection's Issues

Description

Goal

Impact

Reference

Additional

Tasks

Description

Goal

Impact

Tasks

Description

Goal

Impact

Reference

Tasks

Description

Goal

Impact

Reference

Additional

Description

Goal

Impact

Additional

Tasks

Description

Goal

Impact

Reference

Additional

Tasks

Description

Expected behavior

To reproduce

Impact

Reference

Additional

Tasks

Description

Expected behavior

To reproduce

Impact

Additional

Tasks

Description

Goal

Impact

Tasks

Description

Goal

Impact

Reference

Additional

Tasks

Description

Goal

Impact

Reference

Additional

Tasks

Description

Goal

Impact

Reference

Additional

Tasks

Description

Goal

Impact

Reference

Additional

Tasks

Description

Goal

Impact