Git Product home page Git Product logo

kaggle_ieee-fraud-detection's People

Contributors

dependabot[bot] avatar yota-p avatar

Watchers

 avatar

kaggle_ieee-fraud-detection's Issues

Feature script to be independent

Description

Currently feature scripts are not executable by itself, which needs to be called from featurefatory. Thus it is difficult to debug, I need to make it executable by itself.

Goal

Features raw.py, altgor.py are executable by itself.
Remove feature creating process and add feature loading to the main script.

Impact

Easy to debug feature creating scripts.

Reference

https://github.com/hakubishin3/kaggle_ieee

Additional

  • Change directories of mylog.py and reduce_mem_usage.py into src/utils.
  • Set ROOTDIR by script itself. (Not to set from config)

Tasks

  • Run raw, nroman, main without modification
  • Modify & run raw
  • Run main with raw modified
  • Modify & run nroman
  • Run main with raw, nroman modified

Output training log

Description

Currently train process is not logged.
Thus I can't check model training process after the batch process.

Goal

To output training log.

Impact

Be able to save & check training results after batch experiment.

Tasks

  • Replace print() with logger_train()
  • Get training result(evals_result ) from model
  • Add stop_watch() to all methods called from Pipeline()

Rebuild architecture

Description

Currently I'm using Full-process piped architecture, which means preprocess, train, predict are connected with pipe. However, Pipeline definition causes a lot of modification for switching pipes. I recently noticed it doesn't have to be luigi pipeline, it can be altered by creating a method to judge the task itself is done. Thus I want to rebuild my architecture.

Goal

To implement full architecture introduced by TIS Kubo 2018.

Impact

Full project

Reference

TIS Kubo 2018

Tasks

  • Implement empty modules and classes from sequence chart and class diagram
  • Migrate modules from previous version, check consistency

Reproduce gold solution

Description

Project architecture has been refined. Now it's time to reproduce the top Kaggler's scores.

Goal

Reproduce gold solution, which is Public LB > 0.946627.

Impact

Gets evidence of this project being able to reproduce top solutions.

Reference

Chirs Deotte - 1st place solution
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111308
https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

Additional

  • Fixed feature implement script to create logger for executing itself.
  • Fixed feature implement script to obtain options, such as force calcutaion and debug mode.

Run from Make

Description

Currently we run training script by typing python src/main.py 0000, which is not self-evident (here 0000 indicates a parameter set).
I want to gather commands in Makefile.

Goal

Run training by 'make train 0000'.
Run prediction by 'make pred 0000'.

Impact

Easy to run for beginners.
Easy to type.

Additional

Need to separate training and predicting code.

Tasks

  • Add make run to Makefile

Define filename by run

Description

Currently output filename is defined at both config and run.
Directory structure(log/, data/...) requires filenames to be determined on certain rules, thus defining filename at config is not necessary.
In this issue, naming rules will be defined, and modify configs not to define filenames, dirs.

Goal

Define filenames based on RULE below:

RULE
{filetype}_{VERSION}(.small){.extention}

data/feature/name_test(.small).pkl
data/feature/name_train(.small).pkl
data/raw/*.csv
data/model/model_0000(.small).json
data/submission/submission_0000.csv
log/main_0000(.small).log
log/train_0000(.small).tsv

Impact

Config file to be slim.
Dynamically change filename at run_{VERSION}.py.

Reference

None

Additional

  • utilize method islatest
  • output option to log
  • remove print() from source
  • remove TODO from source
  • remove unneeded lines (unused modules, comment outs)
  • remove RUN_TRAIN, RUN_PRED from config
  • Deprecate 0000-0009

Tasks

  • search data input/output
  • Implement rules
    • Create run_0010.py (copy of 0009)
    • Comment out paths from config
    • set filenames at run_0010.py
    • separate output (*.small) from full data
      • move value c.transfromer.USE_SMALL_DATA to c.runtime
      • Modify all file input & output methods
    • run 0010, compare output to 0009
  • Add data reading methods to debugging notebook (based on RULE above)

Transformer resampling & train num_boost_round not recognized

Description

If transformer reads .small data under --small run, it resamples already .small data.
xgboost.train does not recognize num_boost_round in params and default value used.
Feature importance score is not logged.
Deprecate train_and_validate. Validation data is optional.

Expected behavior

Transformer to skip sampling step if --small and read data from file.
Pass num_boost_round to xgboost.train().
Log feature importance for full dataset.

To reproduce

python run_gbdt.py --small

Impact

To finish the objectives above.

Reference

None

Additional

Tasks

Directory missing

Description

I cloned source and tried to run, but couldn't finish running.
This because the required directoriy did not exist.

Expected behavior

Check for directory data/interim, raw, processed, exterim before running.
Create them if not exist.

To reproduce

  1. Clone master
  2. python src/main.py 0000

Impact

Can't run further than downloading raw data.

Additional

Check for directory doesn't need to be executed automatically done when running.
Add directory creating step to Makefile. Later it will be depended from make run in other issue.

Tasks

Add creating directory step to Makefile.

Divide & Pipe preprocess method

Description

Currently preprocess.py contains all preprocessing steps in one method.
Thus if this method raised error while running, we have to execute every steps from the head of method.
I want preprocess method to be divided and piped.

Goal

Preprocess method detects and skip finished steps.

Impact

Reduce re-executing time.

Tasks

  • Divide each steps in preprocess() into methods
  • Implement pipeline

Organize directory

Description

Current directory is having datas at top level, though

Goal

Finish Tasks below

Impact

Better access to each files.
Remove unneeded files.

Reference

None

Additional

This is how the directories will look like:
(move src/* to top, remove 's' from directory name, data in data/*)

data/model *ignore
data/submission *ignore
data/raw *ignore
data/feature *ignore
log/train_0000.tsv
log/main_0000.log
log/importance_0000.json
log/params_0000.json
util/
feature/
config/
run0000.py
...
run9999.py

Tasks

  • Remove make requirements, dir, run*, test, test_environment from Makefile
  • Change print in reduce_mem_usage into logger
  • Check run 0007
  • move transformer to utils, feature_factory to features
  • Change directories as above
    • move dir & grep for adjust & adjust code & run for each:
      • models->data/model
      • data/submission (no need of modification)
      • data/raw (no need of modification)
      • data/processed->data/feature
      • data/interim->delete
      • data/external->delete
      • log/train/0000.tsv->log/train_0000.tsv
      • log/main/0000.log->log/main_0000.log
      • log/importance_0000.json (no files to rewrite)
      • log/params_0000.json (no files to rewrite)
      • src/utils->util/
      • src/features->feature/
      • src/config->config/
      • src/models->model/
      • src/run->run0000.py
      • notebooks->notebook
  • .gitignore data
  • .gitkeep data/DIR
  • remove unneccesary sys.path.insert()  - [x] Check run 0007
  • Define paths at config as possible
    • data/feature/EACHFEATURE *this needs to be single executable. need to define dir in script
    • data/feature/MERGED
    • data/model *model_name will be determined from model.TYPE
    • data/raw *this needs to be single executable. need to define dir in script
    • data/submission
    • log/main, train
  • Create submission shell script & run from make -> couldn't pass variables properly so decided to leave memo and use command to submit.

Reproduce 1st place solution

Description

Currently my highest score is Private 0.932896, Public 0.958595.
1st place solution is 0.967722 public, 0.945884 private.

Goal

To achieve Private score > 0.945884.

  • Catboost (0.963915 public / 0.940826 private)
  • LGBM (0.961748 / 0.938359)
  • XGB (0.960205 / 0.932369)

Impact

Proof of correct model.

Reference

1st Place Solution - Part 1
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284

1st Place Solution - Part 2
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111308

Very short summary
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111257

IEEE - Notebooks series (LB 0.9480/no blend)
https://www.kaggle.com/c/ieee-fraud-detection/discussion/104142

How to find UIDs
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111510

How the magic works
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111453

XGB Fraud with Magic - [0.9600]
https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

Kaggle Ensembling Guide
https://mlwave.com/kaggle-ensembling-guide/

Additional

None

Tasks

  • Check reference
  • How to validate -> GroudKfold by month 'DT_M'
  • Fix diffs from reference kernels
  • Create features by Cdeotte's notebook
  • Run xgboost by batch
  • Run lightgbm by batch
  • Run catboost by batch
  • Blend submissions
  • Divide scripts by objective
  • Output evals_result to file
  • Deprecate callbacks, train_log
  • Hyper parameter tune
  • Use mlflow
  • Confirm Private score > 0.945884.

Simplify experiment

Description

Now I'm struggling to reproduce past experiment.
The main cause is because the dependent modules & utils change oftenly.

Goal

Discontinue complex class dependencies.
Integrate important methods & parameters for each experiments.

Impact

Reproductivity increases.

Reference

None

Additional

None

Tasks

None

Log validation scores at each steps

Description

Current log file doesn’t include evaluation scores at each steps.

Goal

Evaluation scores are logged in each log files.

Impact

Able to analyze score variation over steps

Reference

None

Additional

None

Tasks

  • On lgb
    • Create lgb exclusive logger creating & getting method on trainer
    • Organize model parameters
    • move unneeded messages to main logger
    • Specify digits for scores on train logger
    • Output fold on train logger
  • Apply same things to xgb
  • Swich to small datasets if debug mode
  • Consistency check for lgb, xgb

Organize model and trainer

Description

GBDT models (LightGBM, XGBoost, CatBoost) have similar but slightly different methods.
Let's use them at same procedure.

Goal

To Switch models from config
i.e. To finish the tasks below.

Impact

Easy to switch GBDT models.

Reference

None

Additional

None

Tasks

  • Deprecate configure.py
    • Create utils.easydict
    • Create utils.option
    • Apply option at run.py
  • At run:
    • Create modelfactory
    • Create defaultmodel
    • tune
    • Create tunedmodel
    • train
    • predict
  • At script:
    • Create method train at run
    • Create method tune_params at run
  • Create model.ModelFactory
  • Create model.BaseModel
  • Create model.LightGBM_Model
  • Create model.XGBoost_Model(hollow class)
  • run 0007
  • Remove LGBMTrainer
  • Remove BaseTrainer
  • Save model as name: model_0000_TYPE.pkl
  • Save feature importance
  • Create model.XGBoost_Model
  • run 0008

Neural network model from TensorFlow

Description

Currently the models are gradient boost trees only, we need to try other models such as neutral networks.

Goal

Implement neural network model

Impact

For future knowledge

Reference

None

Additional

None

Tasks

None

Small enhancements

Description

Small enhancements

Goal

  • deprecate c.runtime
  • reduce Slack notification
  • use JSON for configs

Impact

None

Reference

None

Additional

None

Tasks

  • copy 0010 as 0012, 0011 as 0013
  • deprecate c.runtime on config
    • deprecate ROOTDIR (use relative path)
    • RANDOM_SEED to args
    • USE_SMALL_DATA to args
    • VERSION from command line args
    • indicate config by command line args
  • separate config.slackauth
  • reduce Slack notification - by optional args for timer
  • FILE, SLACK, STREAM_HANDLER_LEVEL to small letters
  • use JSON for config
    • create slackauth.json
    • read slackauth.json from run_gbdt.py
    • deprecate np.nan from missing (magic.py, config_0012.py)
    • create config_0012.json, config_0013.json
    • read config_0012.json, config_0013.json from run_gbdt.py
    • deprecate pathlib.Path as normal use. Use it for is_latest only
  • confirm consistency

Create test code

Description

This project doesn't have a test code.
Need a test suite for essential modules.

Goal

One can run test suite by make test.

Impact

Keep codes consistent to existing functions.
Reduce bugs in production code.
Reduce time for bug fixing.

Tasks

  • Dig up modules to test
  • Implement make test
  • Implement test cases
    • module1, 2, ...

Divide model into Model, ModelAPI, Trainer

Description

Currently the models are called directly from the training method.
Thus model & library specific codes are having dense union to the common code.
I want to call model using common method.

Goal

Model & library specific codes not implemented out of src/model.
Create two models implementing Model, Trainer, ModelAPI.
Can change model from config (without modifying the code!)

Impact

Be able to handle different models using same method.
Reduce risks of changing improper codes.

Reference

Use architecture from TIS Kubo.
TIS_Kubo_2016
TIS_Kubo_2018

Additional

Each models contain Model, Trainer, ModelAPI.
Class Model implements the model & library specific codes.
To train: Trainer.train() calls Model.train()
To predict: ModelAPI.predict() calls Model.predict()

Tasks

  • Create AbstractClass of ModelAPI, Trainer
  • Implement lightgbm into Model, ModelAPI, Trainer
  • Implement NN into Model, ModelAPI, Trainer
  • Choose model from config file

Small refactoring requests

Description

This is a issue to gather small refactoring requests to improve productivity.
Add them in this issue and will be modified at once.

Note: This issue will accept up to 5 requests.
Each requests should be able to be fixed within 20 lines of coding.
Unsuitable tasks will not be fixed in this issue.

Goal

To complete all requests in this issue.

Impact

Usability imporvement.

Additional

Change output directory of lightgbm oof

Currently oof training data is being outputted to src/ directory, which is not for output.
Change output directory to log/train/.

Gather key path

Key files like .slack_token, .kaggle.json are dispersed in the project or system directory.
I need to gather keys to one directory to make them manageable.

Remove AbstractTask.py

Currently class AbstractTask() is for just forcing it's inheritors to implement output() and run().
This is replaceable by luigi.Task(), so AbstractTask is no longer needed.

Tasks

  • Change output directory of lightgbm oof
  • Gather Key path
  • Remove AbstractTask.py

Each script to be executable

Description

Current trainer, modelapi are not executable.

Goal

To execute trainer, modelapi by itself

Impact

Easy to debug
Reduce execution time

Reference

None

Additional

None

Tasks

None

Divide config and slack handler from save_log module

Description

Currently config is handled by save_log module to call get_options module.
This structure is taken over from external repository, but it looks weird to handle config files through log handler module.
Sample methods save_log.py contains:

  • get_version() # for reading config
  • get_training_logger() # log
  • send_message() # for slack

I want to remove methods for config and slack from save_log module.

Goal

Separate configure and save_log module into individual codes.

Impact

Clear architecture.

Reference

Current save_log.py

Tasks

  • Divide slack methods
  • Divide config methods

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.