- data
- asr: ASR raw CTM files and processed CSV files
- fastText: containing fastText related training and trained files
- processed:
- Huy: features from Huy
- grammar: features from Chuan
- df_ CSV files will be used by prep_ml_data ipynb
- numpy: containing pickled numpy arrays for sklearn
- refernce_grammar:
- scst1: contains training and test data sets for 2017 SCST1 challenge (only for the text task)
- texttask_trainData: contains the training data sets for 2018 SCST2 challenge's text task. Please check internal readme to know three files' purposes.
- result:
- _tpot.pkl files are ML models generated by TPOT
- src: codes
- text_data.py: loading 2017 and 2018 data sets to generate data/processed/data.pkl
- vecdist_feature_fasttext.py: using FastText word embedding to compute Cos-Sim and Word Moving Distance features
- TODO: map-reduce on pandas rows
- parse_grm_error.py: parse grammar error count sent by Chuan
- parse_ctm.py: convert ASR CTM outputs to .csv files containing Id and RecResult columns
- prep_ml_data_{year}_{task}.ipynb IPython notebook to prepare data for running ML tasks. The output will be in data/processed/numpy
- train_model_v2.py: using a ML (w/ default hyper parameters) model (-t LR, RF, XGB, SVC) to make reject/accept predictions.
- train_model_hptuned.py: using hyperopt to tune ML models' hyper-parameters
- TODO: enriching tuning grids
- train_model_tpot.py: using TPOT genetic programming way to train optimal ML model
- utils.py: utility functions
- train_model.py: using hyperopt to tune model parameters. Now support RF, SVC, and XGBoost.
- eval_model.py: tuning a prob-cutoff for predicting "accept/reject" and run D score evaluation
- ml_model.py
- model_sandbox.ipynb
- end_to_end.py
- try_default_SVC.py: trying default SVC
- try_tuned_SVC.py: tuned SVC didn't show helps on higher D score