pandasang1231 / 522_ramen Goto Github PK

Project work for 552

License: MIT License

Python 0.78% Jupyter Notebook 70.71% HTML 28.32% Makefile 0.14% Dockerfile 0.05%

522_ramen's Introduction

Hi there, this is Wenjia Zhu! 👋

Welcome to my Github page! My name is Wenjia Zhu, and I'm currently a Master of Data Science student at the University of British Columbia (graduating in June 2022). Prior to this, I have worked as a Data Scientist in Companies such as Alipay and PPdai for 4 years in China.

Github Stats

Repositories

Practical projects I have worked on:

Developed data science tool/packages for simplifing routine work, you can access the package on PYPI - Python risk_model_tool
Developed a Dashboard to illustrate high-level political status in Canada and generate reports in BIRS_CIH Hackthon - source code, report

Learning related projects :

Machine_learning_byHand: UBC CS430M, deployment of traditional ML algorithms from scratch - view
Deep_learning_byHand: deployment of deeplearning algorithms from scratch and using Pytorch. Topics include: Mlp, CNN, RNN, NLP, Transformer, Attention - view
Leetcode_byHand: Prepare for online coding in algorithom and SQL - view

522_ramen's People

Contributors

Watchers

Forkers

anthea98 datallurgy shyan0903

522_ramen's Issues

Rewrite the README

Just rewrite the README and there are many words about the report part and model part, maybe you can refer to it to save time when conducting your part of optimization

download_data.py does not seem to be functioning

I can't get the script to work in the terminal or the the Makefile with the following inputs:

--url="www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx"
--out_file="data/raw/ramen_ratings.csv"

EDA plots now showing correctly in the initial_data_exploration file

Looks like the file was uploaded before everything got generated.

Revised the report

According to TA's milestone one review. Add wrapper feature selector.
Used four models, LR, SVM, Random forest, and CatBoost. And picked the catboost.
According to peer review, add confusion matrix. Use metrics like precision, recall, f1 score to compare train and test.
According to peer review, add cv in the code
According to peer review, use shap to explain the model

Weekly Update Issue. 2021.11.27

Week1 contribution:

Create q repo(Tuan);
Three files(Irene);
Split data(Anthea);
Conduct EDA( every one )
Sat 4pm merge our results

Milestone 2 Review

Good job! My concerns are:

Please make sure that the work is distributed evenly across group mates, we can not judge this but this is a concern that we have.

"Write a literate code document which presents your analysis and findings": I can't find such a document, can you refer me to it if you have it?

Writing an analysis that uses multiple scripts: Reasoning
This document seems a bit sparse in comparison to the results that you have generated. If you omitted some of the figures, please let us know why in your document.

Writing an analysis that uses multiple scripts: Code Quality
The quality of your scripts could be better. Right now there is only a main method and in that method there are many repeated blocks that could be handled ideally using for loops or inner functions.

Project organization and documentation expectations: Reasoning
To me, your project's structure seems clear. But I doubt that a person with not much machine learning background could infer anything from your project. To fix this, please add more comments on why you are doing different things, like generating plots, doing inference or anything else. This could be in the docopt of the files or/and in your Readme.

Please refer me to the files that I think you're missing and if you have them I'll fix the grade.

train_model.py doesn't function well

It is reported that there is an error while running the code.

python src/train_model.py --train_file="data/process/train_process.csv" --out_file_train="results/best_model.pkl"
/Users/allyson/School/Block-3/522_Ramen/src/train_model.py:69: DtypeWarning: Columns (647) have mixed types.Specify dtype option on import or set low_memory=False.
 main(opt["--train_file"], opt["--out_file_train"])
Traceback (most recent call last):
 File "/Users/allyson/School/Block-3/522_Ramen/src/train_model.py", line 69, in <module>
  main(opt["--train_file"], opt["--out_file_train"])
 File "/Users/allyson/School/Block-3/522_Ramen/src/train_model.py", line 44, in main
  train_data["Stars"] = train_data["Stars"].replace("Unrated", -1).astype(float).apply(handle_target)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/generic.py", line 5815, in astype
  new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 418, in astype
  return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 327, in apply
  applied = getattr(b, f)(**kwargs)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 591, in astype
  new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1309, in astype_array_safe
  new_values = astype_array(values, dtype, copy=copy)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1257, in astype_array
  values = astype_nansafe(values, dtype, copy=copy)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1201, in astype_nansafe
  return arr.astype(dtype, copy=True)
ValueError: could not convert string to float: 'NR'
make: *** [results/best_model.pkl] Error 1

Milestone 1 Review

Good job! Here are my feedbacks for milestone 1 assessment.

Project proposal: reasoning
You might need to pay more attention to these parts:

"Clearly state the research question and any natural sub-questions you need to address, and their type." In your proposal, have you analyzed different possible situations that might arise when working with textual data? Why do you use logistic regression if you are facing a classification problem? If you are not doing regression and are doing classification, why do you have AUC score? Moreover, these details are not very much acceptable by a not-so-technical person (like AUC score).
What about data visualization? What specifically are you going to do?
For these algorithms, what packages will you use? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?

Exploratory data analysis in a literate code document: VIZ
Have you looked into the HTML report file that you have provided? It's not really opening on github. First, your report should be openable on github so that everyone could see this. Second, you don't need to convert it to HTML. That's why it breaks. Please do not convert your notebooks to HTML files again.
Exploratory data analysis in a literate code document: QUALITY

It's nice that you have used the pandas profiling tool, but where is your motivation for the things that you have done? How do you wanna handle the missing values? What did you infer from your analysis? Just plotting the results without any results seems a bit pointless.

Proposal

Make the question more clear, whether it is classification or regression, and the reason for using logistic regression, and AUC scores.
What about data visualization? What specifically are you going to do?
-------do not quite understand this point-----------
packages we use in models.
PR: Specify the instructions on how to install and activate the environment needed for doing the whole process.
PR: the explicit command lines to be run in the terminal
------we have a workflow chart and a make file, is it necessary?-----
PR: link to the data source

Report

Delete the HTML report/the EDA report and need to create an MD file instead.
State the motivation about EDA
------(original suggestions: where is your motivation for the things that you have done? How do you wanna handle the missing values? What did you infer from your analysis? Just plotting the results without any results seems a bit pointless.)--------
------maybe some explanations about what we found in the profiling reports.
Same writing questions like Proposal-point1
PR: Identified that your dataset has a class imbalance, how are you addressing to handle it?
------we don't conduct a predict right?------
PR: Address the size of the test set and your model performance.
PR: grammatical and formatting errors (Fig 3. caption alignment)
PR: Discuss future direction like how you will address the limitations of your research in the future with better data and methods.
PR: Include authors' names.
PR: Elaborate more on the coefficients and state how exactly the important features are impacting the classification.

Models

PR: question about the class imbalance
PR: a confusion matrix for how your model performed on the test data. (mentioned many times)
PR: 5 fold cross validation to assess the model
PR: why drop the top-ten column? (maybe clarify it in EDA)

Scripts

The usage for the EDA script does not match the script itself. Its called generate_EDA_figures.py, but the usage states create_EDA_figures.py