Git Product home page Git Product logo

522_ramen's Introduction

Hi there, this is Wenjia Zhu! ๐Ÿ‘‹

Github

Welcome to my Github page! My name is Wenjia Zhu, and I'm currently a Master of Data Science student at the University of British Columbia (graduating in June 2022). Prior to this, I have worked as a Data Scientist in Companies such as Alipay and PPdai for 4 years in China.

Github Stats

Wenjia Zhuโ€˜s GitHub Stats

Repositories

Practical projects I have worked on:

  • Developed data science tool/packages for simplifing routine work, you can access the package on PYPI - Python risk_model_tool

  • Developed a Dashboard to illustrate high-level political status in Canada and generate reports in BIRS_CIH Hackthon - source code, report


Learning related projects :

  • Machine_learning_byHand: UBC CS430M, deployment of traditional ML algorithms from scratch - view

  • Deep_learning_byHand: deployment of deeplearning algorithms from scratch and using Pytorch. Topics include: Mlp, CNN, RNN, NLP, Transformer, Attention - view

  • Leetcode_byHand: Prepare for online coding in algorithom and SQL - view

522_ramen's People

Contributors

allysonmstoll avatar anthea98 avatar datallurgy avatar pandasang1231 avatar shyan0903 avatar

Watchers

 avatar

522_ramen's Issues

Rewrite the README

Just rewrite the README and there are many words about the report part and model part, maybe you can refer to it to save time when conducting your part of optimization

download_data.py does not seem to be functioning

I can't get the script to work in the terminal or the the Makefile with the following inputs:

--url="www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx"
--out_file="data/raw/ramen_ratings.csv"

Revised the report

  1. According to TA's milestone one review. Add wrapper feature selector.
  2. Used four models, LR, SVM, Random forest, and CatBoost. And picked the catboost.
  3. According to peer review, add confusion matrix. Use metrics like precision, recall, f1 score to compare train and test.
  4. According to peer review, add cv in the code
  5. According to peer review, use shap to explain the model

Weekly Update Issue. 2021.11.27

Week1 contribution:

  • Create q repo(Tuan);
  • Three files(Irene);
  • Split data(Anthea);
  • Conduct EDA( every one )
  • Sat 4pm merge our results

Milestone 2 Review

Good job! My concerns are:

Please make sure that the work is distributed evenly across group mates, we can not judge this but this is a concern that we have.

"Write a literate code document which presents your analysis and findings": I can't find such a document, can you refer me to it if you have it?

Writing an analysis that uses multiple scripts: Reasoning
This document seems a bit sparse in comparison to the results that you have generated. If you omitted some of the figures, please let us know why in your document.

Writing an analysis that uses multiple scripts: Code Quality
The quality of your scripts could be better. Right now there is only a main method and in that method there are many repeated blocks that could be handled ideally using for loops or inner functions.

Project organization and documentation expectations: Reasoning
To me, your project's structure seems clear. But I doubt that a person with not much machine learning background could infer anything from your project. To fix this, please add more comments on why you are doing different things, like generating plots, doing inference or anything else. This could be in the docopt of the files or/and in your Readme.

Please refer me to the files that I think you're missing and if you have them I'll fix the grade.

train_model.py doesn't function well

It is reported that there is an error while running the code.

python src/train_model.py --train_file="data/process/train_process.csv" --out_file_train="results/best_model.pkl"
/Users/allyson/School/Block-3/522_Ramen/src/train_model.py:69: DtypeWarning: Columns (647) have mixed types.Specify dtype option on import or set low_memory=False.
 main(opt["--train_file"], opt["--out_file_train"])
Traceback (most recent call last):
 File "/Users/allyson/School/Block-3/522_Ramen/src/train_model.py", line 69, in <module>
  main(opt["--train_file"], opt["--out_file_train"])
 File "/Users/allyson/School/Block-3/522_Ramen/src/train_model.py", line 44, in main
  train_data["Stars"] = train_data["Stars"].replace("Unrated", -1).astype(float).apply(handle_target)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/generic.py", line 5815, in astype
  new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 418, in astype
  return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 327, in apply
  applied = getattr(b, f)(**kwargs)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 591, in astype
  new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1309, in astype_array_safe
  new_values = astype_array(values, dtype, copy=copy)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1257, in astype_array
  values = astype_nansafe(values, dtype, copy=copy)
 File "/opt/miniconda3/envs/522_Group6/lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1201, in astype_nansafe
  return arr.astype(dtype, copy=True)
ValueError: could not convert string to float: 'NR'
make: *** [results/best_model.pkl] Error 1

Milestone 1 Review

Good job! Here are my feedbacks for milestone 1 assessment.

  1. Project proposal: reasoning
    You might need to pay more attention to these parts:
  • "Clearly state the research question and any natural sub-questions you need to address, and their type." In your proposal, have you analyzed different possible situations that might arise when working with textual data? Why do you use logistic regression if you are facing a classification problem? If you are not doing regression and are doing classification, why do you have AUC score? Moreover, these details are not very much acceptable by a not-so-technical person (like AUC score).
  • What about data visualization? What specifically are you going to do?
  • For these algorithms, what packages will you use? Have you thought of using wrapper algorithms (boruta algorithm) for feature selection?
  1. Exploratory data analysis in a literate code document: VIZ
    Have you looked into the HTML report file that you have provided? It's not really opening on github. First, your report should be openable on github so that everyone could see this. Second, you don't need to convert it to HTML. That's why it breaks. Please do not convert your notebooks to HTML files again.

  2. Exploratory data analysis in a literate code document: QUALITY

  • It's nice that you have used the pandas profiling tool, but where is your motivation for the things that you have done? How do you wanna handle the missing values? What did you infer from your analysis? Just plotting the results without any results seems a bit pointless.

Finish the data splitting

Finish the data splitting with the train_test_split function locally and update the files for further EDA

Summary of TA's review and peer review

This is the summary of places we need to optimize!

Proposal

  1. Make the question more clear, whether it is classification or regression, and the reason for using logistic regression, and AUC scores.
  2. What about data visualization? What specifically are you going to do?
    -------do not quite understand this point-----------
  3. packages we use in models.
  4. PR: Specify the instructions on how to install and activate the environment needed for doing the whole process.
  5. PR: the explicit command lines to be run in the terminal
    ------we have a workflow chart and a make file, is it necessary?-----
  6. PR: link to the data source

Report

  1. Delete the HTML report/the EDA report and need to create an MD file instead.
  2. State the motivation about EDA
    ------(original suggestions: where is your motivation for the things that you have done? How do you wanna handle the missing values? What did you infer from your analysis? Just plotting the results without any results seems a bit pointless.)--------
    ------maybe some explanations about what we found in the profiling reports.
  3. Same writing questions like Proposal-point1
  4. PR: Identified that your dataset has a class imbalance, how are you addressing to handle it?
    ------we don't conduct a predict right?------
  5. PR: Address the size of the test set and your model performance.
  6. PR: grammatical and formatting errors (Fig 3. caption alignment)
  7. PR: Discuss future direction like how you will address the limitations of your research in the future with better data and methods.
  8. PR: Include authors' names.
  9. PR: Elaborate more on the coefficients and state how exactly the important features are impacting the classification.

Models

  1. PR: question about the class imbalance
  2. PR: a confusion matrix for how your model performed on the test data. (mentioned many times)
  3. PR: 5 fold cross validation to assess the model
  4. PR: why drop the top-ten column? (maybe clarify it in EDA)

Scripts

  1. The usage for the EDA script does not match the script itself. Its called generate_EDA_figures.py, but the usage states create_EDA_figures.py

Need tests!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.