Git Product home page Git Product logo

stat447-pokemoneggsteps's Introduction

Stat 447 - Team 2 Final Submission

Hello, and welcome to the codebase for Team 2's Stat 447 submission.

If you would like to run the codebase from start to finish, each file is ordered from 01a through to 03b. Our initial processing and exploration of the dataset is in python (both a .py code file and some Jupyter notebooks), but after that we use exclusively R. If you do not wish to run all of these files, we have included the intermediate data created by each file, so that the required inputs for each file are already present, even if the preceding files are not run

This analysis is based on the dataset found at https://www.kaggle.com/rounakbanik/pokemon As there are several missing values, and we also needed a column to differentiate from Pokemon which can and cannot hatch from eggs, we manually entered the missing values and information about which pokemon can and cannot hatch from eggs by using the data on https://www.serebii.net/pokedex-sm/ Because of this preprocessing, the first csv file is NOT the raw dataset from Kaggle, but rather our manually updated version of that dataset, without missing values and with an extra column indicating which Pokemon can hatch: pokemon_no_missing_with_can_hatch.csv

No files (that are not in ArchivedCode) take >5 minutes to run, although some take several minutes.

The files, in the order in which they should be run, are:

01a-ProcessTypes.py : given the manually preprocessed starting dataset pokemon_no_missing_with_can_hatch.csv (which represents type as two columns, each with 18 different strings for each type), this file makes a new dataset which represents the types(s) of each pokemon with 19 different binary columns (a column for each type, and a 19th column to indicate if the Pokemon has only one type).

01b-ExploratoryAnalysis-Part1.ipynb : Given the manually processed Pokemon dataset, performs data cleaning, feature conversion, and data splitting into train and validation sets, and makes 2 new .csv files of each of these two new sets.

01c-ExploratoryAnalysis-Part2.ipynb : Given the cleaned Pokemon training dataset from Exploratory Analysis Pt. 1, plot explanatory variables against the response (base_egg_steps) to visualize relationships between them. Perform a variety of exploratory analyses.

01d-ValidTrainSets.R : Given cleaned train and validation Pokemon datasets in .csv format, extracts them, further cleans them, and converts some features to ordered factors. Saves a new RData object with these two modified datasets.

02a-Utils.R : Creates new Rdata object containing utility functions needed for the rest of the project, like computing prediction intervals and assigning prediction intervals a loss value.

02b-MultinLogitModel.R : Given cleaned train and holdout Pokemon datasets from Code File 01d, fits multinomial logistic regression models to the data and selects variables that result in the best performance. Saves the best model and its selected variables, along with functions to be used when fitting the model on new data and making predictions. These functions are later used for cross validation.

02c-RandomForestModel.R : Given cleaned train and holdout Pokemon datasets from Code File 01d, fits random forest models to the data and selects variables that result in the best performance. Saves the best model and its selected variables, along with functions to be used when fitting the model on new data and making predictions. These functions are later used for cross validation.

02d-OrdLogitModel.R : Given cleaned train and holdout Pokemon datasets from Code File 01d, fits proportional odds logistic regression to the data and selects variables that result in the best performance. Saves the best model and its selected variables, along with functions to be used when fitting the model on new data and making predictions. These functions are later used for cross validation.

02e-DecisionTreeModel.R : Given cleaned train and holdout Pokemon datasets from Code File 01d, uses a greedy decision tree algorithm to fit a model to the data, selecting hyperparameters using the holdout set so that the model performs reasonably well. We also create functions to be used when fitting the model (with the selected hyperparameters) on new data and making predictions. These functions are later used for cross validation.

03a-CrossValidationTools.R : Creates functions needed to run cross validation

03b-ComputeCV.R : Runs cross validation on all models for a variety of metrics

In ArchivedCode, you can find code which was used as a part of our process when working on the project, but which is not necessary for the final report. This includes 02f-KNNmodel.R , which fits a KNN model (which we found to haave unacceptably low performance) and 01e-SummarizeTrain.R, which reports several univariate and bivariate statistics for the training set but which takes a long time to run, and is not an essential part of our process.

Thank you!

stat447-pokemoneggsteps's People

Contributors

haydenmct avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.