Git Product home page Git Product logo

owrestimator's Introduction

TAS Predictor : Project Overview

An app allowing you to predict the best possible time for a game speedrun.

  • Created a web app that estimates the best possible time of a speedrun (MAE ~ 388 seconds).
  • Extracted 3800+ game runs from Speedrun.com using its API and Python.
  • Scrapped 2000+ game runs from TASvideos using Beautiful Soup and Python.
  • Engineering features from the time of the world records, number of runners and released year for each game put on Python, Libre Office Calc/Excel and Streamlit.
  • Optimized linear, lasso, ridge, random forest, gradient boost regressor using GridSearchCV to find the best model.
  • Built a web app using Streamlit.

Run it

Open in Streamlit

Locally

  • Download the project and extract it
  • Run Python 3.9 on a terminal
  • Navigate into the folder you extracted
  • Install the requirements : pip install -r requirements.txt
  • Then : streamlit run src/app/main.py

Code and ressources used

Python version 3.9

Packages: pandas, numpy, sklearn, matplotlib, seaborn, beautiful soup, flask, streamlit, joblib

Libraries : Python library for speedrun.com API

Scraper Repository: https://github.com/kaparker/tutorials/blob/master/pythonscraper/websitescrapefasttrack.py

Scraper Article: https://blog.lesjeudis.com/web-scraping-avec-python (french)

Flask Productionization: https://www.analyticsvidhya.com/blog/2020/04/how-to-deploy-machine-learning-model-flask/

Data collection

Web scrapping

Tweaked the web scrapper above to scrape 2000+ games from tasvideos.org. With each game the following :

  • Time of the best TAS
  • Emulator the TAS has been made with

Speedrun.com API

Used the python library to use speedrun.com API (above) and extract 3800+ runs with the following :

  • game name
  • category
  • platform
  • engine the game has been made with
  • developpers
  • publishers
  • released year
  • number of runners for each game

Data cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

  • put the game and category in the same column
  • calculate the age of game from released year
  • removed runs with WR better than collected TAS time
  • created a variable to control outliers : time_difference
    • removed outliers for runs with time_difference > 20
  • added columns coding age and number of runners for each game to categorical variables

EDA

It revealed that number of runners were a key feature to predict the best possible time. graph

Model building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 30% (studies shows the best repartition is somewhere near that).

I tried different models :

  • Simple linear regression
  • Multiple linear regression
  • Multiple linear regression with significative feautres (p-values < 0.2 on linear regressions)
  • Lasso Regression
  • Ridge Regression
  • Random Forest Regressor
  • Gradient Boost Regressor

Model performance

The best model was the Gradient Boost Regressor with a MAE of ~388 seconds on test set.

Productionization

In this step, I built a flask API web app prototype using the relevant link in reference on a local webserver. The prototype allows the user to type the name of the game, choose the category and returns the link of the world record, its time and the estimated best time.

I decided to do the actual app using Streamlit. app prototype

The online version is coming soon.

owrestimator's People

Contributors

cmnemoi avatar

Stargazers

 avatar

Watchers

 avatar

owrestimator's Issues

'Main_platform' can't be use into the model pipeline

Summary

In model_pipeline.py, we are trying to build a ML pipeline using all features in our dataset. The dataset contains categorical feature such as 'main_platform'. We use pd.get_dummies to transform this feature into a dummy one. However some values of 'main_platform' presents in training set are not in the test set, resulting into an error.

Steps to reproduce

Download the repo and run model_pipeline.py.

Expected behavior

We should get a MAE score printed in the terminal at the very end. Instead, we get an error.

Actual behavior

We get the following error :

  File "C:\Users\charl\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'main_platform'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\utils\__init__.py", line 396, in _get_column_indices
    col_idx = all_columns.get_loc(col)
  File "C:\Users\charl\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'main_platform'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\charl\OneDrive\Documents\Projects\OWREstimator\src\model_pipeline.py", line 80, in <module>
    model_pipeline.fit(X_train,y_train)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\charl\miniconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 332, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\utils\__init__.py", line 403, in _get_column_indices
    raise ValueError(
ValueError: A given column is not a column of the dataframe```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.