cmnemoi / owrestimator Goto Github PK

An app allowing you to predict the best possible time possible for a game speedrun.

License: MIT License

Python 0.80% Jupyter Notebook 98.99% HTML 0.21%

machinelearning machinelearning-python supervised-learning regression-model python scikit-learn streamlit speedrunning video-games gradient-boosting

owrestimator's Introduction

TAS Predictor : Project Overview

An app allowing you to predict the best possible time for a game speedrun.

Created a web app that estimates the best possible time of a speedrun (MAE ~ 388 seconds).
Extracted 3800+ game runs from Speedrun.com using its API and Python.
Scrapped 2000+ game runs from TASvideos using Beautiful Soup and Python.
Engineering features from the time of the world records, number of runners and released year for each game put on Python, Libre Office Calc/Excel and Streamlit.
Optimized linear, lasso, ridge, random forest, gradient boost regressor using GridSearchCV to find the best model.
Built a web app using Streamlit.

Run it

Locally

Download the project and extract it
Run Python 3.9 on a terminal
Navigate into the folder you extracted
Install the requirements : pip install -r requirements.txt
Then : streamlit run src/app/main.py

Code and ressources used

Python version 3.9

Packages: pandas, numpy, sklearn, matplotlib, seaborn, beautiful soup, flask, streamlit, joblib

Libraries : Python library for speedrun.com API

Scraper Repository: https://github.com/kaparker/tutorials/blob/master/pythonscraper/websitescrapefasttrack.py

Scraper Article: https://blog.lesjeudis.com/web-scraping-avec-python (french)

Flask Productionization: https://www.analyticsvidhya.com/blog/2020/04/how-to-deploy-machine-learning-model-flask/

Data collection

Web scrapping

Tweaked the web scrapper above to scrape 2000+ games from tasvideos.org. With each game the following :

Time of the best TAS
Emulator the TAS has been made with

Speedrun.com API

Used the python library to use speedrun.com API (above) and extract 3800+ runs with the following :

game name
category
platform
engine the game has been made with
developpers
publishers
released year
number of runners for each game

Data cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

put the game and category in the same column
calculate the age of game from released year
removed runs with WR better than collected TAS time
created a variable to control outliers : time_difference
- removed outliers for runs with time_difference > 20
added columns coding age and number of runners for each game to categorical variables

EDA

It revealed that number of runners were a key feature to predict the best possible time.

Model building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 30% (studies shows the best repartition is somewhere near that).

I tried different models :

Simple linear regression
Multiple linear regression
Multiple linear regression with significative feautres (p-values < 0.2 on linear regressions)
Lasso Regression
Ridge Regression
Random Forest Regressor
Gradient Boost Regressor

Model performance

The best model was the Gradient Boost Regressor with a MAE of ~388 seconds on test set.

Productionization

In this step, I built a flask API web app prototype using the relevant link in reference on a local webserver. The prototype allows the user to type the name of the game, choose the category and returns the link of the world record, its time and the estimated best time.

I decided to do the actual app using Streamlit.

The online version is coming soon.

owrestimator's People

Contributors

Stargazers

Watchers

owrestimator's Issues

Add prediction interval feature

Documentation : https://medium.com/@qucit/a-simple-technique-to-estimate-prediction-intervals-for-any-regression-model-2dd73f630bcb

Add the possibility to search a game by its speedrun.com page weblink on the app

'Main_platform' can't be use into the model pipeline

Summary

In model_pipeline.py, we are trying to build a ML pipeline using all features in our dataset. The dataset contains categorical feature such as 'main_platform'. We use pd.get_dummies to transform this feature into a dummy one. However some values of 'main_platform' presents in training set are not in the test set, resulting into an error.

Steps to reproduce

Download the repo and run model_pipeline.py.

Expected behavior

We should get a MAE score printed in the terminal at the very end. Instead, we get an error.

Actual behavior

We get the following error :

  File "C:\Users\charl\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'main_platform'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\utils\__init__.py", line 396, in _get_column_indices
    col_idx = all_columns.get_loc(col)
  File "C:\Users\charl\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'main_platform'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\charl\OneDrive\Documents\Projects\OWREstimator\src\model_pipeline.py", line 80, in <module>
    model_pipeline.fit(X_train,y_train)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\charl\miniconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 332, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))
  File "C:\Users\charl\miniconda3\lib\site-packages\sklearn\utils\__init__.py", line 403, in _get_column_indices
    raise ValueError(
ValueError: A given column is not a column of the dataframe```