christopherjenness / nba-prediction Goto Github PK

View Code? Open in Web Editor NEW

149.0 20.0 43.0 68 KB

Predict scores of NBA games using regularized matrix completion

Python 78.37% R 21.63%

nba nba-prediction nba-games matrix-completion

nba-prediction's Introduction

NBA-prediction

Predicts scores of NBA games using matrix completion

The Model

For a given NBA game, if you could accurately predict each team's offensive rating (points per 100 possessions) and the pace of the game (possessions per game), you could estimate the final score of the game.

Predicting a team's offensive rating against another team is tricky. It depends on how good the offensive team is at scoring and how good the defending team is a defending. Most importantly though, it depends on the specific matchups between the two teams. This is reminiscent of recommendation systems where the recommendation depends on the type of user, the type of product, and the affinity between those two. Furthermore, for a given season only some offensive ratings between teams are available (the teams that have already played). The strategy in this model is to use matrix completion techniques to estimate unseen offensive ratings. These will be combined with pace estimations to predict final scores.

Matrix completion

Here, we look at two methods for matrix completion: Maximum Margin Matrix Factorization (MMMF) and Singular Value Decomposition (SVD).

Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.

Maximum Margin Matrix Factorization (MMMF)

The objective of MMMF is approximate an m x n matrix Z by factoring into

where A is an m x r matrix and B is an n x r matrix. Effectively, this puts a rank constraint r on the approximation M.

This can be estimated by solving the following

where Omega indicates that only the known values in Z should be taken into consideration. Any unknown value is treated as zero.

While intuitive, this approach has a two of problems. First, this is a two dimensional family of models indexed by r (the rank of the factorization) and lambda (the magnitude of regularization), which requires a lot of tuning. Second, this optomization problem is non-convex and in practice did not find global minima when used to predict NBA offensive ratings. Because of this, we turned to SVD.

Singular Value Decomposition Using Nuclear Norm

SVD, not explained here, can be used to provide a rank-q approximation of a matrix (Z) by constraining the rank of the SVD (M). This amounts to the following optimization

If values are missing from Z then you can constrain M to correctly impute these values, while approximating the unknown values

Where omega is the set of known values. However, this problem is NP-hard and also leads to overfitting since the known values are required to be predicted exactly. Instead, you can simultanously predict unknown values and approximate known values by solving the following optimization

Like MMMF, this problem is non-convex. However, it can be relaxed to the following convex optimization problem

where a nuclear norm on M, ||M||_* is used. This algorithm, called soft-impute, is studied extensively in:

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. "Spectral regularization algorithms for learning large incomplete matrices." Journal of machine learning research 11.Aug (2010): 2287-2322.

Example Code

To make predictions, use the following code:

>> model = NBAModel(update=True)
>> model.get_scores('PHO', 'WAS')
PHO WAS
92.9092883132 97.1806398788

which predicts the Suns will lose to the Wizards 93-97.

Note, scraping all the data required to run the algorithm is slow. This only needs to be done the first time. On subsequent models, you can use update=False to used the cached data.

Model Tuning and Test Error

The optimization strategy above is parameterized by lambda, the extent of regularization. Using a validation set (10% of sample), we determined 25 to be optimal value of lambda.

Using lambda = 25 on a held out test set, our model estimates a team's final score with an MSE of 6.7. Not bad.

nba-prediction's People

Contributors

Stargazers

Watchers

nba-prediction's Issues

KeyError when Update=True

Hi i'm testing the python 3 code and I am getting this error all the time but only if i set NBAModel update to true hope you can help me:

**** http://www.basketball-reference.com/leagues/NBA_2019_games-october.html
**** http://www.basketball-reference.com/leagues/NBA_2019_games-november.html
**** http://www.basketball-reference.com/leagues/NBA_2019_games-december.html
    Unnamed: 0_level_0 Basic Box Score Stats            ...                    
              Starters                    MP   FG  FGA  ...  TOV   PF  PTS  +/-
0           Al Horford                 15:44    3    4  ...    1    0    7   +8
1         Kyrie Irving                 14:50    0    8  ...    1    0    0   +9
2         Jayson Tatum                 15:35    5   10  ...    0    1   13   +2
3         Jaylen Brown                 15:44    2    9  ...    1    2    5   +8
4       Gordon Hayward                 11:24    2    6  ...    0    0    4   +7
5             Reserves                    MP   FG  FGA  ...  TOV   PF  PTS  +/-
6         Terry Rozier                 12:40    3    5  ...    0    0    6   +3
7         Marcus Smart                 14:29    2    3  ...    1    1    7   -5
8        Marcus Morris                 11:18    2    4  ...    1    3    5   -4
9          Aron Baynes                  8:16    0    3  ...    1    1    0   -3
10        Daniel Theis                   NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN
11        Semi Ojeleye                   NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN
12  Guerschon Yabusele                   NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN
13      Brad Wanamaker                   NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN
14         Team Totals                   240   42   97  ...   14   20  105  NaN

[15 rows x 21 columns]

Traceback (most recent call last):
  File "venv\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Kyrie Irving'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "NBA-prediction/model/model.py", line 228, in <module>
    model = NBAModel(update=True)
  File "NBA-prediction/model/model.py", line 52, in __init__
    self.df_pace, self.df_OR = self.make_matrices()
  File "NBA-prediction/model/model.py", line 179, in make_matrices
    df_pace, df_OR = self.full_update(url, df_pace, df_OR)
  File "NBA-prediction/model/model.py", line 163, in full_update
    df_pace = self.update_df(df_pace, team1, team2, pace)
  File "NBA-prediction/model/model.py", line 114, in update_df
    old_value = df[team2].loc[team1]
  File "venv\lib\site-packages\pandas\core\frame.py", line 2899, in __getitem__
    indexer = self.columns.get_loc(key)
  File "venv\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err
KeyError: 'Kyrie Irving'

Process finished with exit code 1

Add unit testing

Somes error

Hi i have somes errors and i don't understend it.
I think maybe it's linked to my installation softImpute i'm not sure how install it

Note: Also requires softImpute R package install.packages('softImpute')

This never generate predictions.csv file but it have turn on True the update.

Can you help me please ?
Thanks for your work it's pretty good 💯

('****', 'http://www.basketball-reference.com/leagues/NBA_2017_games-october.html')
/home/lucas/.local/lib/python2.7/site-packages/pandas/core/indexing.py:117: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
Traceback (most recent call last):
  File "model.py", line 223, in <module>
    model = NBAModel(update=True)
  File "model.py", line 51, in __init__
    self.soft_impute()
  File "model.py", line 189, in soft_impute
    subprocess.check_output(['Rscript', './model/predict_soft_impute.R'])
  File "/usr/lib/python2.7/subprocess.py", line 216, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

I have try with Python3 and Python3 project branch and i have it

**** http://www.basketball-reference.com/leagues/NBA_2017_games-october.html
**** http://www.basketball-reference.com/leagues/NBA_2017_games-november.html
**** http://www.basketball-reference.com/leagues/NBA_2017_games-december.html
  Unnamed: 0_level_0 Unnamed: 1_level_0  ... Four Factors Unnamed: 6_level_0
  Unnamed: 0_level_1               Pace  ...       FT/FGA               ORtg
0                NYK               99.9  ...        0.172               88.1
1                CLE               99.9  ...        0.149              117.1

[2 rows x 7 columns]
Traceback (most recent call last):
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2656, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 228, in <module>
    model = NBAModel(update=True)
  File "model.py", line 52, in __init__
    self.df_pace, self.df_OR = self.make_matrices()
  File "model.py", line 179, in make_matrices
    df_pace, df_OR = self.full_update(url, df_pace, df_OR)
  File "model.py", line 162, in full_update
    team1, team2, team1_OR, team2_OR, pace = self.extract_data(table)
  File "model.py", line 140, in extract_data
    team1 = table.loc[2][0]
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 1913, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/indexing.py", line 141, in _get_label
    return self.obj._xs(label, axis=axis)
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 3585, in xs
    loc = self.index.get_loc(key)
  File "/home/lucas/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 2```

TypeError: a bytes-like object is required, not 'str'

hello

i have this message error when running the code

stat_html = html.replace('<!--', "")
TypeError: a bytes-like object is required, not 'str'

can someone help me ?

best regards

Can't run model without predictions.csv

Hi! I'm trying to use your model - very cool work - but can't initialize the object without a predictions.csv file. Is that something I need to manually pull before running?

pytest - 404 error

...Needs resolving.

https://travis-ci.org/christopherjenness/NBA-prediction/builds/267961639

great!! so interesting..

just for praise, i am learning tensorflow now.

Updates to basketball-reference layout

nevermind

Historical Performance

Hello

This looks like a really cool project and a novel approach to this problem. Do you have any results to report on how accurate this method is for historical data?

FileNotFoundError: [WinError 2]

Traceback (most recent call last):
File "E:\PyProjects\basketball\NBA-prediction\model\model.py", line 228, in
model = NBAModel(update=True)
File "E:\PyProjects\basketball\NBA-prediction\model\model.py", line 54, in init
self.soft_impute()
File "E:\PyProjects\basketball\NBA-prediction\model\model.py", line 194, in soft_impute
subprocess.check_output(['Rscript', 'model/predict_soft_impute.R'])
File "C:\Users\Пользователь\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Users\Пользователь\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\Пользователь\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Пользователь\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2]

I can't figure out what the problem is with subprocess.check_output(['Rscript', 'model/predict_soft_impute.R'])

Calculation used to update pace and OR dataframe values

Regarding the following line in update_df:

new_value = (float(old_value) + float(value)) / 2

Am I understanding this correctly that this calculation means the most-recent games are more heavily weighted in the resulting matrices? Team A can have an average offensive rating of 90 vs Team B over 6 games. If Team A then has an offensive rating of 110 against Team B in their 7th meeting, the matrix will now have the value (90+110) / 2 = 100 even though their actual season average vs Team B is ~93.

Is this intentional (or am I missing something)? I'd assume a true average would be more preferable than the above case. I'm not sure if there's a simple way to track the number of updates to each value to achieve this.

predictions.csv is missing

I tried running your code but it seems that predictions.csv is missing.

https://github.com/christopherjenness/NBA-prediction/blob/master/model.py#L187

Would you mind including this file?