Git Product home page Git Product logo

nba_data's Introduction

Dataset NBA play-by-play data and shotdetails from 1996/1997 to 2023/24

Update 2023-07-11

  • Added playoff play-by-play and shotdetail data for all seasons
  • Added folder loading with scripts for data collection
  • Added folder build_dataset with scripts for build files dataset directory
  • Season 2022/23 added in Kaggle dataset and Google Drive archive

Description

Dataset contains play-by-play data from three sources: stats.nba.com, data.nba.com and pbpstats.com and also shots details. Dataset contains data from season 1996/97 for stats.nba.com and shotdetails, from season 2000/01 for pbpstats.com and from season 2016/17 for data.nba.com.

Data collected with scripts, which are located in loading folder. More info about loading data you can read in README file in these folder.

Detailed description data can be read in file description_fields.md.

Useful links:

Ryan Davis - Analyze the Play by Play Data

Python nba_api package for work with NBA API - https://github.com/swar/nba_api

R hoopR package for work with NBA API - https://hoopr.sportsdataverse.org/

Motivation

I made this dataset because I want to simplify and speed up work with play-by-play data so that researchers spend their time studying data, not collecting it. Due to the limits on requests on the NBA website, and also because you can get play-by-play of only one game per request, collecting this data is a very long process.

Using this dataset, you can reduce the time to get information about one season from a few hours to a couple of seconds and spend more time analyzing data or building models.

I also added play-by-play information from other sources: pbpstats.com (there is information about the time of ownership and the type of its beginning) and data.nba.com (there you can find coordinates of actions on court). This data will enrich information about the progress of each game and hopefully add opportunities to do interesting things.

Download

You can download dataset several ways:

Clone git repository to your device

git clone https://github.com/shufinskiy/nba_data.git

Download using R or Python

You can write your own loading functions or use ones I wrote for R and Python languages.

R:

load_nba_data <- function(path = getwd(),
                          seasons = seq(1996, 2023),
                          data = c("datanba", "nbastats", "pbpstats", "shotdetail", "cdnnba", "nbastatsv3"),
                          seasontype = 'rg',
                          untar = FALSE){

  if(seasontype == 'rg'){
    df <- expand.grid(data, seasons)
    need_data <- paste(df$Var1, df$Var2, sep = "_")
  } else if(seasontype == 'po'){
    df <- expand.grid(data, 'po', seasons)
    need_data <- paste(df$Var1, df$Var2, df$Var3, sep = "_")
  } else {
    df_rg <- expand.grid(data, seasons)
    df_po <- expand.grid(data, 'po', seasons)
    need_data <- c(paste(df_rg$Var1, df_rg$Var2, sep = "_"), paste(df_po$Var1, df_po$Var2, df_po$Var3, sep = "_"))
  }
  temp <- tempfile()
  download.file("https://raw.githubusercontent.com/shufinskiy/nba_data/main/list_data.txt", temp)
  f <- readLines(temp)
  unlink(temp)
  
  v <- unlist(strsplit(f, "="))
  
  name_v <- v[seq(1, length(v), 2)]
  element_v <- v[seq(2, length(v), 2)]
  
  need_name <- name_v[which(name_v %in% need_data)]
  need_element <- element_v[which(name_v %in% need_data)]

  if(!dir.exists(path)){
    dir.create(path)
  }
  
  for(i in seq_along(need_element)){
    destfile <- paste0(path, '/', need_name[i], ".tar.xz")
    download.file(need_element[i], destfile = destfile)
    if(untar){
      untar(destfile, paste0(need_name[i], ".csv"), exdir = path)
      unlink(destfile)
    }
  }
}

Python:

For download data in Python, you can use the nba-on-court package

from pathlib import Path
from itertools import product
import urllib.request
import tarfile
from typing import Union, Sequence

def load_nba_data(path: Union[Path, str] = Path.cwd(),
                  seasons: Union[Sequence, int] = range(1996, 2023),
                  data: Union[Sequence, str] = ("datanba", "nbastats", "pbpstats",
                                                "shotdetail", "cdnnba", "nbastatsv3"),
                  seasontype: str = 'rg',
                  untar: bool = False) -> None:
    """
    Loading a nba play-by-play dataset from github repository https://github.com/shufinskiy/nba_data

    Args:
        path (Union[Path, str]): Path where downloaded file should be saved on the hard disk
        seasons (Union[Sequence, int]): Sequence or integer of the year of start of season
        data (Union[Sequence, str]): Sequence or string of data types to load
        seasontype (str): Part of season: rg - Regular Season, po - Playoffs
        untar (bool): Logical: do need to untar loaded archive

    Returns:
        None
    """
    if isinstance(path, str):
        path = Path(path)
    if isinstance(seasons, int):
        seasons = (seasons,)
    if isinstance(data, str):
        data = (data,)

    if seasontype == 'rg':
        need_data = tuple(["_".join([data, str(season)]) for (data, season) in product(data, seasons)])
    elif seasontype == 'po':
        need_data = tuple(["_".join([data, seasontype, str(season)]) \
                           for (data, seasontype, season) in product(data, (seasontype,), seasons)])
    else:
        need_data_rg = tuple(["_".join([data, str(season)]) for (data, season) in product(data, seasons)])
        need_data_po = tuple(["_".join([data, seasontype, str(season)]) \
                              for (data, seasontype, season) in product(data, ('po',), seasons)])
        need_data = need_data_rg + need_data_po

    with urllib.request.urlopen("https://raw.githubusercontent.com/shufinskiy/nba_data/main/list_data.txt") as f:
        v = f.read().decode('utf-8').strip()

    name_v = [string.split("=")[0] for string in v.split("\n")]
    element_v = [string.split("=")[1] for string in v.split("\n")]

    need_name = [name for name in name_v if name in need_data]
    need_element = [element for (name, element) in zip(name_v, element_v) if name in need_data]

    for i in range(len(need_name)):
        t = urllib.request.urlopen(need_element[i])
        with path.joinpath("".join([need_name[i], ".tar.xz"])).open(mode='wb') as f:
            f.write(t.read())
        if untar:
            with tarfile.open(path.joinpath("".join([need_name[i], ".tar.xz"]))) as f:
                f.extract("".join([need_name[i], ".csv"]), path)

            path.joinpath("".join([need_name[i], ".tar.xz"])).unlink()

Dataset on Kaggle

Kaggle notebook with examples work with dataset (R)

Download from Google Drive

You can also download full version of the dataset from GoogleDrive.

Contact me:

If you have questions or proposal about dataset, you can write me convenient for you in a way.

nba_data's People

Contributors

shufinskiy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nba_data's Issues

incorrect game reference of GAME_ID `42000133`

Within datanba_po_2020.csv, the references to GAME_ID of 42000133 references all the events of NBA's gid of 0041900133.

0042000133 should reference a game between NYK and ATL from 2021-05-28 https://www.nba.com/game/nyk-vs-atl-0042000133/play-by-play but first actual event in for that game is:

Jump Ball Adebayo vs Turner (G Dragic gains possession)

Which comes from a game between MIA an IND from 2020-08-20 https://www.nba.com/game/mia-vs-ind-0041900132/play-by-play

`game_id` is a varchar

Flagging that from the NBA's perspective, game_id is a 10 digit character vector but all the uncompressed csv files in this repository have converted this variable to a numeric and taken off the first two digits. This does make for a more efficient compression as the first two digits are always "0", but it distorts what a researcher would have received from making calls to these APIs themselves as this variable is now of a different type.

And just for reference the game_id is the form XXXYYZZZZZ:

  • XXX: is the season prefix
    • 001 - preseason
    • 002 - regular season
    • 003 - all-star
    • 004 - playoffs
    • 005 - play-in
  • YY: is two digit start year of the season (2023-2024 would be 23)
  • ZZZZZ: is sequenced game number

G-league data

Hi there -- thanks SO MUCH for putting this repo together and open sourcing it!

With the new wnba data in place -- any chance you could drop the data for the G-leauge as well? It's super useful as extra data for padding out data used to fit / train models. It's league_id 20 in the NBA data api -- not sure about the other sources.

Playoff data?

Thank you for compiling this dataset and for making it easily accessible!

I was just wondering about the availability of post-season (as well as pre-season and all-start game) data. What was the rationale behind only using regular season data here?

Thank you again!

Use parquet files instead of csvs

Hey there,

Thanks for the great resource. The files are huge, and for good reason โ€” but they're also stored as CSVs. Would it be possible to instead store the files as parquet? This will enable less bandwidth usage, storage space, and also has the schema defined inside the file format for ease of use. I can submit a PR if you're interested :)

interpretation of 'PCTIMESTRING' in stats.nba.com file

I'm still unsure of how game time is denoted in the nbastats_XXXX.csv files. I consulted the data dictionary, but it doesn't seem to be that PCTIMESTRING represents the time to the end of the quarter. At best, it looks like this is the cumulative game time that has elapsed in the game multiplied by 60. However, there are still obvious errors. For instance, there are numerous games where a quarter ends, and the next quarter starts with a PCTIMESTRING value that is less than the previous value.

Can you offer clarification on this, and more importantly, how to estimate the current game time for each event in the file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.