meysubb / cfbscrapr-archived Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 9.0 22.08 MB

CFB R Package

License: GNU General Public License v3.0

R 100.00%

cfbscrapr-archived's People

Contributors

Stargazers

Watchers

Forkers

ehess navyjeff parishk10 gamedaycole rlindholm engy-22 rickstarblazer accidentalguru darrentriplett

cfbscrapr-archived's Issues

Question about run locations

Great package! Thanks for putting this together.

I had a question in case I have missed it. Is there information for the location/gap of the run? Based on the text description most probably not but I was wondering whether this is something possible with the raw data scraped.

Running into issues pulling historical SP+ data

Calling cfb_sp_ranking with only a year parameter is throwing errors for me.

The code I ran to create the error:

library(cfbscrapR)

sp_data <- data.frame()

for (year in 2005:2020){
temp_df <- cfb_sp_ranking(year=year)
sp_data <- rbind(temp_df, sp_data)
}

The trace:
Error in if (!repeated && grepl("%[[:xdigit:]]{2}", URL, useBytes = TRUE)) return(URL) : missing value where TRUE/FALSE needed
2.
URLencode(team, reserved = T)
1.
cfb_sp_ranking(year = year)

I'm not confident in R to be 100% positive that this is the issue, but I believe what this is saying is that, because I didn't pass in a team, team is set to null and can't be encoded.

Potential Solutions:
Check to see if team is null before calling URLencode, if it is null, skip the encoding.

elapsed.seconds object not found error in cfb_pbp_data function

## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)

year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))

df %>%
  filter(year == 2019) %>%
  mutate(pbp = purrr::pmap(
    list(x = year,
         y = week),
    .f = function(x, y) {
      cfb_pbp_data(
        year = x,
        week = y,
        team = "Pittsburgh",
        epa_wpa = TRUE
      )
    }
  )) -> pitt
#> Error in is_character(x): object 'elapsed.seconds' not found

^{Created on 2019-12-29 by the reprex package (v0.3.0)}

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0        stringr_1.4.0        dplyr_0.8.3         
 [4] purrr_0.3.3          readr_1.3.1          tidyr_1.0.0         
 [7] tibble_2.1.3         ggplot2_3.2.1        tidyverse_1.3.0     
[10] cfbscrapR_0.0.0.9000

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 splines_3.6.1    haven_2.2.0      lattice_0.20-38 
 [5] colorspace_1.4-1 vctrs_0.2.0      generics_0.0.2   yaml_2.2.0      
 [9] mgcv_1.8-28      rlang_0.4.2      pillar_1.4.2     withr_2.1.2     
[13] glue_1.3.1       DBI_1.0.0        dbplyr_1.4.2     modelr_0.1.5    
[17] readxl_1.3.1     lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0    
[21] cellranger_1.1.0 rvest_0.3.5      curl_4.3         broom_0.5.2     
[25] Rcpp_1.0.3       backports_1.1.5  scales_1.1.0     jsonlite_1.6    
[29] fs_1.3.1         hms_0.5.2        stringi_1.4.3    ggrepel_0.8.1   
[33] grid_3.6.1       cli_1.1.0        tools_3.6.1      magrittr_1.5    
[37] lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.3  zeallot_0.1.0   
[41] ellipsis_0.3.0   Matrix_1.2-17    xml2_1.2.2       reprex_0.3.0    
[45] lubridate_1.7.4  assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10 
[49] R6_2.4.1         nnet_7.3-12      nlme_3.1-140     compiler_3.6.1

Getting Names Must be Unique Error when using cfb_pbp_data function

I am getting this error when I am trying to get the pbp data for 2019.

Error: Names must be unique.
x These names are duplicated:

"game_id" at locations 9 and 24.

I am using this code to get the data which worked for me previously.

df <- cfb_pbp_data(year = 2019, season_type = "regular", week = NULL, epa_wpa = TRUE) %>% mutate(year = 2019)

R Version: 3.6.12

cfb_play_stats_player leads to "Error: Column Rush is of unsupported type NULL"

Description

If you try to scrape the player info associated by play for all games of the 2019 season, you will get an error message 'Error: Column Rush is of unsupported type NULL'.
It is working for the SEC but no for the game_Ids of the other conferences.

Reprex

library(tidyverse)
#> Warning: Paket 'tidyverse' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.6.2 erstellt
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'

library(reprex)
#> Warning: Paket 'reprex' wurde unter R Version 3.6.2 erstellt

# Loading cfb_game_info Data for the 2019 season 
game_info <- cfb_game_info(year = 2019, season_type = "both")

# Loading cfb_play_stats_player Data for all games loaded before:
player_info <- data.frame()

for(i in factor(game_info$id)){
  data <- cfb_play_stats_player(gameId = i)
  df <- data.frame(data)
  player_info <- bind_rows(player_info, df)
}
#> Warning: Values in `athleteName` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(athleteName = list)` to suppress this warning.
#> * Use `values_fn = list(athleteName = length)` to identify where the duplicates arise
#> * Use `values_fn = list(athleteName = summary_fun)` to summarise duplicates
#> Error: Column `Rush` is of unsupported type NULL

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

R Version

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.4   rmarkdown_2.1  
#>  [9] highr_0.8       knitr_1.28      stringr_1.4.0   xfun_0.12      
#> [13] digest_0.6.23   rlang_0.4.4     evaluate_0.14

^{Created on 2020-02-18 by the reprex package (v0.3.0)}

Create Game Excitement Index

Game Excitement Index for WP plots.

Error in `[<-.data.frame`(`tmp`, kickoff_ind, "ep_before", value = c(0, : replacement has 7 rows, data has 1

## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)

year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))

df %>%
  mutate(pbp = purrr::pmap(
    list(x = year,
         y = week),
    .f = function(x, y) {
      cfb_pbp_data(
        year = x,
        week = y,
        team = "Pittsburgh",
        epa_wpa = TRUE
      )
    }
  )) -> pitt
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Error in `[<-.data.frame`(`*tmp*`, kickoff_ind, "ep_before", value = c(0, : replacement has 7 rows, data has 1

^{Created on 2019-12-29 by the reprex package (v0.3.0)}

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
 [5] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
 [9] tidyverse_1.3.0 cfbscrapR_0.0.1

loaded via a namespace (and not attached):
 [1] ggrepel_0.8.1     Rcpp_1.0.3        lubridate_1.7.4   lattice_0.20-38  
 [5] ps_1.3.0          assertthat_0.2.1  zeallot_0.1.0     digest_0.6.23    
 [9] R6_2.4.1          cellranger_1.1.0  backports_1.1.5   reprex_0.3.0     
[13] evaluate_0.14     httr_1.4.1        pillar_1.4.2      rlang_0.4.2      
[17] lazyeval_0.2.2    curl_4.3          readxl_1.3.1      rstudioapi_0.10  
[21] whisker_0.4       callr_3.3.2       Matrix_1.2-17     rmarkdown_1.18   
[25] splines_3.6.1     munsell_0.5.0     broom_0.5.2       compiler_3.6.1   
[29] modelr_0.1.5      xfun_0.11         pkgconfig_2.0.3   clipr_0.7.0      
[33] mgcv_1.8-28       htmltools_0.4.0   nnet_7.3-12       tidyselect_0.2.5 
[37] crayon_1.3.4      dbplyr_1.4.2      withr_2.1.2       grid_3.6.1       
[41] nlme_3.1-140      jsonlite_1.6      gtable_0.3.0      lifecycle_0.1.0  
[45] DBI_1.0.0         magrittr_1.5      scales_1.1.0      cli_1.1.0        
[49] stringi_1.4.3     fs_1.3.1          xml2_1.2.2        ellipsis_0.3.0   
[53] generics_0.0.2    vctrs_0.2.0       tools_3.6.1       glue_1.3.1       
[57] hms_0.5.2         processx_3.4.1    yaml_2.2.0        colorspace_1.4-1 
[61] sessioninfo_1.1.1 rvest_0.3.5       knitr_1.26        haven_2.2.0

ReadBin error when trying to import data

I'm using this code to get pbp data.

cpbp <- data.frame()

for (j in 2012:2019){
  data <- data.frame()
  for(i in 1:15) {
    data2 <-
      cfb_pbp_data(year = j, season_type = "both", week = i, epa_wpa = FALSE) %>%
      mutate(week = i, year = j)
    data2 <- data.frame(data2)
    data <- bind_rows(data, data2)
  }
  cpbp <- bind_rows(cpbp, data)
}

However I'm getting this error:

Error in readBin(3L, raw(0), 32768L) :
transfer closed with 10396534 bytes remaining to read

I had no problem getting the data previously and have re-installed the package to no avail. Thanks
R Version
3.6.2

EPA lower than expected value in Michigan-Rutgers game

Description:

The 2019 Michigan-Rutgers game says Michigan's first play on offense, 1st & 10 from the 20, was a run that went for 6 yards, but was worth -0.58 EPA. That's much lower than I expected.

Reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom'
#> when loading 'cfbscrapR'
library(reprex)

pbp_2019 <- data.frame()

for(i in 1:15) {
  data <-
    cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>%
    mutate(week = 1, year = 2019)
  data <- data.frame(data)
  pbp_2019 <- bind_rows(pbp_2019, data)
  
}

pbp_2019 %>%
  filter(offense_play == "Michigan",
         defense_play == "Rutgers") %>%
  select(offense_play,
         defense_play,
         drive_id,
         half,
         clock.minutes,
         clock.seconds,
         offense_score,
         defense_score,
         play_type,
         down,
         distance,
         adj_yd_line,
         yards_gained,
         ep_before,
         ep_after,
         EPA) %>% head()
#>   offense_play defense_play   drive_id half clock.minutes clock.seconds
#> 1     Michigan      Rutgers 4011122251    1            30             0
#> 2     Michigan      Rutgers 4011122251    1            30             0
#> 3     Michigan      Rutgers 4011122251    1            30             0
#> 4     Michigan      Rutgers 4011122251    1            30             0
#> 5     Michigan      Rutgers 4011122251    1            27            52
#> 6     Michigan      Rutgers 4011122252    1            27            52
#>   offense_score defense_score         play_type down distance adj_yd_line
#> 1             0             0              Rush    1       10          80
#> 2             0             0    Pass Reception    2        4          74
#> 3             0             0              Rush    1       10          60
#> 4             0             0              Rush    2        8          58
#> 5             7             0 Passing Touchdown    1       10          48
#> 6             7             0           Kickoff    1        0          78
#>   yards_gained  ep_before   ep_after        EPA
#> 1            6 0.66694425 0.08348048 -0.5834638
#> 2           14 0.08348048 2.05876294  1.9752825
#> 3            2 2.05876294 1.29938456 -0.7593784
#> 4           10 1.29938456 2.93682006  1.6374355
#> 5           48 2.93682006 7.00000000  4.0631799
#> 6           19 1.06557103 0.85929598 -0.2062750

^{Created on 2020-01-11 by the reprex package (v0.3.0)}

R version:

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.3  magrittr_1.5    tools_3.5.3     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   rmarkdown_1.12 
#>  [9] highr_0.8       knitr_1.22      stringr_1.4.0   xfun_0.6       
#> [13] digest_0.6.18   evaluate_0.13

^{Created on 2020-01-11 by the reprex package (v0.3.0)}

PBP Sequencing if PBP data has EP/WP info

The play by play sequencing plot breaks if you pull the PBP and add the epa_wpa=T parameter.

Passer Column Contains Underscores

Trying to get play by play data at the player level utilizing cfb_play_stats_player, it seems like some rows contain both the player and receiver/rusher/sacking player in one string separated by an underscore. I would separate them based on the underscore but it's not consistent with which names appears first and whether the play was a sack, reception, etc.

cfb_play_stats_player() not working for all game IDs?

I installed the package, and was trying to pull the player stats using the various game IDs. But some work and some do not:

cfb_play_stats_player(401012246)

#Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "list"

Punt EPA is backwards

EPA is calculated with respect to the team who has the ball.

So for punts, we need to do -1 * EPA or something along those lines.

Because currently Punt Return TDs are being extremely generous to the team who punted. Which doesn't make sense. It should actually be determinantal.

Mis-classifcation of Rushing Touchdown

I noticed that "Rushing Touchdowns" are classified as rush = 0 when pulling from cfb_pbp_data(). Able to correct it and then get the correct number of touchdowns to show up.

library(tidyverse)
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'

pbp_2019 <- data.frame()
  for(i in 1:15){
    data <- cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>% 
      mutate(week = i, year = 2019)
    df <- data.frame(data)
    pbp_2019<- bind_rows(pbp_2019, df)
  }

test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 9 x 2
#>   play_type                      n
#>   <chr>                      <int>
#> 1 Fumble Recovery (Opponent)   175
#> 2 Fumble Recovery (Own)        142
#> 3 Pass Incompletion           6422
#> 4 Pass Interception Return     490
#> 5 Pass Reception              7646
#> 6 Passing Touchdown            835
#> 7 Rush                        9909
#> 8 Sack                        1478
#> 9 Safety                         9

pbp_2019<- pbp_2019 %>% mutate(rush = ifelse(play_type == "Rushing Touchdown", 1, rush))

test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 10 x 2
#>    play_type                      n
#>    <chr>                      <int>
#>  1 Fumble Recovery (Opponent)   175
#>  2 Fumble Recovery (Own)        142
#>  3 Pass Incompletion           6422
#>  4 Pass Interception Return     490
#>  5 Pass Reception              7646
#>  6 Passing Touchdown            835
#>  7 Rush                        9909
#>  8 Rushing Touchdown            633
#>  9 Sack                        1478
#> 10 Safety                         9

^{Created on 2020-01-07 by the reprex package (v0.3.0)}

EPA by Week seems to be off

Describe the bug
When extracting EPA by week like this

cfb_regular_play_2019 <- cfb_pbp_data(2019, season_type = "regular", week = 14, epa_wpa = TRUE)
cfb_osu = cfb_regular_play_2019 %>% filter(game_id == 401112228)

The top 3 negative EPA plays look off.

However when I pull the same data like this

df = cfb_pbp_data(year=2019,week=14,team='Ohio State',epa_wpa = T)

The EPA calcs look fine. Look into this.