Git Product home page Git Product logo

cfbscrapr-archived's People

Contributors

havocanalytics avatar meysubb avatar saiemgilani avatar spfleming avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cfbscrapr-archived's Issues

Question about run locations

Great package! Thanks for putting this together.

I had a question in case I have missed it. Is there information for the location/gap of the run? Based on the text description most probably not but I was wondering whether this is something possible with the raw data scraped.

Running into issues pulling historical SP+ data

Calling cfb_sp_ranking with only a year parameter is throwing errors for me.

The code I ran to create the error:

library(cfbscrapR)

sp_data <- data.frame()

for (year in 2005:2020){
temp_df <- cfb_sp_ranking(year=year)
sp_data <- rbind(temp_df, sp_data)
}

The trace:
Error in if (!repeated && grepl("%[[:xdigit:]]{2}", URL, useBytes = TRUE)) return(URL) : missing value where TRUE/FALSE needed
2.
URLencode(team, reserved = T)
1.
cfb_sp_ranking(year = year)

I'm not confident in R to be 100% positive that this is the issue, but I believe what this is saying is that, because I didn't pass in a team, team is set to null and can't be encoded.

Potential Solutions:
Check to see if team is null before calling URLencode, if it is null, skip the encoding.

elapsed.seconds object not found error in cfb_pbp_data function

## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)

year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))

df %>%
  filter(year == 2019) %>%
  mutate(pbp = purrr::pmap(
    list(x = year,
         y = week),
    .f = function(x, y) {
      cfb_pbp_data(
        year = x,
        week = y,
        team = "Pittsburgh",
        epa_wpa = TRUE
      )
    }
  )) -> pitt
#> Error in is_character(x): object 'elapsed.seconds' not found

Created on 2019-12-29 by the reprex package (v0.3.0)

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0        stringr_1.4.0        dplyr_0.8.3         
 [4] purrr_0.3.3          readr_1.3.1          tidyr_1.0.0         
 [7] tibble_2.1.3         ggplot2_3.2.1        tidyverse_1.3.0     
[10] cfbscrapR_0.0.0.9000

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 splines_3.6.1    haven_2.2.0      lattice_0.20-38 
 [5] colorspace_1.4-1 vctrs_0.2.0      generics_0.0.2   yaml_2.2.0      
 [9] mgcv_1.8-28      rlang_0.4.2      pillar_1.4.2     withr_2.1.2     
[13] glue_1.3.1       DBI_1.0.0        dbplyr_1.4.2     modelr_0.1.5    
[17] readxl_1.3.1     lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0    
[21] cellranger_1.1.0 rvest_0.3.5      curl_4.3         broom_0.5.2     
[25] Rcpp_1.0.3       backports_1.1.5  scales_1.1.0     jsonlite_1.6    
[29] fs_1.3.1         hms_0.5.2        stringi_1.4.3    ggrepel_0.8.1   
[33] grid_3.6.1       cli_1.1.0        tools_3.6.1      magrittr_1.5    
[37] lazyeval_0.2.2   crayon_1.3.4     pkgconfig_2.0.3  zeallot_0.1.0   
[41] ellipsis_0.3.0   Matrix_1.2-17    xml2_1.2.2       reprex_0.3.0    
[45] lubridate_1.7.4  assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10 
[49] R6_2.4.1         nnet_7.3-12      nlme_3.1-140     compiler_3.6.1 

Getting Names Must be Unique Error when using cfb_pbp_data function

I am getting this error when I am trying to get the pbp data for 2019.

Error: Names must be unique.
x These names are duplicated:

  • "game_id" at locations 9 and 24.

I am using this code to get the data which worked for me previously.

df <- cfb_pbp_data(year = 2019, season_type = "regular", week = NULL, epa_wpa = TRUE) %>% mutate(year = 2019)

R Version: 3.6.12

cfb_play_stats_player leads to "Error: Column Rush is of unsupported type NULL"

Description

If you try to scrape the player info associated by play for all games of the 2019 season, you will get an error message 'Error: Column Rush is of unsupported type NULL'.
It is working for the SEC but no for the game_Ids of the other conferences.

Reprex

library(tidyverse)
#> Warning: Paket 'tidyverse' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.6.2 erstellt
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'

library(reprex)
#> Warning: Paket 'reprex' wurde unter R Version 3.6.2 erstellt

# Loading cfb_game_info Data for the 2019 season 
game_info <- cfb_game_info(year = 2019, season_type = "both")

# Loading cfb_play_stats_player Data for all games loaded before:
player_info <- data.frame()

for(i in factor(game_info$id)){
  data <- cfb_play_stats_player(gameId = i)
  df <- data.frame(data)
  player_info <- bind_rows(player_info, df)
}
#> Warning: Values in `athleteName` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(athleteName = list)` to suppress this warning.
#> * Use `values_fn = list(athleteName = length)` to identify where the duplicates arise
#> * Use `values_fn = list(athleteName = summary_fun)` to summarise duplicates
#> Error: Column `Rush` is of unsupported type NULL

Created on 2020-02-18 by the reprex package (v0.3.0)

R Version

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.4   rmarkdown_2.1  
#>  [9] highr_0.8       knitr_1.28      stringr_1.4.0   xfun_0.12      
#> [13] digest_0.6.23   rlang_0.4.4     evaluate_0.14

Created on 2020-02-18 by the reprex package (v0.3.0)

Error in `[<-.data.frame`(`*tmp*`, kickoff_ind, "ep_before", value = c(0, : replacement has 7 rows, data has 1

## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)

year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))

df %>%
  mutate(pbp = purrr::pmap(
    list(x = year,
         y = week),
    .f = function(x, y) {
      cfb_pbp_data(
        year = x,
        week = y,
        team = "Pittsburgh",
        epa_wpa = TRUE
      )
    }
  )) -> pitt
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.

#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#>             for this one week or team.
#> Error in `[<-.data.frame`(`*tmp*`, kickoff_ind, "ep_before", value = c(0, : replacement has 7 rows, data has 1

Created on 2019-12-29 by the reprex package (v0.3.0)

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
 [5] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
 [9] tidyverse_1.3.0 cfbscrapR_0.0.1

loaded via a namespace (and not attached):
 [1] ggrepel_0.8.1     Rcpp_1.0.3        lubridate_1.7.4   lattice_0.20-38  
 [5] ps_1.3.0          assertthat_0.2.1  zeallot_0.1.0     digest_0.6.23    
 [9] R6_2.4.1          cellranger_1.1.0  backports_1.1.5   reprex_0.3.0     
[13] evaluate_0.14     httr_1.4.1        pillar_1.4.2      rlang_0.4.2      
[17] lazyeval_0.2.2    curl_4.3          readxl_1.3.1      rstudioapi_0.10  
[21] whisker_0.4       callr_3.3.2       Matrix_1.2-17     rmarkdown_1.18   
[25] splines_3.6.1     munsell_0.5.0     broom_0.5.2       compiler_3.6.1   
[29] modelr_0.1.5      xfun_0.11         pkgconfig_2.0.3   clipr_0.7.0      
[33] mgcv_1.8-28       htmltools_0.4.0   nnet_7.3-12       tidyselect_0.2.5 
[37] crayon_1.3.4      dbplyr_1.4.2      withr_2.1.2       grid_3.6.1       
[41] nlme_3.1-140      jsonlite_1.6      gtable_0.3.0      lifecycle_0.1.0  
[45] DBI_1.0.0         magrittr_1.5      scales_1.1.0      cli_1.1.0        
[49] stringi_1.4.3     fs_1.3.1          xml2_1.2.2        ellipsis_0.3.0   
[53] generics_0.0.2    vctrs_0.2.0       tools_3.6.1       glue_1.3.1       
[57] hms_0.5.2         processx_3.4.1    yaml_2.2.0        colorspace_1.4-1 
[61] sessioninfo_1.1.1 rvest_0.3.5       knitr_1.26        haven_2.2.0

ReadBin error when trying to import data

I'm using this code to get pbp data.

cpbp <- data.frame()

for (j in 2012:2019){
  data <- data.frame()
  for(i in 1:15) {
    data2 <-
      cfb_pbp_data(year = j, season_type = "both", week = i, epa_wpa = FALSE) %>%
      mutate(week = i, year = j)
    data2 <- data.frame(data2)
    data <- bind_rows(data, data2)
  }
  cpbp <- bind_rows(cpbp, data)
}

However I'm getting this error:

Error in readBin(3L, raw(0), 32768L) :
transfer closed with 10396534 bytes remaining to read

I had no problem getting the data previously and have re-installed the package to no avail. Thanks
R Version
3.6.2

EPA lower than expected value in Michigan-Rutgers game

Description:

The 2019 Michigan-Rutgers game says Michigan's first play on offense, 1st & 10 from the 20, was a run that went for 6 yards, but was worth -0.58 EPA. That's much lower than I expected.

Reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom'
#> when loading 'cfbscrapR'
library(reprex)

pbp_2019 <- data.frame()

for(i in 1:15) {
  data <-
    cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>%
    mutate(week = 1, year = 2019)
  data <- data.frame(data)
  pbp_2019 <- bind_rows(pbp_2019, data)
  
}

pbp_2019 %>%
  filter(offense_play == "Michigan",
         defense_play == "Rutgers") %>%
  select(offense_play,
         defense_play,
         drive_id,
         half,
         clock.minutes,
         clock.seconds,
         offense_score,
         defense_score,
         play_type,
         down,
         distance,
         adj_yd_line,
         yards_gained,
         ep_before,
         ep_after,
         EPA) %>% head()
#>   offense_play defense_play   drive_id half clock.minutes clock.seconds
#> 1     Michigan      Rutgers 4011122251    1            30             0
#> 2     Michigan      Rutgers 4011122251    1            30             0
#> 3     Michigan      Rutgers 4011122251    1            30             0
#> 4     Michigan      Rutgers 4011122251    1            30             0
#> 5     Michigan      Rutgers 4011122251    1            27            52
#> 6     Michigan      Rutgers 4011122252    1            27            52
#>   offense_score defense_score         play_type down distance adj_yd_line
#> 1             0             0              Rush    1       10          80
#> 2             0             0    Pass Reception    2        4          74
#> 3             0             0              Rush    1       10          60
#> 4             0             0              Rush    2        8          58
#> 5             7             0 Passing Touchdown    1       10          48
#> 6             7             0           Kickoff    1        0          78
#>   yards_gained  ep_before   ep_after        EPA
#> 1            6 0.66694425 0.08348048 -0.5834638
#> 2           14 0.08348048 2.05876294  1.9752825
#> 3            2 2.05876294 1.29938456 -0.7593784
#> 4           10 1.29938456 2.93682006  1.6374355
#> 5           48 2.93682006 7.00000000  4.0631799
#> 6           19 1.06557103 0.85929598 -0.2062750

Created on 2020-01-11 by the reprex package (v0.3.0)

R version:

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.3  magrittr_1.5    tools_3.5.3     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   rmarkdown_1.12 
#>  [9] highr_0.8       knitr_1.22      stringr_1.4.0   xfun_0.6       
#> [13] digest_0.6.18   evaluate_0.13

Created on 2020-01-11 by the reprex package (v0.3.0)

Passer Column Contains Underscores

Screen Shot 2020-04-29 at 1 56 21 PM

Trying to get play by play data at the player level utilizing cfb_play_stats_player, it seems like some rows contain both the player and receiver/rusher/sacking player in one string separated by an underscore. I would separate them based on the underscore but it's not consistent with which names appears first and whether the play was a sack, reception, etc.

cfb_play_stats_player() not working for all game IDs?

I installed the package, and was trying to pull the player stats using the various game IDs. But some work and some do not:

cfb_play_stats_player(401012246)

#Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "list"

Punt EPA is backwards

EPA is calculated with respect to the team who has the ball.

So for punts, we need to do -1 * EPA or something along those lines.

Because currently Punt Return TDs are being extremely generous to the team who punted. Which doesn't make sense. It should actually be determinantal.

Mis-classifcation of Rushing Touchdown

I noticed that "Rushing Touchdowns" are classified as rush = 0 when pulling from cfb_pbp_data(). Able to correct it and then get the correct number of touchdowns to show up.

library(tidyverse)
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'

pbp_2019 <- data.frame()
  for(i in 1:15){
    data <- cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>% 
      mutate(week = i, year = 2019)
    df <- data.frame(data)
    pbp_2019<- bind_rows(pbp_2019, df)
  }

test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 9 x 2
#>   play_type                      n
#>   <chr>                      <int>
#> 1 Fumble Recovery (Opponent)   175
#> 2 Fumble Recovery (Own)        142
#> 3 Pass Incompletion           6422
#> 4 Pass Interception Return     490
#> 5 Pass Reception              7646
#> 6 Passing Touchdown            835
#> 7 Rush                        9909
#> 8 Sack                        1478
#> 9 Safety                         9

pbp_2019<- pbp_2019 %>% mutate(rush = ifelse(play_type == "Rushing Touchdown", 1, rush))

test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 10 x 2
#>    play_type                      n
#>    <chr>                      <int>
#>  1 Fumble Recovery (Opponent)   175
#>  2 Fumble Recovery (Own)        142
#>  3 Pass Incompletion           6422
#>  4 Pass Interception Return     490
#>  5 Pass Reception              7646
#>  6 Passing Touchdown            835
#>  7 Rush                        9909
#>  8 Rushing Touchdown            633
#>  9 Sack                        1478
#> 10 Safety                         9

Created on 2020-01-07 by the reprex package (v0.3.0)

EPA by Week seems to be off

Describe the bug
When extracting EPA by week like this

cfb_regular_play_2019 <- cfb_pbp_data(2019, season_type = "regular", week = 14, epa_wpa = TRUE)
cfb_osu = cfb_regular_play_2019 %>% filter(game_id == 401112228)

The top 3 negative EPA plays look off.

However when I pull the same data like this

df = cfb_pbp_data(year=2019,week=14,team='Ohio State',epa_wpa = T)

The EPA calcs look fine. Look into this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.