meysubb / cfbscrapr-archived Goto Github PK
View Code? Open in Web Editor NEWCFB R Package
License: GNU General Public License v3.0
CFB R Package
License: GNU General Public License v3.0
Great package! Thanks for putting this together.
I had a question in case I have missed it. Is there information for the location/gap of the run? Based on the text description most probably not but I was wondering whether this is something possible with the raw data scraped.
Calling cfb_sp_ranking with only a year parameter is throwing errors for me.
The code I ran to create the error:
library(cfbscrapR)
sp_data <- data.frame()
for (year in 2005:2020){
temp_df <- cfb_sp_ranking(year=year)
sp_data <- rbind(temp_df, sp_data)
}
The trace:
Error in if (!repeated && grepl("%[[:xdigit:]]{2}", URL, useBytes = TRUE)) return(URL) : missing value where TRUE/FALSE needed
2.
URLencode(team, reserved = T)
1.
cfb_sp_ranking(year = year)
I'm not confident in R to be 100% positive that this is the issue, but I believe what this is saying is that, because I didn't pass in a team, team is set to null and can't be encoded.
Potential Solutions:
Check to see if team is null before calling URLencode, if it is null, skip the encoding.
## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)
year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))
df %>%
filter(year == 2019) %>%
mutate(pbp = purrr::pmap(
list(x = year,
y = week),
.f = function(x, y) {
cfb_pbp_data(
year = x,
week = y,
team = "Pittsburgh",
epa_wpa = TRUE
)
}
)) -> pitt
#> Error in is_character(x): object 'elapsed.seconds' not found
Created on 2019-12-29 by the reprex package (v0.3.0)
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[10] cfbscrapR_0.0.0.9000
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 splines_3.6.1 haven_2.2.0 lattice_0.20-38
[5] colorspace_1.4-1 vctrs_0.2.0 generics_0.0.2 yaml_2.2.0
[9] mgcv_1.8-28 rlang_0.4.2 pillar_1.4.2 withr_2.1.2
[13] glue_1.3.1 DBI_1.0.0 dbplyr_1.4.2 modelr_0.1.5
[17] readxl_1.3.1 lifecycle_0.1.0 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.5 curl_4.3 broom_0.5.2
[25] Rcpp_1.0.3 backports_1.1.5 scales_1.1.0 jsonlite_1.6
[29] fs_1.3.1 hms_0.5.2 stringi_1.4.3 ggrepel_0.8.1
[33] grid_3.6.1 cli_1.1.0 tools_3.6.1 magrittr_1.5
[37] lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.3 zeallot_0.1.0
[41] ellipsis_0.3.0 Matrix_1.2-17 xml2_1.2.2 reprex_0.3.0
[45] lubridate_1.7.4 assertthat_0.2.1 httr_1.4.1 rstudioapi_0.10
[49] R6_2.4.1 nnet_7.3-12 nlme_3.1-140 compiler_3.6.1
I am getting this error when I am trying to get the pbp data for 2019.
Error: Names must be unique.
x These names are duplicated:
I am using this code to get the data which worked for me previously.
df <- cfb_pbp_data(year = 2019, season_type = "regular", week = NULL, epa_wpa = TRUE) %>% mutate(year = 2019)
R Version: 3.6.12
Description
If you try to scrape the player info associated by play for all games of the 2019 season, you will get an error message 'Error: Column Rush
is of unsupported type NULL'.
It is working for the SEC but no for the game_Ids of the other conferences.
Reprex
library(tidyverse)
#> Warning: Paket 'tidyverse' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.6.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.6.2 erstellt
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(reprex)
#> Warning: Paket 'reprex' wurde unter R Version 3.6.2 erstellt
# Loading cfb_game_info Data for the 2019 season
game_info <- cfb_game_info(year = 2019, season_type = "both")
# Loading cfb_play_stats_player Data for all games loaded before:
player_info <- data.frame()
for(i in factor(game_info$id)){
data <- cfb_play_stats_player(gameId = i)
df <- data.frame(data)
player_info <- bind_rows(player_info, df)
}
#> Warning: Values in `athleteName` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list(athleteName = list)` to suppress this warning.
#> * Use `values_fn = list(athleteName = length)` to identify where the duplicates arise
#> * Use `values_fn = list(athleteName = summary_fun)` to summarise duplicates
#> Error: Column `Rush` is of unsupported type NULL
Created on 2020-02-18 by the reprex package (v0.3.0)
R Version
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Germany.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.6.1 magrittr_1.5 tools_3.6.1 htmltools_0.4.0
#> [5] yaml_2.2.1 Rcpp_1.0.3 stringi_1.4.4 rmarkdown_2.1
#> [9] highr_0.8 knitr_1.28 stringr_1.4.0 xfun_0.12
#> [13] digest_0.6.23 rlang_0.4.4 evaluate_0.14
Created on 2020-02-18 by the reprex package (v0.3.0)
Game Excitement Index for WP plots.
## devtools::install_github("meysubb/cfbscrapR")
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
library(tidyverse)
year = 2001:2019
week = 1:14
df = expand.grid(year, week) %>% setNames(c('year', 'week'))
df %>%
mutate(pbp = purrr::pmap(
list(x = year,
y = week),
.f = function(x, y) {
cfb_pbp_data(
year = x,
week = y,
team = "Pittsburgh",
epa_wpa = TRUE
)
}
)) -> pitt
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in log(adj_yd_line): NaNs produced
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Warning in cfb_pbp_data(year = x, week = y, team = "Pittsburgh", epa_wpa = TRUE): Most likely a bye week, the data pulled from the API was empty. Returning nothing
#> for this one week or team.
#> Error in `[<-.data.frame`(`*tmp*`, kickoff_ind, "ep_before", value = c(0, : replacement has 7 rows, data has 1
Created on 2019-12-29 by the reprex package (v0.3.0)
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
[5] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
[9] tidyverse_1.3.0 cfbscrapR_0.0.1
loaded via a namespace (and not attached):
[1] ggrepel_0.8.1 Rcpp_1.0.3 lubridate_1.7.4 lattice_0.20-38
[5] ps_1.3.0 assertthat_0.2.1 zeallot_0.1.0 digest_0.6.23
[9] R6_2.4.1 cellranger_1.1.0 backports_1.1.5 reprex_0.3.0
[13] evaluate_0.14 httr_1.4.1 pillar_1.4.2 rlang_0.4.2
[17] lazyeval_0.2.2 curl_4.3 readxl_1.3.1 rstudioapi_0.10
[21] whisker_0.4 callr_3.3.2 Matrix_1.2-17 rmarkdown_1.18
[25] splines_3.6.1 munsell_0.5.0 broom_0.5.2 compiler_3.6.1
[29] modelr_0.1.5 xfun_0.11 pkgconfig_2.0.3 clipr_0.7.0
[33] mgcv_1.8-28 htmltools_0.4.0 nnet_7.3-12 tidyselect_0.2.5
[37] crayon_1.3.4 dbplyr_1.4.2 withr_2.1.2 grid_3.6.1
[41] nlme_3.1-140 jsonlite_1.6 gtable_0.3.0 lifecycle_0.1.0
[45] DBI_1.0.0 magrittr_1.5 scales_1.1.0 cli_1.1.0
[49] stringi_1.4.3 fs_1.3.1 xml2_1.2.2 ellipsis_0.3.0
[53] generics_0.0.2 vctrs_0.2.0 tools_3.6.1 glue_1.3.1
[57] hms_0.5.2 processx_3.4.1 yaml_2.2.0 colorspace_1.4-1
[61] sessioninfo_1.1.1 rvest_0.3.5 knitr_1.26 haven_2.2.0
I'm using this code to get pbp data.
cpbp <- data.frame()
for (j in 2012:2019){
data <- data.frame()
for(i in 1:15) {
data2 <-
cfb_pbp_data(year = j, season_type = "both", week = i, epa_wpa = FALSE) %>%
mutate(week = i, year = j)
data2 <- data.frame(data2)
data <- bind_rows(data, data2)
}
cpbp <- bind_rows(cpbp, data)
}
However I'm getting this error:
Error in readBin(3L, raw(0), 32768L) :
transfer closed with 10396534 bytes remaining to read
I had no problem getting the data previously and have re-installed the package to no avail. Thanks
R Version
3.6.2
Description:
The 2019 Michigan-Rutgers game says Michigan's first play on offense, 1st & 10 from the 20, was a run that went for 6 yards, but was worth -0.58 EPA. That's much lower than I expected.
Reprex:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom'
#> when loading 'cfbscrapR'
library(reprex)
pbp_2019 <- data.frame()
for(i in 1:15) {
data <-
cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>%
mutate(week = 1, year = 2019)
data <- data.frame(data)
pbp_2019 <- bind_rows(pbp_2019, data)
}
pbp_2019 %>%
filter(offense_play == "Michigan",
defense_play == "Rutgers") %>%
select(offense_play,
defense_play,
drive_id,
half,
clock.minutes,
clock.seconds,
offense_score,
defense_score,
play_type,
down,
distance,
adj_yd_line,
yards_gained,
ep_before,
ep_after,
EPA) %>% head()
#> offense_play defense_play drive_id half clock.minutes clock.seconds
#> 1 Michigan Rutgers 4011122251 1 30 0
#> 2 Michigan Rutgers 4011122251 1 30 0
#> 3 Michigan Rutgers 4011122251 1 30 0
#> 4 Michigan Rutgers 4011122251 1 30 0
#> 5 Michigan Rutgers 4011122251 1 27 52
#> 6 Michigan Rutgers 4011122252 1 27 52
#> offense_score defense_score play_type down distance adj_yd_line
#> 1 0 0 Rush 1 10 80
#> 2 0 0 Pass Reception 2 4 74
#> 3 0 0 Rush 1 10 60
#> 4 0 0 Rush 2 8 58
#> 5 7 0 Passing Touchdown 1 10 48
#> 6 7 0 Kickoff 1 0 78
#> yards_gained ep_before ep_after EPA
#> 1 6 0.66694425 0.08348048 -0.5834638
#> 2 14 0.08348048 2.05876294 1.9752825
#> 3 2 2.05876294 1.29938456 -0.7593784
#> 4 10 1.29938456 2.93682006 1.6374355
#> 5 48 2.93682006 7.00000000 4.0631799
#> 6 19 1.06557103 0.85929598 -0.2062750
Created on 2020-01-11 by the reprex package (v0.3.0)
R version:
sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.5.3 magrittr_1.5 tools_3.5.3 htmltools_0.3.6
#> [5] yaml_2.2.0 Rcpp_1.0.1 stringi_1.4.3 rmarkdown_1.12
#> [9] highr_0.8 knitr_1.22 stringr_1.4.0 xfun_0.6
#> [13] digest_0.6.18 evaluate_0.13
Created on 2020-01-11 by the reprex package (v0.3.0)
The play by play sequencing plot breaks if you pull the PBP and add the epa_wpa=T
parameter.
Trying to get play by play data at the player level utilizing cfb_play_stats_player, it seems like some rows contain both the player and receiver/rusher/sacking player in one string separated by an underscore. I would separate them based on the underscore but it's not consistent with which names appears first and whether the play was a sack, reception, etc.
I installed the package, and was trying to pull the player stats using the various game IDs. But some work and some do not:
cfb_play_stats_player(401012246)
#Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "list"
EPA is calculated with respect to the team who has the ball.
So for punts, we need to do -1 * EPA or something along those lines.
Because currently Punt Return TDs are being extremely generous to the team who punted. Which doesn't make sense. It should actually be determinantal.
I noticed that "Rushing Touchdowns" are classified as rush = 0 when pulling from cfb_pbp_data(). Able to correct it and then get the correct number of touchdowns to show up.
library(tidyverse)
library(cfbscrapR)
#> Warning: replacing previous import 'mgcv::multinom' by 'nnet::multinom' when
#> loading 'cfbscrapR'
pbp_2019 <- data.frame()
for(i in 1:15){
data <- cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>%
mutate(week = i, year = 2019)
df <- data.frame(data)
pbp_2019<- bind_rows(pbp_2019, df)
}
test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 9 x 2
#> play_type n
#> <chr> <int>
#> 1 Fumble Recovery (Opponent) 175
#> 2 Fumble Recovery (Own) 142
#> 3 Pass Incompletion 6422
#> 4 Pass Interception Return 490
#> 5 Pass Reception 7646
#> 6 Passing Touchdown 835
#> 7 Rush 9909
#> 8 Sack 1478
#> 9 Safety 9
pbp_2019<- pbp_2019 %>% mutate(rush = ifelse(play_type == "Rushing Touchdown", 1, rush))
test <- pbp_2019 %>% filter(rush == 1 | pass == 1) %>% filter(down == 3 | down == 4)
test %>% count(play_type)
#> # A tibble: 10 x 2
#> play_type n
#> <chr> <int>
#> 1 Fumble Recovery (Opponent) 175
#> 2 Fumble Recovery (Own) 142
#> 3 Pass Incompletion 6422
#> 4 Pass Interception Return 490
#> 5 Pass Reception 7646
#> 6 Passing Touchdown 835
#> 7 Rush 9909
#> 8 Rushing Touchdown 633
#> 9 Sack 1478
#> 10 Safety 9
Created on 2020-01-07 by the reprex package (v0.3.0)
Describe the bug
When extracting EPA by week like this
cfb_regular_play_2019 <- cfb_pbp_data(2019, season_type = "regular", week = 14, epa_wpa = TRUE)
cfb_osu = cfb_regular_play_2019 %>% filter(game_id == 401112228)
The top 3 negative EPA plays look off.
However when I pull the same data like this
df = cfb_pbp_data(year=2019,week=14,team='Ohio State',epa_wpa = T)
The EPA calcs look fine. Look into this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.