droher / boxball Goto Github PK

Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.

License: Apache License 2.0

Dockerfile 5.63% Shell 0.48% Python 93.89%

retrosheet docker postgres sql baseball sports sqlite mysql column-store postgresql containers baseballdatabank sabermetrics sports-stats sports-data clickhouse play-by-play apache-drill

boxball's Introduction

Update: I have released a new project, baseball.computer, which is designed as the successor to boxball. It is much easier to use (no Docker required, runs entirely in your browser/program) and includes many more tables, features, and quality controls. The event schema is different, which will be the main migration pain point. I aim to continue Boxball maintenence and updates as long as people are still using it, and I may try to rebase boxball on top of the new project to make maintaining both easier. Please let me know if there are things you can do in Boxball that you can't do yet in baseball.computer by filing an issue on the repo or reaching me at [email protected].

Introduction

Boxball creates prepopulated databases of the two most significant open source baseball datasets: Retrosheet and the Baseball Databank. Retrosheet contains information on every major-league pitch since 2000, every play since 1928, every box score since 1901, and every game since 1871. The Databank (based on the Lahman Database) contains yearly summaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:

Docker for repeatable builds and easy distribution
SQLAlchemy for abstracting away DDL differences between databases
Chadwick for translating Retrosheet's complex event files into a relational format

Follow the instructions below to install your distribution of choice. The full set of images is also available on Docker Hub.

The Retrosheet schema is extensively documented in the code; see the source here until I find a prettier solution.

If you find the project useful, please consider donating to:

The Ali Forney Center for homeless LGBTQ youth
350.org, a grassroots international climate change organization

Feel free to contact me with questions or comments!

Requirements

Docker (v18.06, earlier versions may not work)
2-20GB Disk space (depends on distribution choice)
500MB-8GB RAM available to Docker (depends on distribution choice)

Distributions

Column-Oriented Databases

Postgres cstore_fdw (Recommended)

This distribution uses the cstore_fdw extension to turn PostgreSQL into a column-oriented database. This means that you get the rich featureset of Postgres, but with a huge improvement in speed and disk usage. To install and run the database server:

docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest

Roughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5433 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres-cstore-fdw (~1.5GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Clickhouse

Clickhouse is a database developed by Yandex with some very impressive performance benchmarks. It uses less disk space than Postgres cstore_fdw, but significantly more RAM (~5GB). I've yet to run any query performance comparisons. To install and run the database server:

docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-latest

15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching the container and using the clickhouse-client CLI or by using a local database client on port 8123 as the user default. The data will be persisted on your machine in ~/boxball/clickhouse (~700MB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Drill

Drill is a framework that allows for SQL queries directly on files, without having to declare any schema. It is usually used on a computing cluster with massive datasets, but we use a single-node setup. To install and run:

docker run --name drill -id -p 8047:8047 -p 31010:31010 -v ~/boxball/drill:/data doublewick/boxball:drill-latest

Data will be immediately available to query after the image is downloaded. Use port 8047 to access the Web UI (which includes a SQL runner) and port 31010 to connect via a database client. You may also attach the container and query from the command line. The data will be persisted on your machine in ~/boxball/drill (~700MB).

Traditional (Row-oriented) Databases

Note: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantly more disk space than their columnar counterparts.

Postgres

Similar configuration to the cstore_fdw extended version above, but stored in the conventional way.

docker run --name postgres -d -p 5432:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-latest

Roughly 90 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5432 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

MySQL

To install and run:

docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest

Roughly two hours after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user root on port 3306. The data will be persisted on your machine in ~/boxball/mysql (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

SQLite (with web UI)

To install and run:

docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-latest

Roughly two minutes after the image is downloaded, the data will be fully loaded into the database. localhost:8080 will provide a web UI where you can write queries and perform schema exploration.

Flat File Downloads

Parquet

Parquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas, and many other frameworks. OneDrive

CSV

The original CSVs from the extract step (each CSV file is compressed in the ZSTD format). OneDrive

Acknowledgements

Ted Turocy's Chadwick Bureau developed the tools and repos that made this project possible. I am also grateful to Sean Lahman for creating his database, which I have been using for over 15 years. I was able to develop and host this project for free thanks to the generous open-source plans of Jetbrains, CircleCI, Github, and Docker Hub.

Retrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry. I hope Boxball facilitates more historical research to continue this tradition.

Licence(s)

All code is released under the Apache 2.0 license. Baseball Databank data is distributed under the CC-SA 4.0 license. Retrosheet data is released under the condition that the below text appear prominently:

The information used here was obtained free of
charge from and is copyrighted by Retrosheet.  Interested
parties may contact Retrosheet at "www.retrosheet.org".

boxball's People

Contributors

Stargazers

Watchers

Forkers

gjreda aborshenko grajk12-github jackstockless tvancor rubik-ai sdvinay halasystems bdilday sol-won-1100 frankfanslc almostprod ryanferg ivoryshoulder brayanmnz jts599

boxball's Issues

Add Superset

Need help? Want new features/endpoints?

I'm looking to help make this data as accessible as possible for experienced researchers and novice data analysts alike -- please let me know if anything would help! This includes documentation, additional features, new data sources, or new load endpoints (including the ones listed in the other features so I can prioritize).

MySQL Container fails while loading data

Describe the bug
Error when loading the MySQL tag

To Reproduce
Ran suggested MySQL start point
docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest

Specs
Ubuntu through WSL2 on Windows 10,

Error Message
2022-11-23 17:18:07 ERROR 1262 (01000) at line 2063: Row 112998 was truncated; it contained more data than there were input columns

Looks to be when loading the Retrosheet Event csv file

Additional context
Full container logs: https://pastebin.com/bD53UHXE

retrosheet_event table empty for postgres-cstore-fdw

Executing SELECT COUNT(*) FROM retrosheet_event; returns 0. I've tried the latest image and 2020.2.

To Reproduce
Steps to reproduce the behavior:

docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest
psql -U postgres -p 5433
SELECT COUNT(*) FROM retrosheet_event
Output should be zero.

Expected behavior
Is the event data still being included in the images?

Desktop (please complete the following information):

OS: Windows 10 w/ WSL 2-based docker
Version: doublewick/boxball:postgres-cstore-fdw-latest

Separate data files from image

The data files are currently baked into the images for each database, which is not ideal because they're a single huge layer and Docker can't share it between images even though the data is the same. Instead, the data should be left out of each image, and downloaded when the container starts. The download should first go to some general shareable volume between images, so that it gets cached and only has to be downloaded once across multiple images.

As a prerequisite i'll need to add the data upload to OneDrive/wherever as part of the build instead of something I do manually.

Add Spark

Add Tensorflow

Update with 2021 Season data

I believe baseball-databank has came out with data files that now include 2021 season data.

Is it possible to include those?

Update Parquet Files

Hey...

This may be a weird thing to ask here, but can you update the parquet files with updated retrosheet data? They've been updated with 2019 events.

If you can, thanks.

Update data to the Summer 2023 Release

Hi,

Tom Thress recently announced the availability of the first semi-annual Summer 2023 Release.

https://www.retrosheet.org/june2023release.html

I was wondering if boxball can be updated to reflect the changes. Mainly, I'm interested in the new optional field, gametype, that's now included. It's been a real sore point for me to figure out based on game dates and what not.

A new info record type was introduced in 2023 describing the type of game.

info,gametype,regular

The possible types are:
regular
exhibition
preseason
allstar
playoff
worldseries
lcs
divisionseries
wildcard
championship

Thank you so much for putting this project together.

Add docker-compose/Makefiles for easier builds and multi-image config

Simple makefile commands like make mysql rather than having to run the entire docker command
Extend docker-compose file to publish both latest and absolute versions

Update with Dec 2023 Retrosheet release

Integrate Gameday and Retrosheet tables

Inconsistencies in Parquet download files

Describe the bug
I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.

First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect game.parquet, gamelog.parquet, or schedule.parquet to have an indicator column for this, but I do not see one).
Second appears to be more of a bug. Contrary to no. 1 above, schedule.parquet only seems to include regular season games. Great, we can simply filter game.parquet and other files by whether or not the game exists in schedule.parquet, right? Nope, schedule.parquet seems to list games that never actually occurred. As an example: schedule.parquet includes a game MIL @ CHN 2022-04-08, not part of a double header. However, games.parquet (as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' in schedule.parquet for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.

To Reproduce
Steps to reproduce the behavior:

Compare schedule.parquet and game.parquet as described above.

Expected behavior
Games will be consistent across files, and a column easily delineates whether games are regular season or not.

I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!

Add RStudio Server

retrosheet_daily table missing game.source

cwdaily outputs daily lines for each player, which include the source for the game information.
For games with multiple sources, there will be multiple daily entries for a given (player_id, game_id) tuple, and right now there's no column that can be used to disambiguate.

E.g. select * from retrosheet_daily where game_dt = '1943-06-19' and player_id = 'mackr101';

yields 5 rows for 2 games (two halves of a double header), which each game having a box score & deduced game, according to https://raw.githubusercontent.com/chadwickbureau/retrosplits/master/daybyday/playing-1943.csv. I'm not sure why mack in particular has 2 deduced game entries for CHA194306191, but that's probably an issue in chadwick

Add Julia

Add Gameday Data

Update with 2022 season from Retrosheet

Retrosheet recently published the 2022 season (among other corrections/changes)

https://www.retrosheet.org/Fall2022Release.html

I was wondering if you could update to include the new data.

Please and thank you!

Missing Retrosheet player data

Describe the bug
When analyzing RetroSheet data, there is no way to join to player information.

Additional context
Retrosheet offers a biofile.csv data source that does not come in from the upstream Chadwick repository. This data would be super helpful to have as a table within the retrosheet schema.

Reference: https://www.retrosheet.org/biofile.htm

SQLite image is not working

Hi,
I found that sqlite image is not working when I follow the usage in README which is
docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-0.0.2

The container will exit immediately and the error message shows:

$ docker logs sqlite 
zstd: can't stat /db/boxball.db.zst : No such file or directory -- ignored

I try not to mount the data folder and there is another problem. The container won't print any message and becomes to exited after a while. The weird thing is that the container will still be displayed if you use docker ps to show status.

$ docker ps
CONTAINER ID        IMAGE                             COMMAND                  CREATED              STATUS              PORTS                               NAMES
8618c27cc60c        doublewick/boxball:sqlite-0.0.2   "/bin/sh -c 'zstd --…"   About a minute ago   Up About a minute   0.0.0.0:8080->8080/tcp              sqlite

But it's actually exited if you use docker inspect sqlite to display the status:

$ docker inspect sqlite
[
    {
        "Id": "8618c27cc60c2d227977a29ba7942b5d63e22daa227a376b079858523a9345ea",
        "Created": "2019-06-14T05:41:52.886829683Z",
        "Path": "/bin/sh",
        "Args": [
            "-c",
            "zstd --rm -d /db/boxball.db.zst && sqlite_web -H 0.0.0.0 -x /db/boxball.db"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 73,
            "Error": "",
            "StartedAt": "2019-06-14T05:41:53.700159123Z",
            "FinishedAt": "2019-06-14T05:42:23.108049305Z"
        },

I have no idea why this is happening and I'll guess it's because the data is too large.

Build Schema Doc

Add foreign keys to schema

Without foreign keys, tools such as https://github.com/graphile/postgraphile are unable to infer the relationships between the tables.

If you're interested in this addition, I'd be happy to come up with a PR adding it!

Create markdown documentation for database schemas

Is your feature request related to a problem? Please describe.
Currently, the database schemas are only documented in code.

Describe the solution you'd like
Create a markdown page with a more human readable representation of the database structure

Describe alternatives you've considered
There are other more human readable formats that could be used, but Github markdown seems to be a natural fit.

Additional context
I have a script that generates an (admittedly basic) markdown page for the schemas in my fork's transform/src/schemas directory.

2020 Season Data

I saw that 2020 went up on Retrosheet.

Is it possible to update the docker images with that data included?

premission error in build script

Describe the bug
Can't get the db to build.

To Reproduce
docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" doublewick/boxball:postgres-cstore-fdw-latest

Expected behavior
I want the built database to be inside the container, but when I take out the -v flag on the docker run command I'm hit with the directory not empty error. I expect the postgres db to exists in /var/lib/postgresql/data inside the container.

Error Message

initdb: error: directory "/var/lib/postgresql/data" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/var/lib/postgresql/data" or run initdb
with an argument other than "/var/lib/postgresql/data".

Desktop (please complete the following information):
docker desktop on window 10

Add DataFusion

Billy Herman 1934 allstar_full_pkey unique violation in 2020 postgres and mysql images, value in park.csv exceeds varcar(45) in 2020 postgres_cstore image

Tested a few of the new images and three of the containers of were exiting shortly after launching. Removed the image and cleared the cache with each of these three and retried but got same error codes. Outputted their logs, see below, and saw these issues. I only managed to try postgres, postres_cstore, mysql, and clickhouse. Clickhouse worked with no issues. Docker version 19.03.8, build afacb8b running on CentOS Linux release 7.8.2003 (Core). These could probably be separated into two different issues but I'm grouping them together because they're both a result of the 2020 updates.

Postgres Log
2020-05-02 05:15:21.721 UTC [42] ERROR: duplicate key value violates unique constraint "allstar_full_pkey" 2020-05-02 05:15:21.721 UTC [42] DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. 2020-05-02 05:15:21.721 UTC [42] CONTEXT: COPY allstar_full, line 69 2020-05-02 05:15:21.721 UTC [42] STATEMENT: COPY baseballdatabank.allstar_full(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/allstar_full.csv.zst' WITH (FORMAT CSV, FORCE_NULL(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos)); psql:/docker-entrypoint-initdb.d/postgres.sql:535: ERROR: duplicate key value violates unique constraint "allstar_full_pkey" DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. CONTEXT: COPY allstar_full, line 69

MySQL Log
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/mysql.sql ERROR 1062 (23000) at line 536: Duplicate entry 'hermabi01-1934-0' for key 'PRIMARY'

Note: An issue exists on baseballdatabank for this Billy Herman record. He played twice in the game. 🤷‍♂️
https://github.com/chadwickbureau/baseballdatabank/issues/104

Postgres_cstore Log
/data/baseballdatabank/parks.csv.zst: 11602 bytes 2020-05-02 05:23:59.954 UTC [40] ERROR: value too long for type character varying(45) 2020-05-02 05:23:59.954 UTC [40] STATEMENT: COPY baseballdatabank_parks(park_id, park_name, park_alias, city, state, country) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/parks.csv.zst' WITH (FORMAT CSV, FORCE_NULL(park_id, park_name, park_alias, city, state, country)); psql:/docker-entrypoint-initdb.d/postgres_cstore_fdw.sql:527: ERROR: value too long for type character varying(45)

Build Gatsby site

Row-based Postgres not populating

Describe the bug
Waited well over 90 minutes for the population of the row-based postgres docker image, and it seems like the data isn't populating in the postgres database. The data seems to exist in the boxball directory, but it's not showing when I try to access it in the database.

I installed the sqlite image to see if it was all the images, but it worked as expected, so it might just be that particular image that has a problem

To Reproduce
Steps to reproduce the behavior:

Follow instructions to start up row-based postgres docker image
Leave up for 6 hours, and the data is not populated

Expected behavior
Data should populate in around 90 minutes

Screenshots

Desktop:

OS: Fedora 39 (Workstation) x86_64
Kernel: 6.5.9-300.fc39.x86_64
Docker: 24.0.7

Automatically trigger build when new data is published

There are a few manual tasks I need to do in order to update Boxball when Retrosheet publishes new data. While they don't take very long, they're still a pain and ended up resulting in months of extra delay getting the latest data out. As an alternative:

Set up a listener for changes in the latest commit hashes of baseballdatabank and retrosheet repos (and/or new releases)
Automatically create a new branch/release draft, update the .env file with the new information, put up a PR, build images, and upload data files to OneDrive
When PR is merged, push the images with their appropriate release labels

droher / boxball Goto Github PK

boxball's Introduction

Introduction

Requirements

Distributions

Column-Oriented Databases

Postgres cstore_fdw (Recommended)

Clickhouse

Drill

Traditional (Row-oriented) Databases

Postgres

MySQL

SQLite (with web UI)

Flat File Downloads

Parquet

CSV

Acknowledgements

Licence(s)

boxball's People

Contributors

Stargazers

Watchers

Forkers

boxball's Issues

Recommend Projects

Recommend Topics

Recommend Org