catalyst-cooperative / pudl Goto Github PK

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

Home Page: https://catalyst.coop/pudl

License: MIT License

Python 98.72% Shell 0.37% Jinja 0.59% Dockerfile 0.07% HCL 0.07% Mako 0.01% Makefile 0.17%

open-data ferc eia energy utility climate electricity epa coal natural-gas

pudl's Introduction

The Public Utility Data Liberation Project (PUDL)

What is PUDL?

The PUDL Project is an open source data processing pipeline that makes US energy data easier to access and use programmatically.

Hundreds of gigabytes of valuable data are published by US government agencies, but it's often difficult to work with. PUDL takes the original spreadsheets, CSV files, and databases and turns them into a unified resource. This allows users to spend more time on novel analysis and less time on data preparation.

The project is focused on serving researchers, activists, journalists, policy makers, and small businesses that might not otherwise be able to afford access to this data from commercial sources and who may not have the time or expertise to do all the data processing themselves from scratch.

We want to make this data accessible and easy to work with for as wide an audience as possible: anyone from a grassroots youth climate organizers working with Google sheets to university researchers with access to scalable cloud computing resources and everyone in between!

PUDL is comprised of three core components:

Raw Data Archives

PUDL archives all our raw inputs on Zenodo to ensure permanent, versioned access to the data. In the event that an agency changes how they publish data or deletes old files, the data processing pipeline will still have access to the original inputs. Each of the data inputs may have several different versions archived, and all are assigned a unique DOI (digital object identifier) and made available through Zenodo's REST API. You can read more about the Raw Data Archives in the docs.

Data Pipeline

The data pipeline (this repo) ingests raw data from the archives, cleans and integrates it, and writes the resulting tables to SQLite and Apache Parquet files, with some acompanying metadata stored as JSON. Each release of the PUDL software contains a set of of DOIs indicating which versions of the raw inputs it processes. This helps ensure that the outputs are replicable. You can read more about our ETL (extract, transform, load) process in the PUDL documentation.

Data Warehouse

The SQLite, Parquet, and JSON outputs from the data pipeline, sometimes called "PUDL outputs", are updated each night by an automated build process, and periodically archived so that users can access the data without having to install and run our data processing system. These outputs contain hundreds of tables and comprise a small file-based data warehouse that can be used for a variety of energy system analyses. Learn more about how to access the PUDL data.

What data is available?

PUDL currently integrates data from:

EIA Form 860: 2001-2022
- Source Docs
- PUDL Docs
EIA Form 860m: 2023-12
- Source Docs
EIA Form 861: 2001-2022
- Source Docs
- PUDL Docs
EIA Form 923: 2001-2023
- Source Docs
- PUDL Docs
EPA Continuous Emissions Monitoring System (CEMS): 1995Q1-2023Q4
- Source Docs
- PUDL Docs
FERC Form 1: 1994-2022
- Source Docs
- PUDL Docs
FERC Form 714: 2006-2022 (mostly raw)
- Source Docs
- PUDL Docs
FERC Form 2: 1996-2022 (raw only)
- Source Docs
FERC Form 6: 2000-2022 (raw only)
- Source Docs
FERC Form 60: 2006-2022 (raw only)
- Source Docs
US Census Demographic Profile 1 Geodatabase: 2010
- Source Docs

Thanks to support from the Alfred P. Sloan Foundation Energy & Environment Program, from 2021 to 2024 we will be cleaning and integrating the following data as well:

EIA Form 176 (The Annual Report of Natural Gas Supply and Disposition)
FERC Electric Quarterly Reports (EQR)
FERC Form 2 (Annual Report of Major Natural Gas Companies)
PHMSA Natural Gas Annual Report
Machine Readable Specifications of State Clean Energy Standards

How do I access the data?

For details on how to access PUDL data, see the data access documentation. A quick summary:

Datasette provides browsable and queryable data from our nightly builds on the web: https://data.catalyst.coop
Kaggle provides easy Jupyter notebook access to the PUDL data, updated weekly: https://www.kaggle.com/datasets/catalystcooperative/pudl-project
Zenodo provides stable long-term access to our versioned data releases with a citeable DOI: https://doi.org/10.5281/zenodo.3653158
Nightly Data Builds push their outputs to the AWS Open Data Registry: https://registry.opendata.aws/catalyst-cooperative-pudl/ See the nightly build docs for direct download links.
The PUDL Development Environment lets you run the PUDL data processing pipeline locally.

Contributing to PUDL

Find PUDL useful? Want to help make it better? There are lots of ways to help!

Check out our contribution guide including our Code of Conduct.
You can file a bug report, make a feature request, or ask questions in the Github issue tracker.
Feel free to fork the project and make a pull request with new code, better documentation, or example notebooks.
Make a recurring financial contribution to support our work liberating public energy data.
Hire us to do some custom analysis and allow us to integrate the resulting code into PUDL.

Licensing

In general, our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we've done.

The PUDL software is released under the MIT License.
The PUDL data and documentation are published under the Creative Commons Attribution License v4.0 (CC-BY-4.0).

Contact Us

For bug reports, feature requests, and other software or data issues please make a GitHub Issue.
For more general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
If you'd like to get occasional updates about the project sign up for our email list.
Want to schedule a time to chat with us one-on-one about your PUDL use case, ideas for improvement, or get some personalized support? Join us for Office Hours
Follow us here on GitHub
Follow us on Mastodon: @[email protected]
Follow us on BlueSky: @catalyst.coop
Follow us on LinkedIn
Follow us on HuggingFace
Follow us on Twitter: @CatalystCoop
Follow us on Kaggle
More info on our website: https://catalyst.coop
Email us if you'd like to hire us to provide customized data extraction and analysis: [email protected]

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

pudl's People

Contributors

Stargazers

Watchers

Forkers

maxoboe deardeepan gschivley barry8197 shuang1994 gitter-badger coryvegan jfburkhart cbdavis orenbaldinger powersym dbaston m-jamieson josiahjohnston sunliping0213 cmgosnell johnnydv aaronholm truggles claytonpbarrows jwoollacott patrickbrown4 erictleung jrea-rmi sammardell beckyxiluli rousik ella24 exemplary-citizen tiledb-inc yashkumar1803 e7dal aesharpe curioustauseef kma33 vishakh22 shahirsu uniteddiversity udayvaradarajan kyleries nickrobinson251 fchorney karldw ahahajade obianuju-rmi nguyendoanquyet cschloer mpaulgreen silky hortonshelpers anamileva gregosimo kunle1992 evelynmitchell daniellevie kevinsung tosinjayeola rohithdesikan nataliamush smuratsirin kev-rheinheimer wheelspawn jwmazzi chrigraf wchenaus noahcknox nxshi grgmiller bendnorman rmi-electricity boyuan276 drewclutter kevpgoff neverett knordback jdangerx nelsonauner jeffreysward apptrain doug-cady dstansby ericscheier ggurjar333 ryanmasson turbinehub etrieschman robertozanchi singularity-energy jamesbrofos yashpandey06 montroyjosh hfireborn aayushk9 jaishree2310 nancy9ice arzoo14 hrishikeshh

pudl's Issues

Functionalize the operations taking place within pudl.init_db()

The pudl.init_db() subroutine is getting too long and has become unreadable. Separate various logical pieces of it into functions that define particular portions of the process, including:

importing the static tables/lists
generating and importing the "glue" tables.
the import & cleanup for each of the ferc1 tables.

This will make it easier to hand off & test the individual tasks of importing new ferc1 tables.

Conference line

Assess impacts of constraints applied to plants for ID mapping

The constraints set on pulling the plants for the id mapping (ex. excluding anything under 5 MW) excluded some fields like common plants, which have expenses associated with them, but no capacity. Not a top priority by any means, but eventually we should go back through these constraints to see what we missed.

Example: for the f1_hydro table, we lost respondent_id, plant_names:
122, 'Common Hydro Plant'; 70, 'Common Facilities'

git and ORM

Spot check past FERC Form 1 versions to match line numbers with FERC Accounts

Periodically FERC updates their uniform chart of accounts for electric plant (e.g. they just recently added accounts for electricity storage devices, yay!). When this happens, the FERC Form 1 line numbers that correspond to a given FERC account may change. If that happens, all of our parsing of the f1_plant_in_srvce data will get messed up.

Somebody needs to go back and spot check some past years (between 2004 and 2015) to figure out if/when the lines-to-accounts mapping has changed in the past, and we'll need to generate a new mapping for each different set of lines/accounts. This could be done with old blank copies of Form 1 (maybe we can ask FERC?) or from the archived versions of the form that we got generated for PSCo. The page(s) we're interested in are FERC Form 1 pp 204-207.

For reference, the line-to-account mapping for 2015 is defined in a DataFrame named: pudl.constants.ferc_electric_plant_accounts

Find "provision for depreciation" in the original ferc1 database

There's a respondent-wide depreciation value reported to FERC, but it's not in the Plant In Service table. Uday/CPI want that number, so we need to figure out which FERC1 table it's in, and make sure we pull that table into the FERC1 and then PUDL DBs.

Define DB tables & add import logic for ingesting f1_gnrt_plant table.

The small plants data need to be ingested into PUDL, along the same lines as f1_steam, f1_fuel, etc. Define the ORM class in models_ferc1.py and define the ingest function in pudl.py.

Add SIMPLE IRA contributions to budget

Devise naming convention for PUDL DB table columns

Many of the database tables we are creating have dozens of columns. We need readable, memorable, tab-completable convention for naming them. This could be based on the table or data source they are being imported from, and also the nature of the data within them.

Create Utility-Plant Relations

We need a table that indicates which plants are associated with what utilities. For now it can just contain two columns: utility_id and plant_id, both of which are primary keys, and foreign keys. The same table might later also store the ownership percentages.

Define string cleaning dictionaries for f1_steam fields

The f1_steam table in the ferc1 database has at least two freeform fields that need to be cleaned up. They are type_const and plant_kind. Export a list of all unique strings found in those two fields, from all of the data we can import into the database simultaneously -- years 2004-2015. Using whatever information you can find about what those fields are supposed to describe (e.g. the blank FERC Form 1 document, and the instructions for filling it out) categorize the strings into a few meaningful categories, using the ferc1_fuel_strings and ferc1_fuel_unit_strings dictionary-of-lists in constants.py as a model. Look at whatever other fields you need to within the f1_steam table for context on what is meant by the type_const and plant_kind fields. This issue is complete when there are ferc1_type_const and ferc1_plant_kind dictionaries in constants.py that can be used to clean up these columns.

Import f1_fuel data into PUDL database

Get at least the f1_fuel data automatically flowing into the PUDL database when init_db() is run. This will require cleaning up any data fields which are freeform strings (fuel, fuel_unit), deciding which records are so bad that they can't be imported, and ensuring that the necessary foreign key tables are consistent with the actual contents of the f1_fuel table.

Allow multiple years of data in ferc1 database

The ferc1 module can now pull in data from the old DBF files from 2004-2015, however, the database can only hold one year's worth of data at a time, because the non-data tables (like f1_respondents) don't have a year field, and so entries for the same respondents from different years collide in the database -- you can't add respondent_id 134 to the table more than once. There needs to be logic incorporated that allows the union of all respondent_id's and respondent_names to be added to f1_respondent, and something similar for other non-data tables.

Convert to using platform independent filenames/paths

Right now there are some hard coded paths referring to data files within the project. Mostly they currently assume a Unix/OSX type filesystem. They should be made platform agnostic, using the os.path module.

Database table definition - EIA 923 tables

Making sure that there’s a mapping for the older tables that have different column names

Change FERC Form 1 DB ingestion to allow selection of year

Rather than hard-coding the year 2015 into the database ingestion, the user should be able to specify what year's data they want to pull in (or potentially which years they want to pull in?) and have that data get pulled. This will almost certainly involve some more grungy data cleaning, as new and exciting freeform strings describing fuels etc. will appear in previous years' data.

Flesh out static tables needed for EIA923

There are many static tables (e.g. abbreviations of NERC regions) that we should have available as foreign keys within the PUDL Database. They are summarized in the back of the EIA923 spreadsheet. Because they're simple, they'll also provide a good practice run for defining tables using the ORM, and getting those tables populated when init_db() runs.

Create PUDL ingest function for f1_plant_in_srvce table

The f1_plant_in_srvce table contains all the FERC account balance information -- nearly ~70 separate individual accounts in all -- on a respondent-by-respondent basis. We need to slurp it up!

Create error checking mechanisms for ID matching process

Send Ron Lehr on Mojave closure

Look into github for project management

milestones,
weekly goals,
issues;
put goals for the week in there
send that to team

Add an admin hours layer to the budget

Conservative estimate preferred by Alana
-Breaking up by task
-End of year - more heavy there - running payroll, team management with meetings, running invoices, running invoices

Beginning of year - heavier with set up

Do a detailed review of draft bylaws.

Before monday 2/27, do a close read of the draft bylaws, and make any last comments before we send them to Bill and/or Jason for final drafting changes. Put edits in the "remaining questions" Google doc: https://docs.google.com/document/d/1cu_-FTBZ3MXM0mopGPq6ubXjG8aGHNMU6xxje9Jh_-8/edit#

Look into Errors and Omissions policy

Devise naming convention for shared constants

The constants.py module is accumulating a lot of values, which pertain to different parts of the project, some of which contain the same kind of information but from different sources. We need a well defined naming convention so that we don't end up using the wrong values.

Hourly compensation and estimates for Q1 time commitments

ORM background

Research database cascading changes

Make the list of ferc1 tables to replicate an input to ferc1.init_db()

Right now there's a hard-coded list of DBF files which should be pulled into the ferc1 database clone buried within ferc1.init_db. This list should instead be an input into the function. A default list of DBF files/tables to import can be stored in the constants module.

Pay RMFU for help drafting bylaws.

RMFU offered to help us write our bylaws for $100/hr w/ a $750 cap. Once we've adopted the bylaws... we should really pay them.

Follow up with Jason about insurance

Define DB Table & Ingest function for f1_accumdepr_prvsn

Alongside the f1_plant_in_srvce table which describes the balances and changes in the electric plant in service, we need to ingest information about utility-wide depreciation, which comes from p. 219 of FERC Form 1, and is stored in table f1_accumdepr_prvsn. We need to define a PUDL DB table for this data, and write an ingest function. It appears similar in structure to f1_plant_in_srvce, which means we'll need to do a line number -> meaningful description dictionary/data table too, and hope that they don't change from year to year.

Pull EIA and FERC fuel data into the PUDL DB

Create the necessary tables in the PUDL DB to hold both FERC Form 1 annual fuel information (from the f1_fuel table) and EIA923 monthly fuel deliveries. Demonstrate the ability to populate these tables with information from FERC Form 1 and EIA923.

mail workers' comp waiver paperwork

to do for Monday 3/6

Test ingestion of FERC1 tables into PUDL for full 2004-2015 date range

Many FERC1 data tables are now successfully importing into PUDL, but only for the year 2015. Once we have the glue tables updated with relationships between plants and utilities for the years 2004-2015 (See Issue #26 ), we'll need to revisit the imports of all of the other years of these older tables, and make sure they still work.

Define PUDL DB table and import logic for f1_pumped_storage table

The ferc1 database has a table called f1_pumped_storage which stores attributes of various pumped hydro storage plants around the US. It's the simplest of the tables describing plants. Define a table to receive this data using the ORM in models_ferc1.py. Look at the FuelFERC1 class as a template.

Then create a function in pudl.py that pulls the f1_pumped_hydro data out of the ferc1 database into a DataFrame using pd.read_sql(), cleans it up as necessary for insertion into the PUDL DB, and adds it using DataFrame.to_sql(). Look at how the same work is done by for importing the f1_fuel data as an example.

When this issue is complete, pudl.init_db() should successfully import the pumped_hydro data.

Technical milestone map

As we finally approach having our initially desired data pulled into PUDL from FERC1 & EIA923 we need to have a full list of the issues required to do an initial alpha release (yay!) of PUDL that can be used to do interesting things in the world... like make some plots for potential funders to ogle.

Define DB table and ingest function for f1_hydro

Get the f1_hydro table imported into PUDL from FERC1 a process similar to that for the f1_steam and f1_pumped_storage to define the database table using the ORM in models_ferc1.py and a corresponding ingest function within pudl.py

EIA data normalization

Create schemata for the PUDL database tables which we will import the EIA923 data into, using the ORM syntax.

Updating ID matching and PUDL ID generation

Devise naming convention for SQLAlchemy ORM Classes.

We need a readable, easily rememberable convention for the names of the classes which we are constructing with the SQLAlchemy ORM. This should determine the name of the class, and the name of the DB table which objects of this class will be records within.

Download atom text editor

Logo finalization

Expand ID Mapping to data from 2006-2016 for FERC1 & EIA923

We can pull in FERC Form 1 data from 2004 onward. However, the id mapping "glue" tables that we have right now only encompass information from the 2015 FERC database. If we want to be able to work with multiple years of FERC data in the PUDL database, we need glue for those other years. Thus, the id mapping exercise needs to be expanded to pull in new associations between plants & utilities from prior FERC years.

If @cmgosnell & @swinter2011 can indicate which fields they need from the various plants tables to do the mapping, @zaneselvans can pull them from the 2004-2015 Form 1 DB for matching. If we can come up with a common set of columns for all the different types of plants, we could do it all in one go maybe?

finish quickbooks setup

create initial categories
create member accounts
create stock ledger
run the above by the accountant
anything else needed?

Adopt final bylaws as drafted.

Investigate less zealous removal of NA values during f1_fuel import

After the fuel and fuel_unit strings are cleaned up in the f1_fuel import, any record which contains any NA values is dropped, before the DataFrame is pulled into the PUDL DB. In some cases, this means we lose some data. E.g. some utilities only report their mmbtu/kWh numbers on a separate "Total" line. We should check and see whether there's a less destructive way we can deal with these leftover records.

create classes for eia923 dictionaries

check to see if all metadata-type variables in models_eia923 have an associated dictionary in constants.py; once there are dictionaries create classes in models.py;
for new static constants added in models.py, add to ingest_static_tables function in pudl.py.

Check in with Alana about admin

availability
invoicing and payment
getting time in by end of month

Define DB table and ingest function for f1_purchased_pwr table

Using a process similar to that for f1_fuel, f1_steam, f1_hydro, etc. define a DB table for the f1_purchased_pwr table using the ORM in models_ferc1, and flesh out the ingest function in pudl.py.