catalyst-cooperative / pudl-archiver Goto Github PK

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.

License: MIT License

Python 100.00%

climate-change electricity energy-data open-data policy zenodo eia environmental-data epa ferc

pudl-archiver's Introduction

PUDL Archivers

This repo implements data archivers for The Public Utility Data Liberation Project (PUDL). It is responsible for downloading raw data from multiple sources, and create Zenodo archives containing that data.

Background on Zenodo

Zenodo is an open repository maintained by CERN that allows users to archive research-related digital artifacts for free. Catalyst uses Zenodo to archive raw datasets scraped from the likes of FERC, EIA, and the EPA to ensure reliable, versioned access to the data PUDL depends on. Take a look at our archives here. In the event that any of the publishers change the format or contents of their data, remove old years, or simply cease to exist, we will have a permanent record of the data. All data uploaded to Zenodo is assigned a DOI for streamlined access and citing.

Whenever the historical data changes substantially or new years are added, we make new Zenodo archives and build out new versions of PUDL that are compatible. Paring specific Zenodo archives with PUDL releases ensures a functioning ETL for users and developers.

Once created, Zenodo archives cannot be deleted. This is, in fact, their purpose! It also means that one ought to be sparing with the information uploaded. We don't want wade through tons of test uploads when looking for the most recent version of data. Luckily Zenodo has created a sandbox environment for testing API integration. Unlike the regular environment, the sandbox can be wiped clean at any time. When testing uploads, you'll want to upload to the sandbox first. Because we want to keep our Zenodo as clean as possible, we keep the upload tokens internal to Catalyst. If there's data you want to see integrated, and you're not part of the team, send us an email at [email protected].

One last thing-- Zenodo archives for particular datasets are referred to as "depositions". Each dataset is it's own deposition that gets created when the dataset is first uploaded to Zenodo and versioned as the source releases new data that gets uploaded to Zenodo.

Installation

We recommend using conda to create and manage your environment.

Run:

conda env create -f environment.yml
conda activate pudl-cataloger

Setting up environment

API tokens are required to interact with Zenodo. There is one set of tokens for accessing the sandbox server, and one for the production server. The archiver tool expects these tokens to be set in the following environment variables: ZENODO_TOKEN_PUBLISH and ZENODO_TOKEN_UPLOAD or ZENODO_SANDBOX_TOKEN_PUBLISH and ZENODO_SANDBOX_TOKEN_UPLOAD for the sandbox server. Catalyst uses a set of institutional tokens - you can contact a maintainer for tokens.

If you want to interact with the epacems archiver, you'll need to get a personal API key and store it as an environment variable at EPACEMS_API_KEY.

Usage

A CLI is provided for creating and updating archives. The basic usage looks like:

pudl_archiver --datasets {list_of_datasources}

This command will download the latest available data and create archives for each requested datasource requested. The supported datasources include censusdp1tract, eia_bulk_elec, eia176, eia191, eia757a,eia860, eia860m, eia861, eia923, eia930, eiaaeo, eiawater, epacems, epacamd_eia, ferc1, ferc2, ferc6, ferc60, ferc714, nrelatb, phmsagas, mshamines.

There are also four optional flags available:

--sandbox: used for testing. It will only interact with Zenodo's sandbox instance.
--initialize: used for creating an archive for a new dataset that doesn't currently exist on zenodo. If successful, this command will automatically add the new Zenodo DOI to the dataset_doi.yaml file.
--dry-run: used for testing, it ignores all Zenodo write operations.
--all: shortcut for archiving all datasets that we have defined archivers for. Overrides --datasets.

Adding a new dataset

Step 1: Implement archiver interface

All of the archivers inherit from the AbstractDatasetArchiver base class (defined in src/pudl_archiver/archiver/classes.py. There is only a single method that each archiver needs to implement. That is the get_resources method. This method will be called by the base class to coordinate downloading all data-resources. It should be a generator that yields awaitables to download those resources. Those awaitables should be coroutines that download a single resource, and return a path to that resource on disk, and a dictionary of working partitions relevant to the resource. In practice this generally looks something like:

class ExampleArchiver(AbstractDatasetArchiver):
    name = "example"

    async def get_resources(self) -> ArchiveAwaitable:
        for year in range(start_year, end_year):
            yield self.download_year(year)

    async def download_year(self, year: int) -> tuple[Path, dict]:
        url = f"https://example.com/example_form_{year}.zip"
        download_path = self.download_directory / f"example_form_{year}.zip"
        await self.download_zipfile(url, download_path)

        return download_path, {"year": year}

This example uses a couple of useful helper methods/variables defined in the base class. Notice, download_year uses self.download_directory this is a temporary directory created and manged by the base class that is used as a staging area for downloading data before uploading it to Zenodo. This temporary directory will be automatically removed once the data has been uploaded. download_year also uses the method download_zipfile. This is a helper method implemented to handle downloading zipfiles that includes a check for valid zipfiles, and a configurable number of retries. Not shown here, but also used frequently is the get_hyperlinks method. This helper method takes a URL, and a regex pattern, and it will find all hyperlinks matching the pattern on the page pointed to by the URL. This is useful if there's a page containing links to a series of data resources that have somewhat structured names.

Step 2: Run --initialize command

You will need to run the initialize command to create a new zenodo deposition, and update the config file with the new DOI:

pudl_archiver --datasets {new_dataset_name} --initialize --summary-file {new_dataset_name}-summary.json

Using the --summary-file flag will save a .json file summarizing the results of all validation tests, which is useful for reviewing your dataset. Note that this step will require you to create your own Zenodo validation credentials if you are not a core Catalyst developer.

Step 3: Manually review your archive before publication.

If the archiver run is successful, it will produce a link to the draft archive. Though many of the validation steps are automated, it is worthwhile manually reviewing archives before publication, since a Zenodo record cannot be deleted once published. Here are some recommended additional manual steps for verification:

Open the *-summary.json file that your archiver run produced. It will contain the name, description and success of each test run on the archive, along with any notes. If your draft archive was successfully created all tests have passed, but it's worthwhile skimming through the file to make sure all expected tests have been run and there are no notable warnings in the notes.
On Zenodo, "preview" the draft archive and check to see that nothing seems unusual (e.g., missing years of data, new partition formats, contributors).
Look at the datapackage.json. Does the dataset metadata look as expected? How about the metadata for each resource?
Click to download one or two files from the archive. Extract them and open them to make sure they look as expected.

When you're ready to submit this archive, hit "publish"! Then head over to the pudl repo to integrate the new archive.

Development

We only have one development specific tool, which is the Zenodo Postman collection in /devtools. This tool is used for testing and prototyping Zenodo API calls, it is not needed to use the archiver tool itself.

To use it:

download Postman (or use their web client)
import this collection
set up a publish_token Postman environment variable like in the docs
send stuff to Zenodo by clicking buttons in Postman!

pudl-archiver's People

Contributors

Stargazers

Watchers

Forkers

b4k2 samleonard

pudl-archiver's Issues

Define `eia176` archiver success conditions

Create github action for running the scraping/archiving process at desired frequencies

@zschira commented on Tue Sep 13 2022

Once the archiver/scraper repos have been combined, and we have high level scripts for managing the process, it should be very easy to create github actions for automating the archiving process. New data is released at various frequencies for the different data sources incorporated in PUDL, so we can create multiple actions that run at frequencies reflective of this.

@zaneselvans commented on Tue Sep 13 2022

I am so excited for this to finally happen!

Define `censusdp1tract` archiver success conditions

Fix DepositionMetadata creation error when there are no contributors for a data source

Zenodo requires creators in DepositionMetadata, but some PUDL data sources don't have Contributors defined. To mitigate this issue, we can just add catalyst-cooperative if no other contributors are available.

Improve documentation to add a new dataset

Update from 2023: Now that our archiver infrastructure is updated, let's take a moment to update the documentation in github for adding a new archiver.

Original issue below.

This is a general issue based on the experience adding NREL ATB. It's possible to work through the datastore, etl, constants, metadata json and other files to add a new source. Possible, but could definitely be made more clear. Especially little things like needing to add the new source to contributors_by_source in constants.py and then adding new contributors to contributors.

I ran into the most trouble when trying to run tests. I'm not sure how to best document adding a new data source to the tests but this process was painful for me. I was trying to update the datastore_test.py file and couldn't figure out why files weren't being downloaded, not realizing that I also needed to update datastore_fixture in conftest.py. It's all more clear in retrospect, but knowing to modify the settings_datapackage_fast.yml file and things like that would be helpful.

Increase archiver robustness to transient network issues

Looking at some of the archiver logging output, the exponential backoff on failure / timeout seemed like was going up to about a 1 minute wait time. I've sometimes experienced network connectivity issues with the federal agency sites that last for longer than that. Right now it seems like about 1 in 20 archiver runs fails due to transient network issues like connection timeouts or closed connections. But another run done 15 minutes or an hour later has no trouble.

It might make sense to back off to as much as 15 or 30 minutes when we're doing the scheduled periodic archiving runs. They'll probably run overnight and could even take a couple of hours to complete without causing any problems.

If the maximum wait time or the multiplier between steps were a parameter, we could still have the CI go quickly, while the scheduled runs can be more robust and leisurely.

Scope

Beta Give feedback

configure default backoff limit
configure dataset backoff limits
Options

Next steps

Beta Give feedback

add retry-count CLI arg, and pass that through to utils.retry_async
Options

the `data_format` field for FERC1 contains UPPER and lower case values

in datapackage.json for the new ferc1 archive, the data_format field has dbf and XBRL. this tripped me up because the pudl.extract module looks for this value and it used to be DBF. Can we standardize with one or the other?

Implement multi-year (2010, 2020) scraping for the censusdp1tract scraper

Currently the US Census DP1 scrape is hard-coded to obtain the 2010 data, and label the output file as such, and then this data is packaged for Zenodo using an annual datapackager, which reads the year from the filename.

However, with the release of the 2020 DP1 data, we'll need to update the scraper to get both 2010 and 2020 data, and automagically label the resulting downloaded files accordingly, for consumption by the zenodo storage module.

Fix duplicated procedure documentation for new Zenodo archives

Right now there are two descriptions of how to add new archives to Zenodo. One in the pudl-zenodo-storage README, and one in the PUDL "Development/Working With the Datastore" section of the read-the-docs. This information should only live in one place, the pudl-zenodo-storage README.

This issue is dedicated to:

Making sure that the pudl-zenodo-storage README is up to par
Removing all the detail from the PUDL docs description and simply referencing the pudl-zenodo-storage README.

If dataset shrinks, we fail when updating datapackage

I simulated "a bunch of files go offline" by only returning one hyperlink from get_hyperlinks:

diff --git a/src/pudl_archiver/archivers/classes.py b/src/pudl_archiver/archivers/classes.py
index 1883d0c..35b85bc 100644
--- a/src/pudl_archiver/archivers/classes.py
+++ b/src/pudl_archiver/archivers/classes.py
@@ -157,7 +157,7 @@ class AbstractDatasetArchiver(ABC):
                 f"Make sure your filter_pattern is correct or if the structure of the {url} page changed."
             )

-        return hyperlinks
+        return list(hyperlinks)[:1]

     async def create_archive(self):
         """Download all resources and create an archive for upload.

Then when I ran pudl_archiver --sandbox --datasets eia861 I got:

[2023-01-20 14:10:49 [    INFO] catalystcoop.pudl_archiver.zenodo.api_client:317 GET https://sandbox.zenodo.org/api/deposit/depositions - get
deposition
2023-01-20 14:10:54 [    INFO] catalystcoop.pudl_archiver.zenodo.api_client:197 GET https://sandbox.zenodo.org/api/deposit/depositions/1149408/
files
2023-01-20 14:10:57 [    INFO] catalystcoop.pudl_archiver.archivers.classes:75 Archiving eia861
2023-01-20 14:11:07 [    INFO] catalystcoop.pudl_archiver.archivers.classes:176 Downloaded /var/folders/7q/28j531_d151fn0d3831ql0c00000gr/T/tmp
5w_m4p_9/eia861-1999.zip.
2023-01-20 14:11:07 [    INFO] catalystcoop.pudl_archiver.zenodo.api_client:259 Adding eia861-1999.zip to deposition.
2023-01-20 14:11:07 [    INFO] catalystcoop.pudl_archiver.zenodo.api_client:439 Creating new datapackage.json for eia861
2023-01-20 14:11:07 [    INFO] catalystcoop.pudl_archiver.zenodo.api_client:443 GET https://sandbox.zenodo.org/api/deposit/depositions/1149408/
files - getting file list for datapackage
Encountered exceptions, showing traceback for last one: ["('eia861', KeyError('eia861-2006.zip'))"]
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/bin/pudl_archiver", line 8, in <module>
    sys.exit(main())
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/cli.py", line 58, in main
    asyncio.run(archive_datasets(**vars(args)))
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/__init__.py", line 99, in archive_datasets
    raise exceptions[-1][1]                                                                                                                      File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/__init__.py", line 47, in archive_dataset
    await archiver.create_archive()                                                                                                              File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/classes.py", line 180, in create_archive
    await self.deposition.add_files(resource_dict)
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/zenodo/api_client.py", line 283, in add_files
    await self.update_datapackage(resources)
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/zenodo/api_client.py", line 454, in update_datapackage
    datapackage = DataPackage.from_filelist(
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/frictionless.py", line 98, in from_filelist
    resources=[
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/frictionless.py", line 99, in <listcomp>
    Resource.from_file(file, resources[file.filename].partitions)
KeyError: 'eia861-2006.zip'

This appears to be because in frictionless.DataPackage.from_filelist we have

            resources=[
                Resource.from_file(file, resources[file.filename].partitions)
                for file in files
            ]

Where files is the list of files that were in the deposition before we started doing all these edits and resources is the list of resources we downloaded. Then if there's a file in files that's not in resources we hit that KeyError.

We should probably pass in the list of files for the deposition version post-edits - so maybe we should:

stage changes for normal files
apply changes
refresh deposition information
stage changes for datapackage
apply changes for datapackage
publish

@zschira does this analysis make sense?

Fix DatasetSettings validation error

Using str in Doi objects was giving validation errors when creating DatasetSettings objects.

Automate Zenodo archiving

This issue tracks all the tasks required to get our automated archiving system up and running.

Scope

Archivers create a report describing what happened during an archiving run.
Each archiver can specify dataset-specific success/failure criteria.
Archivers are robust to transient networking issues (false "failures" that are external to the data).
Scheduled archiving runs produce a persistent record of their outcome (probably a GitHub issue).
Scheduled archiving runs notify us of their outcome (Slack notification + issue assignment).
All working archivers are being run on a schedule, publishing new archives to the production Zenodo site.

Out of Scope

Tasks

Beta Give feedback

Update EIA Thermoelectric Cooling Water link matching patterns #77

eiawater inframundo
Refresh Zenodo Archives Manually #79

17 of 17

inframundo zenodo
Add production credentials to workflow
Ensure each XBRL filing in FERC RSS feed has a stable ID #73

bug ferc1 ferc2 ferc6 ferc60 ferc714 inframundo xbrl
Running archiver with --dry-run leaves an unpublished draft #75

inframundo zenodo
Add all new archives to the Catalyst Cooperative community #76

inframundo zenodo
Increase archiver robustness to transient network issues #74

0 of 3

inframundo
Summarize archiver run and send notification via GH Action #60

inframundo
Specify archiver success & failure conditions #70

1 of 17

Epic inframundo
Update keywords in new dataset metadata #80

5 of 5

eia176 eiawater inframundo metadata phmsagas
"ValueError: I/O operation on closed file" uploading to Zenodo #81

inframundo mshamines zenodo
Add validation test that looks at file size #144
Automate remaining archive runs #276

automation github_actions
Options

Test individual archivers

Background

Right now the individual archivers (inheriting from AbstractDatasetArchiver) are largely untested. There's been discussion in #43 about the appropriate way to test these archivers. In the long run, we will want to have new archives tested through nightly builds to fully verify the entire workflow. This, however will likely be future work once we are actually generating new archives regularly, and have decided how we want to integrate those archives into PUDL. In the immediate future, there are likely helpful tests that we can create at the unit/integration level to perform some basic sanity checking.

Scope

This issue will only track the development of unit/integration tests, and will leave full integration into the nightly builds process for future work. Things we can check for at this level include:

Are all of the links we attempt to download files from valid?
Is the structure of the archive what is expected by PUDL (partitions are formatted correctly, and filenames look like what we'd expect

Adapt cheshire Dockerfile to pudl-scrapers

We have a Dockerfile which uses mambaforge in the cheshire template repo. Adapt it to containerizing the pudl-scrapers and running scrape_everything.sh

EQR archiver concurrency/disk limits

In #31 and #43 @e-belfer ran into an issue where if we download all the EQR data at once it uses a ton of disk space - the full EQR dataset is 15.5 GB.

The problem, apart from this just being slow, is that GH actions runners only have 14G of disk space, so we'd have to manually partition across multiple runners or try to reduce the disk usage by only keeping a small subset of the data on disk at any one time.

I think we can try to basically lazily load the files from the downloader:

initialize new deposition version
for each resource we have, individually download/checksum/upload to deposition, with some concurrency set at the dataset level
delete files that we didn't see in the above step from the pending deposition
regenerate datapackage
update settings etc.

Also, only 3 users globally can download EQR data at a time.

Scope:

we only run one concurrent EQR download at once
we only store one EQR dataset on disk at once

Next steps:

allow datasets to force aiohttp client pool size
add a method to AbstractDatasetArchiver that spits out a generator of (name, resource) tuples instead of a dict from name to resources
- hope this is effectively limited by the session concurrency limits above
in DepositorOrchestrator, diff files by md5 one-by-one instead of all at once

Archiver workflow sometimes creates unpublished draft depositions

After adding several new archivers to the run-archivers workflow in the course of #65 and running all the workflows multiple times, I'm seeing some behavior that confuses me. The newly added archivers are:

eia176
eia923 (didn't make it into the list before)
eiawater
mshamines
phmsagas

My expectation was that if any changes were detected for any of the archived datasets (old or new) a new deposition would be created on Zenodo with the new data in it, and regardless of whether a new archive was published, no draft deposition would remain after the archiving process had completed.

I'm now able to run the run-archivers workflow manually via the workflow-dispatch, and all of the actions "succeed" ✅. What happens with the various archivers seemed to differ:

`censusdp1tract`

Initially the archiver was failing, for reasons I didn't understand.
I searched for Census DP1 archives on Zenodo Sandbox and found 2 different lineages of archives, one of which was started in January, and wondered if the new archiver might somehow be incompatible with the old lineage (started in 2020), so I switched the sandbox DOI over to the newer lineage, but this didn't fix it. It did seem to try and create a v2.0.0 archive though, which it hadn't been doing before.
Then I looked for draft archives, and saw that there were unpublished drafts for both the old and new lineages. I deleted them, and re-ran the run-archivers workflow, and this time it succeeded ✅.
However, it also created a new unpublished draft, which contains files with identical MD5 checksums as the current published version, but a new version (v2.0.0) and a new DOI.

`eia176` (first time w/ workflow)

The archiver runs and succeeds ✅, and (correctly) does not create a new published deposition.
However, like the censusdp1tract archiver, it seems to create an unpublished draft, even though the MD5 checksums for all the files are identical. The new unpublished draft has a different DOI and a new version (v3.0.0) compared to the previously published deposition.

`eia860`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from October, 2022, with the same MD5 sums, but a new reserved DOI and a new version.

`eia860m`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 6th, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eia861`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 2nd, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eia923` (first time w/ workflow)

Running the archiver did not result in a newly published archive, but did create a new unpublished draft.
The new unpublished draft looks like it has the same MD5 sums, but a new reserved DOI and a new version.

`eia_bulk_elec`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 3rd, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eiawater` (first time w/ workflow)

Running the archiver did not result in a newly published archive, but did create a new unpublished draft.
The new unpublished draft looks like it has the same MD5 sums, but a new reserved DOI and a new version.

`epacamd_eia`

Running the archiver did not result in a new published archive or a new draft.
However, an old draft from September 8th, 2022 is still hanging around.
Also, the unpublished draft only contains the epacamd_eia.zip file, and not the datapackage.json, and the MD5 sum for the zipfile is different from that of the most recently published archive.

`epacems`

Running the archiver did not result in a new published archive, even though it definitely should have. The current archive is more than 2 years out of date.
It also didn't create a new draft, but there was an old draft from April, 2021 which contains no uploaded files.
I deleted the old draft and re-ran the archiving workflow, to see what would happen with the substantial downloads required to make a new CEMS archive.
With no blocking draft deposition, the CEMS archiver is running much longer. This seems to suggest that the existence of the draft deposition prevented the download of new raw data from EPA entirely, which means it wouldn't have been able to compare the checksums of the raw files with those stored on Zenodo to even figure out if it needed to make a new archive.
It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

`ferc1`

A new deposition is created every time the archiver is run, because the contents of the data downloaded from RSS is different every time (probably there are some timestamps or something).
In addition to the new deposition being created with a new version and DOI, a draft deposition is created and left unpublished. This draft deposition appears identical to the most recently published version -- the MD5 checksums for the files are the same, and the reserved DOI in the draft deposition is identical to the actual DOI for the new deposition that got published.

`ferc2`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc6`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc60`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc714`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`mshamines` (first time w/ workflow)

No new published version is created.
An unpublished draft deposition is created, with a new version and DOI, but apparently the same MD5 sums.

`phmsagas` (first time w/ workflow)

No new published version is created.
An unpublished draft deposition is created, with a new version and DOI, but apparently the same MD5 sums.

Patterns?

It seems like any time there's a pre-existing unpublished draft, it blocks a new archive from being created.
It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)
It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.
However the FERC Form 1 doesn't seem to follow this pattern.

Permissions errors when adding new dataset to Zenodo

Ran into some permissions issues trying to test a new source of data (FERC EQR) in the Zenodo sandbox.

pudl_archiver --datasets ferceqr --dry-run requires the production token at present, returning: `KeyError: 'ZENODO_TOKEN_UPLOAD'

pudl_archiver --datasets ferceqr --initialize returns the same error, also requiring the production token.

pudl_archiver --datasets ferceqr --sandbox returns a KeyError for the dataset name in the Zenodo API client, breaking on the line settings = self.dataset_settings[data_source_id] in /zenodo/api_client.py.

In other words, it seems to be impossible to test a new dataset (either locally or in the sandbox) without updating the dataset settings entry, which requires access to the production keys. This seems to be an undesireable outcome of some of the more recent refactoring changes.

Archive FERC EQR

There's high demand for FERC EQR data. Even if we haven't integrated it yet, we can still archive it.

FERC EQR data.

This is a duplicate issue of PUDL issue 1414, but I'm putting it in the archiver repo.

Specify archiver success & failure conditions

Our goal is to have the archivers running on an automated schedule in the background, taking snapshots of the original data sources which can be accessed programmatically. This will minimize the overhead associated with keeping our raw inputs up to date, but we still need the system to alert us when something goes wrong so we can fix it.

Each dataset's archiver should generate a report describing the outcome of the archiving run. See #60.
Depending on what happens during the archiving run (which could be encapsulated by the generated report), the run should be declared a success or a failure. Success or failure should be defined at the level of each individual dataset so as to effectively direct our attention to where the problem is.
What constitutes a successful archiving run will vary by dataset. We need a way to specify our expectations, and if anything outside of those expectations happens we should get a failure. If we start with stringent criteria and find we're getting too many false positives ("failures" that are actually okay), we can loosen the criteria based on the actual outcomes we see.
In general we expect the set of data partitions to either remain constant or grow over time, but we have fairly specific expectations about what new data partitions should look like. In most cases, new partitions are just additional timesteps, e.g.:
- a new year of EIA-923 data
- a new month of EIA-860m data
- a new month of EPA CEMS data
- a new quarter of FERC EQR
- In the case of FERC's XBRL data which is scraped from an RSS feed, we expect to see any number of new individual filings, primarily in the most recent timestep, but occasionally revising older filings.
If any expected data partition is not found, that should result in failure.
In the case of frequently updated datasets like the EIA-860M, it might make sense to raise a 🚩 if much more time than expected passes without an update. E.g. if there haven't been any changes to a monthly dataset for 3 months, maybe the agency has actually started putting new data somewhere else.
If a new data partition of an unexpected form is found, that should result in failure, since it means our expectations about what the data should look like are no longer correct. We should be required to explicitly update our expectations about the data.
We should at least report some measure of the scale of changes detected between versions, and beyond a certain threshold we might want to cause a failure. E.g.:
- If all the expected data partitions are found, but some of them have decreased in size by 90% something is probably wrong and needs to be investigated.
- If the file type has changed even though the file name has not changed (as happened when the EIA-176 started publishing Excel spreadsheets but calling them CSVs) that should also probably result in a failure that demands investigation.
- If an unusually high proportion of the data partitions has changed, but they still have the right names, file types, and sizes, maybe it's not a failure, but should be investigated as it could indicate big revisions to the data (as often happens without any kind of notice or documentation).

Ensure each XBRL filing in FERC RSS feed has a stable ID

The FERC RSS feed is changing guid's every time the feed is downloaded. According to the RSS standard this field is supposed to be a unique identifier of entries, which is why I chose to use it for naming the XBRL filings, but they don't seem to be conforming to this standard. I emailed FERC to check if they know this is happening, and if they plan to fix it, but I haven't heard back. While this is still the case, every time we run one of the FERC archivers, all of the XBRL file names will change and we will end up creating a new version even if none of the underlying data has changed.

We can fix this one of two ways:

Do nothing and wait for FERC to fix this.
Change how we name XBRL filings

Option 1 I think would be nice as it doesn't require any work and using UUID's is robust and won't lead to any accidental collisions. However, option 2 is probably the better option as we don't know if/when FERC will actually get to this.

Scrape/archive supplemental FERC forms

@zschira commented on Wed Sep 28 2022

There are several smaller FERC forms that supplement their primary forms (1,2,6,60,714). These supplemental forms include 1F, 3Q Electric, 2A, 3Q Gas, and 6Q. We should be archiving these supplemental forms even if we are not actively using them at this point in time.

Write archiver to regularly run and archive updated CEMS crosswalk

This is a continuation of issue #2505 in PUDL, which sets out to update the EPA-EIA crosswalk with 2021 data. The script written to do the updated archiving is written in PR #1 in the forked Catalyst repository.

While creating a static manually compiled output is a good start, it would probably be good to have a more reproducible programmatic process that will incorporate any data updates, and any updates to the crosswalk repo (this could be process changes or manual mapping additions), and that archives these outputs in a manner consistent to our other data sources.

Configure archiver to work with environment variable
Configure archiver to run render.r in the catalyst-cooperative/camd-eia-crosswalk-2021 repo and upload outputs to Zenodo, or configure the repo itself to run regularly (whichever is easier).

Develop high level script(s) for managing scraping/archiving

@zschira commented on Tue Sep 13 2022

The FERC datasets will need a script to manage scraping both the DBF and XBRL data. It may also be useful to create a single high level script for scraping data from all sources.

@zaneselvans commented on Tue Sep 13 2022

We already depend indirectly on the click and typer CLI frameworks, and I think they both provide hooks for tab completion and hierarchical scripts, which might be useful in this context. I've often imagined having a hierarchical script for PUDL with unified help messages & interface like

$ pudl scrape ferc1 ferc2 ferc6 ferc60 ferc714
$ pudl archive ferc1 ferc2 ferc6 ferc60 ferc714
$ pudl datastore update-cache ferc1 ferc2 ferc6 ferc60 ferc714
$ pudl ferc2sqlite settings/ferc2sqlite.yml

Update keywords in new dataset metadata

Some of our new datasets either don't have any search keywords associated with them, or have irrelevant keywords. Update this information in the associated metadata in the DataSource metadata within PUDL to be relevant/accurate and sufficient for enabling data discovery.

Tasks

Beta Give feedback

eia176: irrelevant keywords related to coal, air pollution, etc.
eiawater: no keywords
phmsagas: no keywords
mshamines: no keywords
When making a new dataset, assert keywords exist #167
Options

Refresh Zenodo Archives Manually

The new archivers are working, but we don't have the infrastructure set up to run them on an automated schedule. Instead up them manually, adding new archive lineages for the new datasets

🆕 New Dataset
✅ Updated Existing Archive
⛔ No Update
💀 Archiver is broken

Rogue EIA Thermoelectric Cooling Water Archives?

I was trying to add all of our raw data archives into the Catalyst Cooperative community on Zenodo and noticed that the EIA Thermoelectric Cooling Water archive doesn't seem to be owned by the [email protected] account -- I can't update the metadata to add it to the community. Strangely, the archive also supposedly dates from November 6th, 2022, even though (I thought) @e-belfer only created the new archiver a few weeks ago.

Did this archive somehow get created with a different Zenodo upload/publish token? Do we know who created it? There's no way to change ownership of archives, so maybe we'll have to ask Zenodo to delete this one once we've got an archive series started with the new archiver? Mostly just confused.

Separate IO and business logic

Right now our Zenodo interactions are deeply intertwined with our "if things have changed, do something, otherwise, don't" logic.

This makes it so that when we want to test business logic, we have to set up Zenodo stuff and be subject to the vagaries of Zenodo sandbox API; and when we want to test our Zenodo API interactions, we have to make sure to get the state set up just right, so that we get the Zenodo API actions we want.

Instead, we could pull those two sets of concerns apart, with a class hierarchy like so:

Deposition # we already have this
  metadata: DepositionMetadata
  file_list: list[DepositionFile]
  
Depositor # right now, this is ZenodoDepositionInterface
  get(doi)
  new_version(deposition)
  discard(deposition)
  publish(deposition)

  # get file id from deposition based on filename
  delete_file(deposition, filename)
  add_file(deposition, filename, data)
  update_file(deposition, filename, data)
  
Downloader # right now this is AbstractBaseDataArchiver
  # if we do this lazily, we might even circumvent some disk usage problems which would be a nice win. but not that important here.
  get_files()
  
ArchiveOrchestrator # Right now, this is ZenodoDepositorInterface also - which is where the problem lies
  run():
    #IO
    files = downloader.get_files()
    files_with_checksums = add_checksums(files)  # or the checksums could be calculated at download time.
    doi = get_correct_doi_from_metadata()
    # we don't really care about 'old deposition'
    # just about 'make me a new version for this doi'
    old_deposition = depositor.get_deposition(doi)
    deposition = depositor.new_version(old_deposition)
    
    # "business logic"
    changes = make_changes(files_with_checksums, deposition)
    
    # IO
    apply_changes(changes, deposition, depositor)
    
  make_changes(files, deposition):
    changes = []
    for local, remote in files:
      if local_checksum == remote_checksum:
        continue
      if remote in deposition.files_list:
        changes.append(Update)
      else:
        changes.append(Add)
    for remote in deposition.files_list:
      if remote not in files:
        changes.append(Delete)
        
    return changes
        
  # looks like this could be an instance member of the depositor as well
  apply_changes(changes, deposition, depositor):
    for change in changes:
      match change:
        Update:
          depositor.update()
        Add:
          depositor.add()
        Delete:
          depositor.delete()

We could write unit tests like this:

@pytest.mark.parameterize()
test_make_changes(local_with_checksums, remote_with_checksums, expected):
   deposition = make_deposition_with_files
   assert make_changes(local, deposition) == expected

And test IO with integration tests:

@pytest.fixture()
the_fake_deposition:
  # get the test deposition
  
test_upload
test_delete

test_update
test_new_version ...

test_zenodo
  # create a test deposition
  # assert there are no files in it
  # assert the version is right
  # add
  # assert we have 1 file
  # update
  # assert we have 1 updated file
  # publish
  # assert success

In terms of how to get there from here:

Write a ZenodoDepositor that does Zenodo interactions; write tests for this
Pass in a ZenodoDepositor to the ZenodoDepositionInterface & pull all Zenodo IO out of that class
rename ZDI to ArchiveOrchestrator, turn add_file into run, and hook up to the CLI

Fix eia860m datapackage errors

The eia860m archiver creates parts in the wrong way, and PUDL can't parse them right now. They need to be changed to year_month.

Archive FERC EQR pre-2013

The existing EQR archiver only archives data from 2013-present. There is commented out code for archiving earlier years of data, but as of now it does not work. This is because these years of data is only available on an FTP server that has a global 3 user limit (see here for more info). As a workaround to this limit, the commented out code will attempt to ping the ftp server until it gets a successful response, then try to download the data. So far in testing, however, the archiver has not been able to successfully interface with the server even when running all night trying to ping it.

Move to official arelle package

Test official arelle package, and switch dependency from internal mirror.

Running archiver with `--dry-run` leaves an unpublished draft

The --dry-run option isn't quite as dry as I thought it would be.

When run on an existing dataset, a draft deposition is created (apparently based on the most recently published version) and left unpublished on Zenodo. I think doing a "dry run" should have no effect on Zenodo, and only communicate what would have been done if a real run had been requested.
When run in combination with --initialize an empty draft deposition is created and left unpublished.

"ValueError: I/O operation on closed file" uploading to Zenodo

Running the MSHA Mines archiver locally, I'm frequently getting I/O errors during upload, which don't seem to trigger a retry. This seems to be happening on the larger files (a couple 100 MB, by process of elimination), but I'm not sure which upload is actually failing.

I'm not sure if there as any issue with the file itself, as the temporary download directory is cleaned up at the end of the archiver run, but they're zipfiles, and so they should have been verified as valid zipfiles upon download.

2023-02-27 20:36:34 [    INFO] catalystcoop.pudl_archiver.depositors.zenodo:92 PUT https://sandbox.zenodo.org/api/files/a2ac85a6-1aa3-4339-9c07-51163bedffe9/mshamines-assessed_violations.zip - Uploading mshamines-assessed_violations.zip to bucket
2023-02-27 20:41:35 [    INFO] catalystcoop.pudl_archiver.utils:46 Error while executing <coroutine object ZenodoDepositor._make_requester.<locals>.requester.<locals>.run_request at 0x28dfa9f50> (try #1, retry in 10s):
Encountered exceptions, showing traceback for last one: ["('mshamines', ValueError('I/O operation on closed file'))"]
Traceback (most recent call last):
  File "/Users/zane/mambaforge/envs/pudl-cataloger/bin/pudl_archiver", line 8, in <module>
    sys.exit(main())
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/cli.py", line 58, in main
    asyncio.run(archive_datasets(**vars(args)))
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/__init__.py", line 81, in archive_datasets
    raise exceptions[-1][1]
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/orchestrator.py", line 195, in run
    await self._apply_changes()
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/orchestrator.py", line 258, in _apply_changes
    await self.depositor.create_file(
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/depositors/zenodo.py", line 346, in create_file
    return await self.request(
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/depositors/zenodo.py", line 104, in requester
    response = await retry_async(
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/utils.py", line 41, in retry_async
    return await coro
  File "/Users/zane/code/catalyst/pudl-archiver/src/pudl_archiver/depositors/zenodo.py", line 95, in run_request
    response = await session.request(method, url, **kwargs)
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/site-packages/aiohttp/client.py", line 508, in _request
    req = self._request_class(
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 313, in __init__
    self.update_body_from_data(data)
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 517, in update_body_from_data
    size = body.size
  File "/Users/zane/mambaforge/envs/pudl-cataloger/lib/python3.10/site-packages/aiohttp/payload.py", line 379, in size
    return os.fstat(self._value.fileno()).st_size - self._value.tell()
ValueError: I/O operation on closed file

Flesh out metadata for raw datapackage uploads

Some issues to note in the metadata of the currently uploaded Zenodo raw data packages:

Zenodo datapackages are not currently versioned. Data package version should be same as the Zenodo deposition version.
Uploaded raw input data packages lack a unique ID. The versioned deposition DOI should really be the ID... but this would mean some structural changes to how the whole library works. See catalyst-cooperative/pudl-zenodo-storage#3.
It would also be good to store the concept_doi in the datapackage metadata, to give the user the option of obtaining the most recent version of the archive, if desired.

Error installing pudl-archiver: Don't know machine value for archs=()

When I try and create the pudl-archiver environment using

conda env create --file environment.yml

I get ValueError: Don't know machine value for archs=() when installing pip dependencies.

This is the stack trace.

Traceback (most recent call last):
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/__main__.py", line 31, in <module>
    sys.exit(_main())
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/cli/main.py", line 70, in main
    return command.main(cmd_args)
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 101, in main
    return self._main(args)
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 216, in _main
    self.handle_pip_version_check(options)
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 179, in handle_pip_version_check
    session = self._build_session(
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 125, in _build_session
    session = PipSession(
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/network/session.py", line 343, in __init__
    self.headers["User-Agent"] = user_agent()
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/network/session.py", line 175, in user_agent
    setuptools_dist = get_default_environment().get_distribution("setuptools")
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/metadata/__init__.py", line 75, in get_default_environment
    return select_backend().Environment.default()
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/metadata/__init__.py", line 63, in select_backend
    from . import pkg_resources
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_internal/metadata/pkg_resources.py", line 8, in <module>
    from pip._vendor import pkg_resources
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 959, in <module>
    class Environment:
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 963, in Environment
    self, search_path=None, platform=get_supported_platform(),
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 190, in get_supported_platform
    plat = get_build_platform()
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 395, in get_build_platform
    plat = get_platform()
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/sysconfig.py", line 745, in get_platform
    osname, release, machine = _osx_support.get_platform_osx(
  File "/Users/katielamb/miniconda3/envs/pudl-archiver/lib/python3.10/_osx_support.py", line 556, in get_platform_osx
    raise ValueError(
ValueError: Don't know machine value for archs=()

failed

CondaEnvException: Pip failed

I'm running macOS Monterey. Has anyone else with macOS successfully created the environment?

I tried taking out the optional dependencies (dev, docs, tests) but still got the error.

Maybe it's the build-system part of pyproject.toml?

[build-system]
requires = [
    "setuptools<66,>=61.0"
]
build-backend = "setuptools.build_meta"

Rewrite EPA CEMS archiver to use API

At some point in the last few months, EPA seems to have removed the bulk CSV files containing hourly emissions data from their site, and are now exclusively making the data available through an API.

The old URL now gives a 404, though a lot of other data remains on the server. Top level index
Emissions data API Documentation
GitHub repo with example usage

Looking at their example bulk Python download script it seems like it's possible to download CSVs from an s3 bucket. Hopefully they're not too different from the CSVs we were previously downloading directly from EPA!

Note that this API replaces .zip files with one .csv file for each state-year, so we will also need to adjust the extraction step in pudl to match the new data format.

Add flag to update metadata for existing archives

Write individual validation tests as flags in CLI

Sometimes we want to turn off one particular validation test for an archiver run without editing the codebase. Let's write each validation test as a flag that can be easily turned on and off through the CLI.

`--dry-run` CLI option is causing failures

The --dry-run CLI option is currently causing failures because it is creating a new draft deposition, not changing that draft deposition, then trying to publish it.

Combine archiver/scraper repos into a single repo

@zschira commented on Tue Sep 13 2022

Background

To simplify the archiving/scraping processes, and to enable automation of these processes, we should combine the archiver/scraper repos. These two repos are already tightly bound, as the archiver looks for data downloaded by the scraper and creates zenodo archives containing this data. Combining these two repos will allow for formalizing this dependency and making it easier to add/maintain datasets.

Tasks

Combine code into a single repository
Create glue code allowing the archiver/scrapers to interact in a more formalized way
Develop scripts for high level management of the scraping/archiving process

@zaneselvans commented on Tue Sep 13 2022

A minor thing that would be nice to change here is also getting rid of the hard-coded ~/Downloads/pudl_scrapers/scraped path for where the data is dropped off and picked up, since it's very OS / user specific. Especially if these are meant to run on GitHub Actions automatically, and they are going create archives as they go, the downloaded files could probably just go in a temporary directory and get deleted at the end of the archiving process.

@zschira commented on Mon Sep 19 2022

Combining the two repos offers us an opportunity to simplify some of the inter-repo metadata dependencies. These dependencies can make it cumbersome to integrate new data sources, as it's often difficult to understand what needs to be updated, and where it all lives. While undergoing the refactoring process involved with combining these repos, it should be a priority to consolidate as much of this information as possible, and make it easier to update. Below are the main friction points I've identified, with possible solutions for each:

Datasource Metadata

Currently the archiver repo depends on the DataSource metadata and class implemented in PUDL. Ultimately we don't want any of our tools depending on PUDL, so this metadata should be removed from PUDL. Allowing our metadata classes to be used by other projects has also been brought up in catalyst-cooperative/pudl#1522. As outlined in this issue, the metadata classes are fairly tightly bound to PUDL at the moment, and will take a decent amount of refactoring to pull this out.

Zenodo Archive Format

Both PUDL and the scraper must understand the structure of the archives at some level. The format of the archives is somewhat standard, but there are many anomalies. For example, FERC 2 DBF data is comprised of 2 partitions for the years 1991-1999, but only a single partition after that, and all of the FERC archives will contain DBF and XBRL data. The scrapers need to understand this information to download the data, and create archives, and PUDL needs to understand how to parse the archives (done via the Datastore). I've considered removing the Datastore, and encapsulating all of this information/logic in one place.

Proposed solutions

Option 1: Gradual refactor

Perhaps the most obvious solution would be to combine the repos with minimal refactoring right now, but with the intentions to fully disentangle the archiver/scraper from PUDL. This would mean maintaining the dependency on PUDL for the time being, while we continue to work to extract the metadata classes from PUDL.

Option 2: Combine archiver/scraper with PUDL

My second proposed solution is to actually combine the archiver/scraper and PUDL into one mono-repo. This seems somewhat counterintuitive as we are trying to reduce inter-repo dependencies, but I think perhaps simply formalizing those dependencies could be a really simple and clean solution. This would the scraper/archivers immediate access to all PUDL metadata required, while also potentially making it possible to integrate the scraping/archiving process into nightly builds and other PUDL automation.
The biggest obvious drawback is adding additional code to the already large codebase that is PUDL, but with some good module organization, this might not be that big of a deal. I also think moving away from PUDL as a library makes this more feasible, as we can control our dependencies better.

@zaneselvans commented on Mon Sep 19 2022

The mono-repo option makes me kind of nervous. Having everything wrapped up in the main PUDL repo seems like a setup for lots of entanglement between the different parts, and I'm wondering if that's unavoidable, or if there's a meaningful way to split up these concerns.

Both the scrapers/archivers and PUDL need to be able to access the metadata describing the data sources, but the metadata itself is almost just a static collection of Pydantic data models. So one reasonable arrangement seems like it would be to have a simple pudl-metadata repo that just stores that information, which both the archivers and PUDL depend on.

It also really feels like this must be a solved problem: storing blobs of unstructured or semi-structured data, such that particular blobs can be addressed based on some set of key-value pairs, and storing metadata associated with those blobs. Having some taxonomy of blobs. Is this what a so-called data lake or data lakehouse is?

Add all new archives to the Catalyst Cooperative community

We have a "community" on Zenodo that collects all of our published archives together at: https://zenodo.org/communities/catalyst-cooperative/

Currently new archives have to be added to the community manually, but we should automate the process so all of our data can be found more easily by users.

Unfortunately Zenodo currently only allows a single community curator, and the account associated with it belongs to @zaneselvans rather than the PUDL Bot which owns all of the datastore archives, so Zane will have to approve requests to join the community. This issue will be fixed in a future version of the Zenodo backend.

Update EIA Thermoelectric Cooling Water link matching patterns

The 2021 EIA thermoelectric cooling water data is using new and exciting filenames that are different from all the past filenames, as well the old files being in an "archive" directory but the newest files showing up at the top level, and our archiver doesn't catch the new names:

2020

2021

Archive raw PHMSA Gas Data

A duplicate of catalyst-cooperative/pudl#1415, migrated to the pudl-archiver repo.

Data here: https://www.phmsa.dot.gov/data-and-statistics/pipeline/gas-distribution-gas-gathering-gas-transmission-hazardous-liquids

Archive EIA 176 data

A duplicate of #1413, migrated to the pudl-archiver repo.

Data here: https://www.eia.gov/naturalgas/ngqs/all_ng_data.zip

Add validation test that looks at file size

New validation tests currently check missing files and for invalid files based on file types. Add another check that looks at file size and flags:

File size outside of range of 95th percentile of old datastore
File sizes all the same for all files in a dataset (sign of invalid files)

Archiver no longer works with most recent `pudl@dev`

Scope

Beta Give feedback

pin archiver to main instead of dev
archivers run again
Options

Develop unit tests for new archiver

There were tests for both the scraper and archiver before. Many of these can be moved over to the new combined repo, but they need to be updated to work with the new architecture, and a number of new tests will need to be added as well.

Develop unit tests
Develop integration tests
Create basic CI for running tests

Validate zipfile contents

Many of the resources we are archiving are zipfiles containing many other files. Currently we check that the zipfile is a valid zipfile, but we don't look inside it. It came up in the context of the FERC XBRL Taxonomies, that the archiver might occasionally get blocked from downloading some files, so we need the ability to look inside the zipfile and verify that everything we expect is there and looks correct.

Summarize archiver run and send notification via GH Action

With the move to automated archiving, there's a need to easily and visibly summarize the outcome of the automated archival run. Communication should summarize the status of each archiver's run, and any critical warnings or errors. We could also use this as a chance to create a public record of what's new in each deposition.

Who/Where to notify?

The hope is that these archives will be a public resource, so producing a persistent public record of the data updates and/or allowing the general public to get access to this information would be good. Some options:

Send a Slack message to a publicly accessible channel.
Send an email to a list that outside folks can sign up to.
Create an issue on GitHub containing the summary and requiring that it be reviewed by someone on our team e.g. “Review new data snapshots for 2023-03-14.”
Automatically tweet a link to the GitHub issue.

Notification / Summary contents?

Have a section dedicated to each dataset being archived.
For each dataset, indicate whether the archiving run was labeled a success or a failure.
If there was no change from the previous version, resulting in no new archive being created, say so explicitly, and indicate how long it has been since the archive was updated (we should eventually get suspicious if archives that we expect to change don't).
List any change in the set of data partitions that were archived relative to the previous version.
List any existing data partitions that were updated and indicate the change in file size.
Summarize how many partitions were updated (number and % changed)
Summarize the change in archive size (absolute MB and % change)
If the archiving run was a failure, indicate what criteria failed and provide context. E.g.
- List any expected but missing data partitions
- List any unexpected new data partitions
- List any files that have an unexpected internal filetype, and the type that was expected
- List any files that shrank so much that they triggered the failure
If the archiving run failed for reasons unrelated to the data being archived, indicate that as well. E.g.
- Unable to connect to the data provider
- Unable to connect to Zenodo
- GitHub runner ran out of disk space
- Archiving run took too long and timed out

Use new https:// source for the EPA CEMS data

@zaneselvans commented on Tue Jul 26 2022

EPA now appears to be providing (and preferring) https:// access to the FTP server that has historically been used to distribute the EPA CEMS hourly emissions data. This means we can get rid of the janky FTP specific logic in our epacems data downloading script and use standard requests infrastructure (which will hopefully be much faster and might be able to run in async mode?)

The new address is: https://gaftp.epa.gov/DMDnLoad/emissions/hourly/monthly/

@zaneselvans commented on Fri Oct 21 2022

@zschira did you say this issue had been addressed in the scraper-archiver repo merge changes?

@zschira commented on Fri Oct 21 2022

Oh yeah, forgot to add this to the sprint, but the new scraper/archiver is using the https source

Validate zipfiles and re-download if corrupted

@zaneselvans commented on Wed Aug 17 2022

In creating this EIA-923 archive, everything seemed to go fine, but somehow the eia923-2020.zip file was corrupted or incomplete (even though it was about the right size) and I didn't catch it until I tried to run the ETL. The file from the EIA website seems fine, so it appears to have been a downloading glitch (the version I have locally downloaded by the scraper is corrupted too)

We should at least check that zipfiles we download are valid (without getting into the contents) to prevent this kind of thing from happening.

catalyst-cooperative / pudl-archiver Goto Github PK

pudl-archiver's Introduction

PUDL Archivers

Background on Zenodo

Installation

Setting up environment

Usage

Adding a new dataset

Step 1: Implement archiver interface

Step 2: Run --initialize command

Step 3: Manually review your archive before publication.

Development

pudl-archiver's People

Contributors

Stargazers

Watchers

Forkers

pudl-archiver's Issues

Scope

Next steps

Scope

Out of Scope

Tasks

Background

Scope

censusdp1tract

eia176 (first time w/ workflow)

eia860

eia860m

eia861

eia923 (first time w/ workflow)

eia_bulk_elec

eiawater (first time w/ workflow)

epacamd_eia

epacems

ferc1

ferc2

ferc6

ferc60

ferc714

mshamines (first time w/ workflow)

phmsagas (first time w/ workflow)

Patterns?

Tasks

Tasks

Tasks

Background

Tasks

Datasource Metadata

Zenodo Archive Format

Proposed solutions

Option 1: Gradual refactor

Option 2: Combine archiver/scraper with PUDL

2020

2021

Scope

Who/Where to notify?

Notification / Summary contents?

Recommend Projects

Recommend Topics

Recommend Org

`censusdp1tract`

`eia176` (first time w/ workflow)

`eia860`

`eia860m`

`eia861`

`eia923` (first time w/ workflow)

`eia_bulk_elec`

`eiawater` (first time w/ workflow)

`epacamd_eia`

`epacems`

`ferc1`

`ferc2`

`ferc6`

`ferc60`

`ferc714`

`mshamines` (first time w/ workflow)

`phmsagas` (first time w/ workflow)