After adding several new archivers to the run-archivers</cod

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Archiver workflow sometimes creates unpublished draft depositions,about catalyst-cooperative/pudl-archiver

Comments (4)

zaneselvans commented on June 3, 2024

@zschira what do you think of all this?

from pudl-archiver.

zschira commented on June 3, 2024

It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)

Creating a new draft is expected behavior, this happens at the start of a run, and if nothing changes the draft will not be published. This draft will have the exact same contents as the previous version as this is zenodo's behavior when creating a new draft. Nothing should be uploaded to create this new version. The existence of a draft should also not block future runs. If the archiver finds an existing unpublished draft it will just re-use it. I think we could change this behavior to not create a draft until there are detected changes, but it should be able to reuse drafts without a problem

It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.

I did some digging on why every run is requiring updates, and it looks like the names of the XBRL files are changing. These names come from the guid field from the rss feed, which is supposed to be a unique identifier, but as far as I can tell FERC is just putting random ID's in this field every time you download the RSS feed. This is definitely not desired, so I can email FERC about it, and/or we can change where we get the filing name from.

It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

This must be a quite recent change, we'll need to update to use the new API. This does highlight a place we might want to error out though. I ran the EPACEMS archiver and it's logging a warning that no files to download could be found, but that should probably result in a failure.

General thoughts

In general, I think some of these sandbox depositions are just in a weird state right now, potentially from all the testing that we've been doing recently. For example, I ran the epacamd_eia archiver and it says nothing should change even though the checksum of the zipfile on the most recent published version doesn't match the checksum of the downloaded file. This is because the draft deposition it's using has the new zipfile, so the archiver doesn't think there are any changes. @zaneselvans mentions the draft is from September 8th, so perhaps this version was created with the old pudl-zenodo-storage tool that didn't automatically publish new versions. We could try to go through dataset by dataset and clean up these archives, or because it's the sandbox we could just run --initialize on each dataset and start fresh to get out of any weird state situations.

I think changing the archiver to not create a new draft deposition until after it has detected changes would be good, and maybe help to avoid some of these weird state issues. It would also be nice if there's a programmatic way to delete draft depositions. I think just deleting any draft depositions and not trying to reuse them might avoid some weirdness. I haven't been able to figure out how to do this though, perhaps @zaneselvans has seen something?

Also curious if @jdangerx has any thoughts?

from pudl-archiver.

jdangerx commented on June 3, 2024

@zschira did a good job covering the high level stuff - here's some additional thoughts:

The unpublished draft problem occurs when something like this happens:

old data is in our published deposition
we start an archiver run, making a new draft that has the old deposition data
we download a bunch of new data, see that that's not what's in the draft, and update the draft deposition to have the new data
publish fails for some reason
old data is still the latest published version, but draft deposition has all the new data
we start an archiver run, picking up the draft deposition
we download a bunch of new data, see that that is exactly what is in the draft, and then decide "there are no changes to be made"
now we are stuck in "unpublishable new data" land until the data changes yet again.

I think ideally, when we start an archiver run, we'd delete the old draft and create a new one - then we can avoid this issue altogether. I think we had run into issues using the discard endpoint for this before (I think that's for discarding changes in an edit session for an already published deposition) but we might be able to use the delete endpoint to make sure the new version is actually new - something like this in the Zenodo depositor's get_new_version method:

+        response = await self.request(
+            "POST", url, log_label="Creating new version", headers=self.auth_write
+        )
+        old_deposition = Deposition(**response)
+        # Get url to newest deposition
+        new_deposition_url = old_deposition.links.latest_draft
+
+        # immediately delete new version, in case this was actually an old draft
+        headers = {
+            "Content-Type": "application/json",
+        } | self.auth_write
+        response = await self.request("DELETE", new_deposition_url, headers=headers)
+
+        # re-get a new version
        response = await self.request(
            "POST", url, log_label="Creating new version", headers=self.auth_write
        )
        old_deposition = Deposition(**response)
        # Get url to newest deposition
        new_deposition_url = old_deposition.links.latest_draft

        # existing metadata logic elided from this code snippet 

        response = await self.request(
            "PUT",
            new_deposition_url,
            log_label=f"Updating version number from {previous} ({old_deposition.id_}) to {version_info}",
            data=data,
            headers=headers,
        )
        return Deposition(**response)

from pudl-archiver.

jdangerx commented on June 3, 2024

Small PR to delete depositions at the right time incoming.

from pudl-archiver.

Archiver workflow sometimes creates unpublished draft depositions about pudl-archiver HOT 4 CLOSED

Comments (4)

General thoughts

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent