Git Product home page Git Product logo

Comments (4)

zaneselvans avatar zaneselvans commented on June 3, 2024

@zschira what do you think of all this?

from pudl-archiver.

zschira avatar zschira commented on June 3, 2024

It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)

Creating a new draft is expected behavior, this happens at the start of a run, and if nothing changes the draft will not be published. This draft will have the exact same contents as the previous version as this is zenodo's behavior when creating a new draft. Nothing should be uploaded to create this new version. The existence of a draft should also not block future runs. If the archiver finds an existing unpublished draft it will just re-use it. I think we could change this behavior to not create a draft until there are detected changes, but it should be able to reuse drafts without a problem

It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.

I did some digging on why every run is requiring updates, and it looks like the names of the XBRL files are changing. These names come from the guid field from the rss feed, which is supposed to be a unique identifier, but as far as I can tell FERC is just putting random ID's in this field every time you download the RSS feed. This is definitely not desired, so I can email FERC about it, and/or we can change where we get the filing name from.

It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

This must be a quite recent change, we'll need to update to use the new API. This does highlight a place we might want to error out though. I ran the EPACEMS archiver and it's logging a warning that no files to download could be found, but that should probably result in a failure.

General thoughts

In general, I think some of these sandbox depositions are just in a weird state right now, potentially from all the testing that we've been doing recently. For example, I ran the epacamd_eia archiver and it says nothing should change even though the checksum of the zipfile on the most recent published version doesn't match the checksum of the downloaded file. This is because the draft deposition it's using has the new zipfile, so the archiver doesn't think there are any changes. @zaneselvans mentions the draft is from September 8th, so perhaps this version was created with the old pudl-zenodo-storage tool that didn't automatically publish new versions. We could try to go through dataset by dataset and clean up these archives, or because it's the sandbox we could just run --initialize on each dataset and start fresh to get out of any weird state situations.

I think changing the archiver to not create a new draft deposition until after it has detected changes would be good, and maybe help to avoid some of these weird state issues. It would also be nice if there's a programmatic way to delete draft depositions. I think just deleting any draft depositions and not trying to reuse them might avoid some weirdness. I haven't been able to figure out how to do this though, perhaps @zaneselvans has seen something?

Also curious if @jdangerx has any thoughts?

from pudl-archiver.

jdangerx avatar jdangerx commented on June 3, 2024

@zschira did a good job covering the high level stuff - here's some additional thoughts:

The unpublished draft problem occurs when something like this happens:

  1. old data is in our published deposition
  2. we start an archiver run, making a new draft that has the old deposition data
  3. we download a bunch of new data, see that that's not what's in the draft, and update the draft deposition to have the new data
  4. publish fails for some reason
  5. old data is still the latest published version, but draft deposition has all the new data
  6. we start an archiver run, picking up the draft deposition
  7. we download a bunch of new data, see that that is exactly what is in the draft, and then decide "there are no changes to be made"
  8. now we are stuck in "unpublishable new data" land until the data changes yet again.

I think ideally, when we start an archiver run, we'd delete the old draft and create a new one - then we can avoid this issue altogether. I think we had run into issues using the discard endpoint for this before (I think that's for discarding changes in an edit session for an already published deposition) but we might be able to use the delete endpoint to make sure the new version is actually new - something like this in the Zenodo depositor's get_new_version method:

+        response = await self.request(
+            "POST", url, log_label="Creating new version", headers=self.auth_write
+        )
+        old_deposition = Deposition(**response)
+        # Get url to newest deposition
+        new_deposition_url = old_deposition.links.latest_draft
+
+        # immediately delete new version, in case this was actually an old draft
+        headers = {
+            "Content-Type": "application/json",
+        } | self.auth_write
+        response = await self.request("DELETE", new_deposition_url, headers=headers)
+
+        # re-get a new version
        response = await self.request(
            "POST", url, log_label="Creating new version", headers=self.auth_write
        )
        old_deposition = Deposition(**response)
        # Get url to newest deposition
        new_deposition_url = old_deposition.links.latest_draft

        # existing metadata logic elided from this code snippet 

        response = await self.request(
            "PUT",
            new_deposition_url,
            log_label=f"Updating version number from {previous} ({old_deposition.id_}) to {version_info}",
            data=data,
            headers=headers,
        )
        return Deposition(**response)

from pudl-archiver.

jdangerx avatar jdangerx commented on June 3, 2024

Small PR to delete depositions at the right time incoming.

from pudl-archiver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.