Git Product home page Git Product logo

Comments (7)

looselycoupled avatar looselycoupled commented on June 2, 2024

I'm on this one.

from cultivar.

looselycoupled avatar looselycoupled commented on June 2, 2024

@bbengfort and @rebeccabilbro, please check out my proposal below and let me know if anything is counter to project requirements.

Proposal:

There are a number of long term issues with the proposal below but it gets us closer to what we want. Some would require answering outstanding questions or perhaps unassigned/unidentified issues.

S3 Buckets / Storage
Prepackaged download bundles are stored in S3 bucket that already holds the base files. Bucket is open to world but no browsing allowed. According to current code, each dataset has its own folder specified by datasets/<account>/<dataset> and each datafile is stored here. We will create a bundles directory with sub-directories for each version as in datasets/<account>/<dataset>/bundles/12. The bundle filename will always be <dataset>-bundle-v<version>.zip ala floompa-bundle-v12.zip.

Question: Alternatively we could create bundles/<account>/<dataset>/<version> and keep them somewhat separate. Thoughts? Is that even needed long term for security or other reasons?

Question: Should we keep bundles on a different bucket and use a UUID as folder name to obfuscate so that only those who have the link for a private dataset can download? Should we just use a UUID as the bundle name or is there a requirement that it be friendly filename in some way?

Security
For the moment, if a user has a link then they can download the bundle even if it's a private dataset.

Bundle generation
Whenever an update is needed a new celery task is enqueued to replace (or initially add) a bundle. Presumably one could trigger a bunch of updates relatively quickly. There is a timing problem here that only the latest bundle is ever generated. I'd like to punt this problem until I have a better idea of how we are versioning the individual files (seems easily solvable in the future).

User Interface
Users can use the download link in the project page. If no bundle is yet available then a pop-up message is displayed (I can also color code the download button yellow until ready). Else the download link is direct to s3 http download. I'll likely make a new dataset field to determine if the bundle is ready - either a simple boolean or perhaps something more informative. What might be best is a DatasetVersion model to map DataFiles to Datasets. That would be a natural place for status and give us more flexibility in the future.

from cultivar.

looselycoupled avatar looselycoupled commented on June 2, 2024
  • Develop new DatasetVersion model
  • Develop migration file for existing data?
  • Modify upload code to increment dataset version
  • Develop celery task to bundle content, update, version record
  • Color code Download link
  • Provide popup with download links for available versions

from cultivar.

bbengfort avatar bbengfort commented on June 2, 2024

Point on security: at the moment (I believe) the bucket requires a token to give up the goods, and that token is generated via boto through the Django Storages app. The token grants the user a download, and the link only lasts for 6 hours or something. Meaning that the link isn't created for a user who doesn't have permission.

If this is not the case; then I must have manually edited the bucket for development reasons, and we should go back to the token method above.

from cultivar.

bbengfort avatar bbengfort commented on June 2, 2024

Also, I'm happy to store the bundles on S3 if that's what you think we should do. However, I was planning to generate the zip file on demand with the things that are in the database via the zipfile library and StringIO objects, sort of like Use compressed data directly – from ZIP files or gzip http response

Maybe you're thinking this doesn't scale, which is fair; so bunldes/account/dataset-version.zip seems fine to me. All the rest of your proposal looks good to me.

from cultivar.

looselycoupled avatar looselycoupled commented on June 2, 2024

Current status:
A new bundle is created whenever a file is added and the download link works correctly.

Todo:
Only major item left is to create a new many-to-many so that we can keep track of which files go with which versions. Right now everything maps to the latest version which is the only download provided. Goal is to keep track of the dataset at every version and offer downloads for each.

from cultivar.

bbengfort avatar bbengfort commented on June 2, 2024

I like the idea of being able to download a dataset at previous versions - that will help with estimator reproducibility and a host of other items.

from cultivar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.