Git Product home page Git Product logo

conference-archive's People

Contributors

ajaysmurthy avatar alastair avatar ecanoc avatar ejhumphrey avatar fzalkow avatar ismirweb avatar stefan-balke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conference-archive's Issues

continuous integration with Zenodo

The metadata in this repository should act as the Single Source of Truth for ISMIR proceedings. Downstream consumers, like Zenodo, do not actively pull from this collection, and instead this information must be pushed.

One way to do this would be for the uploader scripts to run on something like Travis-CI on successful pushes to master. Ideally, only deltas would be fixed, though perhaps a full sweep of metadata would be necessary.

There are a few open questions:

  • how long would it reasonably take for this to run. Zenodo does have rate limiting, and it's unclear if say a 20 min update on Travis is good / bad.
  • how does travis handle API keys in a secure way? does this require a subscription?
  • how often is this metadata going to change? if the community is going to be very attentive, this might be worth it. alternatively, writing all the scripts and leaving it as a monthly update on a private machine somewhere is sufficient.

Add Program Committe

Getting an issue from 2019:
Add more meta information such as paper chairs and conference organizers.

Late Breaking Demo

We are currently not archiving LBDs.
Maybe this can stay a "manual process" and will not involve DBLP but maybe we want those on Zenodo?

Publication identifier namespace and index

I had been hoping we could use DBLP's naming convention for "primary keys" over papers, but it looks like the convention may be difficult to follow independently. There are at least two side-effects that would be difficult to model:

  • keys that would create collisions are suffixed with alpha-enumeraters, e.g. a,b,c...
  • author names are derived from DBLPs global author namespace, and an ISMIR author may have a different representation as a result.

An obvious place this should happen at the point of submission in a given conference, e.g. event_id-submission_id where event ID is year (for now).

Perhaps there would be value to producing (and maintaining) a table / CSV index of paper identifiers in different namespaces? Some of these are (effectively) non-deterministic, so there doesn't really seem to be a good way around it. The three columns that seem important so far would be:

  • ISMIR: {event_id}-{submission_id}, e.g. 2017-103
  • Zenodo: integer "record ID", e.g. 1417159
  • DBLP: mix of authors + year, e.g. FonsecaPFFBFOPS17

bulk archive collection assumes new records

current uploader creates new Zenodo IDs for every record to be uploaded. On first (ad-hoc) pass, something was skipping over previously uploaded records, but there doesn't seem to be any committed code that does this.

instead, the uploader should check if the entity has a zenodo ID in the object, and use that to follow an updating fork in the logic.

Download 2018

Hey,

I was not able to download the PDFs for 2018. Always results in ~100 byte-ish files.

python download_proceedings.py ../database/proceedings/2018.json ../database/pdfs

Can you reproduce this?

Archive 2020

Open Tasks:

  • Archive PDFs to ISMIR archive
  • Archive PDFs to Zenodo
  • Get JSON File
  • Trigger Website Built
  • Trigger DBLP Update
  • Archive website

@ejhumphrey who has access to archives.ismir.net from the current board?

Zenodo uploader new features

When I ran the uploader for 2019 I ran into these issues:

  • If the ee field pointed to archives.ismir.net, it didn't correctly download the files in order to upload them to zenodo. I worked around this by adding a local path to the files in this field when running the uploader
  • The uploader replaces the ee field with a link to zenodo, however: 1) this link doesn't actually seem to work, 2) for consistency perhaps we want to keep this on archives.ismir.net?
  • If we have additional data (e.g., an "extra" key that we added to 2019 for takeaway message/external links) then the updated data file that the uploader writes doesn't include these keys

export_to_markdown

Hey,

is there any reason this script uses parallel.
For consistency, I would stay with joblib throughout this repo.

do authors prefer YAML to JSON?

in the future, it may be preferable to store all metadata as YAML, so that humans may more easily correct errors. This is important because it the metadata maintained here is meant to act at the single source of truth, and that Zenodo (and others) should inherit from it.

the question then is, is it actually painful for users to manually update JSON? would YAML make this easier? are comments (# i'm a comment) useful?

uploader idempotence

upload_to_zenodo doesn't actually directly close the loop on idempotence, e.g. don't repeat finished work. The steps to do this would be either:

  • have a separate merge operation for output files that joins on more complete records
  • basically the above, but internal to the upload script

the former is nice because it's generic, the latter is nice because the database files don't have an explicit timestamp for last_updated.

Document the information graph

part of what makes the archival process challenging is that the flow of information from various stakeholders and responsible parties is at best implicit and not always transparent.

It would be useful / valuable to document this process and also diagram how all the pieces fit. Important information to call out will be:

  • who are the different agents involved, and what are they responsible for
  • what are the different data models that pass between nodes
  • what are the technologies in use, and who owns / can access them
  • which pieces are manual, which pieces are automated

Add front matter of conferences

Some people have expressed an interest in having the front matter for all conferences available too. If we can find/extract this, it would be great to add to the archive too

refactor repository structure for file database

currently: all articles live in a single proceedings JSON file, and conference metadata lives in a JSON file

future: each "event" gets its own folder. there is one metadata JSON file and one publications JSON file. (alternatively, there is no folder, just one JSON file, and both publications and metadata live in it under separate keys, but this feels slightly worse, somehow).

Missing Papers

When downloading, these papers have no URLs.

Cope01: No URL available.
Downie01: No URL available.
Raskin01: No URL available.
Barlas02: No URL available.
Hofstadter02: No URL available.
Olson03: No URL available.
Pedersen03: No URL available.
WangFC08: No URL available.
Dubnov08: No URL available.
LamereP08: No URL available.
Selfridge-FieldS08: No URL available.

Maybe those are keynotes or whatever. Should decide on a case by case basis to keep or remove them.

Better handling of differential uploads for PDFs

Currently, upload_to_zenodo will try to lob whatever the specified PDF is at Zenodo. Zenodo itself is idempotent, e.g. won't change the upload if the MD5 checksum matches, but this (a) is slow because each paper is 1-4MB, which is roughly 100MB per conference and over 1GB in aggregate, and (b) requires that the PDFs are accessible if local, or suffers both a download and upload if the electronic edition (ee) is a URL on the web.

There are few ways around this (and maybe others):

  1. track MD5 checksums from Zenodo in the proceedings database
  2. before uploading, ask Zenodo what the latest MD5 checksum is
  3. toggle PDF uploading as a global arg

(3) is certainly the easiest to implement, but also the easiest to misuse / create drift. That said, perhaps we start there and revisit if it becomes problematic?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.