ismir / conference-archive Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 5.0 1.82 MB

Machinery for archiving conference proceedings and maintaining metadata.

License: MIT License

Python 36.52% Shell 0.24% HTML 63.24%

conference-archive's People

Contributors

Stargazers

Watchers

Forkers

stefan-balke fzalkow ajaysmurthy nkundiushuti jpauwels

conference-archive's Issues

continuous integration with Zenodo

The metadata in this repository should act as the Single Source of Truth for ISMIR proceedings. Downstream consumers, like Zenodo, do not actively pull from this collection, and instead this information must be pushed.

One way to do this would be for the uploader scripts to run on something like Travis-CI on successful pushes to master. Ideally, only deltas would be fixed, though perhaps a full sweep of metadata would be necessary.

There are a few open questions:

how long would it reasonably take for this to run. Zenodo does have rate limiting, and it's unclear if say a 20 min update on Travis is good / bad.
how does travis handle API keys in a secure way? does this require a subscription?
how often is this metadata going to change? if the community is going to be very attentive, this might be worth it. alternatively, writing all the scripts and leaving it as a monthly update on a private machine somewhere is sufficient.

bulk archive collection assumes hosted PDFs

uploader assumes that the pdf_url is an externally / web-hosted PDF. this should also accept file:// prefixes (or assume that non-http(s) prefixes are local) and handle them gracefully.

Add Program Committe

Getting an issue from 2019:
Add more meta information such as paper chairs and conference organizers.

Late Breaking Demo

We are currently not archiving LBDs.
Maybe this can stay a "manual process" and will not involve DBLP but maybe we want those on Zenodo?

Publication identifier namespace and index

I had been hoping we could use DBLP's naming convention for "primary keys" over papers, but it looks like the convention may be difficult to follow independently. There are at least two side-effects that would be difficult to model:

keys that would create collisions are suffixed with alpha-enumeraters, e.g. a,b,c...
author names are derived from DBLPs global author namespace, and an ISMIR author may have a different representation as a result.

An obvious place this should happen at the point of submission in a given conference, e.g. event_id-submission_id where event ID is year (for now).

Perhaps there would be value to producing (and maintaining) a table / CSV index of paper identifiers in different namespaces? Some of these are (effectively) non-deterministic, so there doesn't really seem to be a good way around it. The three columns that seem important so far would be:

ISMIR: {event_id}-{submission_id}, e.g. 2017-103
Zenodo: integer "record ID", e.g. 1417159
DBLP: mix of authors + year, e.g. FonsecaPFFBFOPS17

bulk archive collection assumes new records

current uploader creates new Zenodo IDs for every record to be uploaded. On first (ad-hoc) pass, something was skipping over previously uploaded records, but there doesn't seem to be any committed code that does this.

instead, the uploader should check if the entity has a zenodo ID in the object, and use that to follow an updating fork in the logic.

Download 2018

Hey,

I was not able to download the PDFs for 2018. Always results in ~100 byte-ish files.

python download_proceedings.py ../database/proceedings/2018.json ../database/pdfs

Can you reproduce this?

Add abstract to uploader

Zenodo uploader/updater needs to include the abstracts.

parse SoftConf submissions into metadata

starting from the submission info from softconf + PDF publications, produce two files:

markdown in the format expected by DBLP, like this
JSON for the Zenodo uploader script, like this

Archive 2020

Open Tasks:

@ejhumphrey who has access to archives.ismir.net from the current board?

Zenodo uploader new features

When I ran the uploader for 2019 I ran into these issues:

If the ee field pointed to archives.ismir.net, it didn't correctly download the files in order to upload them to zenodo. I worked around this by adding a local path to the files in this field when running the uploader
The uploader replaces the ee field with a link to zenodo, however: 1) this link doesn't actually seem to work, 2) for consistency perhaps we want to keep this on archives.ismir.net?
If we have additional data (e.g., an "extra" key that we added to 2019 for takeaway message/external links) then the updated data file that the uploader writes doesn't include these keys

export_to_markdown

Hey,

is there any reason this script uses parallel.
For consistency, I would stay with joblib throughout this repo.

do authors prefer YAML to JSON?

in the future, it may be preferable to store all metadata as YAML, so that humans may more easily correct errors. This is important because it the metadata maintained here is meant to act at the single source of truth, and that Zenodo (and others) should inherit from it.

the question then is, is it actually painful for users to manually update JSON? would YAML make this easier? are comments (# i'm a comment) useful?

sample CSV file of SoftConf export

for development purposes, we should have a a dummy CSV file matching the headings produced by softconf.

uploader idempotence

upload_to_zenodo doesn't actually directly close the loop on idempotence, e.g. don't repeat finished work. The steps to do this would be either:

have a separate merge operation for output files that joins on more complete records
basically the above, but internal to the upload script

the former is nice because it's generic, the latter is nice because the database files don't have an explicit timestamp for last_updated.

Document the information graph

part of what makes the archival process challenging is that the flow of information from various stakeholders and responsible parties is at best implicit and not always transparent.

It would be useful / valuable to document this process and also diagram how all the pieces fit. Important information to call out will be:

who are the different agents involved, and what are they responsible for
what are the different data models that pass between nodes
what are the technologies in use, and who owns / can access them
which pieces are manual, which pieces are automated

Add front matter of conferences

Some people have expressed an interest in having the front matter for all conferences available too. If we can find/extract this, it would be great to add to the archive too

ismir 2018 abstracts

produce JSON with DBLP-like keys mapped to string abstracts.

refactor repository structure for file database

currently: all articles live in a single proceedings JSON file, and conference metadata lives in a JSON file

future: each "event" gets its own folder. there is one metadata JSON file and one publications JSON file. (alternatively, there is no folder, just one JSON file, and both publications and metadata live in it under separate keys, but this feels slightly worse, somehow).

Missing Papers

When downloading, these papers have no URLs.

Cope01: No URL available.
Downie01: No URL available.
Raskin01: No URL available.
Barlas02: No URL available.
Hofstadter02: No URL available.
Olson03: No URL available.
Pedersen03: No URL available.
WangFC08: No URL available.
Dubnov08: No URL available.
LamereP08: No URL available.
Selfridge-FieldS08: No URL available.

Maybe those are keynotes or whatever. Should decide on a case by case basis to keep or remove them.

Broken PDF "An Industrial Strength Audio Search Algorithm"

I am not sure if this is the right place to raise this issue.
I just became aware that the following PDF is broken at Zenodo.

https://zenodo.org/record/1416340

Better handling of differential uploads for PDFs

Currently, upload_to_zenodo will try to lob whatever the specified PDF is at Zenodo. Zenodo itself is idempotent, e.g. won't change the upload if the MD5 checksum matches, but this (a) is slow because each paper is 1-4MB, which is roughly 100MB per conference and over 1GB in aggregate, and (b) requires that the PDFs are accessible if local, or suffers both a download and upload if the electronic edition (ee) is a URL on the web.

There are few ways around this (and maybe others):

track MD5 checksums from Zenodo in the proceedings database
before uploading, ask Zenodo what the latest MD5 checksum is
toggle PDF uploading as a global arg

(3) is certainly the easiest to implement, but also the easiest to misuse / create drift. That said, perhaps we start there and revisit if it becomes problematic?