ismir / conference-archive Goto Github PK
View Code? Open in Web Editor NEWMachinery for archiving conference proceedings and maintaining metadata.
License: MIT License
Machinery for archiving conference proceedings and maintaining metadata.
License: MIT License
The metadata in this repository should act as the Single Source of Truth for ISMIR proceedings. Downstream consumers, like Zenodo, do not actively pull from this collection, and instead this information must be pushed.
One way to do this would be for the uploader scripts to run on something like Travis-CI on successful pushes to master. Ideally, only deltas would be fixed, though perhaps a full sweep of metadata would be necessary.
There are a few open questions:
uploader assumes that the pdf_url
is an externally / web-hosted PDF. this should also accept file://
prefixes (or assume that non-http(s) prefixes are local) and handle them gracefully.
Getting an issue from 2019:
Add more meta information such as paper chairs and conference organizers.
We are currently not archiving LBDs.
Maybe this can stay a "manual process" and will not involve DBLP but maybe we want those on Zenodo?
I had been hoping we could use DBLP's naming convention for "primary keys" over papers, but it looks like the convention may be difficult to follow independently. There are at least two side-effects that would be difficult to model:
An obvious place this should happen at the point of submission in a given conference, e.g. event_id-submission_id
where event ID is year (for now).
Perhaps there would be value to producing (and maintaining) a table / CSV index of paper identifiers in different namespaces? Some of these are (effectively) non-deterministic, so there doesn't really seem to be a good way around it. The three columns that seem important so far would be:
{event_id}-{submission_id}
, e.g. 2017-103current uploader creates new Zenodo IDs for every record to be uploaded. On first (ad-hoc) pass, something was skipping over previously uploaded records, but there doesn't seem to be any committed code that does this.
instead, the uploader should check if the entity has a zenodo ID in the object, and use that to follow an updating fork in the logic.
Hey,
I was not able to download the PDFs for 2018. Always results in ~100 byte-ish files.
python download_proceedings.py ../database/proceedings/2018.json ../database/pdfs
Can you reproduce this?
Zenodo uploader/updater needs to include the abstracts.
Open Tasks:
@ejhumphrey who has access to archives.ismir.net from the current board?
When I ran the uploader for 2019 I ran into these issues:
Hey,
is there any reason this script uses parallel
.
For consistency, I would stay with joblib
throughout this repo.
in the future, it may be preferable to store all metadata as YAML, so that humans may more easily correct errors. This is important because it the metadata maintained here is meant to act at the single source of truth, and that Zenodo (and others) should inherit from it.
the question then is, is it actually painful for users to manually update JSON? would YAML make this easier? are comments (# i'm a comment
) useful?
for development purposes, we should have a a dummy CSV file matching the headings produced by softconf.
upload_to_zenodo doesn't actually directly close the loop on idempotence, e.g. don't repeat finished work. The steps to do this would be either:
the former is nice because it's generic, the latter is nice because the database files don't have an explicit timestamp for last_updated.
part of what makes the archival process challenging is that the flow of information from various stakeholders and responsible parties is at best implicit and not always transparent.
It would be useful / valuable to document this process and also diagram how all the pieces fit. Important information to call out will be:
Some people have expressed an interest in having the front matter for all conferences available too. If we can find/extract this, it would be great to add to the archive too
produce JSON with DBLP-like keys mapped to string abstracts.
currently: all articles live in a single proceedings
JSON file, and conference metadata lives in a JSON file
future: each "event" gets its own folder. there is one metadata JSON file and one publications JSON file. (alternatively, there is no folder, just one JSON file, and both publications and metadata live in it under separate keys, but this feels slightly worse, somehow).
When downloading, these papers have no URLs.
Cope01: No URL available.
Downie01: No URL available.
Raskin01: No URL available.
Barlas02: No URL available.
Hofstadter02: No URL available.
Olson03: No URL available.
Pedersen03: No URL available.
WangFC08: No URL available.
Dubnov08: No URL available.
LamereP08: No URL available.
Selfridge-FieldS08: No URL available.
Maybe those are keynotes or whatever. Should decide on a case by case basis to keep or remove them.
I am not sure if this is the right place to raise this issue.
I just became aware that the following PDF is broken at Zenodo.
Currently, upload_to_zenodo
will try to lob whatever the specified PDF is at Zenodo. Zenodo itself is idempotent, e.g. won't change the upload if the MD5 checksum matches, but this (a) is slow because each paper is 1-4MB, which is roughly 100MB per conference and over 1GB in aggregate, and (b) requires that the PDFs are accessible if local, or suffers both a download and upload if the electronic edition (ee
) is a URL on the web.
There are few ways around this (and maybe others):
(3) is certainly the easiest to implement, but also the easiest to misuse / create drift. That said, perhaps we start there and revisit if it becomes problematic?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.