datacite / content-resolver Goto Github PK

Legacy DataCite content resolver

License: Apache License 2.0

Java 70.35% HTML 11.57% FreeMarker 7.57% Ruby 6.12% CSS 1.40% Dockerfile 3.00%

content-resolver's Introduction

DataCite

DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence.

About this repository

This is the generic DataCite repository for bugs, enhancements, and other issues. DataCite users can add their ideas through the DataCite Roadmap.

content-resolver's People

Contributors

Stargazers

Watchers

Forkers

yarikoptic dhimmel

content-resolver's Issues

RDF misses several metadata attributes

The turtle and RDF/XML representations seem to be missing several of the metadata attributes, including some which would seem important for machine aggregation, such as rights, citation, isPartOf, and subjects.

Example is here:
http://data.datacite.org/10.5061/DRYAD.17VS2D34/1

It'd be great if this can be fixed.

display new elements/attributes from schema 3.0 especially geolocation

exception if relatedIdentifier element is empty

http://data.datacite.org/10.6096/HYMEX.AROME_WMED.2012.02.20

support multiple rights

with schema 3.0 multiple rights can be provided

respect titleTypes instead of always using the first title

citation mismatches DataCite Schema Documentation

Documentation:

Creator (PublicationYear): Title. Publisher. Identifier

DataCite Metadata Search Beta:

Creator; (PublicationYear): Title; Publisher. Identifier

Also the documentation describes (optional) display of Version and ResourceType.

RIS file format is documented to be encoded in windows-1252 but is generated as UTF-8

The RIS file format is oficially documented to be (for historic reasons, unfortunately) in the windows-1252 character set. The content-resolver returns the text data in UTF-8. This makes it ususeable for citations with umlauts, if you try to import them into Endnote. The same issue exists for Crossref output.

For PANGAEA (where we use the crossref and datacite content negotiation to get full citation information), we use Apache Tika's caracter sets auto-negotiation before parsing the RIS file, but thats not agood idea.

I just opened the issue to hopefully fix this (in communication with Crossref).

add nice 404 page

it might be irritating if a doi is already registered, but not (yet) in search available

Correctly format publication date in citeproc JSON

Instead of

{ "issued":  {
  "raw": 2016
}

use

"issued": {
  "date-parts": [
    [
      2016
    ]
  ]
}

Use one line per author in RIS output

For example

AU  - Bräuning, Manuela
AU  - Hellmuth, Markus
AU  - Tumbrink, Theodor

for https://doi.org/10.2314/GBV:731423534

better rendering of nameIdentifier and other fields

Currently e.g. for contributors the whole text content of the <contributor>; so nameIdentifier is just concatenated to contributorName. The nameIdentifier should also be actionable or at least show schemeURI and/or nameIdentifierScheme.

support rightsURI

new in schema 3.0. maybe we should link the rights value if there is a URI given.

Don't use application/ld+json mime type

Using this content type to return metadata in schema.org/JSON-LD format breaks support for custom application/ld+json media registered with a DOI. See codemeta/codemeta#125 for background.

We should instead use a more specific mime type. This needs more discussion, but for now, we can use application/vnd.schemaorg.ld+json.

wrong author seperation in BibTeX format

it's currently a semi-colon. "and" would be correct.

BIBTEX output format does not escape characters according to Latex

The Latex escapes a lot of common characters, so importing the BIBTEX files from content-resolver in most cases lead to Latex errors. The rules for escaping characters are very complicated.

At PANGAEA we have an escaper class for BIBTEX that handles most of western chars to be correctly escaped when exported as Latex text (used by BIBTEX). We can provide this Java code here, it should be available to everyone. Its mainly a POJO with a static method that gets a String and returns the String as escaped Latex code.

bad pages for test prefix

Due to the template handle on the test prefix 10.5072, a page is rendered for all possible DOI suffixes:

http://data.datacite.org/10.5072/foobardfgdgksdjth

labels too long

Labels that are too long need to be wrapped so that they do not mask the metadata:

Show nameIdentifier scheme in HTML version

Working on datacite/search#148.

Add Google Analytics token via configuration

Provide JSON-LD content type

Using the JSON-LD representation of schema.org metadata.

Why are DOIs converted to uppercase?

I see that many DataCite services seem to provide DOIs in ALL CAPS. For example, 10.5281/ZENODO.48810 rather than 10.5281/zenodo.48810. I didn't know until now that DOI resolution was case agnostic.

However, I view it as extremely undesirable to transform DOIs from their registered case. Basically, this defies the uniqueness tenant of the DOI system. In other words, I can no longer use DOIs as a primary key for a resource without converting to lowercase myself?

correctly format personal names in citeproc JSON

Split given and family name, e.g.

"author": [{
    "family": "Vision",
    "given": "Todd"
}, {
    "family": "Rueda",
    "given": "Laura"
}, {
    "family": "Dasler",
    "given": "Robin"
}, {
    "family": "Haak",
    "given": "Laure"
}, {
    "family": "Cruse",
    "given": "Patricia"
}, {
    "literal": "THOR Consortium"
}]

Pass-through unknown content types for downstream content negotiation

Instead of returning a 406 error for unknown content types, content negotiation should forward to the URL registered in the handle system, enabling content negotiation at that URL.

See codemeta/codemeta#125 for more details.

superflous semicolon in citation after author list

some unicode chars not shown correctly

It renders Tadi? instead of Tadić as in search

[Bug-report]: For RIS-citations the charset= is doubled in the HTTP response header:

Dear Datacite,
[Feature request:] Would you consider to use utf-8 as standard character set in all citation services?
Or, at least support Accept-Charset / Accept with charset as shown below.

[Bug-report]: For RIS-citations the charset= is doubled in the HTTP response header:

$ curl -v http://data.datacite.org/application/x-research-info-systems/10.21334/npolar.2016.3d72756d -H "Accept: application/x-research-info-systems;charset=utf-8" -H "Accept-Charset:utf-8"

< HTTP/1.1 200 OK
< Content-Type: application/x-research-info-systems; charset=charset=windows-1252

Front conversations

license -> rights

to be consistent with the DataCite Metadata schema

Text formats returned as UTF-8 without indicating a charset

I found this with both text/x-bibliography and text/turtle.

I'm using python-requests; I had trouble making the problem clear using other tools:

> r = requests.get('http://data.datacite.org/10.2312%2FGFZ.syserde.03.01.9', 
         headers={'accept':'text/x-bibliography; style=harvard3'})

No encoding is returned in the response, so according to the HTTP spec the response must be encoded as ISO-8859-1, and that is how it is (incorrectly) decoded.

> r.headers['content-type']
'text/x-bibliography'
> r.encoding
'ISO-8859-1'
> r.text 
u'Cacace,\xc2\xa0Mauro, Scheck-Wenderoth,\xc2\xa0Magdalena, 
Cherubini,\xc2\xa0Yvonne, and Przybycin,\xc2\xa0Anna Maria 2013,
\xe2\x80\x9cBeckenmodellierung: Temperatur in Sedimentbecken,\xe2\x80\x9d
Deutsches GeoForschungsZentrum GFZ, viewed 
<http://dx.doi.org/10.2312/GFZ.syserde.03.01.9>.\n'

Here, \xc2\xa0, \xe2\x80\x9c and \xe2\x80\x9d are all multi-byte UTF-8 characters (nbsp and left/right smartquotes) that have been decoded as multiple single-byte characters.

The correct output can be generated by explicitly decoding as UTF-8, but this is against spec and would require handling this particular service as a special case:

> r.content
'Cacace,\xc2\xa0Mauro, Scheck-Wenderoth,\xc2\xa0Magdalena, 
Cherubini,\xc2\xa0Yvonne, and Przybycin,\xc2\xa0Anna Maria 2013, 
\xe2\x80\x9cBeckenmodellierung: Temperatur in Sedimentbecken,\xe2\x80\x9d 
Deutsches GeoForschungsZentrum GFZ, viewed 
<http://dx.doi.org/10.2312/GFZ.syserde.03.01.9>.\n'
> r.content.decode('UTF8')
u'Cacace,\xa0Mauro, Scheck-Wenderoth,\xa0Magdalena, 
Cherubini,\xa0Yvonne, and Przybycin,\xa0Anna Maria 2013, 
\u201cBeckenmodellierung: Temperatur in Sedimentbecken,\u201d 
Deutsches GeoForschungsZentrum GFZ, viewed 
<http://dx.doi.org/10.2312/GFZ.syserde.03.01.9>.\n'