gbv / beaconspec Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 2.0 489 KB

BEACON link dump format specification

Home Page: http://gbv.github.com/beaconspec/

Shell 7.65% Makefile 92.35%

beaconspec's People

Contributors

Stargazers

Watchers

Forkers

mathiasschindler

beaconspec's Issues

Language tags in meta fields

Language tags could be included as done in aREF by appending @ followed by a language tag. The following fields value may contain a language tag:

description
name
institution (*)

The institution field requires special treatment because it may contain and URL as well.

Make SOURCESET/TARGETSET repeatable?

For instance for the GND database:

#SOURCESET: http://d-nb.info/gnd/7749153-1
#SOURCESET: http://viaf.org/viaf/188136221

Introductionary Example

An easy example would help for not having to dig into the full specification.

Provide online playground and validator

Please specify preferred file name ending in the spec

Seeing that the mime type for (non-xml) beacon is (currently) text/plain, I assume that the preferred file name ending is .txt, but it would help to have that put down explicitly in the spec, particularly if you are going for text/beacon instead.

Specify meta fields

A review of currently used meta fields from BEACON files at http://de.wikipedia.org/wiki/Wikipedia:BEACON shows the following distribution:

   125 #TARGET
    118 #INSTITUTION
    112 #FEED
    100 #CONTACT
     94 #DESCRIPTION
     92 #MESSAGE
     85 #VERSION
     78 #PREFIX
     77 #FORMAT
     65 #NAME
     63 #TIMESTAMP
     30 #REVISIT
     12 #DATE
     10 #EXAMPLES
      8 #ISIL
      5 #UPDATE
      4 #COUNT
      2 #SOMEMESSAGE
      1 #SOURCE
      1 #RESULTS
      1 #Langtext a {color
      1 #IMGTARGET
      1 #ALTTARGET

Which fields should become part of the final specification?

PREFIX and TARGET are essential to abbreviate links.
FORMAT and VERSION should be dropped
TIMESTAMP should be used instead of DATE
COUNT and EXAMPLES are not that important: drop them
REVISIT is used (with different format), but I'd prefer the UPDATE field with controlled terms as used in sitemaps.xml standard instead
FEED is useful
ISIL and SOMEMESSAGE should be dropped because less used
ALTTARGET, IMGTARGET, Langtext, RESULTS, and SOURCE are not standard
What about: INSTITUTION, CONTACT, DESCRIPTION, MESSAGE, NAME?

So finally there is:

PREFIX and TARGET to abbreviate links
FEED, TIMESTAMP, and UPDATE to describe dump URL, time and update frequency
CONTACT
These need clarification: INSTITUTION, DESCRIPTION, MESSAGE, NAME. Can we drop one of them?

Link fields

Given that the (source identifier or URI) id field and the optional target (link, URI, identifier, search term, label ...) field will transport quite a broad spectrum of character data (but maybe no spaces) I already see difficulties for a third optional label or description field in the text serialization.

As for content I doubt that there will any clear distinction between "label" and "description" data. The known use cases contain - if any - partly the preferred name of the object in the target application (use case: biography) and partly some indication of the number of distinct objects at the target (use case: bibliography) and as a matter of fact never both.

Agreed that a consumer of a beacon file wants to construct hyperlinks from the data received, there certainly is a need to label the link and to provide context information by means of tooltips or enriching the label or anything it considers suitable. For this any header fields can be exploited at the discretion of the consumer (which has to take into account that all of them are optional) but the creator of the beacon file may provide #LABEL and #INFO (?, #TOOLTIP?, #CONTEXT??, ...) header fields which by virtue of the templating mechanism may refer to any link field: {srcid}, {trgid}, {label}.

However http://beacon.findbuch.de/portraits/ps_usbk with the formats "seealso" and "seealso-imglink" demonstrates that in the presence of graphical resources "label" and "description" sometimes are not enough elements to produce appealing links and it would be nice to have even more template-aware header fields like #THUMBNAIL or #PREVIEW or #LOGO to transport the construction rules for additional texts or URLs. Trying to standardize these would be completely out of scope of a specification, but it is already stated that extra header fields are not in violation of the standard and the (editorial) #REMARK field could try to explain the purpose of the extra fields or alternatively link to a documentation page.

Support link-specific relations

#FORMAT: BEACON
#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://purl.org/spar/cito/
#RELATION: http://www.w3.org/2004/02/skos/core#{+ID}

P2860|exactMatch|cites

And N-Triples like

#RELATION: {+ID}

http://www.wikidata.org/entity/P2860|http://www.w3.org/2004/02/skos/core#exactMatch|http://purl.org/spar/cito/cites

Clarify repeatability of meta fields

Repeated meta fields are no syntax error but SHOULD result in a warning. Application may choose which value to pick but they MUST pick only one value.

Symmetry

As the introduction states nicely, Beacon files denotate a mapping function: From "Identifiers" or URIs to URLs, URIs and maybe "Identifiers".

To stress the symmetry the template fields PREFIX and TARGET should be named SOURCE and TARGET and obey the same syntax rules.

To make things like "{ID}" clearer, the placeholders should be named "SRCID" and "TRGID" (both may occur in targets, there are even examples for mixtures and repetitions [and alas also target placeholders split in two parts with a repetitive element between them]). However TRGID should be forbidden for the SOURCE template.

Rename to BEACON file

BEACON was first, but http://www.w3.org/TR/beacon/ is stronger.

Remove everything related to RDF

This request was raised (in a similar form) by @MathiasSchindler.

Make #FORMAT a mandatory meta field

Every valid BEACON file should contain a meta field explaining its own format.

As suggested by Thomas Berger, previously used #VERSION and #FORMAT should be merged.

As of 2014, the only valid entry should be

FORMAT BEACON 1.0

Future Versions should be constructed accordingly.

FORMAT should be a mandatory meta field.

VERSION should be invalid.

BEACON text link format

Reviewing existing BEACON files with more than a single id field, there are several cases, for instance:

Number of "hits" as second field:

For instance http://dingler.culture.hu-berlin.de/beacon and http://beacon.findbuch.de/downloads/pw/pw_imslp-pndbeacon.txt:

116137592|138
100001718|2

Label and target URI:

For instance http://www.zisterzienserlexikon.de/beacon/beacon.txt and http://beacon.findbuch.de/downloads/ps_usbk/DE-38-USB_Koeln-Portraitsammlung-portraitierte-beacon.txt and http://www.andreas-praefcke.de/temp/BEACON-PND-ADS.txt:

139788824|Kurz, Matthäus|http://www.zisterzienserlexikon.de/wiki/Kurz,_Matth%C3%A4us
116647868|Kobolt, F. W.: Kupferstich, 1795|http://kug.ub.uni-koeln.de/portal/connector/permalink/portrait/1/1/index.html
101148739X|Dones Elvira (Namensdatensatz)|http://lexikon.a-d-s.ch/edit/detail_a.php?id_autor=43

target URI:

For instance in http://www.andreas-praefcke.de/temp/BEACON-PND-BBC-Paintings.txt and http://www.andreas-praefcke.de/temp/BEACON-PND-GW.txt:

102436606|http://www.bbc.co.uk/arts/yourpaintings/artists/thomas-bardwell
100001467|http://gesamtkatalogderwiegendrucke.de/docs/ANDRANT.htm

number of hits and label:

For instance http://www.historische-kommission-muenchen-editionen.de/beacon_adr.txt:

116001038|1|Abegg, Waldemar

What should the final specification be? The last case only occurrs once so it could be marked as deprecated.

Unicode Normalization

Clarify whether/where Unicode Normalization MUST/SHOULD be done.

How to deal with non-URI identifiers

As raised in another issue there may be non-URI identifiers. I neither see the need to support these kind of identifiers nor know how to support them, but maybe there is a convincing use case and a simple solution.

Suggestion&Question concerning 3. Metafields, Casesensitivity

Text states: "A BEACON dump MAY be annotated with a set of meta fields. Each meta field is identified by its name, build of lowercase letters a-z."

This is a little bit confusing as meta fields in the text serialization are upper case.

Suggestion: also lower case meta fields in text serialization.

Question: Are metafields in the text serialization supposed to be case sensitive?

4.1.4. RELATION

4.1.4. RELATION should be removed, it is irrelvant to the BEACON purpose and only meant to introduce semantic web concepts into BEACON.

Recommendations for the target links?

Does a recommendation exist for the link targets in the BEACON file?

I am asking because we can provide dedicated links that point to a GUI representation of a resource or its representation in RDF/XML (or turtle etc.) and from the specification and the various examples I found, I understand that only one link is provided per GND ID.

Remove #FORMAT

In http://gbv.github.io/beaconspec/beacon.html#rfc.section.4.1 remove

The BEACON text file SHOULD start with a fixed meta field:
START = "#FORMAT:" +WHITESPACE "BEACON" *WHITESPACE LINEBREAK

Collect use cases and implementations

BEACON files used in/by Wikipedia
BEACON exported by Wikidata tool
Java implementation https://github.com/thunken/beacon
wdmappings/wdtaxonomy
https://beacon.findbuch.de/
Link shorteners archived by http://urlte.am/
- see https://github.com/ArchiveTeam/urlteam-stuff/blob/master/tools/mkbeacon.pl
- format is "Shortcode, pipe (Ascii 0x7C), long URL, line feed"

Reference implementation

A new implementation to be published as node-module so it can be used for instance as data source of Linked Data Fragments Server.

Remove #MESSAGE?

Construction of annotation via message field may be too complicated. By removal of #MESSAGE, however, the link construction rule of BEACON text might need to get adjusted. In particular, without message field, this:

123|456

is always a link with source token "123" and annotation "456". The second value "456" cannot be used as target token to construct a target URI.

For target tokens one always needs to use two bars:

#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://www.librarything.com/work/
#RELATION: http://www.w3.org/2002/07/owl#sameAs

Q721||3383

Only if #TARGET is default ({+ID}), the second token is taken as target if it begins with http: or https::

#PREFIX: http://www.wikidata.org/entity/
#RELATION: http://www.w3.org/2002/07/owl#sameAs

Q721|http://www.librarything.com/work/3383

Rename #INSTITUTION to #CREATOR

The meaning of the institution meta field should be clarified by renaming it to "creator" and mapping it to http://purl.org/dc/terms/creator ("the person, organization, or a service primarily responsible for making the BEACON dump").

Mime types

The curent MIME type for Beacon text is is "text/plain". Should one specify for instance "text/beacon"?

Remove colon in meta fields

e.g.

#NAME ACME documents

instead of

#NAME: ACME documents

The corresponding rule in BEACON text format could be changed to

METALINE    =  "#" METAFIELD ( ":" / WSP ) METAVALUE LINEBREAK

to support colon, space and tabulator.

"link construction" => "link expansion"

Add back meta field COUNT?

COUNT was used in the early days. Linked Data Fragments requires:

Each Triple Pattern Fragment, and each page of a Triple Pattern Fragment, MUST contain the estimated total number of triples that match the fragment's selector.

On the other hand the number can be computed by counting.

update field

it is very frequent that beacon files do not change for years but then have to since the addressing scheme of the target site changes. Therefore values like "infrequently" or "on demand" are needed.

I very much liked the #REVISIT field of the 0.2 spec which gave the opportunity to give an realistic estimate of the next "release" of the file and was easier to parse than the only intentional #UPDATE field.

Suggest file extension

.beacon or .txt? See also #24 - a MIME type registration often includes a suggested file extension.

I'm out of it

This business has turned too nonsensical for me and I do not want to be associated with that any more. To my impression there is no attempt made to solve serious issues ("source" still abundant) and instead progress is made towards directions away from current practical use of BEACON files.

Sorry, but please could you remove me from the list of authors ASAP?
Thomas Berger

Rename #QUALIFIER to #ANNOTATION

I'll rename the "qualifier" meta field to "annotation" this should be more clear, right?

#REVISIT meta field

In issue #3 gymel wrote "I very much liked the #REVISIT field of the 0.2 spec which gave the opportunity to give an realistic estimate of the next "release" of the file and was easier to parse than the only intentional #UPDATE field." So should the #REVISIT included in addition to the #UPDATE field?

Syntax of email address in #CONTACT

As a BEACON is a simple text file and most BEACON are in the open www anyone can crawl it.
To avoid being SPAMmed one could deliberately provide a syntactically wrong email adress, e.g. "foobar.net". This adress can be easily parsed by humans and not quite so easily by bots.
Thus, I would like to enhance the beacon specs so that it is allowed to use syntactically incorrect email adresses.

You may argue against my proposition: "that should be the work of anti SPAM filters", but why put load to SPAM filters when there is a conveninet way to avoid SPAM at whole. Since I don't see any necessity at all for machines to be able to contact me I would be fine when the address is only human readable.

section link fields

Would it make sense to give the table a header row?
Rows such as "target + target --> target" might be confusing.

Or is "meta field + link field --> link element" supposed to be the header row? In that case underlining it might make things clearer.

Mandatory Metafields

I would be much in favor of defining a set of mandatory Metafields.

Suggestion: Changing "A BEACON dump MAY be annotated with a set of meta fields." to "A BEACON dump MUST be annotated with the following mandatory meta fields (...) and MAY be additionally annotated with ..."

Humble suggestion for mandatory fields: description, name, institution, timestamp

Specify the type of sources and targets

There may be a need to identify the "kind of" resources identified by source URIs and/or target URIs (there is no "kind of" identifier as all identifiers are URIs). This information may just be put in the #DESCRIPTION meta field. The concept of a "kind of" thing is rather fuzzy anyway. A formal solution was to introduce something like #SOURCETYPE and/or #TARGETTYPE, for instance to state that all entities linked to/from are people (foaf:Person): For instance

#PREFIX: http://d-nb.info/gnd/
#TARGET: http://example.org/{ID}
#SOURCETYPE: http://xmlns.com/foaf/0.1/Person
#TARGETTYPE: http://purl.org/ontology/bibo/Document

115541543

is mapped to the RDF graph:

<http://d-nb.info/gnd/115541543> a foaf:Person ;
  rdfs:seeAlso <http://example.org/115541543> .
<http://example.org/115541543> a bibo:Document .

name field and the like

A frequent situation is a database of a certain #NAME (and URL) provided by a certain #INSTITUTION (which has a name and an URL for its home page) and the beacon file is contributed and updated by a third party editor (#CONTACT?). There is need for a detailed #DESCRIPTION of the contents of the target database as well as for an editorial #REMARK concerning coverage, methodology (intellectual assignment, semi-automatical, mechanical extraction of another database, ..).

Information about a "source dataset"

Right now there is no concept of a "source dataset" but only information about the target dataset, described by meta fields #NAME and #INSTITUTION. Is information about the set (or superset) of all source URIs needed? Current use cases involve links to a specific database, instead of from a specific database. Supporting both would make the specification more complicated but it may be useful too.

Documentation with illustration

Documentation is crucial to a specification. I created a diagram to depict the core concepts of Beacon specification and the 15 meta fields:

Rename #PREFIX to #SOURCE

As suggested in issue #6, the name meta field name 'prefix' is confusing, so it should simply be renamed to 'source'.

Refer to Linked Data Fragments

See http://www.hydra-cg.com/spec/latest/linked-data-fragments/ this looks relevant

URIs, IRIs, Templates

The format uses "{" and "}" for denotating placeholders in templates, and probably "|" in the serialization as text. These characters are fine to use, since by RFC 3986 they are neither reserverd nor unreserved: Therefore they are not legal URI characters and cannot be confused with non-placeholders or non-delimiters.

If this specification however would employ IRIs, this crucial distinction would fall away and leave room for ambiguity (There are many examples of target URLs employing query parts and/or fragment identifiers, and RDF-style source identifiers frequently also contain fragment identifiers.

A very different approach would consist in providing examples instead of templates for #PREFIX and #TARGET: Then valid URIs (or IRIs!) could be provided and accompanying fields would state the sample identifier used in the example and it would be the responsiblity of the author to choose example identifiers which can be unambigeously matched in the example URI/IRI. The advantage over templates would consist in using standard XSD types for constraints.

Furthermore it is important to specify which transformations have to be applied when inserting values into templates: Target values often contain phrases like "Jakob Voß" and it is almost impossible to guess a "correct" encoding especially if query parts are involved ("%20" or "+" or "_" or ...). Thus operating with IRIs and attempting to convert this to an URI after template substitution might be dangerous.

Support short form `foo|http://example.org`

Both existing legacy Beacon files and the large link dumps of URL shorteners provided by http://urlte.am/ support a short version if identifier-to-URL links:

foo|http://example.org

instead of

foo||http://example.org

I added a rule to interprete this variant in Beacon text format if no annotations or target templates have been defined. Here is an example of a Beacon text file for an URL shortener (only the first meta field #PREFIX is mandatory):

#PREFIX: http://tinyurl.com/
#NAME: TinyURL
#INSTITUTION: TinyURL, LLC (Kevin Gilbertson)
#HOMEPAGE: http://urlte.am/
#RELATION: http://dbpedia.org/resource/HTTP_301

m3q2xt|http://en.wikipedia.org/wiki/URL_shortening
55mp6b|http://www.youtube.com/watch?v=oHg5SJYRHA0

Suggestion for text in 1.1. Overview

Currently it says "a BEACON dump could consist of links between two domains"
However, the next line states "http://example.org/{ID1} ---> http://example.com/{ID2}", using the same domain, which is slightly confusing.
Suggestion: http://example-domain1.org/{ID1} ---> http://example-domain2.com/{ID2}

Add license recommendation (CC0)

BEACON link dumps SHOULD be Open Data so the following triple MAY be assumed:

@prefix cc: <http://creativecommons.org/ns#> .

:dump cc:license <http://creativecommons.org/publicdomain/zero/1.0/> .

Additional meta fields #LABELTYPE and #DESCRIPTIONTYPE?

These fields could be used to specify the RDF property between target URI and label/description. See also the proposal at http://meta.wikimedia.org/wiki/Dynamic_links_to_external_resources#How_does_BEACON_relate_to_RDF.3F with additional fields #LABEL and #DATATYPE.

Wrong examples in URI patterns

The following is wrong:

Hello%20World     {+ID}       Hello%20World
M%C3%BCller       {+ID}       M%C3%BCller

{+ID} still escapes % so there is no way to literally have this character in a link (same applies for < and >, %, space, and possibly other characters). This may be a problem if one requires this characters in a source identifier or target identifier.

Better document use case of {ID} vs {ID+}

http://example.org/?q={ID} escapes & and =among other characters
http://example.org/{+ID} a full query expression can be included

Also revise all examples (1.2. Examples, 2.4. URI patterns and Appendices),