gbv / beaconspec Goto Github PK
View Code? Open in Web Editor NEWBEACON link dump format specification
Home Page: http://gbv.github.com/beaconspec/
BEACON link dump format specification
Home Page: http://gbv.github.com/beaconspec/
Language tags could be included as done in aREF by appending @
followed by a language tag. The following fields value may contain a language tag:
The institution
field requires special treatment because it may contain and URL as well.
For instance for the GND database:
#SOURCESET: http://d-nb.info/gnd/7749153-1
#SOURCESET: http://viaf.org/viaf/188136221
An easy example would help for not having to dig into the full specification.
Similar to https://validator.w3.org/ and https://json-ld.org/playground/
is not mentioned
Seeing that the mime type for (non-xml) beacon is (currently) text/plain, I assume that the preferred file name ending is .txt, but it would help to have that put down explicitly in the spec, particularly if you are going for text/beacon instead.
A review of currently used meta fields from BEACON files at http://de.wikipedia.org/wiki/Wikipedia:BEACON shows the following distribution:
125 #TARGET
118 #INSTITUTION
112 #FEED
100 #CONTACT
94 #DESCRIPTION
92 #MESSAGE
85 #VERSION
78 #PREFIX
77 #FORMAT
65 #NAME
63 #TIMESTAMP
30 #REVISIT
12 #DATE
10 #EXAMPLES
8 #ISIL
5 #UPDATE
4 #COUNT
2 #SOMEMESSAGE
1 #SOURCE
1 #RESULTS
1 #Langtext a {color
1 #IMGTARGET
1 #ALTTARGET
Which fields should become part of the final specification?
PREFIX
and TARGET
are essential to abbreviate links.FORMAT
and VERSION
should be droppedTIMESTAMP
should be used instead of DATE
COUNT
and EXAMPLES
are not that important: drop themREVISIT
is used (with different format), but I'd prefer the UPDATE
field with controlled terms as used in sitemaps.xml standard insteadFEED
is usefulISIL
and SOMEMESSAGE
should be dropped because less usedALTTARGET
, IMGTARGET
, Langtext
, RESULTS
, and SOURCE
are not standardINSTITUTION
, CONTACT
, DESCRIPTION
, MESSAGE
, NAME
?So finally there is:
PREFIX
and TARGET
to abbreviate linksFEED
, TIMESTAMP
, and UPDATE
to describe dump URL, time and update frequencyCONTACT
INSTITUTION
, DESCRIPTION
, MESSAGE
, NAME
. Can we drop one of them?Given that the (source identifier or URI) id field and the optional target (link, URI, identifier, search term, label ...) field will transport quite a broad spectrum of character data (but maybe no spaces) I already see difficulties for a third optional label or description field in the text serialization.
As for content I doubt that there will any clear distinction between "label" and "description" data. The known use cases contain - if any - partly the preferred name of the object in the target application (use case: biography) and partly some indication of the number of distinct objects at the target (use case: bibliography) and as a matter of fact never both.
Agreed that a consumer of a beacon file wants to construct hyperlinks from the data received, there certainly is a need to label the link and to provide context information by means of tooltips or enriching the label or anything it considers suitable. For this any header fields can be exploited at the discretion of the consumer (which has to take into account that all of them are optional) but the creator of the beacon file may provide #LABEL and #INFO (?, #TOOLTIP?, #CONTEXT??, ...) header fields which by virtue of the templating mechanism may refer to any link field: {srcid}, {trgid}, {label}.
However http://beacon.findbuch.de/portraits/ps_usbk with the formats "seealso" and "seealso-imglink" demonstrates that in the presence of graphical resources "label" and "description" sometimes are not enough elements to produce appealing links and it would be nice to have even more template-aware header fields like #THUMBNAIL or #PREVIEW or #LOGO to transport the construction rules for additional texts or URLs. Trying to standardize these would be completely out of scope of a specification, but it is already stated that extra header fields are not in violation of the standard and the (editorial) #REMARK field could try to explain the purpose of the extra fields or alternatively link to a documentation page.
#FORMAT: BEACON
#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://purl.org/spar/cito/
#RELATION: http://www.w3.org/2004/02/skos/core#{+ID}
P2860|exactMatch|cites
And N-Triples like
#RELATION: {+ID}
http://www.wikidata.org/entity/P2860|http://www.w3.org/2004/02/skos/core#exactMatch|http://purl.org/spar/cito/cites
Repeated meta fields are no syntax error but SHOULD result in a warning. Application may choose which value to pick but they MUST pick only one value.
As the introduction states nicely, Beacon files denotate a mapping function: From "Identifiers" or URIs to URLs, URIs and maybe "Identifiers".
To stress the symmetry the template fields PREFIX and TARGET should be named SOURCE and TARGET and obey the same syntax rules.
To make things like "{ID}" clearer, the placeholders should be named "SRCID" and "TRGID" (both may occur in targets, there are even examples for mixtures and repetitions [and alas also target placeholders split in two parts with a repetitive element between them]). However TRGID should be forbidden for the SOURCE template.
BEACON was first, but http://www.w3.org/TR/beacon/ is stronger.
This request was raised (in a similar form) by @MathiasSchindler.
Every valid BEACON file should contain a meta field explaining its own format.
As suggested by Thomas Berger, previously used #VERSION and #FORMAT should be merged.
As of 2014, the only valid entry should be
Future Versions should be constructed accordingly.
Reviewing existing BEACON files with more than a single id field, there are several cases, for instance:
Number of "hits" as second field:
For instance http://dingler.culture.hu-berlin.de/beacon and http://beacon.findbuch.de/downloads/pw/pw_imslp-pndbeacon.txt:
116137592|138
100001718|2
Label and target URI:
For instance http://www.zisterzienserlexikon.de/beacon/beacon.txt and http://beacon.findbuch.de/downloads/ps_usbk/DE-38-USB_Koeln-Portraitsammlung-portraitierte-beacon.txt and http://www.andreas-praefcke.de/temp/BEACON-PND-ADS.txt:
139788824|Kurz, Matthäus|http://www.zisterzienserlexikon.de/wiki/Kurz,_Matth%C3%A4us
116647868|Kobolt, F. W.: Kupferstich, 1795|http://kug.ub.uni-koeln.de/portal/connector/permalink/portrait/1/1/index.html
101148739X|Dones Elvira (Namensdatensatz)|http://lexikon.a-d-s.ch/edit/detail_a.php?id_autor=43
target URI:
For instance in http://www.andreas-praefcke.de/temp/BEACON-PND-BBC-Paintings.txt and http://www.andreas-praefcke.de/temp/BEACON-PND-GW.txt:
102436606|http://www.bbc.co.uk/arts/yourpaintings/artists/thomas-bardwell
100001467|http://gesamtkatalogderwiegendrucke.de/docs/ANDRANT.htm
number of hits and label:
For instance http://www.historische-kommission-muenchen-editionen.de/beacon_adr.txt:
116001038|1|Abegg, Waldemar
What should the final specification be? The last case only occurrs once so it could be marked as deprecated.
See also related issue #7.
Clarify whether/where Unicode Normalization MUST/SHOULD be done.
As raised in another issue there may be non-URI identifiers. I neither see the need to support these kind of identifiers nor know how to support them, but maybe there is a convincing use case and a simple solution.
Text states: "A BEACON dump MAY be annotated with a set of meta fields. Each meta field is identified by its name, build of lowercase letters a-z."
This is a little bit confusing as meta fields in the text serialization are upper case.
Suggestion: also lower case meta fields in text serialization.
Question: Are metafields in the text serialization supposed to be case sensitive?
4.1.4. RELATION should be removed, it is irrelvant to the BEACON purpose and only meant to introduce semantic web concepts into BEACON.
Does a recommendation exist for the link targets in the BEACON file?
I am asking because we can provide dedicated links that point to a GUI representation of a resource or its representation in RDF/XML (or turtle etc.) and from the specification and the various examples I found, I understand that only one link is provided per GND ID.
In http://gbv.github.io/beaconspec/beacon.html#rfc.section.4.1 remove
The BEACON text file SHOULD start with a fixed meta field:
START = "#FORMAT:" +WHITESPACE "BEACON" *WHITESPACE LINEBREAK
A new implementation to be published as node-module so it can be used for instance as data source of Linked Data Fragments Server.
Construction of annotation via message field may be too complicated. By removal of #MESSAGE, however, the link construction rule of BEACON text might need to get adjusted. In particular, without message field, this:
123|456
is always a link with source token "123" and annotation "456". The second value "456" cannot be used as target token to construct a target URI.
For target tokens one always needs to use two bars:
#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://www.librarything.com/work/
#RELATION: http://www.w3.org/2002/07/owl#sameAs
Q721||3383
Only if #TARGET is default ({+ID}
), the second token is taken as target if it begins with http:
or https:
:
#PREFIX: http://www.wikidata.org/entity/
#RELATION: http://www.w3.org/2002/07/owl#sameAs
Q721|http://www.librarything.com/work/3383
The meaning of the institution meta field should be clarified by renaming it to "creator" and mapping it to http://purl.org/dc/terms/creator ("the person, organization, or a service primarily responsible for making the BEACON dump").
The curent MIME type for Beacon text is is "text/plain". Should one specify for instance "text/beacon"?
e.g.
#NAME ACME documents
instead of
#NAME: ACME documents
The corresponding rule in BEACON text format could be changed to
METALINE = "#" METAFIELD ( ":" / WSP ) METAVALUE LINEBREAK
to support colon, space and tabulator.
COUNT was used in the early days. Linked Data Fragments requires:
Each Triple Pattern Fragment, and each page of a Triple Pattern Fragment, MUST contain the estimated total number of triples that match the fragment's selector.
On the other hand the number can be computed by counting.
it is very frequent that beacon files do not change for years but then have to since the addressing scheme of the target site changes. Therefore values like "infrequently" or "on demand" are needed.
I very much liked the #REVISIT field of the 0.2 spec which gave the opportunity to give an realistic estimate of the next "release" of the file and was easier to parse than the only intentional #UPDATE field.
.beacon
or .txt
? See also #24 - a MIME type registration often includes a suggested file extension.
This business has turned too nonsensical for me and I do not want to be associated with that any more. To my impression there is no attempt made to solve serious issues ("source" still abundant) and instead progress is made towards directions away from current practical use of BEACON files.
Sorry, but please could you remove me from the list of authors ASAP?
Thomas Berger
I'll rename the "qualifier" meta field to "annotation" this should be more clear, right?
In issue #3 gymel wrote "I very much liked the #REVISIT field of the 0.2 spec which gave the opportunity to give an realistic estimate of the next "release" of the file and was easier to parse than the only intentional #UPDATE field." So should the #REVISIT included in addition to the #UPDATE field?
As a BEACON is a simple text file and most BEACON are in the open www anyone can crawl it.
To avoid being SPAMmed one could deliberately provide a syntactically wrong email adress, e.g. "foobar.net". This adress can be easily parsed by humans and not quite so easily by bots.
Thus, I would like to enhance the beacon specs so that it is allowed to use syntactically incorrect email adresses.
You may argue against my proposition: "that should be the work of anti SPAM filters", but why put load to SPAM filters when there is a conveninet way to avoid SPAM at whole. Since I don't see any necessity at all for machines to be able to contact me I would be fine when the address is only human readable.
Would it make sense to give the table a header row?
Rows such as "target + target --> target" might be confusing.
Or is "meta field + link field --> link element" supposed to be the header row? In that case underlining it might make things clearer.
I would be much in favor of defining a set of mandatory Metafields.
Suggestion: Changing "A BEACON dump MAY be annotated with a set of meta fields." to "A BEACON dump MUST be annotated with the following mandatory meta fields (...) and MAY be additionally annotated with ..."
Humble suggestion for mandatory fields: description, name, institution, timestamp
There may be a need to identify the "kind of" resources identified by source URIs and/or target URIs (there is no "kind of" identifier as all identifiers are URIs). This information may just be put in the #DESCRIPTION meta field. The concept of a "kind of" thing is rather fuzzy anyway. A formal solution was to introduce something like #SOURCETYPE and/or #TARGETTYPE, for instance to state that all entities linked to/from are people (foaf:Person): For instance
#PREFIX: http://d-nb.info/gnd/
#TARGET: http://example.org/{ID}
#SOURCETYPE: http://xmlns.com/foaf/0.1/Person
#TARGETTYPE: http://purl.org/ontology/bibo/Document
115541543
is mapped to the RDF graph:
<http://d-nb.info/gnd/115541543> a foaf:Person ;
rdfs:seeAlso <http://example.org/115541543> .
<http://example.org/115541543> a bibo:Document .
A frequent situation is a database of a certain #NAME (and URL) provided by a certain #INSTITUTION (which has a name and an URL for its home page) and the beacon file is contributed and updated by a third party editor (#CONTACT?). There is need for a detailed #DESCRIPTION of the contents of the target database as well as for an editorial #REMARK concerning coverage, methodology (intellectual assignment, semi-automatical, mechanical extraction of another database, ..).
Right now there is no concept of a "source dataset" but only information about the target dataset, described by meta fields #NAME and #INSTITUTION. Is information about the set (or superset) of all source URIs needed? Current use cases involve links to a specific database, instead of from a specific database. Supporting both would make the specification more complicated but it may be useful too.
As suggested in issue #6, the name meta field name 'prefix' is confusing, so it should simply be renamed to 'source'.
See http://www.hydra-cg.com/spec/latest/linked-data-fragments/ this looks relevant
The format uses "{" and "}" for denotating placeholders in templates, and probably "|" in the serialization as text. These characters are fine to use, since by RFC 3986 they are neither reserverd nor unreserved: Therefore they are not legal URI characters and cannot be confused with non-placeholders or non-delimiters.
If this specification however would employ IRIs, this crucial distinction would fall away and leave room for ambiguity (There are many examples of target URLs employing query parts and/or fragment identifiers, and RDF-style source identifiers frequently also contain fragment identifiers.
A very different approach would consist in providing examples instead of templates for #PREFIX and #TARGET: Then valid URIs (or IRIs!) could be provided and accompanying fields would state the sample identifier used in the example and it would be the responsiblity of the author to choose example identifiers which can be unambigeously matched in the example URI/IRI. The advantage over templates would consist in using standard XSD types for constraints.
Furthermore it is important to specify which transformations have to be applied when inserting values into templates: Target values often contain phrases like "Jakob Voß" and it is almost impossible to guess a "correct" encoding especially if query parts are involved ("%20" or "+" or "_" or ...). Thus operating with IRIs and attempting to convert this to an URI after template substitution might be dangerous.
Both existing legacy Beacon files and the large link dumps of URL shorteners provided by http://urlte.am/ support a short version if identifier-to-URL links:
foo|http://example.org
instead of
foo||http://example.org
I added a rule to interprete this variant in Beacon text format if no annotations or target templates have been defined. Here is an example of a Beacon text file for an URL shortener (only the first meta field #PREFIX
is mandatory):
#PREFIX: http://tinyurl.com/
#NAME: TinyURL
#INSTITUTION: TinyURL, LLC (Kevin Gilbertson)
#HOMEPAGE: http://urlte.am/
#RELATION: http://dbpedia.org/resource/HTTP_301
m3q2xt|http://en.wikipedia.org/wiki/URL_shortening
55mp6b|http://www.youtube.com/watch?v=oHg5SJYRHA0
Currently it says "a BEACON dump could consist of links between two domains"
However, the next line states "http://example.org/{ID1} ---> http://example.com/{ID2}", using the same domain, which is slightly confusing.
Suggestion: http://example-domain1.org/{ID1} ---> http://example-domain2.com/{ID2}
BEACON link dumps SHOULD be Open Data so the following triple MAY be assumed:
@prefix cc: <http://creativecommons.org/ns#> .
:dump cc:license <http://creativecommons.org/publicdomain/zero/1.0/> .
These fields could be used to specify the RDF property between target URI and label/description. See also the proposal at http://meta.wikimedia.org/wiki/Dynamic_links_to_external_resources#How_does_BEACON_relate_to_RDF.3F with additional fields #LABEL and #DATATYPE.
The following is wrong:
Hello%20World {+ID} Hello%20World
M%C3%BCller {+ID} M%C3%BCller
{+ID}
still escapes %
so there is no way to literally have this character in a link (same applies for <
and >
, %
, space, and possibly other characters). This may be a problem if one requires this characters in a source identifier or target identifier.
http://example.org/?q={ID}
escapes &
and =
among other charactershttp://example.org/{+ID}
a full query expression can be includedAlso revise all examples (1.2. Examples, 2.4. URI patterns and Appendices),
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.