kg-construct / rml-io Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 4.0 3.12 MB

RML-IO: Input/Output declarations for RML

Home Page: https://w3id.org/rml/io/spec

License: Creative Commons Attribution 4.0 International

HTML 99.37% JavaScript 0.09% CSS 0.14% Python 0.39%

io ontology rdf rml rml-mapping shacl

rml-io's People

Contributors

Stargazers

Watchers

Forkers

ghsnd chrdebru pmaria anaigmo

rml-io's Issues

XML namespaces for XPath

XPath allows to use XML namespaces when selecting parts of an XML document.
However, (most) implementations require to register these namespaces before doing an XPath query.
RML does not specify how does this should happen currently:

In the mapping rules?
By the implementation with a CLI parameter or dynamically by parsing the XML document first and find any namespaces
...

CARML has an extension for this: https://github.com/carml/carml#xml-namespace-extension
and it came up in the past already a few times without a clear solution:

Point of use of rml:query

I see several examples where rml:query is used as a property of rml:Source.

And other examples where rml:query is used as a property of rml:LogicalSource.

My preference would be to have rml:query be a property of rml:LogicalSource. Because:

this would allow to reuse the source description for multiple queries
this would also be more in line with the behavior of rml:iterator, in that the evaluation of an rml:iterator produces a list of Records, and the evaluation of rml:query also produces a list of Records (in the case of relational databases: rows). Having these on the same resource type would simplify implementations.

Filtering Logical Source & Logical Target

Idea

It could be very useful to filter records when accessing a Logical Source or filter triples when exporting to a Logical Target, for example:

Logical Source

Validate records before mapping them e.g. remove invalid records
Skip records you are not interested in, e.g. list of people: map only people with age < X
...

Logical Target

Validate triples before exporting them e.g. validate SHACL shapes on generated triples
Filter out triples with private information e.g. apply privacy or access restrictions on generated triples
...

Proposal

Add an optional property rmld:filter to a Logical Source and Logical Target. If not specified, no filtering is applied.
This property accepts an FnO function via FNML which is applied on the rmld:Source or rmld:Target.
- Logical Source: iterator gets its records from the FnO function, the function yields an iterator
- Logical Target: the written triples are yielded from the FnO function
FnO functions can be nested so you can have complex filtering as well.
If the engine does not support FnO functions with FNML, this property is ignored, even if provided.

CC: @bjdmeest @samiscoding

more concrete examples of reference formulations

the reference formulations were introduced in RML but they were never clarified in the end.

now that we have seen some opportunities wrt what to add in the reference formulation, it might be good to introduce some more concrete examples in the spec

Specify how empty literals look like

We should be able to specify how a data source specifies empty literals:

Text: "NULL", "", ``, "random word"
Special value: null in JSON for example
...

Is a job for the DataIO spec: https://rml.io/specs/dataio/

Split of kg-construct/rml-core#16

rml:RelativePathSource expand to Targets

Targets also can use relative paths, 'RelativePathSource' is kinda Source specific.

Drop 'Source' from the class name and move it outside the Source section into something separate in the spec.

cover source & not only target

This spec was originally meant to cover the target but currently it is also meant to cover the source description so we need to extend it to describe the source as well and not only the target.

Drop rml:SQL2008

Replaced by rml:SQL2008Query and rml:SQL2008Table

Improve Target documentation when using Function and Graph Maps

There's a big focus on Target examples for Subject, Predicate, Object Maps but Function and Graph Maps are under specified in the spec.

Fix copy-paste error in Source

Source was copied from Target but the subsections were never updated to explain rml:null and rml:query.
rml:serialization must go

LogicalTarget to Revise

Section 8 of the document is quite verbose, mostly redundant, and as a result it contains errors due to copy-paste.

For instance, the output of Example in 8.4 is wrong:

# file:///data/dump1.nt
<http://example.org/0> <http://xmlns.com/foaf/0.1/name> "Kara Danvers" .
<http://example.org/1> <http://xmlns.com/foaf/0.1/name> "Alex Danvers" .
<http://example.org/2> <http://xmlns.com/foaf/0.1/name> "J'onn J'onzz" .
<http://example.org/3> <http://xmlns.com/foaf/0.1/name> "Nia Nal" .

# file:///data/dump2.nt
<http://example.org/0> <http://xmlns.com/foaf/0.1/name> "Kara Danvers" .
<http://example.org/1> <http://xmlns.com/foaf/0.1/name> "Alex Danvers" .
<http://example.org/2> <http://xmlns.com/foaf/0.1/name> "J'onn J'onzz" .
<http://example.org/3> <http://xmlns.com/foaf/0.1/name> "Nia Nal" .

Note that the second group of triples should refer to nicknames instead.

Is it really necessary to provide all possible examples? Is there a general rule on how this works, that we could write instead?

Specifying an iterator: default only applicable to db, CSV and TSV

By default, the iterator is considered a row, if not specified:
* In the case of databases, CSV or TSV data sources, the value of the rml:iterator, if not specified, is a "row".
* In the case of XML and JSON data sources, it is a valid reference to an element or an object respectively considering the reference formulation specified.

The sentence "By default, the iterator is considered a row, if not specified:" seems not necessary, since it only applies to the first bullet (databases, CSV or TSV). And in case of XML and JSON it must be specified and there is no "row" concept, right?

Issues with test cases

Wrong delimiters in 0004a, 0004b, and 0004c
0004a, 0004b, and 0004c -> spaces in CSV files
0004a should contain

<http://example.org/5> <http://xmlns.com/foaf/0.1/age> "" .
<http://example.org/5> <http://xmlns.com/foaf/0.1/name> "" .

Paginated sources

Paginated JSON/XML/... is used extensively in the wild for Web APIs.
However, we don't support this in the current Logical Source.

See:

newer SQL versions

How are we going to keep up with newer versions of SQL?

Currently we have SQL:2008 as a reference formulation, following R2RML. But there are newer versions of SQL, the latest being SQL:2023. How are we going to keep up with these?

Adjust spec with ontology

Once #20 is merged, adjust spec with slightly changes

Some possible minor issues

http://w3id.org/rml/resources/Friends.json.zip resolves to the documentation --> moved to kg-construct/rml-resources#5
3.2.2 Compression formats refers to rml:tarXz and rml:tarGz, but rest of documentation and tests use rml:tarxz and rml:targz

Update Logical Target examples with new structure

Examples are not all updated to the new structure of Source and Target.

DCAT & VOID - examples for both for both source & target

related to #13

Small issues with Source 0006x

00006b: the SPARQL query does not contain an id variable. Here is a solution:

    rml:iterator """
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>

    SELECT ?id ?name ?age WHERE {
        ?person foaf:name ?name .
        ?person foaf:age ?age .
        BIND(REPLACE(STR(?person),"http://example.org/", "") AS ?id) .
    }
    """;

0006f misses a reference formulation, but is it a relational database or a CSV file. The mapping seems incomplete.

Small issues

0002e refers to rml:targzip instead of rml:targz
0006e refers to rml:targzip instead of rml:targz
0003 use of relative location of the data dump. Is it by default the location of the mapping file?
0003 refers to id in a reference, but there is no ?id in the SPARQL Query or 0003 should look as follows:

<#TriplesMap> a rml:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source <#VoIDSourceAccess>;
    rml:iterator """
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>

    SELECT ?person ?name ?age WHERE {    # HERE IS THE CHANGE
        ?person foaf:name ?name .
        ?person foaf:age ?age .
    }
    """;
    rml:referenceFormulation formats:SPARQL_Results_CSV;
  ];
  rml:subjectMap [ a rml:SubjectMap;
    rml:reference "person";    # HERE IS THE CHANGE
  ];

datatype inference

A recent paper:

R2RML and the original RML specification defined that RML processors can perform data type inference from the SQL databases. Thus, mappings did not have to specify rr:datatype for RDF Literals to have the correct data type as the processor would retrieve this automatically from the SQL database.
However, RML did not expand this to other heterogeneous datasources such as XML or JSON which both provide data types in different ways: XML schemas, native JSON types, etc. Data type inference is still under discussion but might be moved to RML-IO because this RML module focuses on accessing and iterating over the data source.

and refers to kg-construct/rml-core#87. I don't see much discussion of datatype inference there, so I'm posting this issue here.

Here are a couple of considerations:

For XML, we should clearly use xsd:type, especially focusing on XSD Datatypes but not ignoring custom datatypes like geo:wktLiteral, geo:gmlLiteral etc.
- XML attributes and text content are always strings, so there's no place for implicit types, right?
- One can specialize XSD types using restrictions and extension, which is potentially mappable to rdfs:Datatype constructs, but I think this is clearly beyond scope of RML
- XSD and RelaxNG have the concept of "post schema validation infoset" (PSVI) that can assign application types (eg Person) to elements. However, I don't think we should go there.
for JSON, keep in mind that it does not define what is a number, which leads to a number of unpleasant surprises in JSON-LD. Eg 12345678901234567890 is not a xsd:integer, and small decimals like 12.3 can be treated as float/double (eg 1.23e1) at will. So I'm not sure what can be tested here.

Simplify design of source and target

In the current draft both the Source and Target involve properties to describe at least the following aspects of a resource:

encoding
compression

Additionally, a distinction is made between Source/Target and Access, which creates a rather complex chain of resources. And, when we look at how Access is applied in the current proposal, its intent is to model a Data Source.

I think there is room to simplify the current design, by removing the distinction between Source/Target and Access. And I think we can reuse the same Source class for both the source and target side.

So in this way, we could have a mapping like this:

<#DCATSourceAccess> a dcat:Dataset;
  dcat:distribution [ a dcat:Distribution;
    dcat:downloadURL "https://rml.io/specs/rml-target/Supergirl.xml";
  ];
.

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source <#DCATSourceAccess> ;
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.[*]";
  ];
  rr:subjectMap [ a rr:SubjectMap;
    rr:template "http://example.org/{@id}";
    rml:logicalTarget <#TargetDump1>;
  ];
  rr:predicateObjectMap [ a rr:PredicateObjectMap;
    rr:predicateMap [ a rr:PredicateMap;
      rr:constant foaf:name;
    ];
    rr:objectMap [ a rr:ObjectMap;
      rml:reference "name/text()";
      rml:logicalTarget <#TargetDump2>;
    ];
  ];
  rr:predicateObjectMap [ a rr:PredicateObjectMap;
    rr:predicateMap [ a rr:PredicateMap;
      rr:constant foaf:nickname;
    ];
    rr:objectMap [ a rr:ObjectMap;
      rml:reference "nickname/text()";
    ];
  ];
.

<#TargetDump1> a rmlt:LogicalTarget;
  rmlt:target <#VoIDDump1>;
.
<#TargetDump2> a rmlt:LogicalTarget;
  rmlt:target <#VoIDDump1>;
.

<#VoIDDump1> a void:Dataset ;
  void:dataDump <file:///data/dump1.nt>;
  void:feature formats:N-Triples;
  rmlt:serialization formats:N-Triples;
.
<#VoIDDump2> a void:Dataset ;
  void:dataDump <file:///data/dump2.nt>;
  void:feature formats:N-Triples;
  rmlt:serialization formats:N-Triples; 
.

A specific source type specification can then define how to infer source specific properties like encoding, compression, serialization/format, etc. Many standards already have properties that define these aspects. For example:

This would allow for more simply reusing already existing source descriptions as sources and targets, without having to duplicate information.

Testcases: copy them from examples

Would be nice to have test cases covering the complete specification:

Validating SHACL shapes from #21
Implementations validating their code
...

Most of the tests are actually already used as examples in the spec.

Input stream as source

issue: not possible to describe an input stream as a source

suggestion: add support for describing input streams. This will facilitate the usage of RML in transformation pipelines.

rml:access

I am not sure I understand/agree on the use of rml:access, could you elaborate a bit more the rational behind it?

Rewrite specification to include Logical Sources as well

Need to be transferred from the core RML spec.

Suggestion to rephrase paragraph

I would suggest changing

"rml:null describes which data values inside the source should be considered as NULL. Defaults to the default NULL character if available. If none is available such as CSV, no values are considered NULL, unless specified. Example: CSV does not have a default NULL character, so no value is considered NULL. However, JSON has a NULL character specified: null, this one is used together with the ones specified through rml:null."

"rml:null indicate which data values inside the source should be considered as NULL values. The value for this predicate defaults to the default NULL token of the underlying data model (e.g., NULL in relational databases and null in JSON) and are always processed as such. Some data models have no such default NULL token, such as CSV. When that is the case, then the empty string ϵ is considered NULL."

I understood that one can declare multiple rml:null, but that the default NULL token is always used. How would one otherwise try to process these NULL tokens. The problem, however, is with the phrasing of "unless specified" for CSV files. That gives the impression that rml:null can be "overwritten" for CSV. Is ϵ always considered rml:null? If yes, then we can posit it as such. If not, then this probably needs a section with examples.

How to access a local file via path relative to mapping file?

The mentioned DCAT spec does not allow to refer to relative file paths (it allows for relative resources, but this is relative to the @base of the RML mapping file which is not what I mean).

What would the description of the file ./data/input.json, (path relative to the mapping file) look like?

(you could rephrase my question to 'relative to the cwd' if that makes things easier, I can always set my cwd to the mapping file folder).

Small issues with test cases:

RMLSTC0004a does not have a reference formulation
RMLSTC0004b does not have a reference formulation
RMLSTC0004c has the wrong reference formulation (rml:referenceFormulation rml:JSONPath)
RMLSTC0004a wrong csvw:url -> should be file:

Section 3.1, `rml:iterator`

In Section 3.1, there is the following sentence:

"In the case of databases, CSV or TSV data sources, the value of rml:iterator is considered a "row" and must not be specified."

The sentence above seems incorrect. First of all, what is it meant by "databases"? Are graph databases databases, or only relational ones are meant here?

Further, for relational databases, rml:iterator CAN be used to express queries according to rml:SQL2008Query reference formulation.

rml:Target: add more properties (write mode)

For some targets it might make sense to specify some extra properties. I have this use case where I want to append output to target file x, but overwrite target file y if it exists. So for files there are the typical write modes, e.g. like the python file modes w, a, etc.

In the spec these properties might translate to properties that apply to other kinds of targets as well, for instance databases.

Drop literal support in rml:source & rmlt:target?

The original RML spec supported literals in rml:source and rmlt:target on top of objects (CSVW, SD, etc.) such as

rml:logicalSource [ a rml:LogicalSource;
  rml:source "/path/to/file.csv";
];

We kept supporting this in RML for historical reasons.
However, we have now the chance to drop it in favor of only objects which allow better descriptions.
Would that be a good idea? Or better keep it?

If dropped, the following example would become:

rml:logicalSource [ a rml:LogicalSource;
  rml:source [ a csvw:Table;
    csvw:url  "/path/to/file.csv";
    csvw:dialect [ a csvw:Dialect;
        csvw:delimiter ";";
        csvw:header "1"^^xsd:boolean;
  ];
];

This approach is way more descriptive and allows a lot to be re-used of existing specifications such as CSVW, SD, DCAT, etc.
Publication from @andimou https://dl.acm.org/doi/10.1145/2814864.2814873

Wrong NS for formats

https://www.w3.org/ns/formats/Format is used in many places (test cases, documentation, etc.) instead of http://www.w3.org/ns/formats/Format.

Inconsistency due to `rdfs:domain` of `rml:logicalTarget` and its use on `rml:LanguageMap`

The rdfs:domain of rml:logicalTarget is defined to be rml:TermMap, yet it is also used on rml:LanguageMap which is not a rml:TermMap.

Options:

drop the domain for rml:logicalTarget
make the domain a union of rml:TermMap and rml:LanguageMap

Does SPARQL TSV Results make sense?

Nor the documentation, nor the test cases provide such examples (same for XML and JSON results, by the way). But I question the usefulness of rml:SPARQL_RESULT_TSV. Taking the example of 0003, we would have the following TSV:

?person	?name	?age
<http://example.org/0>	"Monica Geller"	"33"
<http://example.org/1>	"Rachel Green"	"34"
<http://example.org/2>	"Joey Tribbiani"	"35"
<http://example.org/3>	"Chandler Bing"	"36"
<http://example.org/4>	"Ross Geller"	"37"

How should we iterate over those? We cannot treat them as regular TSV. The angle brackets should be removed from IRIs. Literals should be "cast" to their datatypes. And I have no idea what to do with blank node identifiers. Is it possible the group thought that the TSV output would be the same as CSV output, but with tabs?

Same question for JSON and XML representations of SPARQL queries: do they have bespoke iterations (i.e., not the same iterations as for "regular" JSON or XML files), or would iterating over them require a second iterator?

Compressed sources

Compressed sources, do we want to handle this like this:

rml:source [
  rml:access [
    # what we had previously for `rml:source` is now here
    # location & access (ex: WoT has security and location here)
  ];
  rml:encoding enc:UTF-8;
  rml:null "THIS IS A NULL VALUE";
  rml:compression comp:Zip;
]

Or do we need to handle this as nested Logical Sources? This approach won't work for a ZIP file with multiple files.

What is the namespace of `ql`?

The namespace prefix ql is used as a qualifier for http://semweb.mmlab.be/ns/ql#. Is that still the case? Should this be documented in this spec?

Re-order repo

See RML Star

Header required for CSVW?

When CSVW header is false, I assume that columns are numbered from 1 to n (as CSVW assumes column numbers start from 1). Is this OK?

MERGED WITH #71
Allow RDF terms for rml:null for resources based on RDF and SPARQL? --> Moved to #71

Why keep a separate DataIO spec?

As I'm reading the target spec I'm wondering where is the definition of rmlt:LogicalTarget and associated properties.

As a reader I'd expect to have a self-contained spec that includes everything.

SPARQL CONSTRUCT support

Consider example in specification:

<#SPARQLEndpoint> a rml:LogicalSource;
    rml:source [ a rml:Source, sd:Service;
        sd:endpoint  <http://example.com/sparql>;
        sd:supportedLanguage sd:SPARQL11Query;
    ];
    rml:iterator "CONSTRUCT WHERE { ?s ?p ?o. } LIMIT 100";
    rml:referenceFormulation formats:SPARQL_Results_CSV;
.

Since the iterator uses a CONSTRUCT, the reference formulation format cannot be SPARQL_Results_CSV. Suggest either using a SPARQL SELECT form, or change the reference formulation format(?).

data source test: SQL CHAR datatype

Generation of triples by using CHAR datatype column, resulting RDF literal is space-padded. (specification reference)

Suggestions for missing test cases

dcat:format for the encoding (much like CSVW encoding)

dcat Dataset vs dcat Distribution / dcat Data Service

It seems to me that Dataset would never be the actual source, since it is an abstract description of a dataset. The source will always be a specific distribution, or a data service. So shouldn't that be the direct source that is linked to in the mapping?

A dcat dataset could have many distributions for example.

ex:dataset1 a dcat:Dataset ;
    dcat:distribution ex:distribution1 , ex:distribution 2, ex:distribution3 ;

ex:distribution1 dcat:accessUrl <https://foo.bar/1> .
ex:distribution2 dcat:accessUrl <https://foo.bar/2> .
ex:distribution3 dcat:accessUrl <https://foo.bar/3> .

With this mapping:

ex:Source a rml:Source ;
  rml:access ex:dataset1 ;

You don't know which of the distributions is intended.

Specifying encoding

issue: not possible to specify source encoding

suggestion: add support for describing source encoding. With the focus on text based sources it is important to be able specify the encoding of input sources.

No constraint on what a target should be => need to say this is implementation-dependent?

In the examples, the target may be a VOID data or a DCAT dataset. Yet there is no constraint here.
So how should a processor know whether the target is given by void:dataDump or dcat:downloadURL or anything else?

If this is implementation-dependent, then it should be stated explicitly.

base Vs view

The current RML spec says

A logical source can be one of the following:

A base source (any input source or base table),
a view (in case of databases)

should that remain as it is?

LogicalTarget serialization, compression, and encoding domains

There is an inconsistency between Figure 2 and Table in Section 4.2 (e.g., the domain of rml:compression is rml:Target, whereas in the table it is rml:LogicalTarget.

For consistency with logical sources, maybe the figure is "correct", however I wonder why rml:serialization has been put as a property of rml:LogicalTarget. Would it not make it more sense to put it together with compression and encoding details?

Furthermore, should not we not specify rml:serialization in logical sources as well? For instance, if I am reading a binary JSON file (BSON), then rml:serialization could be used to denote this fact which is not currently captured by rml:referenceFormulation rml:JSONPath.

Add Logical Source examples

We have a lot of Logical Target examples in how to use them within RML, do the same for Logical Source