csarven / linked-sdmx Goto Github PK

View Code? Open in Web Editor NEW

16.0 7.0 5.0 22.41 MB

:rocket: Linked SDMX

Home Page: http://csarven.ca/linked-sdmx-data

License: Other

Shell 4.78% XSLT 95.22%

linked-sdmx rdf-data-cube dcv sdmx rdf xslt prov-o semstats linked-data

linked-sdmx's Introduction

SDMX-ML to RDF/XML

See Linked SDMX Data about this project.

This transformation is used to published some of the 270a Linked Dataspaces.

Brief overview of the project is here. If you want to use it or look under the hood for additional information, give the wiki page a go.

What is this?

XSLT 2.0 templates and scripts to transform Generic and Compact SDMX 2.0 data and metadata to RDF/XML using the RDF Data Cube and related vocabularies for statistical Linked Data. Its purpose is:

To automagically transform SDMX-ML data and metadata into RDF/XML as semantically and complete as possible.
To support SDMX publishers to also publish their data using RDF.
To improve access and discovery of statistical cross-domain data.

What can it do?

Transforms SDMX KeyFamilies, ConceptSchemes and Concepts, CodeLists and Codes, Hierarchical CodeLists, DataSets.
Configurability for SDMX publisher's needs.
Reuse of CodeLists and Codes from external agencies.
A way to interlink AnnotationTypes.
Provides basic provenance data using PROV-O.

What is inside?

It comes with scripts and sample data.

Scripts

XSLT 2.0 templates to transform Generic and Compact SDMX-ML data and metadata. It includes the main XSL template for generic SDMX-ML, an XSL for common templates and functions, and an RDF/XML configuration file to set preferences like base URIs, delimiters in URIs, how to mapping annotation types.
Bash script that transforms sample data using saxonb-xslt.

Samples

Sample SDMX Message and Structure data/ retrieved from these organizations: BIS, OECD, UN, ECB, WB, IMF, FAO, EUROSTAT, BFS.

Requirements

An XSLT 2.0 processor to transform, and some configuring using the provided config.rdf file.

How to contribute

See open GitHub issues if you want to hack, or create issues if you encounter bugs, or enhancements within the scope of this project. There are also some questions that would be nice to get answers to.
Please send pull requests or help improve documentation.
Reach out to organizations that publish data using the SDMX-ML to collaborate on this effort.

linked-sdmx's People

Contributors

Stargazers

Watchers

Forkers

peraldus rogargon alexkreidler sulthoni

linked-sdmx's Issues

Whether to use @version, @validFrom, or @validTo values in URI

Some publishers use versions, validFrom, or validTo for their CodeLists. In order to distinguish one CL from another, the consideration here is whether to use @Version, @validfrom, or @validto values in URI. This is going to turn out butt ugly but the alternative is to generate some other token, and in the end, that's not achieving much. What would be the best way to get around this since adding these type of values in URIs is not considered to be good practice?

Currently if both @validfrom and @validto are present, their values are appended to the URI in order to make the CLs unique.

If no action is taken to make them unique, statements from multiple CLs (i.e., with different versions of the same CL) is piled under the same subject resource.

I'm leaning towards adding @Version:

http://example.org/code/CL_FOO/{$version}

where perhaps if version is "1.0" it will not be added - this is since SDMX 2.0 states that if no version is given, it is assumed to be 1.0. Hence, no version and version 1.0 can be treated equally and omitted from the URI. Looking at some sample metadata, @Version is most likely to occur i.e., there is no case where either validFrom or validTo occurs without a @Version. And since SDMX 2.0 says that "the validFrom and validTo attributes provide inclusive dates for providing supplemental validity information about the version", it might be sufficient to only include the version.

Reserving the URI without the version as an alias for the latest version of the CL, with a change history, might help in some ways, but may also be unreliable if applications use that URI and expect a particular list of codes 😟

On the plus side of all this, perhaps placing the @Version in the URI is not all that bad. Given that a KeyFamily will use a particular version of a CL, it needs to be able to point at that particular version. Although this goes against the general recommendation out there for not including the version in the URI, I think it is a good exception 😄 Otherwise, how would creating new terms for the CL URIs without the version information be any different? It seems equally arbitrary.

Handle CategorySchemes

Handle Attributes attached at the DataSet level

Attributes may be attached to the DataSet i.e., Attributes are declared as a child of DataSet - may be a sibling of Series or Group.

Using Agency IDs in URI

If the SDMX publisher (i.e., the agency in charge) makes their data under their own domain name, is there a sense in using the agencyID (as it is currently) in the URI e.g:

http://{domain-of-the-agency}/code/{$agencyID}/{$codelistID}

Similarly, references to external agencies are ideally linked out (see issue #5) such that information about them is omitted in the final transformation.

Therefore, wouldn't it be better to eliminate the agencyID from the URI altogether?

Consider to transform compact SDMX-ML

Current transformation is based on generic SDMX. In order to tap into compact SDMX data, consider handling compact SDMX-ML. There are transformations from compact to generic, so, the other question is whether this project should bother with it or not.

Consider NaN handling in OBS_VALUEs

Example:
http://ecb.270a.info/dataset/BSI/M/AT/N/A/A2Z3/A/I/U2/2240/EUR/A/2011-06

Dealing with attachmentLevel Series

If @attachmentLevel='Series' for a Component, does it correspond to <qb:componentAttachment rdf:resource="{$qb}Dimension"/>?

Consider using skos:prefLabel for structure:Code/structure:Description

Typically structure:Name and structure:Definition are transformed to skos:prefLabel and skos:definition respectively. In SDMX 2.0, structure:Codes do not contain structure:Name but only structure:Description. SDMX 2.1 contains both. Would it be safe to use skos:prefLabel for structure:Description in SDMX 2.0? Looking at sample data, majority of the structure:Descriptions are label (structure:Name) like i.e., they are fairly short in length.

Dealing with structure:TimeDimension

Should structure:TimeDimension be given special treatment even though RDF Data Cube vocabulary currently doesn't i.e., there is only qb:DimensionProperty? Currently the transformation doesn't and treats it as a regular qb:DimensionProperty. Invent/Propose qb:TimeDimension?

Extend interlinking of annotations

Currently the interlinking method that is declared in config for annotations applies to all matching AnnotationTypes. There should be a way to differentiate the semantics of the AnnotationTypes in different lists. For example, the AnnotationType ABBREV in a Code might want to come out as skos:prefLabel whereas in a Concept it might be skos:altLabel. Moreover, the best approach for this would be to differentiate based on the identifier. The config should take in the identifier (of a Concept or Code) in which to apply the interlinking to. Consider also implementing a catch all approach (sort of like how it currently works now).

Apply SDMX COG Annex 2 CL SPECIFIC ISSUES

Apply Specific Issues mentioned in http://sdmx.org/wp-content/uploads/2009/01/02_sdmx_cog_annex

Incorporating structure:TextFormat

Consider how to deal with or incorporate the optional or whatever in structure:KeyFamily structure:Components. Probably should use datatype on the range of the component property?

XSL template improvements and optimizations

There are numerous ways to improve the XSL templates e.g., transformation speed, code clarity, modularization. This issue should ideally remain open.

Consider using separate property URIs for each component property subclass

Currently using a single property path for component property URIs e.g., http://{authority}/property/{version}/{conceptSchemeID}/{conceptID}.

Since the main component properties can be one of: qb:DimensionProperty, qb:MeasureProperty, or qb:AttributeProperty, consider using dimension, measure, or attribute instead of property in path.

Advantages: as the property URIs are constructed by reusing the conceptScheme/concept information, it is possible that the same pattern is used for multiple component properties e.g., dimension and attribute, which results in having the same resource with qb:DimensionProperty and qb:AttributeProperty. Therefore, having different paths would avoid this conflict.

Disadvantages: single path like property is simpler. Also, anonymous users of the data will complain about this change. But, .. ohwell.

This is not a bug per se, but I'm tagging it as such because the potential of having multiple classes for the same resource is there.

Case sensitivity

Currently notations (identifiers) that are taken from source data are used as is without changing the case sensitivity. This means that, URIs end up something like:

http://example.org/code/CL_FREQ/Q
http://example.org/dataset/BIS_M_CIBL_UR/Q/M/U/B/S/5A/KR/2012-Q2

The rationale for keeping them the same was to perhaps allow external consumers to easily match (even though this is trivial) these notations.

From the Linked Data perspective, it doesn't make much difference one way or another. Sometimes lower-case looks nicer and not like some stats are SHOUTING AT YOU!!!

Consider omitting attribute components

Consider omitting attribute components that have quality issues in the data. For example, an attribute value in the data might use a text description of a code instead of a its identifier:

whereas it should be like:

Or there is simply a mismatch between the data and metadata.

These issues are better fixed at the source (and I'm reporting them), however, for the time being it might be better to provide a way to omit them in the results. Obviously precision will be lost but probably better than leaving inaccurate information.

Possible approaches (ordered from easiest to difficult):

Offer a config option to omit attributes with a certain @conceptRef
Leave everything as is in KeyFamily, but omit the values that have whitespaces (note: is this safe in general for attributes?) in GenericData
Omit the component in KeyFamily and the value in GenericData.

Kids, don't manipulate other people's data without parental guidance.

Using SDMX ConceptScheme type

SDMX-ML seems to differentiate ConceptScheme and CodeList. Should sdmx:ConceptScheme be created/proposed? Currently {$skos}ConceptScheme is used - is that sufficient?

Use xml:base

To cut down on output size, use xml:base

Default to language when the structure or data doesn't contain xml:lang

Consider whether to set the default language in Config such that when the structure:Names and structure:Definitions in the data do not contain xml:lang, it will set the lang as such.

Handling different DataSet messages

DataSets may be sent with message:GenericData (most common looking at the samples), or message:MessageGroup - this should be handled as well.

Use xsl:keys for fn:createSeriesKeyComponentData

Should improve parsing of DataSets

Support SDMX 2.1

Current transformation is mostly based on SDMX 2.0. Started to add some SDMX 2.1 stuff like in commit 9151c9c ( issue #28 ) but full coverage would be nice.

Property label should be derived from a specific concept that the DSD uses

The label that's assigned to a component property is derived from a concept that's used in the DSD. The concept term however is not unique i.e., it is used across different concept schemes. Therefore, retrieving the label should factor in other information e.g., agency, concept scheme and version.

Consider external agency IDs for concepts

Concepts may use external agency IDs. Consider whether to generate these descriptions or skip over them.

Currently the descriptions are generated using the main agency ID as default.

Configuration to map dimension, measure, or attribute values to use a prefered namespace

It should be possible to configure dimension, attribute, or measure values based on the concept that's defined in the config file to any namespace. For example, ECB uses the TIME_FORMAT concept for an attribute (which is a isTimeFormat) where the values in the dataset for this attribute is a member of CL_TIME_FORMAT in http://sdmx.org/wp-content/uploads/2009/01/02_sdmx_cog_annex_2_cl_2009.pdf. So, in this case, use the appropriate concept from sdmx-code:timeFormat.

Relates to issue #34 but this is for values instead of properties.

Consider adopting CL_ORGANISATION

The current agency list is maintained manually with identifiers that's based on sample data. Some of which contain aliases. While this approach is okay, and may be sufficient for this project, there is a CL_ORGANISATION from Eurostat (IIRC) which can be adopted here.

Reuse SDMX-RDF vocabulary when agency ID is SDMX

Some SDMX KeyFamilies, and CodeLists (and maybe ConceptSchemes) have agencyID set to "SDMX". In such cases, it should reuse the existing SDMX-RDF vocabulary.

Move namespaces from each element to root element

Commit 21f4e82 forces namespaces to be declared at the element level on each property in observations. This is redundant and bloats the output size.

I'm not sure if a fix for this is easy or straight forward. xsl:namespace could be used right after rdf:RDF but it would mean that it needs to spit out all the namespaces before it gets to the observations.

Configuration to use SDMX-RDF URIs directly

The default approach for the transformation is to create properties under the given agency's namespace so that it can be linked to or extended easily. If SDMX agencyID is detected, the property descriptions point at SDMX-RDF URIs.

Some publishers would prefer to use the SDMX-RDF URIs directly instead of creating new properties when the agencyID for the code lists or concept schemes is set to SDMX.

So, this should be possible and configurable with a simple switch.

Adding labels

Consider whether to "generate" labels based on information from elsewhere e.g., label for resources that is of type qb:DataSet from information in message:Header or structure:KeyFamily/structure:Name

Datatypes in observation object resources

Add appropriate datatypes to the object resources in observations when they are not explicitly given via structure:TextFormat in KeyFamily. This is really about checking literals for known patterns.

Find HCL example with references to different agencies

Are there hierarchical codelists with code references to different agencies?

Improve config for thing separator

The delimiter that is used to separate things in URIs needs to be tested further.

Find a way to remove excessive namespace declarations

There are excessive namespace declarations per property in observations (DataSet). This makes the RDF/XML bloated and takes up time. See if the namespaces can be placed upfront on the parent or some ancestor element instead.

Refactor namespace creation

Currently the namespaces that are created for the component concepts are done for each keyfamily. This doesn't really work out when the input dataset file and the structure file contains multiple datasets and keyfamilies.

Raising or flattening the dataset

Raising this issue due to a decision which was made in issue #11 that perhaps needs a consideration on its own.

Taking issue #11 as example, where there is no corresponding attachment level in the RDF Data Cube vocabulary, the decision in that case was to flatten the dataset - as a practical implementation.

The question here is the following: if there should be a better decision process to raise or flatten a dataset in such cases. Raising reduces repetition, whereas flattening makes querying easier. I'm not aware of any major philosophical reasons to choose one over the other as far as the implementation goes for this project.

Support SDMX Registry query responses

QueryStructureResponse uses RegistryInterface to wrap Structures. Uses a different namespace for CodeLists, Concepts, KeyFamilies. Support that.

Maintain dimension order in observation URIs as in KeyFamily

The order of dimensions used in a DataSet Series may not be same as the dimension order in corresponding KeyFamily. The URI that is generated per observation URI includes the dimension values and it should be the order used in KeyFamily.

Multiple measures

Multiple measures needs to be handled.

Detect common datetime and period patterns

SDMX data contains common datetime patterns e.g., xs:dateTime xs:date xs:gYearMonth xs:gYear, as well as periods e.g., xs:duration, or PeriodTypes in SDMX: YYYY followed with - and Q[1-4], W[1-52], T[1-3], B[1-2](See also: SDMXCommon.xsd). For such literals, use either URIs (e.g., reference year or quarter) or rdf:datatype with corresponding resource.

This is partially supported in issue #9.

Unique Dataset URIs

Dataset URIs need to be unique. Current URI pattern is: http://{authority}/dataset/{KeyFamilyAgencyID}/{KeyFamilyRef}

Consider how to improve e.g., "ID identifies a data flow definition, which, when combined with time, uniquely identifies the data set." from SDMXMessage.xsd.

Extend configuration for URI construction and patterns

Improve the current configuration for URI patterns. Related: #4

Dealing with Provenance

Currently minimum level of provenance is applied using PROV-O. Consider either to leave it as is, exclude, or provide a configuration option to (dis)allow in transformations.

Are concept identifiers unique?

Are all concept identifiers from an agency unique? Put it differently, can a concept identifier occur multiple times in different concept schemes from the same agency? I currently can't see this in the sample data but it might be the case.

In the case of code identifiers, looking at sample data, they are not unique i.e., same code identifier can occur in different code lists with different descriptions. For this reason, codes identifiers in URIs are prefixed with code list identifiers e.g., http://{authority}code/{codelistID}/{codeID}

Eurostat example

In case you need more examples:

http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=data%2Ftps00001.sdmx.zip

Contains a DSD and a corresponding separate compact XML data file.

The data is yearly time series of EU member state populations.

Reuse URIs from existing agencyIDs other than SDMX

This is similar to issue #6 but for external agencies:

Consider how to reuse agencyID's that's different than self agency and SDMX. Ideally it should use their full URI - would this need to check the SDMX Registry or a known published RDF vocab?

The other open question is whether to do this or not in the first place. Should the transformation be complete without making these design decisions e.g., leaving data behind.

qb:ComponentSpecifications should be HTTP URIs instead of blank nodes

Blank nodes are no fun.