interstat / statistics-contextualized Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 20.92 MB

Models for the dissemination of contextualized statistical data

R 2.74% Python 66.30% Jupyter Notebook 30.96%

statistics-contextualized's People

Contributors

Stargazers

Watchers

Forkers

pafrance thomaspo flo7894

statistics-contextualized's Issues

Air Quality ontology and data models

This is a proposal to try model Air quality using existing vocabularies from SOSA for sensor description and AQD model for Air pollution
interstat.pdf
Yellow is related to SOSA concepts and green is related to AQD model.
Bear in mind that this is the ontological description of the domain of interest regarding Air Pollution
This model can be exported in OWL format with eddy.

Actual Data can be mapped by tools like monolith or juma, but some adjustments are needed to match with suggested smart model data structures
The link contains a list of properties and concepts that have been analyzed to solve the compatibility issue
Items highlighted in green have been added in the graphical representation while the yellow ones are proposed for revision.
Not highlighted Item have not been analyzed yet. Set of concepts like these pertains to the administrative elements of the sensor or to its physical environment that could be added to the concept model too.
lista di concetti interstat.docx

Except from the missing areas regarding sensor physical environment, the main mismatch in the data models is about the pollutant structure:
sensor data exhibits a vector of pollutant measurements that can be mapped to a given set of columns in a tabular representation, each of which represents concentration data, and thus is formatted as a float.
Our model represents a single observation as a couple of key/value set, so multiple measurements translate to multiple rows pertaining to the same observation
Is it possible to translate between the two models with a simple pivot/unpivot function.

After we reach consensus on a common data model, the next step is about mapping this model to actual data sources to produce the triplets for each source, but I'd like to discuss available datasets and the common model first.
Italian datasets, which are already compliant to AQD and SOSA models are available for reference.

Implement SEP transformation to NGSI-LD

Implement the transformation task that formats the SEP Census QB DSD and CSV file as NGSI-LD

Specify S4Y client application

Sep data update

Move metadata into dedicated graph: http://rdf.interstat.eng.it/graphs/sep/metadata
Rename data graph: http://rdf.interstat.eng.it/graphs/sep
Reorganize graphdb repositories : sep-test & sep-staging
Fix observation filtering into data pipeline

SEP data workflow: Italian census data

Italian census data is currently produced manually. Explore possibilities of automation.

Map SEP air quality data to SOSA/SSN

Description of the SEP (Support for Environment Policies) data is given here. Some of the fields correspond to artifacts defined in the Semantic Sensor Network Ontology. It would be useful to document these correspondences.

Map Data Cube model to NGSI-LD

The Data Cube model is presented in the specification.

Translate Data Cube DSD for SEP census data to DDI-CDI

SEP data workflow: air quality data

Design and implement data workflow.

Specify GF client application

Loop back Data Cube -> NGSI-LD -> RDF

An interesting exercice would be to take the JSON-LD produced from the Data Cube/Turtle, to convert it back to RDF by standard JSON-LD -> RDF transformation and to compare with the original graph.

Specify S4Y data workflow

Specify the data location, formats, ETL, etc.

Specify SEP client application

update link for creating a data model

We have now a new link for the creation of a data model
https://smartdatamodels.org/index.php/draft-a-data-model/

simpler and more powerful

SEP data workflow: Italian Air Pollution datasets

Data extraction
Step1: data source website
Step2: Select DATA panel. Data are organized in a set of tables
Step3: Scroll to the requested table, named “Tabella 1 – PM10. Stazioni di monitoraggio: dati e parametri statistici per la valutazione della qualità dell'aria (2019)”
Step4: Download link available on the left bottom at the end of the table . Downloaded data are in xls format
The downloaded file is not compliant with the required Data Structure.

Data transformation
The downloaded file has the following Data Structure:
“Regione”,”Provincia”,”Comune”,”Nome della stazione Tipo di zona”,”Tipo di stazione”,”Giorni di superamento di 50 µg/m3”,”Valore medio annuo³ [µg/m³]”,”Rendimento [%]”,”Rispetta copertura minima”,”sufficiente distribuzione temporale nell'anno”,”numero_dati_validi”,”TIPO DI DATI 4”,”Codice zona”,”Nome zona”

Data need to be filtered in order to be compliant to the requested Data Structure,
NUTS3 variable has been added through a transformation from municipality_id Variable, using data from ISTAT LAU archive
Provided metadata for NUTS3 transformation need to be downloaded and merged.
Metadata are referenced in a time series and Variable regarding year 2019 has been used in the script.
Metadata regarding pollutant type, data reference time and aggregation type have been added in the datafile.

Data Load
The transformed file has been uploaded into INTERSTAT GraphDB repository sep-test
GraphDB allows direct link to the resources by a permalink, but the raw data needs a little reworking to be accessed directly.

Further data files available
Same procedure can be used to import other data from Data Source Website
AMBIENT AIR QUALITY: NITROGEN DIOXIDE NO2
AMBIENT AIR QUALITY: TROPOSPHERIC OZONE O3
AMBIENT AIR QUALITY: PARTICULATE PM2.5
These files have not been uploaded to GraphDB repository yet

Transformation script in R language
processing_ETL_AIR.R.txt