google / dspl Goto Github PK

View Code? Open in Web Editor NEW

59.0 7.0 207.0 6.89 MB

Schema and utilities for Google Dataset Publishing Language

Home Page: https://developers.google.com/public-data

License: BSD 3-Clause "New" or "Revised" License

Python 95.90% HTML 1.32% CSS 0.42% JavaScript 2.35%

dspl's Introduction

Dataset Publishing Language

Introduction

DSPL stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data (the numbers) of datasets. Datasets described in this format can be imported into the Google Public Data Explorer, a tool that allows for rich, visual exploration of the data.

This site hosts miscellaneous, open source content (i.e., schemas, example files, and utilities) associated with the DSPL standard. See our documentation site for more details on what DSPL is and how to use it. The utilities in this repository are documented at this site.

Build and install

To build the tools, install lxml, then use the setup.py script in tools/dspltools/. You can use pip to install these:

pip install -r tools/dspltools/requirements.txt
pip install tools/dspltools

DSPL 2

The draft of the DSPL 2 specification, which replaces the existing XML metadata format with schema.org markup, can be found at the DSPL GitHub page. The source for the specification is at docs/dspl2-spec.md.

Some initial library and tool support is available in tools/dspl2

Build and install

To build the tools, install the prerequisites, then use the setup.py script in tools/dspl2/. You can use pip to install these:

pip install -r tools/dspl2/requirements.txt
pip install tools/dspl2

dspl's People

Contributors

Stargazers

Watchers

Forkers

cclauss exhorder dimitraz jobdiogenes tibg luciojuarezmeli sangramga gdk-gagan pranjalbajaj nigelcopley yogish12 avnovikov gaybro8777 henriquemattos nkrishnaswami dmilan77 j-kind krishnakesari lucdemortier mholovat epinhoodceo vivianep silenteh s3661581 techravi007 ashiksam23 coto abdul2782 neotim rosyvo phatkaka adiaosdu harisspahija annefrancine edsantoshn alexander1259 muskanmahajan37 developerdev2 cappsyhun vaporeon-blip deltat1995 chaitanyarai21 aristotekoen olucvolkan slashkari phoenixp123 brindhab gauthiersgn maslamck faiyadhs007 raghav-wd ngoctuanktmm sushil-mathew patrickk2000-zz rana-shubham luisballesteros dev-kperera akiramizoguchi clarisseu kevin-olbrich leandertolksdorf dudemandando samkun5570 svdang619 otis-mingas 7ossam7assan javierespinmegias isabella232 jagsnine jbsainteve kostas30 vincentmcgreevy molorane pgarg22 rotem96 zeina325 lopezloo mikechiz chafian txqischokezz stella-nthenya dionyzoz kaisfa ali-altamimi fabianmendez avikonduru enriquebarco lieemle-1709 sydneysachs lyesferrahi xlogio giedriustavaras tgarciathl askarjon111 santinocc acostadu pdunauskas eidenyoshida roopeshghimire25 carolinacapilla

dspl's Issues

nested/hierarchical dimensions

Many dimensions, for example state/county/city or sector/industry group/industry/subindustry (eg GICS), have parent/child or broader/narrower relationships between dimension values at each level. Other schemes like SKOS or DSPL 1.0 can model this, but DSPL 2.0 does not yet include such a facility.

In this issue, discuss what such an extension could look like.

nested/hierarchical measures

DSPL 1.0 and other schemes permit grouping of measures into categories. For example, the SDG indicators are grouped under their corresponding goals and targets, and the World Development Indicator measures are grouped under a nested hierarchy of topics.

In this issue, discuss how to extend DSPL 2.0 to support this functionality.

id generation for data/metadata in CSV files

We may want to indicate how to generate IDs for the triples corresponding to the rows in CSV files.

This will facilitate having a well defined mapping from DSPL 2 datasets to triples, and may make it feasible to use dimension values and footnotes defined in CSV files across datasets.

Tentative proposal

Attempt to generate easy-to-keep-unique IDs, and make no provisions for ID collisions.

codeList

For each CSV row,

Start with the containing dimension's ID.
If there is no fragment, set the fragment to the dimension's name, URL encoded.
Append an = and the URL-encoded codeValue to the fragment.

For example, if a row's codeValue is us and its containing Dimension has @id of #country, the row's triples should be generated as if from equivalent JSON-LD with "@id": "#country=us".

footnote

For each CSV row,

Start with the containing StatisticalDataset's @id.
If there is a fragment, append a /
Append footnote= and the URL-encoded codeValue to the fragment

For example, if the dataset's @id is the empty string, a footnote with codeValue of p would yield an ID of #footnote=p. Similarly, if the dataset @id is #my_dataset, the footnote would have @id of #my_dataset/footnote=p.

observation

For each CSV row,

Start with the slice's @id.
If there is a fragment, append a / to it.
Sort the dimension values by dimension name.
For each dimension value, append the URL-encoded name, = and the URL-encoded codeValue to the fragment, separating the entries with /.
Sort the measure values by measure name
For each measure value, append the URL-encoded name to the fragment, separating entries with /.

For example, an observation in a slice with an @id of #europe_unemployment_slice with dimensions

gender of m,
country of uk, and
month of 2010-10

and measures

unemployment_rate and
unemployment

would have an @id of #europe_unemployment_slice/country=uk/gender=m/month=2010-10/unemployment/unemployment_rate

Fix rows with multiple currency codes

Per @coto,

There are two Chilean Currencies, one is CLF also know UF, mostly used for real states, it is a currency with inflation included. The other currency is CLP, Chilean Peso. the currency used daily by Chilean people.

The file samples/google/canonical/currencies.csv conjoins these into one row with ID "CLP CLF" and description "Chilean Peso Unidades de fomento". Instead, there should be separate rows for each, with CLP described as "Chilean Peso" with symbol "$" and CLF described as "Chilean Unidades de fomento" with symbol "CF"

As is, they would fail to match either of the currencies, and the codes are never used together for a single value.

This is actually more widespread, with 24 entries exhibiting this problem, eg currencies for Bolivia, Colombia, Cuba, Haiti (where the second currency is USD). Bhutan, Mexico, etc.

Fixes #41

Merge PR fixing conceptual error

I left PR #40 fixing a conceptual error of currencies that you had about my country. Please someone close it or merge it since I won't sign the CLA agreement to pass the check.
BR

Slice mappings from CSV columns to dimensions and measures

Right now we don't have an explicit mechanism for deciding which CSV column names correspond to each dimension or measure in a slice.
We had discussed using the fragment part of the dimension/measure's @id for this, when present.
I propose, if it is not present or is not suitable, that we permit the CSV column name to be specified as a string value for the identifier property on the CategoricalDimension, TimeDimension and StatisticalMeasure. E.g., (omitting dataset for terseness):

    {
      "@type": "StatisticalMeasure",
      "@id": "#employment",
      "name": "Employed",
      "identifier": "employed",
      "description": "The total number of people employed",
      "url": "http://www.bls.gov/cps/cps_htgm.htm",
      "unitCode": "IE"
    }

with the fragment case looking like

    {
      "@type": "TimeDimension",
      "@id": "#month",
      "name": "Month",
      "equvalentType": "xsd:gYearMonth",
      "dateFormat": "MMM yyyy"
    }

Fix Chilean Currencies

There are two Chilean Currencies according ISO 4217:

CLF: also know as UF, mostly used for real states, it is a currency used in Chile with inflation included.
CLP: Chilean Peso. the currency used daily by Chilean people.

it is confusing using CLP CLF together, in the market is CLP or CLF (UF).

Fix in samples/google/canonical/currencies.csv according what was tried to do in #40

References:

google / dspl Goto Github PK

dspl's Introduction

Dataset Publishing Language

Introduction

Build and install

DSPL 2

Build and install

dspl's People

Contributors

Stargazers

Watchers

Forkers

dspl's Issues

Tentative proposal

codeList

footnote

observation

Recommend Projects

Recommend Topics

Recommend Org