Git Product home page Git Product logo

dspl's Introduction

Dataset Publishing Language

Introduction

DSPL stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data (the numbers) of datasets. Datasets described in this format can be imported into the Google Public Data Explorer, a tool that allows for rich, visual exploration of the data.

This site hosts miscellaneous, open source content (i.e., schemas, example files, and utilities) associated with the DSPL standard. See our documentation site for more details on what DSPL is and how to use it. The utilities in this repository are documented at this site.

Build and install

To build the tools, install lxml, then use the setup.py script in tools/dspltools/. You can use pip to install these:

pip install -r tools/dspltools/requirements.txt
pip install tools/dspltools

DSPL 2

The draft of the DSPL 2 specification, which replaces the existing XML metadata format with schema.org markup, can be found at the DSPL GitHub page. The source for the specification is at docs/dspl2-spec.md.

Some initial library and tool support is available in tools/dspl2

Build and install

To build the tools, install the prerequisites, then use the setup.py script in tools/dspl2/. You can use pip to install these:

pip install -r tools/dspl2/requirements.txt
pip install tools/dspl2

dspl's People

Contributors

cclauss avatar mholovat avatar nkrishnaswami avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dspl's Issues

nested/hierarchical dimensions

Many dimensions, for example state/county/city or sector/industry group/industry/subindustry (eg GICS), have parent/child or broader/narrower relationships between dimension values at each level. Other schemes like SKOS or DSPL 1.0 can model this, but DSPL 2.0 does not yet include such a facility.

In this issue, discuss what such an extension could look like.

nested/hierarchical measures

DSPL 1.0 and other schemes permit grouping of measures into categories. For example, the SDG indicators are grouped under their corresponding goals and targets, and the World Development Indicator measures are grouped under a nested hierarchy of topics.

In this issue, discuss how to extend DSPL 2.0 to support this functionality.

id generation for data/metadata in CSV files

We may want to indicate how to generate IDs for the triples corresponding to the rows in CSV files.

This will facilitate having a well defined mapping from DSPL 2 datasets to triples, and may make it feasible to use dimension values and footnotes defined in CSV files across datasets.

Tentative proposal

Attempt to generate easy-to-keep-unique IDs, and make no provisions for ID collisions.

codeList

For each CSV row,

  • Start with the containing dimension's ID.
  • If there is no fragment, set the fragment to the dimension's name, URL encoded.
  • Append an = and the URL-encoded codeValue to the fragment.

For example, if a row's codeValue is us and its containing Dimension has @id of #country, the row's triples should be generated as if from equivalent JSON-LD with "@id": "#country=us".


footnote

For each CSV row,

  • Start with the containing StatisticalDataset's @id.
  • If there is a fragment, append a /
  • Append footnote= and the URL-encoded codeValue to the fragment

For example, if the dataset's @id is the empty string, a footnote with codeValue of p would yield an ID of #footnote=p. Similarly, if the dataset @id is #my_dataset, the footnote would have @id of #my_dataset/footnote=p.


observation

For each CSV row,

  • Start with the slice's @id.
  • If there is a fragment, append a / to it.
  • Sort the dimension values by dimension name.
  • For each dimension value, append the URL-encoded name, = and the URL-encoded codeValue to the fragment, separating the entries with /.
  • Sort the measure values by measure name
  • For each measure value, append the URL-encoded name to the fragment, separating entries with /.

For example, an observation in a slice with an @id of #europe_unemployment_slice with dimensions

  • gender of m,
  • country of uk, and
  • month of 2010-10

and measures

  • unemployment_rate and
  • unemployment

would have an @id of #europe_unemployment_slice/country=uk/gender=m/month=2010-10/unemployment/unemployment_rate

Fix rows with multiple currency codes

Per @coto,

There are two Chilean Currencies, one is CLF also know UF, mostly used for real states, it is a currency with inflation included. The other currency is CLP, Chilean Peso. the currency used daily by Chilean people.

The file samples/google/canonical/currencies.csv conjoins these into one row with ID "CLP CLF" and description "Chilean Peso Unidades de fomento". Instead, there should be separate rows for each, with CLP described as "Chilean Peso" with symbol "$" and CLF described as "Chilean Unidades de fomento" with symbol "CF"

As is, they would fail to match either of the currencies, and the codes are never used together for a single value.

This is actually more widespread, with 24 entries exhibiting this problem, eg currencies for Bolivia, Colombia, Cuba, Haiti (where the second currency is USD). Bhutan, Mexico, etc.

Fixes #41

Merge PR fixing conceptual error

I left PR #40 fixing a conceptual error of currencies that you had about my country. Please someone close it or merge it since I won't sign the CLA agreement to pass the check.
BR

Slice mappings from CSV columns to dimensions and measures

Right now we don't have an explicit mechanism for deciding which CSV column names correspond to each dimension or measure in a slice.
We had discussed using the fragment part of the dimension/measure's @id for this, when present.
I propose, if it is not present or is not suitable, that we permit the CSV column name to be specified as a string value for the identifier property on the CategoricalDimension, TimeDimension and StatisticalMeasure. E.g., (omitting dataset for terseness):

    {
      "@type": "StatisticalMeasure",
      "@id": "#employment",
      "name": "Employed",
      "identifier": "employed",
      "description": "The total number of people employed",
      "url": "http://www.bls.gov/cps/cps_htgm.htm",
      "unitCode": "IE"
    }

with the fragment case looking like

    {
      "@type": "TimeDimension",
      "@id": "#month",
      "name": "Month",
      "equvalentType": "xsd:gYearMonth",
      "dateFormat": "MMM yyyy"
    }

Fix Chilean Currencies

There are two Chilean Currencies according ISO 4217:

  1. CLF: also know as UF, mostly used for real states, it is a currency used in Chile with inflation included.
  2. CLP: Chilean Peso. the currency used daily by Chilean people.

it is confusing using CLP CLF together, in the market is CLP or CLF (UF).

Fix in samples/google/canonical/currencies.csv according what was tried to do in #40

References:

--

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.