Git Product home page Git Product logo

commondataio / dataportals-registry Goto Github PK

View Code? Open in Web Editor NEW
30.0 1.0 4.0 51.63 MB

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

Home Page: https://registry.commondata.io

License: MIT License

Python 97.78% Shell 2.22%
data-catalog data-portal data-repository datasets registry data-discovery opendata dataset open-data

dataportals-registry's Issues

Add 'transferrable' flags for locations and topics

There are two types of datacatalogs.

First type is a classical data catalog with topics/keywords and locations linked to each single dataset. For example it's global data catalogs like Humanitarian Data Exchange. Datasets in this catalog could be linked to different topics and different countries. There is no one single country or other location for all datasets

Second type is a data catalog that has topics or location linked and these topics and locations are attributes of each dataset too. For example most regional and local government data catalogs do not have datasets for other locations except location except region / local territory. Also there are scientific repositories dedicated only to agriculture or sea exploration and we don't need to identify topic of each dataset, we could copy topics from data catalog registry entry.

This approach require manual or half-automated assignment of topics to data catalogs and adding flags that topics and location values are "transferable".

It could be following attributes:

  • transfer_topics: True
  • transfer_geotopics: True
  • transfer_location: True

This way we could toward to replace "Country" facet search from country covered by the source, to the country of single dataset.

Reorganize schema to support sub-regional codes and to differentiate owner location and coverage

Current situation

Country name and country id are part of country attribute; it's mostly related to the owner of the data catalog but sometimes could be related to coverage of the data catalog. For example, https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json STAC server owner located in the UK, but it covers Ghana territory.

Another question is that some countries are big and search engine users could be interested in local datasets. So sub-regional/regional level needs to be added to the data catalog description.

Solution

  1. Create owner attribute with name, link, and type sub-attributes and add location attribute
  2. location attribute could include: country_id, country_name, subregion_id, subregion_name
  3. Create coverage attribute as an array of the location attributes
  4. Consider optional attribute address under location attribute
  5. Convert automatically current countries attribute to coverage and owner attributes. If countries include more than one record, use first for owner location and all of them for coverage

Example

Result should look like

access_mode:
- public
api: true
api_status: uncertain
catalog_type: Geoportal
content_types:
- dataset
- map_layer
description: Sentinel 2 Surface Reflectance over Ghana using SIAC AC.
endpoints:
- type: stacserverapi
  url: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
export_standard: STAC specification
id: s2-for-ghana
langs:
- EN
link: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
name: S2 for Ghana
owner:
- name: "Jasmin. The UK's data analysis facility for environmental science"
- link: "https://jasmin.ac.uk"
- type: Academy
- location:
   - level: 2
   - country:
       - id: UK
         name: United Kingdom
   - subregion
       - id: GB-ENG
       - name: England
coverage:
   - location: 
      - country: 
         - id: GH
         - name: Chana
      - level: 1
software: Stac-server
status: active
tags:
- geospatial
- has_api

Add status attribute to catalog entries

There is no status attribute that could help to identify if data catalog is alive. It should be similar to Fairsharing records status attributes.

Fairsharing uses following four status values:

  • Ready
  • In development
  • Deprecated
  • Uncertain

Since Fairsharing includes not only databases but also standards, policies and other content, so it could include "In development" status.

For data catalogs more likely to use only three status types: active, deprecated, uncertain.

Figure out relationship to dataportals.org

The dataportals.org project (Github repo) has been around for a while and has similar goals of gathering a curated list of open data portals around the world. I have maintained it for some time now, but recently I haven't been able to dedicate much time to improve it. It constantly receives requests for adding new portals. @ivbeg, if you are able and would like to, you could take on managing the project. Or, if you prefer, just specific things like reviewing submissions, improving upon the data schema, the portal frontend, etc.

If nothing else, it could be interesting to at least

  • use it as a source of open data portals for this project, and / or
  • write a script on the other direction to send the list of portals collected here to feed the dataportals.org database

What do you think?

Consider methodology of trust score for data catalogs

Consider adding trust level score or trust level to each data catalog.

Core ideas:

  • use data about trust seals from re3data
  • more trust to academy and government data catalogs
  • less trust to community data catalogs
  • less trust to aggregators
  • less trust to data catalogs without data licenses/rights

Add subregion metadata for all Regional and Local government owner types

For each Regional and Local government, owner type should be set as location subregion data from ISO_3166-2 vocabulary.

It's done for Australia, Germany, Belgium, Brazil, Canada, USA, Russia, Great Britain, Argentina

The top priorities are the biggest countries:

  • Indonesia
  • China
  • India
  • France
  • Italy
  • Japan
  • South Korea
  • Kazakhstan
  • Mexico
  • Netherlands
  • New Zealand
  • Poland
  • Portugal
  • Sweden
  • Thailand
  • South Africa
  • Estonia
  • Chile
  • Norway
  • Ecuador
  • Spain

But all entries should be processed.

Add terms property to data catalog entry

We need to identify licenses used with selected data sources. Sometimes licenses are granular at the dataset level. Sometimes there is just a one single license per data catalog.

Solution

  1. Add attribute terms with sub attributes link, license_level, license_id, license_name, text.
  2. license_level could be "catalog" or "dataset"
  3. All fields are optional

Add ISO 3166-2 subdivisions code

Add ISO 3166-2 subdivisions code. Informally it's set only for US states as subfolders in entities directory. For now ISO 3166-2 code could be assigned manually as attribute with same name iso31662

Fix all blank values of owner.name and owner.link

There are a few hundreds of blank values owner.name and owner.link.
Owner.name displayed with the search entries results, it's important to not to have blank values.

Owner.link is not displayed by default, so it's less critical to have them without empty values, But it's still important too

Fix Dataverse endpoints

Not all Dataverse instances provide open API to work with data repository and some instances require authentication and authorization to use it's API. It's not detected right now and we need to identify if Dataverse API is available and if it's not it shouldn't be registry data catalog entry. Also if authentication is required and we could detect it, probably we need to update data catalog entry schema to set authentication required flag

Prepare data discovery guide

Write a guide about how to find data portals installations.

Write about:

  • how to find open data portals
  • how to use BuiltWith
  • how to find Geonetwork, GeoNode, GeoServer and ArcGIS servers
  • how to identify data portal software
  • how to find API

Add topics/themes

Add manually or automatically from datasets and catalogues descriptions

Create registry website and API

Update website code, add API, and launch it at registry.commondata.io

  • website code updated
  • API added
  • website launched

Critical features:

  • full list of records
  • single catalog record webpage
  • persistent link to download full dataset
  • API to access single datasets

Desired future features, not critical:

  • country profiles
  • software types profiles
  • owner type profiles
  • search and filtering

Review opendataforafrica.org sites

There are websites for all African countries at opendataforafrica.org, but there is no data from the country, only international data sources. It's used by World Bank experts, but it's not really used.

Need to decide about it

Fix NADA endpoints

Not all NADA instances provide REST API, but it's not verified right now. It should be detected and registry data catalogs entries updated

Create extended software profile

Current software description is very short. It's described only as software name and export_formats
For future observability and crawl tasks it could be important to create detailed software profile with data added manually and automatically.

Detailed software profile creation include following tasks:

  • Add data catalogs software types dictionary and/or database
  • Add fields: version, website, plugins, capabilities and e.t.c.
  • Automatically update registry records

Important question to answer, how to update this data regularly. Should it be part of observability tasks or registry too?

Make sure that at least one data catalog per country added to the registry

  1. Get list of the countries from World Bank data portal
  2. Map list of the countries to the country code in catalog records geographic coverage
  3. Get list of missing countries
  4. Search for data catalogs: open data, geoportals, scientific repositories for each missing country
  5. Update data catalogs registry

Add data enrichment from re3data

Current situation
There are a lot of metadata about data catalogs collected in Re3Data scientific data catalog.

Interesting data from re3data:

  • keywords
  • content type
  • contact-email
  • re3data identifier
  • description
  • persistent identifiers systems
  • software
  • versioning
  • institutions
  • repository type

This data could enrich the existing catalog and be added to the entries.

Example Re3Data entry - https://www.re3data.org/repository/r3d100010078

Possible solutions
There are a few possible strategies:

  1. Extract whole re3data catalog and extend existing schema and catalog entries automatically as is under re3data attribute. That means high trust on re3data metadata quality.
  2. Manually merge re3data entries with existing catalog entries and to extend existing schema without re3data attribute
  3. Add automatically only re3data identifiers and consider data enrichment later using Re3data identifier-based API.

Revalidate all endpoints related to ArcGIS Hub

Revalidate all ArcGIS Hub endpoints. Update apidetect.py script and watch for JSON contents.

API responds with HTTP status 200 but 404 statusCode defined in JSON response body. Add additional validation code and fix existing endpoints
{"statusCode":404,"message":"Domain record(s) not found :: A domain record with hostname = 2020census-bucksgis.opendata.arcgis.com does not exist :: 404","error":"Not Found"}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.