Git Product home page Git Product logo

commondataio / dataportals-registry Goto Github PK

View Code? Open in Web Editor NEW
29.0 1.0 4.0 43.43 MB

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

Home Page: https://registry.commondata.io

License: MIT License

Python 97.78% Shell 2.22%
data-catalog data-portal data-repository datasets registry data-discovery opendata dataset open-data

dataportals-registry's Introduction

dataportals-registry

Registry of data portals, catalogs, data repositories and e.t.c.

This is a transitional repository to create registry of all existing open data portals and repositories.

This is the first pillar of the open search engine project. Other pillars include:

  • registry of all catalogs (this one)
  • datasets raw metadata database
  • unified dataset search index and search engine
  • datasets backup and file cache

Please take a look at project mindmap to see it's goals and structure.

What kind of data catalogs collected?

This registry includes description of the following data catalogs:

  • Open data portals
  • Geoportals
  • Scientific data repositories
  • Indicators catalogs
  • Microdata catalogs
  • Machine learning catalogs
  • Data search engines
  • API Catalogs
  • Data marketplaces
  • Other

Inspiration

This project inspired by Re3Data and Fairsharing projects. Key difference is the focus on open data as a broad topic, not just open research data.

Final version of this repository will be reorganized as database with publicly available open API and bulk data dumps.

How this repository organized

Warning: this is temporary description and subject of change

Entities

Data catalog descriptions are YAML files in data/entities folder. Files separated by country/territory folders and inside each country folder there are folders like scientific, opendata, microdata, geo, search, marketplace, other.

Example

Data.gov YAML file

access_mode:
- open
api: true
api_status: active
catalog_type: Open data portal
content_types:
- dataset
coverage:
- location:
    country:
      id: US
      name: United States
    level: 1
endpoints:
- type: ckanapi
  url: https://catalog.data.gov/api/3
export_standard: CKAN API
id: catalogdatagov
identifiers:
- id: wikidata
  url: https://www.wikidata.org/wiki/Q5227102
  value: Q5227102
- id: re3data
  url: https://www.re3data.org/repository/r3d100010078
  value: r3d100010078
- id: fairsharing
  url: https://fairsharing.org/FAIRsharing.6069e1
  valye: FAIRsharing.6069e1
langs:
- EN
link: https://catalog.data.gov
name: NETL Energy Data eXchange
owner:
  location:
    country:
      id: US
      name: United States
    level: 1
  name: U.S. Department of Energy
  type: Central government
software: CKAN
status: active
tags:
- government
- has_api

Datasets and code

Datasets kept in data/datasets folder, right now it's catalogs.jsonl file generated by script builder.py in scripts folder.

Run python builder.py build in scripts folder to regenerate catalogs.jsonl file from YAML files.

How to contribute?

If you find any mistake or you have an additional data catalog to add, please generate pull request or write an issue.

Data sources

Following data sources used:

License

Source code licensed under MIT license Data licensed under CC-BY 4.0 license

dataportals-registry's People

Contributors

ivbeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

dataportals-registry's Issues

Revalidate all endpoints related to ArcGIS Hub

Revalidate all ArcGIS Hub endpoints. Update apidetect.py script and watch for JSON contents.

API responds with HTTP status 200 but 404 statusCode defined in JSON response body. Add additional validation code and fix existing endpoints
{"statusCode":404,"message":"Domain record(s) not found :: A domain record with hostname = 2020census-bucksgis.opendata.arcgis.com does not exist :: 404","error":"Not Found"}

Add topics/themes

Add manually or automatically from datasets and catalogues descriptions

Add subregion metadata for all Regional and Local government owner types

For each Regional and Local government, owner type should be set as location subregion data from ISO_3166-2 vocabulary.

It's done for Australia, Germany, Belgium, Brazil, Canada, USA, Russia, Great Britain, Argentina

The top priorities are the biggest countries:

  • Indonesia
  • China
  • India
  • France
  • Italy
  • Japan
  • South Korea
  • Kazakhstan
  • Mexico
  • Netherlands
  • New Zealand
  • Poland
  • Portugal
  • Sweden
  • Thailand
  • South Africa
  • Estonia
  • Chile
  • Norway
  • Ecuador
  • Spain

But all entries should be processed.

Make sure that at least one data catalog per country added to the registry

  1. Get list of the countries from World Bank data portal
  2. Map list of the countries to the country code in catalog records geographic coverage
  3. Get list of missing countries
  4. Search for data catalogs: open data, geoportals, scientific repositories for each missing country
  5. Update data catalogs registry

Add ISO 3166-2 subdivisions code

Add ISO 3166-2 subdivisions code. Informally it's set only for US states as subfolders in entities directory. For now ISO 3166-2 code could be assigned manually as attribute with same name iso31662

Fix all blank values of owner.name and owner.link

There are a few hundreds of blank values owner.name and owner.link.
Owner.name displayed with the search entries results, it's important to not to have blank values.

Owner.link is not displayed by default, so it's less critical to have them without empty values, But it's still important too

Consider methodology of trust score for data catalogs

Consider adding trust level score or trust level to each data catalog.

Core ideas:

  • use data about trust seals from re3data
  • more trust to academy and government data catalogs
  • less trust to community data catalogs
  • less trust to aggregators
  • less trust to data catalogs without data licenses/rights

Reorganize schema to support sub-regional codes and to differentiate owner location and coverage

Current situation

Country name and country id are part of country attribute; it's mostly related to the owner of the data catalog but sometimes could be related to coverage of the data catalog. For example, https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json STAC server owner located in the UK, but it covers Ghana territory.

Another question is that some countries are big and search engine users could be interested in local datasets. So sub-regional/regional level needs to be added to the data catalog description.

Solution

  1. Create owner attribute with name, link, and type sub-attributes and add location attribute
  2. location attribute could include: country_id, country_name, subregion_id, subregion_name
  3. Create coverage attribute as an array of the location attributes
  4. Consider optional attribute address under location attribute
  5. Convert automatically current countries attribute to coverage and owner attributes. If countries include more than one record, use first for owner location and all of them for coverage

Example

Result should look like

access_mode:
- public
api: true
api_status: uncertain
catalog_type: Geoportal
content_types:
- dataset
- map_layer
description: Sentinel 2 Surface Reflectance over Ghana using SIAC AC.
endpoints:
- type: stacserverapi
  url: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
export_standard: STAC specification
id: s2-for-ghana
langs:
- EN
link: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
name: S2 for Ghana
owner:
- name: "Jasmin. The UK's data analysis facility for environmental science"
- link: "https://jasmin.ac.uk"
- type: Academy
- location:
   - level: 2
   - country:
       - id: UK
         name: United Kingdom
   - subregion
       - id: GB-ENG
       - name: England
coverage:
   - location: 
      - country: 
         - id: GH
         - name: Chana
      - level: 1
software: Stac-server
status: active
tags:
- geospatial
- has_api

Figure out relationship to dataportals.org

The dataportals.org project (Github repo) has been around for a while and has similar goals of gathering a curated list of open data portals around the world. I have maintained it for some time now, but recently I haven't been able to dedicate much time to improve it. It constantly receives requests for adding new portals. @ivbeg, if you are able and would like to, you could take on managing the project. Or, if you prefer, just specific things like reviewing submissions, improving upon the data schema, the portal frontend, etc.

If nothing else, it could be interesting to at least

  • use it as a source of open data portals for this project, and / or
  • write a script on the other direction to send the list of portals collected here to feed the dataportals.org database

What do you think?

Create registry website and API

Update website code, add API, and launch it at registry.commondata.io

  • website code updated
  • API added
  • website launched

Critical features:

  • full list of records
  • single catalog record webpage
  • persistent link to download full dataset
  • API to access single datasets

Desired future features, not critical:

  • country profiles
  • software types profiles
  • owner type profiles
  • search and filtering

Add 'transferrable' flags for locations and topics

There are two types of datacatalogs.

First type is a classical data catalog with topics/keywords and locations linked to each single dataset. For example it's global data catalogs like Humanitarian Data Exchange. Datasets in this catalog could be linked to different topics and different countries. There is no one single country or other location for all datasets

Second type is a data catalog that has topics or location linked and these topics and locations are attributes of each dataset too. For example most regional and local government data catalogs do not have datasets for other locations except location except region / local territory. Also there are scientific repositories dedicated only to agriculture or sea exploration and we don't need to identify topic of each dataset, we could copy topics from data catalog registry entry.

This approach require manual or half-automated assignment of topics to data catalogs and adding flags that topics and location values are "transferable".

It could be following attributes:

  • transfer_topics: True
  • transfer_geotopics: True
  • transfer_location: True

This way we could toward to replace "Country" facet search from country covered by the source, to the country of single dataset.

Fix NADA endpoints

Not all NADA instances provide REST API, but it's not verified right now. It should be detected and registry data catalogs entries updated

Add terms property to data catalog entry

We need to identify licenses used with selected data sources. Sometimes licenses are granular at the dataset level. Sometimes there is just a one single license per data catalog.

Solution

  1. Add attribute terms with sub attributes link, license_level, license_id, license_name, text.
  2. license_level could be "catalog" or "dataset"
  3. All fields are optional

Add status attribute to catalog entries

There is no status attribute that could help to identify if data catalog is alive. It should be similar to Fairsharing records status attributes.

Fairsharing uses following four status values:

  • Ready
  • In development
  • Deprecated
  • Uncertain

Since Fairsharing includes not only databases but also standards, policies and other content, so it could include "In development" status.

For data catalogs more likely to use only three status types: active, deprecated, uncertain.

Fix Dataverse endpoints

Not all Dataverse instances provide open API to work with data repository and some instances require authentication and authorization to use it's API. It's not detected right now and we need to identify if Dataverse API is available and if it's not it shouldn't be registry data catalog entry. Also if authentication is required and we could detect it, probably we need to update data catalog entry schema to set authentication required flag

Prepare data discovery guide

Write a guide about how to find data portals installations.

Write about:

  • how to find open data portals
  • how to use BuiltWith
  • how to find Geonetwork, GeoNode, GeoServer and ArcGIS servers
  • how to identify data portal software
  • how to find API

Create extended software profile

Current software description is very short. It's described only as software name and export_formats
For future observability and crawl tasks it could be important to create detailed software profile with data added manually and automatically.

Detailed software profile creation include following tasks:

  • Add data catalogs software types dictionary and/or database
  • Add fields: version, website, plugins, capabilities and e.t.c.
  • Automatically update registry records

Important question to answer, how to update this data regularly. Should it be part of observability tasks or registry too?

Review opendataforafrica.org sites

There are websites for all African countries at opendataforafrica.org, but there is no data from the country, only international data sources. It's used by World Bank experts, but it's not really used.

Need to decide about it

Add data enrichment from re3data

Current situation
There are a lot of metadata about data catalogs collected in Re3Data scientific data catalog.

Interesting data from re3data:

  • keywords
  • content type
  • contact-email
  • re3data identifier
  • description
  • persistent identifiers systems
  • software
  • versioning
  • institutions
  • repository type

This data could enrich the existing catalog and be added to the entries.

Example Re3Data entry - https://www.re3data.org/repository/r3d100010078

Possible solutions
There are a few possible strategies:

  1. Extract whole re3data catalog and extend existing schema and catalog entries automatically as is under re3data attribute. That means high trust on re3data metadata quality.
  2. Manually merge re3data entries with existing catalog entries and to extend existing schema without re3data attribute
  3. Add automatically only re3data identifiers and consider data enrichment later using Re3data identifier-based API.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.