commondataio / dataportals-registry Goto Github PK

View Code? Open in Web Editor NEW

29.0 1.0 4.0 43.43 MB

Registry of data portals, catalogs, data repositories including data catalogs dataset and catalog description standard

Home Page: https://registry.commondata.io

License: MIT License

Python 97.78% Shell 2.22%

data-catalog data-portal data-repository datasets registry data-discovery opendata dataset open-data

dataportals-registry's Introduction

dataportals-registry

Registry of data portals, catalogs, data repositories and e.t.c.

This is a transitional repository to create registry of all existing open data portals and repositories.

This is the first pillar of the open search engine project. Other pillars include:

registry of all catalogs (this one)
datasets raw metadata database
unified dataset search index and search engine
datasets backup and file cache

Please take a look at project mindmap to see it's goals and structure.

What kind of data catalogs collected?

This registry includes description of the following data catalogs:

Open data portals
Geoportals
Scientific data repositories
Indicators catalogs
Microdata catalogs
Machine learning catalogs
Data search engines
API Catalogs
Data marketplaces
Other

Inspiration

This project inspired by Re3Data and Fairsharing projects. Key difference is the focus on open data as a broad topic, not just open research data.

Final version of this repository will be reorganized as database with publicly available open API and bulk data dumps.

How this repository organized

Warning: this is temporary description and subject of change

Entities

Data catalog descriptions are YAML files in data/entities folder. Files separated by country/territory folders and inside each country folder there are folders like scientific, opendata, microdata, geo, search, marketplace, other.

Example

Data.gov YAML file

access_mode:
- open
api: true
api_status: active
catalog_type: Open data portal
content_types:
- dataset
coverage:
- location:
    country:
      id: US
      name: United States
    level: 1
endpoints:
- type: ckanapi
  url: https://catalog.data.gov/api/3
export_standard: CKAN API
id: catalogdatagov
identifiers:
- id: wikidata
  url: https://www.wikidata.org/wiki/Q5227102
  value: Q5227102
- id: re3data
  url: https://www.re3data.org/repository/r3d100010078
  value: r3d100010078
- id: fairsharing
  url: https://fairsharing.org/FAIRsharing.6069e1
  valye: FAIRsharing.6069e1
langs:
- EN
link: https://catalog.data.gov
name: NETL Energy Data eXchange
owner:
  location:
    country:
      id: US
      name: United States
    level: 1
  name: U.S. Department of Energy
  type: Central government
software: CKAN
status: active
tags:
- government
- has_api

Datasets and code

Datasets kept in data/datasets folder, right now it's catalogs.jsonl file generated by script builder.py in scripts folder.

Run python builder.py build in scripts folder to regenerate catalogs.jsonl file from YAML files.

How to contribute?

If you find any mistake or you have an additional data catalog to add, please generate pull request or write an issue.

Data sources

Following data sources used:

Stac Catalogs https://stacindex.org/catalogs - done
Dataverse Installations https://iqss.github.io/dataverse-installations/data/data.json - done
Open Data Inception https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/information/ - done
CKAN Portals across the world https://datashades.info/ - done
Geonetwork Showcase https://github.com/geonetwork/doc/blob/develop/source/annexes/gallery/gallery-urls.csv - done
PxWeb examples https://www.scb.se/en/services/statistical-programs-for-px-files/px-web/pxweb-examples/ - done
DKAN Community https://getdkan.org/community - done
Junar Clients https://junar.com/customers/ - done
Datashades data portals list https://datashades.info/api/portal/list - done
OpenSDG installations https://open-sdg.org/community - done
MyCore Installations https://www.mycore.de/site/applications/list/ - done
Elsevier Pure installations - https://www.elsevier.com/solutions/pure/pure-in-action - done
CoreTrustSeal Repositories https://amt.coretrustseal.org/certificates - done
GeoOrchestra installations https://www.georchestra.org/community.html - done
EUDAT Repositories https://b2find.eudat.eu/organization/
Data.Europe.eu catalogues https://data.europa.eu/data/catalogues?locale=en
Re3Data https://www.re3data.org/
RISources https://risources.dfg.de
Spanish opendata initiatives https://datos.gob.es/en/accessible-initiatives
INSPIRE Country catalogs https://inspire-geoportal.ec.europa.eu/overview.html?view=thematicEuOverview&theme=none
Socrata OpenDataNetwork https://www.opendatanetwork.com/search?q= - done
ArcGIS Hub search https://hub.arcgis.com/ - done
Brazilian Catalogs of geodata metadata https://inde.gov.br/Estatisticas/CatalogosMetadados
Open Data Monitor (outdated, but useful) https://www.opendatamonitor.eu
List of French open data catalogs https://airtable.com/shrWxHPi2XjLu9xtM/tblwklJPsyayeH5lX
Brazilian local government (state and municipal) open data portals https://github.com/augusto-herrmann/transparencia-dados-abertos-brasil/blob/main/data/valid/brazilian-transparency-and-open-data-portals.csv
Russian and CIS countries data catalogs https://datacatalogs.ru
EntryScape customers (Sweden) https://entryscape.com/en/customers/ - done
Geolode, catalog of open geodata websites https://geolode.org
WebCommons Dataset subset http://webdatacommons.org/structureddata/2022-12/stats/schema_org_subsets.html
Major Smart Cities with Open Data (updated 2019) https://rlist.io/l/major-smart-cities-with-open-data-portals
Registry of Open Access Repositories http://roar.eprints.org
IPT: Integrated Publishing Toolkit installations - https://www.gbif.org/ipt
Geoblacklight showcase - https://geoblacklight.org/showcase/ - done

License

Source code licensed under MIT license Data licensed under CC-BY 4.0 license

dataportals-registry's People

Contributors

Stargazers

Watchers

Forkers

homgorn sorokinvld id-2 railcreator

dataportals-registry's Issues

Revalidate all endpoints related to ArcGIS Hub

Revalidate all ArcGIS Hub endpoints. Update apidetect.py script and watch for JSON contents.

API responds with HTTP status 200 but 404 statusCode defined in JSON response body. Add additional validation code and fix existing endpoints
{"statusCode":404,"message":"Domain record(s) not found :: A domain record with hostname = 2020census-bucksgis.opendata.arcgis.com does not exist :: 404","error":"Not Found"}

Add Galaxy servers from Galaxy usage page as scientific data catalogs

List of Galaxy servers https://galaxyproject.org/use/
Example Galaxy server data library https://usegalaxy.eu/libraries

All these data libraries have no geographic afiiliation and theme is Bioinformatics (should be mapped to relevant topics)

Add scientific data catalogs from BASE list of content providers

Source https://www.base-search.net/about/en/about_sources_date.php?&search_type=Datasets

Replace UK country code with GB code

Current situation

Right now UK code is used by mistake since the valid ISO 3166 code is GB (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)

Solution

Replace by script country id for all entries with UK to GB
Rename entries folder UK to GB

Consider to use EU vocabularities for version 1.0 standard

Map all references to EU vocabularities https://op.europa.eu/en/web/eu-vocabularies/home

It should affect:

And probably: content_type too

Add software and services registry

Include capabilities

Fix DKAN endpoints, detect proper API and update the registry

DKAN provides several API types including internal REST API https://demo.getdkan.org/api, CKAN compatible API /api/3/action/package_list and e.t.c. and DCAT export data.json files. We need at least one API / data export type working to harvest this data effectively. API should be detected, verified and added to registry data catalog entry.

Add PopGIS software and installations

About 11 installations from SPC https://sdd.spc.int/mapping-popgis
It's a GIS - Geoportal software with API and data export

Brazilian local government open data portals

I manage a list of Brazilian local government (state and municipal) open data portals:

https://github.com/augusto-herrmann/transparencia-dados-abertos-brasil/blob/main/data/valid/brazilian-transparency-and-open-data-portals.csv

it could be useful as a data source.

Generate catalog description data schema

Extract list of all data portal registred in Wikidata and add to _unprocessed data portals folder

Add topics/themes

Add manually or automatically from datasets and catalogues descriptions

Replace "Government" type of owner_type with "Central", "Regional" or "Local" government

Add "Regional government" type of owner_type and replace all "Government" with "Regional government"

Add subregion metadata for all Regional and Local government owner types

For each Regional and Local government, owner type should be set as location subregion data from ISO_3166-2 vocabulary.

It's done for Australia, Germany, Belgium, Brazil, Canada, USA, Russia, Great Britain, Argentina

The top priorities are the biggest countries:

But all entries should be processed.

Search and add Geonetwork instances

Find and add Geonetwork servers

Add all ArcGIS endpoints from MappingSupport

Source https://mappingsupport.com

Make sure that at least one data catalog per country added to the registry

Get list of the countries from World Bank data portal
Map list of the countries to the country code in catalog records geographic coverage
Get list of missing countries
Search for data catalogs: open data, geoportals, scientific repositories for each missing country
Update data catalogs registry

Add quality report on data

Generate report on data completeness

Add ISO 3166-2 subdivisions code

Add ISO 3166-2 subdivisions code. Informally it's set only for US states as subfolders in entities directory. For now ISO 3166-2 code could be assigned manually as attribute with same name iso31662

Generate unique identifier instead of domain based

Current identifier as id automatically generated from domain name. Need better identifier similar to Re3Data, ROIR or Fairsharing unique identifiers

Fix all blank values of owner.name and owner.link

There are a few hundreds of blank values owner.name and owner.link.
Owner.name displayed with the search entries results, it's important to not to have blank values.

Owner.link is not displayed by default, so it's less critical to have them without empty values, But it's still important too

Add metadata registries as new data catalogs type

Add metadata registries like:

Some are similar to data catalogs; others are more like structured databases with metadata elements.
At least some of them could be added to the registry.

Add OpenDOAR identifiers to data catalog and extract list of repositories to find more repositories with research data

https://v2.sherpa.ac.uk/opendoar/

Consider methodology of trust score for data catalogs

Consider adding trust level score or trust level to each data catalog.

Core ideas:

use data about trust seals from re3data
more trust to academy and government data catalogs
less trust to community data catalogs
less trust to aggregators
less trust to data catalogs without data licenses/rights

Reorganize schema to support sub-regional codes and to differentiate owner location and coverage

Current situation

Country name and country id are part of country attribute; it's mostly related to the owner of the data catalog but sometimes could be related to coverage of the data catalog. For example, https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json STAC server owner located in the UK, but it covers Ghana territory.

Another question is that some countries are big and search engine users could be interested in local datasets. So sub-regional/regional level needs to be added to the data catalog description.

Solution

Create owner attribute with name, link, and type sub-attributes and add location attribute
location attribute could include: country_id, country_name, subregion_id, subregion_name
Create coverage attribute as an array of the location attributes
Consider optional attribute address under location attribute
Convert automatically current countries attribute to coverage and owner attributes. If countries include more than one record, use first for owner location and all of them for coverage

Example

Result should look like

access_mode:
- public
api: true
api_status: uncertain
catalog_type: Geoportal
content_types:
- dataset
- map_layer
description: Sentinel 2 Surface Reflectance over Ghana using SIAC AC.
endpoints:
- type: stacserverapi
  url: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
export_standard: STAC specification
id: s2-for-ghana
langs:
- EN
link: https://gws-access.jasmin.ac.uk/public/odanceo/S2_L2/collection.json
name: S2 for Ghana
owner:
- name: "Jasmin. The UK's data analysis facility for environmental science"
- link: "https://jasmin.ac.uk"
- type: Academy
- location:
   - level: 2
   - country:
       - id: UK
         name: United Kingdom
   - subregion
       - id: GB-ENG
       - name: England
coverage:
   - location: 
      - country: 
         - id: GH
         - name: Chana
      - level: 1
software: Stac-server
status: active
tags:
- geospatial
- has_api

Fix Geonode endpoints, detect proper API and update the registry

Geonode instances could have multiple API endpoints like data.json export, internal REST API, CSW and other. We need to identify which API is alive and to write proper endpoints information to the registry for future use during data harvesting

Create public website to visualize registry

Figure out relationship to dataportals.org

The dataportals.org project (Github repo) has been around for a while and has similar goals of gathering a curated list of open data portals around the world. I have maintained it for some time now, but recently I haven't been able to dedicate much time to improve it. It constantly receives requests for adding new portals. @ivbeg, if you are able and would like to, you could take on managing the project. Or, if you prefer, just specific things like reviewing submissions, improving upon the data schema, the portal frontend, etc.

If nothing else, it could be interesting to at least

use it as a source of open data portals for this project, and / or
write a script on the other direction to send the list of portals collected here to feed the dataportals.org database

What do you think?

Create registry website and API

Update website code, add API, and launch it at registry.commondata.io

website code updated
API added
website launched

Critical features:

full list of records
single catalog record webpage
persistent link to download full dataset
API to access single datasets

Desired future features, not critical:

country profiles
software types profiles
owner type profiles
search and filtering

Add 'transferrable' flags for locations and topics

There are two types of datacatalogs.

First type is a classical data catalog with topics/keywords and locations linked to each single dataset. For example it's global data catalogs like Humanitarian Data Exchange. Datasets in this catalog could be linked to different topics and different countries. There is no one single country or other location for all datasets

Second type is a data catalog that has topics or location linked and these topics and locations are attributes of each dataset too. For example most regional and local government data catalogs do not have datasets for other locations except location except region / local territory. Also there are scientific repositories dedicated only to agriculture or sea exploration and we don't need to identify topic of each dataset, we could copy topics from data catalog registry entry.

This approach require manual or half-automated assignment of topics to data catalogs and adding flags that topics and location values are "transferable".

It could be following attributes:

transfer_topics: True
transfer_geotopics: True
transfer_location: True

This way we could toward to replace "Country" facet search from country covered by the source, to the country of single dataset.

Write contribution guideline

Fix NADA endpoints

Not all NADA instances provide REST API, but it's not verified right now. It should be detected and registry data catalogs entries updated

Add terms property to data catalog entry

We need to identify licenses used with selected data sources. Sometimes licenses are granular at the dataset level. Sometimes there is just a one single license per data catalog.

Solution

Add attribute terms with sub attributes link, license_level, license_id, license_name, text.
license_level could be "catalog" or "dataset"
All fields are optional

Automate identifiers mapping with dataportals.org, Re3Data, OpenAIRE, Wikidata, Roar

Add OpenAIRE identifiers

OpenAIRE identifiers of the data sources

Extract list of all data portal registred in OpenAIRE

Add status attribute to catalog entries

There is no status attribute that could help to identify if data catalog is alive. It should be similar to Fairsharing records status attributes.

Fairsharing uses following four status values:

Ready
In development
Deprecated
Uncertain

Since Fairsharing includes not only databases but also standards, policies and other content, so it could include "In development" status.

For data catalogs more likely to use only three status types: active, deprecated, uncertain.

Add property that indicates national open data catalogs

Add property that website is a national data catalog provded by government like data.gov, data.gov.fr, data.gov.uk and e.t.c

Fix Dataverse endpoints

Not all Dataverse instances provide open API to work with data repository and some instances require authentication and authorization to use it's API. It's not detected right now and we need to identify if Dataverse API is available and if it's not it shouldn't be registry data catalog entry. Also if authentication is required and we could detect it, probably we need to update data catalog entry schema to set authentication required flag

Fix geonetwork API endpoints. Add automatic API discovery

There are many types of Geonetwork API, current API discovery project doesn't cover all API types. Need to implement pingin possible API endpoint and to identify endpoint types

Prepare data discovery guide

Write a guide about how to find data portals installations.

Write about:

how to find open data portals
how to use BuiltWith
how to find Geonetwork, GeoNode, GeoServer and ArcGIS servers
how to identify data portal software
how to find API

Add Hubzero installations with datasets

https://hubzero.org/sites

Add DRIS identifiers to data catalog and extract list of repositories to find more repositories with research data

https://dspacecris.eurocris.org/cris/explore/dris

Create extended software profile

Current software description is very short. It's described only as software name and export_formats
For future observability and crawl tasks it could be important to create detailed software profile with data added manually and automatically.

Detailed software profile creation include following tasks:

Add data catalogs software types dictionary and/or database
Add fields: version, website, plugins, capabilities and e.t.c.
Automatically update registry records

Important question to answer, how to update this data regularly. Should it be part of observability tasks or registry too?

keywords
content type
contact-email
re3data identifier
description
persistent identifiers systems
software
versioning
institutions
repository type

This data could enrich the existing catalog and be added to the entries.

Example Re3Data entry - https://www.re3data.org/repository/r3d100010078

Possible solutions
There are a few possible strategies:

Extract whole re3data catalog and extend existing schema and catalog entries automatically as is under re3data attribute. That means high trust on re3data metadata quality.
Manually merge re3data entries with existing catalog entries and to extend existing schema without re3data attribute
Add automatically only re3data identifiers and consider data enrichment later using Re3data identifier-based API.

Add ARK identifier to each data catalog record

More info about ARK https://wiki.lyrasis.org/display/ARKs/ARK+Identifiers+FAQ#ARKIdentifiersFAQ-WhatarethepartsofanARK?

Redesign registry website completely

Redesign website and make it more attractive, add search, add better about. Keep persistent links

commondataio / dataportals-registry Goto Github PK

dataportals-registry's Introduction

dataportals-registry

What kind of data catalogs collected?

Inspiration

How this repository organized

Entities

Example

Datasets and code

How to contribute?

Data sources

License

dataportals-registry's People

Contributors

Stargazers

Watchers

Forkers

dataportals-registry's Issues

Recommend Projects

Recommend Topics

Recommend Org