Telescopes, Workflows and Data Services for the Academic Observatory

Home Page: https://academic-observatory-workflows.readthedocs.io

License: Apache License 2.0

Python 89.76% Jinja 10.05% Shell 0.19%

science workflow data academic higher-education research-evaluation

academic-observatory-workflows's People

Contributors

Stargazers

Watchers

academic-observatory-workflows's Issues

Telescope workflow implementation: OpenAire

There are 3 ways of bulk accessing the openaire data:

OpenAIRE Research Graph Dumps
OAI-PMH
Bulk access to projects

OpenAIRE Research Graph Dumps
Can be downloaded from Zenodo (https://zenodo.org/search?page=1&size=20&q=OpenAIRE%20Research%20Graph%20Dump) or explored through their beta portal.
There is one dump available from 18-12-2019 and one from 03-11-2020, which also has an updated json schema.

Each publication on Zenodo contains several dumps/files, the 2019 one is slightly different than 2020.
2019 files:

publication.gz: metadata records about research literature (includes types of publications listed here)
dataset.gz:: metadata records about research data (includes the subtypes listed here)
software.gz:: metadata records about research software (includes the subtypes listed here)
orp.gz: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.gz: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.gz: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.gz: metadata records about projects funded by a given funder.
<funder>_result.gz: metadata records about research results (publications, datasets, software, and other research products) funded by a given funder.

2020 files:

publication_[part].tar: metadata records about research literature (includes types of publications listed here)
dataset.tar: metadata records about research data (includes the subtypes listed here)
software.tar: metadata records about research software (includes the subtypes listed here)
otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.tar: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.tar: metadata records about projects funded by a given funder.
relation_[part].tar: metadata records about relations between entities in the graph
communities_infrastructures.tar: metadata records about research communities and research infrastructures

This image from https://doi.org/10.5281/zenodo.4238939 helps to understand the relationship between these files.

OAI-PMH
The OAI-PMH harvester is available as well, one note:

Currently the OAI-PMH publisher is not supporting incremental harvesting.
Although the usage of the OAI parameters 'from' and 'until' is handled by the OAI publisher, the datestamps of metadata records are updated about every week.

I'm not sure what they mean with 'the datestamps of metadata record are updated about every week'.
Considering the data size it might be best to initially download the dumps instead of using the OAI-PMH harvester. Perhaps the harvester can be used to update the data regularly with newly added/edited records, but I'm skeptical since they mention above that incremental harvesting is not supported.

Bulk access to projects
The APIs offer custom access to metadata about projects funded by a selection of international funders for the DSpace and EPrints platforms. The currently supported funding streams and relative codes are:

FP7: The 7th Framework Programme funded by the European Commission
WT: Wellcome Trust funding programme
H2020: Horizon2020 Programme funded by the European Commission
FCT: The funding programme of Fundação para a Ciência e a Tecnologia, the national funding agency of Portugal
ARC: the funding programme of the Australian Research Council
NHMRC: the funding programme of the Australian National Health and Medical Research Council
SFI: Science Foundation Ireland
HRZZ: Croatian Science Foundation
MZOS: Ministry of Science, Education and Sports of the Republic of Croatia
MESTD: The Ministry of Education, Science and Technological Development of Serbia
NWO: The Netherlands Organisation for Scientific Research

I'm not sure if this is of interest to us. I think this project data is included in the Zenodo files as well and this is just an alternative easy way if you're interested in a specific project.

Questions:

How often is a new dump expected to be released on Zenodo? I can only find the 2 publications so far.
Will we combine all the different files in a single table or separate tables?
For the OAI-PMH, how is the datestamp determined and how often are new records added? The newest record seems to be from 2020-05-12.

Telescope workflow implementation: Dimensions

Details to be updated.

See https://docs.google.com/document/d/14y7lRHYr3MxGzTsiP3j60t7DOJ1tsJLLAMMhQeDHNMA for more info

Define a method for users to easily create new Groups of institutions for analysis and comparison

It would be really beneficial for end users (researchers, analysts, etc) to easily define a list of institutions (using their grid_id or other identifier system) that can be automatically utilised by our data aggregation workflows to produce 'group'-level aggregations the same way we create institution, country, funder and publisher level aggregations.

The current method is slow and requires a knowledge of SQL and BigQuery. Other methods must be investigated, and the chosen solution documented and easy to use.

Design Issue: Long Term Maintenance Plan and Security Posture

There are number of understood and emerging tasks and processes that need defining, documenting and planning for when we consider the longer term viability of this project. The following are some already identified issues:

How dependencies (largely software libraries) will be updated over time, by whom, when, and what is the priority and risk of doing (or not doing) this
Put in place a incident response plan for any security incidents
Review on-going data storage costs and data lifecycle automation and retention policies
Consider putting in place a roster system for responsibilities for fixing issues in production. This could equally apply to Telescopes, Workflows, Kibana or other APIs and services we have deployed

Telescope workflow implementation: Redalyc

Details to be updated.

Telescope workflow implementation: Core

Details to be updated.

Data Aggregation Improvements

A list of useful improvements to the DOI/Entity Aggregation Pipeline. This list also replaces and organises a few issues that have been around the backlog for awhile and need addressing. Closing The-Academic-Observatory/observatory-platform#272, The-Academic-Observatory/observatory-platform#146, The-Academic-Observatory/observatory-platform#129, The-Academic-Observatory/observatory-platform#110, The-Academic-Observatory/observatory-platform#70 as they are now covered here

Cross cutting issues

Grids

Does it make more sense to aggregate grids to their top level organisation? Thus including all publications in the parents count for any other grid that has a child relationship to that parent grid?
- Create a grid table, which is a direct grouping by each grid_id, which is how the current institutions table works
- rename the current institution list in the affiliations section of the DOIs table to grids.
- Create a new institutions table, this will be aggregated entities (counts of child institutions included with their parents).
- Build on the extend_grid query, for each gridId, create an array of all parents up the chain. This might involve a step in python too
- Create a new institutional affiliation list in the DOIs table. This will contain many more links per DOI than the previous version. As there will be a link for every explicitly linked grid, and also a link to all the parents grids right up to the top. Because some grids end up nesting into a national Government (The USA for example) taking a pure approach of always aggregating to the parent of each grid hierarchy hides too much detail. So a publication linked to the NCI, will also be linked to the NIH, the Department of health and human services, and the Government of the USA. The final table will allow an end user to pick the level they are interested in
- For each of these newly established aggregation links, also include the children who were the publishing institution. This will allow for later workflows to break out the relative contributions of the parts.

Groups

Similar to the second point for grids, for a group is it useful to track a minimum amount of metrics for each member grid, so analysis/visualisation of the parent can be broken into its constituent parts to offer a greater level of understanding?

Countries

Is it helpful to have a minimal set of information for each institution within that country, so later analysis/viz can understand the relative contributions from each of the constituent parts of the whole? Due to collaborations, the sum of all the counts for each institution > total, but perhaps the relative sizes might be useful?

Regions

Same as the above for countries, exact rather than breaking down by institution, it will be broken down into all the countries contained in that region

Publishers, Funders and Journals

Similar to the above, but having a list broken down by all contributing institutions for publishers and journals, and funded institution by Funder. Helpful for downstream analysis/viz to understand relative impacts of the parts.

Funders

Does it make more sense to aggregate funders to their top level organisation?
A large proportion of funder references in crossref do not have an associated fundref ID. This limits the use of grouping funding by country of origin, or type of funder, as this information comes from the fundref database.
For the aggregation on funder entities, it currently uses the ID field (which is the fundref_id). This relates to the above point, by using the name field we get a larger set of results. A conditional Jinja statement might be a workaround for this, but really extra attention needs to go into disambiguating funder references.
Fundref, ISO and Geonames differences (alpha 2 v 3) issues to work through to ensure correct joining and colour pallets

Citations

Include full list of cited and cited-by dois as part of the DOIs table (as a repeated nested field). Include published dates
Create derived dataset, based on MAG citations, but in the format of OpenCitations
As part of the final DOI tables schema, Bucket, and sum count for citations from articles over various time periods from publication of the article in focus
As part of the final DOI tables, similar to the above, but do the same for articles that are cited by the focal article. Creating buckets in the periods leading up to publication (aka how far back in the literature did the authors look too)
As part of the final DOI tables schema, create counts for incoming citations bucketed into country of origin. Either as a wide sparse sub-table, or as a list ignoring countries with no incoming citations. Count each country separately, thus one count for each country whom each author is associated with. (This sum of all citations from all country > total citations)
As part of the final DOI tables schema, create counts for outgoing citations bucketed into country of cited work. Either as a wide sparse sub-table, or as a list ignoring countries with no outgoing citations

Events

Move beyond just counting events based on type, and to a histogram based approach bucketing the various types of events in time slots relative to publication date. This is an extension to the aggregrate crossref events script
In the same script, keep the full list of events, including the time in which it happened.
Details TBD, but creating a dataset that pulls the discipline information from MAG for each DOI, associate each event with these disciplines to create a time bucket intensity score, or alt-metric hotness, by month for each of the discipline categories.

Diversity

For institutions, re-include the diversity table join in the new aggregation workflow

Perform IAM and compliance Review

Existing IAM role permissions are too broad and often rely on individual users rather than managed groups. Additionally, there are a range of potential regulations we need to better understand the impact of. Together we need to produce a document explaining our posture and compliance, as well as a better set of operational guidance for managing access

Telescope workflow implementation: DataCite Commons

Details to be updated.

Homepage: https://commons.datacite.org/

Telescope workflow implementation: PubMed

Details to be updated. Part of the initial requirements are a bit of a survey about what can be ingested beyond the open access articles they have available.

Telescope workflow implementation: DataCite

Details to be updated.

Homepage: https://datacite.org/

Telescope workflow implementation: GDELT

Details to be updated.

Suggestion to include variable 'subtype' from Crossref

Crossref metadata contain the variable 'subtype' for records with publication type 'posted-content'. Including this variable allows e.g. distinguishing preprints from other types of posted-content in downstream analysis.

cc @cameronneylon

Telescope workflow implementation: ARC Funding Data

Name: ARC Funding Data
Subject Area: Funding
Harvest Type: Paged JSON
Query Type: via API
Snapshot Type:
Snapshot Frequency:

The ARC provides a JSON API for funding since 2001 at https://dataportal.arc.gov.au/NCGP/Web/Grant/Grants This is a paged JSON API which only provides summary and lead investigator name rather than further details but we could presumably link a lot of it up and/or scrape the additional data from the web interface. The JSON data does seem to be well structured and fairly straightforward to incorporate.

Alternately we could maybe ask for a data dump of this?

By contrast NHMRC data is made available as xlsx spreadsheets at: https://www.nhmrc.gov.au/funding/data-research/outcomes-funding-rounds

the-academic-observatory / academic-observatory-workflows Goto Github PK

academic-observatory-workflows's People

Contributors

Stargazers

Watchers

academic-observatory-workflows's Issues

Recommend Projects

Recommend Topics

Recommend Org