Git Product home page Git Product logo

academic-observatory-workflows's People

Contributors

alexmassen-hane avatar aroelo avatar bechandcock avatar cameronneylon avatar jdddog avatar kathrynnapier avatar keegansmith21 avatar metasj avatar rhosking avatar tuanchien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

academic-observatory-workflows's Issues

Telescope workflow implementation: OpenAire

There are 3 ways of bulk accessing the openaire data:

  • OpenAIRE Research Graph Dumps
  • OAI-PMH
  • Bulk access to projects

OpenAIRE Research Graph Dumps
Can be downloaded from Zenodo (https://zenodo.org/search?page=1&size=20&q=OpenAIRE%20Research%20Graph%20Dump) or explored through their beta portal.
There is one dump available from 18-12-2019 and one from 03-11-2020, which also has an updated json schema.

Each publication on Zenodo contains several dumps/files, the 2019 one is slightly different than 2020.
2019 files:

publication.gz: metadata records about research literature (includes types of publications listed here)
dataset.gz:: metadata records about research data (includes the subtypes listed here)
software.gz:: metadata records about research software (includes the subtypes listed here)
orp.gz: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.gz: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.gz: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.gz: metadata records about projects funded by a given funder.
<funder>_result.gz: metadata records about research results (publications, datasets, software, and other research products) funded by a given funder.

2020 files:

publication_[part].tar: metadata records about research literature (includes types of publications listed here)
dataset.tar: metadata records about research data (includes the subtypes listed here)
software.tar: metadata records about research software (includes the subtypes listed here)
otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.tar: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.tar: metadata records about projects funded by a given funder.
relation_[part].tar: metadata records about relations between entities in the graph
communities_infrastructures.tar: metadata records about research communities and research infrastructures

This image from https://doi.org/10.5281/zenodo.4238939 helps to understand the relationship between these files.
image

OAI-PMH
The OAI-PMH harvester is available as well, one note:

Currently the OAI-PMH publisher is not supporting incremental harvesting.
Although the usage of the OAI parameters 'from' and 'until' is handled by the OAI publisher, the datestamps of metadata records are updated about every week.

I'm not sure what they mean with 'the datestamps of metadata record are updated about every week'.
Considering the data size it might be best to initially download the dumps instead of using the OAI-PMH harvester. Perhaps the harvester can be used to update the data regularly with newly added/edited records, but I'm skeptical since they mention above that incremental harvesting is not supported.

Bulk access to projects
The APIs offer custom access to metadata about projects funded by a selection of international funders for the DSpace and EPrints platforms. The currently supported funding streams and relative codes are:

FP7: The 7th Framework Programme funded by the European Commission
WT: Wellcome Trust funding programme
H2020: Horizon2020 Programme funded by the European Commission
FCT: The funding programme of Fundação para a Ciência e a Tecnologia, the national funding agency of Portugal
ARC: the funding programme of the Australian Research Council
NHMRC: the funding programme of the Australian National Health and Medical Research Council
SFI: Science Foundation Ireland
HRZZ: Croatian Science Foundation
MZOS: Ministry of Science, Education and Sports of the Republic of Croatia
MESTD: The Ministry of Education, Science and Technological Development of Serbia
NWO: The Netherlands Organisation for Scientific Research

I'm not sure if this is of interest to us. I think this project data is included in the Zenodo files as well and this is just an alternative easy way if you're interested in a specific project.

Questions:

  • How often is a new dump expected to be released on Zenodo? I can only find the 2 publications so far.
  • Will we combine all the different files in a single table or separate tables?
  • For the OAI-PMH, how is the datestamp determined and how often are new records added? The newest record seems to be from 2020-05-12.

Define a method for users to easily create new Groups of institutions for analysis and comparison

It would be really beneficial for end users (researchers, analysts, etc) to easily define a list of institutions (using their grid_id or other identifier system) that can be automatically utilised by our data aggregation workflows to produce 'group'-level aggregations the same way we create institution, country, funder and publisher level aggregations.

The current method is slow and requires a knowledge of SQL and BigQuery. Other methods must be investigated, and the chosen solution documented and easy to use.

Design Issue: Long Term Maintenance Plan and Security Posture

There are number of understood and emerging tasks and processes that need defining, documenting and planning for when we consider the longer term viability of this project. The following are some already identified issues:

  • How dependencies (largely software libraries) will be updated over time, by whom, when, and what is the priority and risk of doing (or not doing) this
  • Put in place a incident response plan for any security incidents
  • Review on-going data storage costs and data lifecycle automation and retention policies
  • Consider putting in place a roster system for responsibilities for fixing issues in production. This could equally apply to Telescopes, Workflows, Kibana or other APIs and services we have deployed

Data Aggregation Improvements

A list of useful improvements to the DOI/Entity Aggregation Pipeline. This list also replaces and organises a few issues that have been around the backlog for awhile and need addressing. Closing The-Academic-Observatory/observatory-platform#272, The-Academic-Observatory/observatory-platform#146, The-Academic-Observatory/observatory-platform#129, The-Academic-Observatory/observatory-platform#110, The-Academic-Observatory/observatory-platform#70 as they are now covered here

Cross cutting issues

  • Ensure consistent use of the various crossref dates. Issued date is more reliable across the various output types
  • Add OA types to the collaboration analysis
  • Potentially extend Collaboration analysis with information on discipline and funding
  • Review additional OA types and any definition issues. Gold-only for example is required
  • Removed filtered_list comments
  • Do we want to create monthly aggregations as well as yearly?
    • Create published_year_month in the dois table
    • Modify the aggregate_dois query to have the switchable option between grouping by year
    • Extend the Telescope to enable running in either year or month mode
    • Extend the DAG to ensure both run each week
  • Add citation counts to discipline aggregations
  • Simplify metrics and oa_citations fields
  • Include citation breakdowns from both OpenCitations and MAG for comparison
  • Fix green_in_home repo workflow
  • Remove old commented code, and turn conditional commented code into jinja conditional logic
  • Refactor event aggregation code to reduce storage costs and allow for growth of new event types
  • Ensure duplicate institutions are not found in the affiliation.institutions list

Grids

  • Does it make more sense to aggregate grids to their top level organisation? Thus including all publications in the parents count for any other grid that has a child relationship to that parent grid?
    • Create a grid table, which is a direct grouping by each grid_id, which is how the current institutions table works
    • rename the current institution list in the affiliations section of the DOIs table to grids.
    • Create a new institutions table, this will be aggregated entities (counts of child institutions included with their parents).
    • Build on the extend_grid query, for each gridId, create an array of all parents up the chain. This might involve a step in python too
    • Create a new institutional affiliation list in the DOIs table. This will contain many more links per DOI than the previous version. As there will be a link for every explicitly linked grid, and also a link to all the parents grids right up to the top. Because some grids end up nesting into a national Government (The USA for example) taking a pure approach of always aggregating to the parent of each grid hierarchy hides too much detail. So a publication linked to the NCI, will also be linked to the NIH, the Department of health and human services, and the Government of the USA. The final table will allow an end user to pick the level they are interested in
    • For each of these newly established aggregation links, also include the children who were the publishing institution. This will allow for later workflows to break out the relative contributions of the parts.

Groups

  • Similar to the second point for grids, for a group is it useful to track a minimum amount of metrics for each member grid, so analysis/visualisation of the parent can be broken into its constituent parts to offer a greater level of understanding?

Countries

  • Is it helpful to have a minimal set of information for each institution within that country, so later analysis/viz can understand the relative contributions from each of the constituent parts of the whole? Due to collaborations, the sum of all the counts for each institution > total, but perhaps the relative sizes might be useful?

Regions

  • Same as the above for countries, exact rather than breaking down by institution, it will be broken down into all the countries contained in that region

Publishers, Funders and Journals

  • Similar to the above, but having a list broken down by all contributing institutions for publishers and journals, and funded institution by Funder. Helpful for downstream analysis/viz to understand relative impacts of the parts.

Funders

  • Does it make more sense to aggregate funders to their top level organisation?
  • A large proportion of funder references in crossref do not have an associated fundref ID. This limits the use of grouping funding by country of origin, or type of funder, as this information comes from the fundref database.
  • For the aggregation on funder entities, it currently uses the ID field (which is the fundref_id). This relates to the above point, by using the name field we get a larger set of results. A conditional Jinja statement might be a workaround for this, but really extra attention needs to go into disambiguating funder references.
  • Fundref, ISO and Geonames differences (alpha 2 v 3) issues to work through to ensure correct joining and colour pallets

Citations

  • Include full list of cited and cited-by dois as part of the DOIs table (as a repeated nested field). Include published dates
  • Create derived dataset, based on MAG citations, but in the format of OpenCitations
  • As part of the final DOI tables schema, Bucket, and sum count for citations from articles over various time periods from publication of the article in focus
  • As part of the final DOI tables, similar to the above, but do the same for articles that are cited by the focal article. Creating buckets in the periods leading up to publication (aka how far back in the literature did the authors look too)
  • As part of the final DOI tables schema, create counts for incoming citations bucketed into country of origin. Either as a wide sparse sub-table, or as a list ignoring countries with no incoming citations. Count each country separately, thus one count for each country whom each author is associated with. (This sum of all citations from all country > total citations)
  • As part of the final DOI tables schema, create counts for outgoing citations bucketed into country of cited work. Either as a wide sparse sub-table, or as a list ignoring countries with no outgoing citations

Events

  • Move beyond just counting events based on type, and to a histogram based approach bucketing the various types of events in time slots relative to publication date. This is an extension to the aggregrate crossref events script
  • In the same script, keep the full list of events, including the time in which it happened.
  • Details TBD, but creating a dataset that pulls the discipline information from MAG for each DOI, associate each event with these disciplines to create a time bucket intensity score, or alt-metric hotness, by month for each of the discipline categories.

Diversity

  • For institutions, re-include the diversity table join in the new aggregation workflow

Perform IAM and compliance Review

Existing IAM role permissions are too broad and often rely on individual users rather than managed groups. Additionally, there are a range of potential regulations we need to better understand the impact of. Together we need to produce a document explaining our posture and compliance, as well as a better set of operational guidance for managing access

Telescope workflow implementation: ARC Funding Data

Name: ARC Funding Data
Subject Area: Funding
Harvest Type: Paged JSON
Query Type: via API
Snapshot Type:
Snapshot Frequency:

The ARC provides a JSON API for funding since 2001 at https://dataportal.arc.gov.au/NCGP/Web/Grant/Grants This is a paged JSON API which only provides summary and lead investigator name rather than further details but we could presumably link a lot of it up and/or scrape the additional data from the web interface. The JSON data does seem to be well structured and fairly straightforward to incorporate.

Alternately we could maybe ask for a data dump of this?

By contrast NHMRC data is made available as xlsx spreadsheets at: https://www.nhmrc.gov.au/funding/data-research/outcomes-funding-rounds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.