niaid-data-ecosystem / nde-crawlers Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 9.3 MB

Harvesting infrastructure to collect and standardize dataset and computational tool metadata

License: Apache License 2.0

Dockerfile 7.37% Shell 2.87% Python 74.93% Jupyter Notebook 14.84%

crawler discoverability fair-data findability metadata metadata-extraction metadata-standard spider

nde-crawlers's People

Contributors

Watchers

nde-crawlers's Issues

Import BioContainers tool/workflow metadata

BioContainers

API

Import Seven Bridges Public Apps Gallery metadata as a ComputationalTool

Proposed steps:

Use the SB API to gather all the metadata for the public apps.
(possibly not necessary if #1 gives enough metadata) Loop over each app individual page to gather additional metadata via API
Map the properties to the ComputationalTool schema. Note: we may need to create new properties which reuse schema.org properties to make the rest of the metadata fit IF we think they're important enough to include.
Containerize and add to the Hub
Schedule updates

Related refs from @jackDiGi from ~ 5 years ago:

App Details code -- contains a nice dictionary of the terms
Apps within a project
Copying an app

Import metadata from PMC

Pull all publication metadata from PMC OAI-PMH or from bulk open access data or APIs.

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:8313480&metadataPrefix=pmc

Using this data, there's a number of things that can be done:

Create new datasets from the supplementary materials files

Create Dataset -> Publication -> Grant linkages for the existing datasets in the NDE. If a dataset has a citation listed, augment the existing metadata by attaching the funding provided by PMC.
Parse data availability statements to figure out where the data is and link to existing datasets.

Use regex parsing to mine the text of the document to find similar linkages between a small subset of repos w/ consistent identifiers and publications. Ideally would disambiguate between primary citations (data generation) and secondary citations (publications which reuse the data).
Add additional Dataset -> Publication -> Grant linkages via PMC "related information" structured metadata

Improve searchability by grant numbers

Grant numbers from the NIH have a very consistent format, BUT, there are lots of variations in how these numbers are formatted, which impedes their searchability.
e.g.
U01AI151810-02 vs. U01AI151810-01 vs. U01AI15181002 vs. 1U01 AI151810-02 vs. U01 AI151810-02, etc.

Implement caching for sources with large number of records

General protocol: for metadata crawlers which harvest a large amount of data, they typically take a few days to run to gather all the records. Implement caching, so ~ once a month, we do a full run to update and wipe ALL the metadata (to catch any changes to metadata records), and then with daily updates, only harvest metadata from new records.

This will need to be implemented in harvesters which suck up a lot of data, including:

Zenodo
OmicsDI
Figshare
Dataverse

Import ClinEpiDB metadata

Example dataset: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/details
and data downloads: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/download

To do:

Double check there's no API access
Ensure the web/data usage agreements allow for scraping.
Figure out how to get the study IDs from the overall table: https://clinepidb.org/ce/app/search/dataset/Studies/result
Loop over the ids to grab all the metadata from the various tabs on the dataset pages.

Import Dockstore tools

Example metadata

API

Add scheduled updates for parsers

Create scripts to automatically run Docker containers to harvest new metadata from sources according to a schedule.

Improve / augment metadata

Improve the standardization of metadata; augment available metadata with additional sources, create linkages between metadata records, etc.

Potentially augment Zenodo records

Not a priority atm, but if needed, can access the additional missing Zenodo metadata via their metadata dump, and then filling in the records that occurred since July 2021.

Context: OAI-PMH metadata API is the only way to get all the records via API, but lacks key metadata fields including files, citation and software code.

Set up automated AWS server launch templates

Working from what Zhongchao already created, merge with existing Ansible deployment pipelines

Import VEuPathDB metadata

Add original source metadata to OmicsDI

Since OmicsDI is a metadata repository, it would be good to save the provider information in the sdPublisher field. so for this record it'd be something like:

sdPublisher: [{
      name: "BioStudies",
      identifier: "S-EPMC3841080",
      url: "https://www.ebi.ac.uk/biostudies/studies/S-EPMC3841080"
      }]

it looks like that info comes from a zipping of repository and full_dataset_link (though database is also related)?

Refactor common NDE, RDP, outbreak pipelines

Convert the separate but redundant infrastructure shared between the NDE, RDP, and outbreak projects into a common set of crawlers, with a filtered portion of the API being redirected to each separate project.

Related to #16 (custom query filters for NDE).

Import Vivli metadata

Collect metadata for all datasets associated with clinical trials curated by Vivli. Looks like metadata for each study is available via API

Import Omics DI metadata

Map values in each parser to controlled vocabularies

species and infectiousAgent should be mapped to the NCBI Taxonomy Ontology
infectiousDisease to MONDO

Import AccessClinicalData@NIAID metadata

Create SRA Crawler

Access SRA via API.
Store each accession number as an individual dataset.
Note: will need to make sure to grab just the metadata for a record, not the actual sequencing data.

NOTE: Will need to consider how we also incorporate related NCBI projects: BioProject and GenBank.

Basically -- also see more on the divisions here:

SRA = raw sequencing data
GenBank = processed sequencing data to create whole genomes
BioProject = metadata about the experiment that was used to create the SRA and/or GenBank data, including BioSamples, etc.

Set up API request tracking / monitoring

Import NCBI GEO metadata

Customize query behavior for NDE API

Could involve:

Adding boosting to improve the quality of results
Filtering out particular data sources / results to make them more applicable to immune-mediated and infectious diseases.

Create API documentation website

Create an Open API specification to describe the functionality of the NIAID Data Ecosystem Metadata API
Use the NIAID Data Ecosystem Design library to create a website to allow users to understand how the API can be used, potentially using Swagger UI and/or SmartAPI to help create the documentation.
- Example for inspiration: https://mygene.info/v3/api#/query/get_query
Host at api-staging.data.niaid.nih.gov, and eventually at api.data.niaid.nih.gov
(if needed) Work with NIAID OCGR to get approval for site content.
Develop a set of tests / protocol for ensuring that the documentation site is up-to-date with any changes to the API.

Harvest Protein Data Bank metadata

Harvest metadata from the Protein Data Bank and map to our dataset schema
Can use the outbreak.info PDB parser as inspiration, which harvests only the COVID-19-related structures. It's highly likely that this script isn't optimized or necessarily the best path though ;)

Better document the transformations to convert a source to the NDE schema

Right now, to understand how we've manipulated the metadata from the source, it requires you to go through the parsers and read the code, so our manipulations are less clear.

Is there a better way to document these manipulations-- to essentially create a crosswalk dictionary to convert between the input/output for each parser?

Do we want to publish these crosswalks?

MVP crawlers

Import Figshare metadata

API: https://docs.figshare.com/

example dataset: https://figshare.com/articles/dataset/Barcoded_oligonucleotide_sequences_used_for_second_amplification_prior_to_sequencing_/19623304

prototype repo: https://github.com/biothings/biothings.crawler/blob/master/crawler/spiders/broadscrape/figshare_brunel.py

Import reframeDB metadata

Query API to get list of assays (https://reframedb.org/assays)
Loop over the assay_id to get the metadata for each dataset

Create hierarchical searching based on ontologies

For instance: using the controlled vocabulary of species tied to the NCBI Taxonomy Ontology, search for all E. coli species and strains/subspecies, or synonyms ("human" = "Homo sapiens" = "H. sapiens" = 9606)

Manually curate tools collected by NIAID

Using the DDE registry:

Create a new /guide for niaid:ComputationalTool
register https://www.niaid.nih.gov/research/data-science-resources-researchers
register https://bioinformatics.niaid.nih.gov/applications
register Tools curated by the NIAID Systems Biology Consortia

Create bio.tools crawler

https://bio.tools/

Automated updates of metadata harvesting

Create API documentation

Set up automated API testing

Create a reporter to track the status of a data build

Create a wrapper which makes it easy to check on the health of the Hub updates upon scheduled updates, rather than having to dig through the logs to know if updates failed/succeeded. Should focus on alerting us when something goes wrong and requires intervention. Julia built a slackbot off the /metadata endpoint which has been useful for tracking the size of each outbreak.info source and the last day it was updated:

It'd be cool if something like this was also tied to error states that would require work during an autorelease.

Create a crosswalk between the NDE schemas and common schemas

Crawl bioconductor

https://www.bioconductor.org/packages/release/BiocViews.html#___Software

Import Zenodo

Production release 1 metadata harvesting

De-duplicate / combine overlapping records for Mendeley

Need to think carefully about how to do this, but how do we combine data that are available in multiple indices? For instance: GEO record from NCBI GEO itself, or Omics DI, or Mendeley. Ideally, this would be a single record, but need to figure out how to merge info, resolve conflicts, etc.

Improve the robustness of the API infrastructure

Expand DDE harvester to include NIAID:ComputationalTools

Add ComputationalTools which are registered according to the DDE NIAID ComputationalTool Guide. These tools will be stored in the DDE API (-- it's empty now).

I don't think there should be much schema adjustment needed -- probably just a creator->author change and then call it a day.

De-duplicate / combine records between NCBI GEO + OmicsDI

OmicsDI harvests all the records from GEO, but with less metadata. Combine relevant metadata and de-duplicate the records. See also #40

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14

Import ImmPort metadata

Iterative improvement to the OmicsDI crawler

Now, it looks like all the distribution objects in OmicsDI link to an .xml file. (example). However, the actual file metadata object provide a list of downloadable files (in this example a .pdf).

In the next run of OmicsDI, it would be good to improve the distribution object to make it easier for the user to find these data files, like:

distribution: [ {
    name: annrheumdis-2012-202031-s1.pdf,
    url: https://www.ebi.ac.uk/biostudies/files/S-EPMC3841758/annrheumdis-2012-202031-s1.pdf,
    encodingFormat: application/pdf
} ]

Use Dataverse API to harvest the metadata in schema.org JSON-LD.
Can use @ juliamullen's COVID-19 parser as inspiration, just without the queries
Add in caching.

Add jsonschema validation for larger data sources before going through Hub build

Convert ES mapping to json schema
Use jsonschema package to validate data inputs before going through hub processing.
Add to testing suite (#14)

niaid-data-ecosystem / nde-crawlers Goto Github PK

nde-crawlers's People

Contributors

Watchers

nde-crawlers's Issues

Related WBS task

Recommend Projects

Recommend Topics

Recommend Org