niaid-data-ecosystem / nde-crawlers Goto Github PK
View Code? Open in Web Editor NEWHarvesting infrastructure to collect and standardize dataset and computational tool metadata
License: Apache License 2.0
Harvesting infrastructure to collect and standardize dataset and computational tool metadata
License: Apache License 2.0
Proposed steps:
Related refs from @jackDiGi from ~ 5 years ago:
Pull all publication metadata from PMC OAI-PMH or from bulk open access data or APIs.
Using this data, there's a number of things that can be done:
citation
listed, augment the existing metadata by attaching the funding
provided by PMC.Grant numbers from the NIH have a very consistent format, BUT, there are lots of variations in how these numbers are formatted, which impedes their searchability.
e.g.
U01AI151810-02 vs. U01AI151810-01 vs. U01AI15181002 vs. 1U01 AI151810-02 vs. U01 AI151810-02, etc.
General protocol: for metadata crawlers which harvest a large amount of data, they typically take a few days to run to gather all the records. Implement caching, so ~ once a month, we do a full run to update and wipe ALL the metadata (to catch any changes to metadata records), and then with daily updates, only harvest metadata from new records.
This will need to be implemented in harvesters which suck up a lot of data, including:
Example dataset: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/details
and data downloads: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/download
To do:
Create scripts to automatically run Docker containers to harvest new metadata from sources according to a schedule.
Improve the standardization of metadata; augment available metadata with additional sources, create linkages between metadata records, etc.
Not a priority atm, but if needed, can access the additional missing Zenodo metadata via their metadata dump, and then filling in the records that occurred since July 2021.
Context: OAI-PMH metadata API is the only way to get all the records via API, but lacks key metadata fields including files
, citation
and software code
.
Working from what Zhongchao already created, merge with existing Ansible deployment pipelines
Since OmicsDI is a metadata repository, it would be good to save the provider information in the sdPublisher
field. so for this record it'd be something like:
sdPublisher: [{
name: "BioStudies",
identifier: "S-EPMC3841080",
url: "https://www.ebi.ac.uk/biostudies/studies/S-EPMC3841080"
}]
it looks like that info comes from a zipping of repository
and full_dataset_link
(though database
is also related)?
Convert the separate but redundant infrastructure shared between the NDE, RDP, and outbreak projects into a common set of crawlers, with a filtered portion of the API being redirected to each separate project.
Related to #16 (custom query filters for NDE).
Collect metadata for all datasets associated with clinical trials curated by Vivli. Looks like metadata for each study is available via API
species
and infectiousAgent
should be mapped to the NCBI Taxonomy OntologyinfectiousDisease
to MONDONOTE: Will need to consider how we also incorporate related NCBI projects: BioProject and GenBank.
Basically -- also see more on the divisions here:
Could involve:
Right now, to understand how we've manipulated the metadata from the source, it requires you to go through the parsers and read the code, so our manipulations are less clear.
Is there a better way to document these manipulations-- to essentially create a crosswalk dictionary to convert between the input/output for each parser?
Do we want to publish these crosswalks?
API: https://docs.figshare.com/
example dataset: https://figshare.com/articles/dataset/Barcoded_oligonucleotide_sequences_used_for_second_amplification_prior_to_sequencing_/19623304
prototype repo: https://github.com/biothings/biothings.crawler/blob/master/crawler/spiders/broadscrape/figshare_brunel.py
assay_id
to get the metadata for each datasetFor instance: using the controlled vocabulary of species
tied to the NCBI Taxonomy Ontology, search for all E. coli species and strains/subspecies, or synonyms ("human" = "Homo sapiens" = "H. sapiens" = 9606)
Using the DDE registry:
Create a wrapper which makes it easy to check on the health of the Hub updates upon scheduled updates, rather than having to dig through the logs to know if updates failed/succeeded. Should focus on alerting us when something goes wrong and requires intervention. Julia built a slackbot off the /metadata
endpoint which has been useful for tracking the size of each outbreak.info source and the last day it was updated:
It'd be cool if something like this was also tied to error states that would require work during an autorelease.
Need to think carefully about how to do this, but how do we combine data that are available in multiple indices? For instance: GEO record from NCBI GEO itself, or Omics DI, or Mendeley. Ideally, this would be a single record, but need to figure out how to merge info, resolve conflicts, etc.
Add ComputationalTools which are registered according to the DDE NIAID ComputationalTool Guide. These tools will be stored in the DDE API (-- it's empty now).
I don't think there should be much schema adjustment needed -- probably just a creator
->author
change and then call it a day.
OmicsDI harvests all the records from GEO, but with less metadata. Combine relevant metadata and de-duplicate the records. See also #40
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14
Now, it looks like all the distribution
objects in OmicsDI link to an .xml file. (example). However, the actual file metadata object provide a list of downloadable files (in this example a .pdf).
In the next run of OmicsDI, it would be good to improve the distribution
object to make it easier for the user to find these data files, like:
distribution: [ {
name: annrheumdis-2012-202031-s1.pdf,
url: https://www.ebi.ac.uk/biostudies/files/S-EPMC3841758/annrheumdis-2012-202031-s1.pdf,
encodingFormat: application/pdf
} ]
Via publications, link OmicsDI datasets to funding information.
Major question to consider: is a publication
associated with an OmicsDI record only primary data generation or also secondary analyses of those datasets? Does it matter when connecting to grant numbers?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.