Git Product home page Git Product logo

nde-crawlers's People

Contributors

candicecz avatar dylanwelzel avatar flaneuse avatar gtsueng avatar jal347 avatar newgene avatar nikkibytes avatar zcqian avatar

Watchers

 avatar  avatar  avatar

nde-crawlers's Issues

Import Seven Bridges Public Apps Gallery metadata as a ComputationalTool

Proposed steps:

  1. Use the SB API to gather all the metadata for the public apps.
  2. (possibly not necessary if #1 gives enough metadata) Loop over each app individual page to gather additional metadata via API
  3. Map the properties to the ComputationalTool schema. Note: we may need to create new properties which reuse schema.org properties to make the rest of the metadata fit IF we think they're important enough to include.
  4. Containerize and add to the Hub
  5. Schedule updates

Related refs from @jackDiGi from ~ 5 years ago:

Import metadata from PMC

Pull all publication metadata from PMC OAI-PMH or from bulk open access data or APIs.

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:8313480&metadataPrefix=pmc

Using this data, there's a number of things that can be done:

  1. Create new datasets from the supplementary materials files

Screen Shot 2022-07-07 at 4 50 35 PM

  1. Create Dataset -> Publication -> Grant linkages for the existing datasets in the NDE. If a dataset has a citation listed, augment the existing metadata by attaching the funding provided by PMC.
  2. Parse data availability statements to figure out where the data is and link to existing datasets.

Screen Shot 2022-07-07 at 4 50 35 PM copy

  1. Use regex parsing to mine the text of the document to find similar linkages between a small subset of repos w/ consistent identifiers and publications. Ideally would disambiguate between primary citations (data generation) and secondary citations (publications which reuse the data).
  2. Add additional Dataset -> Publication -> Grant linkages via PMC "related information" structured metadata

Screen Shot 2022-07-07 at 4 50 30 PM

Improve searchability by grant numbers

Grant numbers from the NIH have a very consistent format, BUT, there are lots of variations in how these numbers are formatted, which impedes their searchability.
e.g.
U01AI151810-02 vs. U01AI151810-01 vs. U01AI15181002 vs. 1U01 AI151810-02 vs. U01 AI151810-02, etc.

Implement caching for sources with large number of records

General protocol: for metadata crawlers which harvest a large amount of data, they typically take a few days to run to gather all the records. Implement caching, so ~ once a month, we do a full run to update and wipe ALL the metadata (to catch any changes to metadata records), and then with daily updates, only harvest metadata from new records.

This will need to be implemented in harvesters which suck up a lot of data, including:

  • Zenodo
  • OmicsDI
  • Figshare
  • Dataverse

Import ClinEpiDB metadata

Example dataset: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/details
and data downloads: https://clinepidb.org/ce/app/workspace/analyses/DS_4902d9b7ec/new/download

To do:

  1. Double check there's no API access
  2. Ensure the web/data usage agreements allow for scraping.
  3. Figure out how to get the study IDs from the overall table: https://clinepidb.org/ce/app/search/dataset/Studies/result
  4. Loop over the ids to grab all the metadata from the various tabs on the dataset pages.

Improve / augment metadata

Improve the standardization of metadata; augment available metadata with additional sources, create linkages between metadata records, etc.

Potentially augment Zenodo records

Not a priority atm, but if needed, can access the additional missing Zenodo metadata via their metadata dump, and then filling in the records that occurred since July 2021.

Context: OAI-PMH metadata API is the only way to get all the records via API, but lacks key metadata fields including files, citation and software code.

Add original source metadata to OmicsDI

Since OmicsDI is a metadata repository, it would be good to save the provider information in the sdPublisher field. so for this record it'd be something like:

sdPublisher: [{
      name: "BioStudies",
      identifier: "S-EPMC3841080",
      url: "https://www.ebi.ac.uk/biostudies/studies/S-EPMC3841080"
      }]

it looks like that info comes from a zipping of repository and full_dataset_link (though database is also related)?

Screen Shot 2022-05-19 at 5 44 19 PM

Refactor common NDE, RDP, outbreak pipelines

Convert the separate but redundant infrastructure shared between the NDE, RDP, and outbreak projects into a common set of crawlers, with a filtered portion of the API being redirected to each separate project.

Related to #16 (custom query filters for NDE).

Create SRA Crawler

  • Access SRA via API.
  • Store each accession number as an individual dataset.
  • Note: will need to make sure to grab just the metadata for a record, not the actual sequencing data.

NOTE: Will need to consider how we also incorporate related NCBI projects: BioProject and GenBank.

Basically -- also see more on the divisions here:

  • SRA = raw sequencing data
  • GenBank = processed sequencing data to create whole genomes
  • BioProject = metadata about the experiment that was used to create the SRA and/or GenBank data, including BioSamples, etc.

Customize query behavior for NDE API

Could involve:

  • Adding boosting to improve the quality of results
  • Filtering out particular data sources / results to make them more applicable to immune-mediated and infectious diseases.

Create API documentation website

Better document the transformations to convert a source to the NDE schema

Right now, to understand how we've manipulated the metadata from the source, it requires you to go through the parsers and read the code, so our manipulations are less clear.

Is there a better way to document these manipulations-- to essentially create a crosswalk dictionary to convert between the input/output for each parser?

Do we want to publish these crosswalks?

Create a reporter to track the status of a data build

Create a wrapper which makes it easy to check on the health of the Hub updates upon scheduled updates, rather than having to dig through the logs to know if updates failed/succeeded. Should focus on alerting us when something goes wrong and requires intervention. Julia built a slackbot off the /metadata endpoint which has been useful for tracking the size of each outbreak.info source and the last day it was updated:
Screen Shot 2022-04-14 at 5 34 04 PM

It'd be cool if something like this was also tied to error states that would require work during an autorelease.

De-duplicate / combine overlapping records for Mendeley

Need to think carefully about how to do this, but how do we combine data that are available in multiple indices? For instance: GEO record from NCBI GEO itself, or Omics DI, or Mendeley. Ideally, this would be a single record, but need to figure out how to merge info, resolve conflicts, etc.

Iterative improvement to the OmicsDI crawler

Now, it looks like all the distribution objects in OmicsDI link to an .xml file. (example). However, the actual file metadata object provide a list of downloadable files (in this example a .pdf).

In the next run of OmicsDI, it would be good to improve the distribution object to make it easier for the user to find these data files, like:

distribution: [ {
    name: annrheumdis-2012-202031-s1.pdf,
    url: https://www.ebi.ac.uk/biostudies/files/S-EPMC3841758/annrheumdis-2012-202031-s1.pdf,
    encodingFormat: application/pdf
} ]

Potentially add funding info to OmicsDI

Via publications, link OmicsDI datasets to funding information.

Major question to consider: is a publication associated with an OmicsDI record only primary data generation or also secondary analyses of those datasets? Does it matter when connecting to grant numbers?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.