Git Product home page Git Product logo

bold_sequence's Introduction

BOLD DNA sequence

Linking BOLD DNA sequences to specimens published in GBIF

Linking DNA sequence barcode data from BOLD to specimens in GBIF has a high priority in the GBIF work-plan. The GBIF Science Committee represented by SC chair Rod Page, published in December 2016 a snapshot of the iBOL dataset doi:10.15468/inygc6 including a total of 2,789,906 occurrences. However, the link to the museum specimens themselves has not been maintained. Example: gbifKey:1415958347 and the corresponding BOLD data record with processid:LON2542-15.

The most reliable specimen identifier in GBIF is the dwc:occurrenceID. There is also the traditional and (more) human readable dwc:catalogNumber identifying a museum specimen. The BOLD Process ID is the most important identifier for material samples corresponding to the museum specimens. BOLD also provide a "Museum ID" and a "Sample ID" however, nether match exactly the occurrenceID or the catalogNumber in GBIF.

GBIF BOLD
occurrenceKey = 1426521030 Process ID = NOBAS010-14
occurrenceID = urn:catalog:O:F:75130 Museum ID = O-F-75130
catalogNumber = 75130 Sample ID = O-F-75130
eventID/fieldNumber = [blank] Field ID = MY1-0568

Mapping from BOLD API to GBIF IPT

Feedback on the proposed mapping using the issues tracker is most welcome! What would be the appropriate measurementType and measurementMethod?

  • measurementID = boldAPI:processid
  • measurementType = "BOLD-sequence" [alt. = BOLD-sequence + (markercode)]
  • measurementValue = boldAPI:nucleotides
  • measurementAccuracy = NULL
  • measurementUnit = NULL
  • measurementDeterminedDate = boldAPI:run_dates
  • measurementDeterminedBy = boldAPI:sequencing_centers
  • measurementMethod = boldAPI:markercode [alt. = boldAPI:seq_primers]
  • measurementRemarks = http://www.boldsystems.org/index.php/API_Public/sequence?ids= + processid
  • type = "StillImage"
  • format = "image/jpeg"
  • identifier = boldAPI:image_urls
  • references
  • title = occurrenceID [alt. = processID]
  • description = boldAPI:captions
  • created = boldAPI:copyright_years
  • creator = boldAPI:photographers
  • contributor
  • publisher = boldAPI:copyright_institutions (??)
  • audience = "experts"
  • source = "BOLD"
  • license = boldAPI:copyright_licenses
  • rightsHolder = boldAPI:copyright_institutions
  • datasetID

bold_sequence's People

Contributors

dagendresen avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bold_sequence's Issues

measurementOrFact versus GGBN

@dagendresen Great to see this being done. I'm curious as to the relative merits of measurementOrFact being used to store the link and sequence versus, say, the GGBN extensions. In one example DNA barcode dataset I uploaded I used GGBN, so the sequences look like this https://api.gbif.org/v1/occurrence/1502684137/fragment:

"extensions": {
        "http://data.ggbn.org/schemas/ggbn/terms/Amplification": [
            {
                "consensusSequence": "CCTTTATCTAGTATTTGGTGCTTGAGCTGGAATAGTAGGCACAGCCTTAAGCCTTCTCATTCGAGCAGAACTAAGCCAACCTGGCGCACTCTTAGGAGACGACCAAATCTATAATGTTATTGTTACTGCACATGCCTTCGTAATGATTTTCTTTATAGTAATGCCAATTCTAATCGGGGGGTTTGGAAACTGATTAGTTCCTCTCATGCTTGGAGCCCCTGATATGGCATTCCCTCGTATGAACAACATAAGCTTCTGATTACTCCCTCCGTCATTCCTCCTTTTACTAGCTTCTTCCGGAGTTGAGGCCGGAGCCGGGACAGGTTGAACTGTCTACCCCCCACTGTCTGGTAATCTAGCCCATGCGGGAGCATCAGTAGATTTAACCATCTTCTCCCTGCACCTGGCAGGTATTTCATCAATCCTAGGAGCAATCAACTTTATCACTACCATCATCAACATAAAACCCCCCGCTATCTCTCAATACCAAACTCCTTTATTTGTTTGGGCTGTTCTAATTACTGCCGTTCTTCTACTCCTATCTCTCCCAGTCCTAGCTGCTGGCATTACTATGCTCCTGACCGACCGAAATCTTAATACTACCTTCTTCGATCCCGCAGGAGGAGGAGACCCAATTCTTTACCAACACCTC",
                "geneticAccessionNumber": "KP194104",
                "marker": "COI-5P"
            }
        ]
    },

I don't know whether GGBN will be widely adopted, nor how much data like this GBIF is likely to get. It is also rather hidden in the current GBIF portal as it's not displayed in the HTML view, you have to go through the API.

I guess measurementOrFact has the advantage that the portal supports it already, so people can actually see the sequences (this opens up all sorts of interesting possibilities, such as GBIF analysing sequence data).

The other issue is duplication. As GBIF ingests more and more BOLD sequences, existing records will be duplicated. What if we linked those duplicates? In other words, not only say that this GBIF occurrence from a museum has this DNA barcode, but that DNA barcode is also in GBIF as occurrence xxx?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.