bold_sequence's Introduction

BOLD DNA sequence

Linking BOLD DNA sequences to specimens published in GBIF

Linking DNA sequence barcode data from BOLD to specimens in GBIF has a high priority in the GBIF work-plan. The GBIF Science Committee represented by SC chair Rod Page, published in December 2016 a snapshot of the iBOL dataset doi:10.15468/inygc6 including a total of 2,789,906 occurrences. However, the link to the museum specimens themselves has not been maintained. Example: gbifKey:1415958347 and the corresponding BOLD data record with processid:LON2542-15.

The most reliable specimen identifier in GBIF is the dwc:occurrenceID. There is also the traditional and (more) human readable dwc:catalogNumber identifying a museum specimen. The BOLD Process ID is the most important identifier for material samples corresponding to the museum specimens. BOLD also provide a "Museum ID" and a "Sample ID" however, nether match exactly the occurrenceID or the catalogNumber in GBIF.

GBIF	BOLD
occurrenceKey = 1426521030	Process ID = NOBAS010-14
occurrenceID = urn:catalog:O:F:75130	Museum ID = O-F-75130
catalogNumber = 75130	Sample ID = O-F-75130
eventID/fieldNumber = [blank]	Field ID = MY1-0568

BOLD URL: http://bins.boldsystems.org/index.php/Public_RecordView?processid=NOBAS010-14
BOLD API: http://www.boldsystems.org/index.php/API_Public/sequence?ids=NOBAS010-14
GBIF URL: https://www.gbif.org/occurrence/1426521030
GBIF API: http://api.gbif.org/v1/occurrence/1426521030/verbatim

Mapping from BOLD API to GBIF IPT

Feedback on the proposed mapping using the issues tracker is most welcome! What would be the appropriate measurementType and measurementMethod?

MeasurementOrFact :: http://rs.gbif.org/extension/dwc/measurements_or_facts.xml

measurementID = boldAPI:processid
measurementType = "BOLD-sequence" [alt. = BOLD-sequence + (markercode)]
measurementValue = boldAPI:nucleotides
measurementAccuracy = NULL
measurementUnit = NULL
measurementDeterminedDate = boldAPI:run_dates
measurementDeterminedBy = boldAPI:sequencing_centers
measurementMethod = boldAPI:markercode [alt. = boldAPI:seq_primers]
measurementRemarks = http://www.boldsystems.org/index.php/API_Public/sequence?ids= + processid

Simple Multimedia :: http://rs.gbif.org/extension/gbif/1.0/multimedia.xml

type = "StillImage"
format = "image/jpeg"
identifier = boldAPI:image_urls
references
title = occurrenceID [alt. = processID]
description = boldAPI:captions
created = boldAPI:copyright_years
creator = boldAPI:photographers
contributor
publisher = boldAPI:copyright_institutions (??)
audience = "experts"
source = "BOLD"
license = boldAPI:copyright_licenses
rightsHolder = boldAPI:copyright_institutions
datasetID

bold_sequence's People

Contributors

Stargazers

Watchers

bold_sequence's Issues

measurementOrFact versus GGBN

@dagendresen Great to see this being done. I'm curious as to the relative merits of measurementOrFact being used to store the link and sequence versus, say, the GGBN extensions. In one example DNA barcode dataset I uploaded I used GGBN, so the sequences look like this https://api.gbif.org/v1/occurrence/1502684137/fragment:

"extensions": {
        "http://data.ggbn.org/schemas/ggbn/terms/Amplification": [
            {
                "consensusSequence": "CCTTTATCTAGTATTTGGTGCTTGAGCTGGAATAGTAGGCACAGCCTTAAGCCTTCTCATTCGAGCAGAACTAAGCCAACCTGGCGCACTCTTAGGAGACGACCAAATCTATAATGTTATTGTTACTGCACATGCCTTCGTAATGATTTTCTTTATAGTAATGCCAATTCTAATCGGGGGGTTTGGAAACTGATTAGTTCCTCTCATGCTTGGAGCCCCTGATATGGCATTCCCTCGTATGAACAACATAAGCTTCTGATTACTCCCTCCGTCATTCCTCCTTTTACTAGCTTCTTCCGGAGTTGAGGCCGGAGCCGGGACAGGTTGAACTGTCTACCCCCCACTGTCTGGTAATCTAGCCCATGCGGGAGCATCAGTAGATTTAACCATCTTCTCCCTGCACCTGGCAGGTATTTCATCAATCCTAGGAGCAATCAACTTTATCACTACCATCATCAACATAAAACCCCCCGCTATCTCTCAATACCAAACTCCTTTATTTGTTTGGGCTGTTCTAATTACTGCCGTTCTTCTACTCCTATCTCTCCCAGTCCTAGCTGCTGGCATTACTATGCTCCTGACCGACCGAAATCTTAATACTACCTTCTTCGATCCCGCAGGAGGAGGAGACCCAATTCTTTACCAACACCTC",
                "geneticAccessionNumber": "KP194104",
                "marker": "COI-5P"
            }
        ]
    },

I don't know whether GGBN will be widely adopted, nor how much data like this GBIF is likely to get. It is also rather hidden in the current GBIF portal as it's not displayed in the HTML view, you have to go through the API.

I guess measurementOrFact has the advantage that the portal supports it already, so people can actually see the sequences (this opens up all sorts of interesting possibilities, such as GBIF analysing sequence data).

The other issue is duplication. As GBIF ingests more and more BOLD sequences, existing records will be duplicated. What if we linked those duplicates? In other words, not only say that this GBIF occurrence from a museum has this DNA barcode, but that DNA barcode is also in GBIF as occurrence xxx?

Recommend Projects

gbif-europe / bold_sequence Goto Github PK

bold_sequence's Introduction

BOLD DNA sequence

Mapping from BOLD API to GBIF IPT

MeasurementOrFact :: http://rs.gbif.org/extension/dwc/measurements_or_facts.xml

Simple Multimedia :: http://rs.gbif.org/extension/gbif/1.0/multimedia.xml

bold_sequence's People

Contributors

Stargazers

Watchers

bold_sequence's Issues

measurementOrFact versus GGBN

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent