Git Product home page Git Product logo

zeale_coi_database's Introduction

Update 12-01-2019: This COI database is likely still useful for some arthropod metabarcoding applications but it can be improved upon with more recent sequence curation software (please see https://github.com/RTRichar/MetaCurator and https://github.com/RTRichar/MetabarcodeDBsV2)

This repository contains an arthropod reference sequence database as well as the commands used in the curation (COI_curation_workflow.md) and testing (COI_testing_workflow.md) of the database. It is associated with the following publication: https://peerj.com/articles/5126/. Two Perl scripts for sequence formatting and retrieval of taxonomies using the NCBI Taxonomy module are from Sickel et al. (2015) and can be found at the associated GitHub page: https://github.com/molbiodiv/meta-barcoding-dual-indexing. Generally, commands are provided such that readers can work through our analyses independently and apply the approach to their own endeavors. It should be noted that the syntax, commands and software used may not be entirely transferrable for future applications given differences in computational architecture, software updates, etc. Further, these commands are given without guidance in terms of directory organization, which we leave at the discretion of the reader. Lastly, due to the ease of transferring commands from one analysis to another, we do not provide commands for every analysis performed in the paper and we assume users are sufficiently fluent to overcome troubleshooting issues that routinely occur during any extensive, large-scale data manipulation.

While we provide commands for executing these processes serially, many steps such as the acquisition of taxonomic lineages from the NCBI Taxonomy Module can be manually parallelized by splitting files into smaller files, running the command on each file in an automated fashion and then combining the resulting output files. We also found this approach to be necessary to effectively extract the 157 bp region of interest using the Metaxa2 Database Builder Tool. Extracting these sequences from a single file of all available arthropod entries was problematic in that the alignment processes used to extract the region of interest would be executed with greater reliance on heuristic techniques, resulting in imprecise extraction.

As with any data curation pipeline, users should appropriately search the final dataset to test for any irregularities which may have occurred during processing. In our case, we implemented a number of grep commands with Perl regular expressions to perform such evaluations. Such evaluations can be performed to ensure that all entries meet expectations with respect to the fasta headers, sequences and taxonomic lineage information. In general, NCBI Taxonomy lineages contain numerous artifacts. Some artifacts, like open nomenclature, are expected, however, many have no clear explanation and we recommend performing multiple rounds of testing and curating to ensure that the resulting database optimally meets the assumptions of hierarchical classification.

zeale_coi_database's People

Contributors

rtrichar avatar

Stargazers

James Townsend avatar Claudio Quezada-Romegialli avatar Sebastian Mynott avatar

Watchers

Johan Bengtsson-Palme avatar Claudio Quezada-Romegialli avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.