Git Product home page Git Product logo

ccgp_assembly's Introduction

California Conservation Genomics Project (CCGP)

California Conservation Genomics Project (CCGP) repository for the genome assembly working group.

Content

Overview

This repository contains scripts used for the reference genome assembly efforts of the CCGP.

CCGP reference genomes are assembled following a protocol adapted from Rhie et al. (2021). Assemblies are comprised of PacBio HiFi long read data, which is scaffolded using proximity ligation/chromatin conformation capture (HiC or OmniC) (Dovetail Genomics). Our minimum target reference genome quality is 6.7.Q40, and in most cases we expect to reach 7.C.Q50 or better (see Table 1 in Rhie et al. (2021)).

Here the overview of our current pipeline:

CCGP: Overview of our current pipeline

Pipeline overview

There have been multiple versions since the beginning of the project and this is an overview of how the pipeline has evolved.

CCGP: Evolution of the assembly pipeline

Color blocks:

  • Yellow: sequencing datatypes
  • Dark gray: Fixed processes
  • Light gray: Optional processes
  • Blue: Iterative step

Workflows

  • PacBio HiFi
    • PacBio Adapter filtering
    • K-mer counting with meryl
    • Genome size, heterozygosity and repeat content estimation
    • Coverage validation (calculation of expected coverage given the sequencing data
  • HiC/OmniC
    • Library QC with Dovetail Genomics tools
  • Contig assembly with HiFiasm
    • Depending on datasets available or ploidy, we are using single or HiC mode on HiFiasm.

Purge haplotigs: haplotypic duplications and contig overlaps

  • Alignment of HiFi data with minimap2 and purging with purge_dups
  • Alignments with Arima Genomics Mapping Pipeline
  • Scaffolding with SALSA
  • Generation and visualization of contact maps
    • HiGlass
    • Generation of tracks
      • HiFi coverage
      • HiC/OmniC coverage
      • Genome assembly mappability
      • Gap description
    • PretextSuite

Gap closing

  • Using YAGCloser - based on gap spanning of long reads

Mitochondrial assembly

  • Mitogenome assembly pipeline or MitoHiFi

Contamination screening

  • Organelle filtering from nuclear assemblies
  • Contamination screening with Blobtools
  • Contiguity metrics (contig and scaffold N50)
  • BUSCO scores
  • per base quality / k-mer completeness
  • Frameshift errors
  • Gap description
  • Genome mappability
  • Mapping quality

Versioning

Learn more

  • For further information about our project and efforts please redirect to the CCGP website
  • For more information about the project, you can also check this:

Shaffer HB, Toffelmier E, Corbett-Detig RB, Escalona M, Erickson B, Fiedler P, Gold M, Harrigan RJ, Hodges S, Luckau TK, Miller C, Oliveira DR, Shaffer KE, Shapiro B, Sork VL, Wang IJ (2022) Landscape genomics to enable conservation actions: the California Conservation Genomics Project. Journal of Heredity, 113 (6): 577โ€“588, https://doi.org/10.1093/jhered/esac020

References

ccgp_assembly's People

Contributors

merlyescalona avatar

Stargazers

Vinay K L avatar  avatar ONT_HiFi_HiC avatar  avatar zhenpeng yu avatar

Watchers

 avatar

ccgp_assembly's Issues

Question About Use of BlobToolKit

In the flowchart depicting the pipeline used by CCGP for genome assembly, there is a line connecting the 'Contamination Screening: BlobToolKit' box to the 'ARIMA Mapping Pipeline' box. This suggests that HiC reads are mapped to contigs produced by HiFiasm after they have been decontaminated using BlobToolKit. Is this correct?

I find the BlobToolKit docs to be a bit opaque and am hoping I can avoid puzzling out the pipeline for a little while longer.

Cheers,
Emily

Add License for re-use

Hi Merly,

We have received a suggestion to add this resource to the ERGA Knowledge Hub. We think it would be a great fit to our library of biodiversity genomics resources, but we require a re-usage license for all of the resources. Would you consider adding one to this repository?

Many thanks and all the best,

Tom

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.