Genome-assembly-of-the-copepod-Leptodiaptomus

Generalities

This repository contains a short description of the workflow for the assembly and comparison of genomes of the copepod Leptodiaptomus sicilis group.

Copepods are a little-studied model, but in recent years they are gaining relevance as a genomic model because they have shown extensive adaptive evolution, colonizing almost any aquatic system, and it has been observed that the genomes of phylogenetically close copepods may differ in size and structure. genetics, as is the case with the copepod group L. sicilis. This model is in a process of ecological speciation (1 and 2) which is mainly promoted by the contrast in salinity and permanence of the habitat (Table 1). However, the genomic bases of adaptive evolution are unknown and to better understand these bases we have set the following objectives:

1.- Assembly and annotation of genome de Novo
2.- Assembly genomes and compare with the reference.
3.- Characterize divergent genomic regions subject to selection and adaptive evolution.

Key ecological features of the four lakes inhabited by L. sicilis group

This repository is a report and a guide to the bioinformatics methodology used to meet the objectives set. Genome size and type of sequences used are briefly summarized in Table 2. The short sequences (illumina) were created unique dual-indexed (UDI) TruSeq Nano DNA libraries and sequenced on a single lane of a NovaSeq SP 2x150-bp run. And the long sequences (PacBio) were created with a express 20 Kb library and sequenced on a sequel II 8M SMRT cell using continuous long read (CLR) method.

The size of the genome has been estimated by flow cytometry and the conversion to Mpb was carried out with a function in R: picograms_to_Mpb(n) and you can see in plot_genome_size

The workflow is structured in four parts, following different bioinformatic strategies that depend on the data type. 1)De novo assembly and annotation of the El carmen genome (MinIon + illumina). 2)De novo assembly and annotation of the Atexcac genome (PacBio + illumina). 3)Assembly with reference for the rest of the populations (illumina) 4). Comparison of the 5 genomes and identification of variants. At the moment only the strategy of the first step presented and as progress is made the repository do will be updated.

Pre-requirements (Software versions)

Fastqc:v0.11.8

Trimmomatic:v0.39

Porechop:v0.2.4

Canu:v2.0

SMARTdenovo:v2.12

Repository structure

The repository is organized into five folders for better monitoring and visualization:

/bin/ In this folder are the scripts that will help us obtain the results, each script assumes that the working directory is this folder, that the input files are in meta-data and that the outputs will go to results.
/courses/ The creation of this repository arose as part of a project in a bioinformatics workshop, and in this folder there are files that are not necessary for the precented workflows, but that were part of the course work.
/images/ Here they are deposited all images that have been used to illustrate this repository
/results/ In this folder are the results obtained, mainly graphs or tables obtained during the assembly process. Here is a readme with a short summary of the general results because you can not put all the results obtained, by size or because they are not published.
/meta-data/ Here are some tables or data of general information of the five study populations and that can be used for future analysis, for example: the general_data.txt has general ecological data such as the precense of predators, degrees of salinity and permanence of the habitat, and added the differential size of the genomes of each population.

Nota: Because the work is not yet published and the weight of most of the data and results is large, they are not here, but if there is any doubt or question, you can write an issue here or send an email to [email protected]

Workflows structure

Workflow to de Novo assembly of the "El Carmen"

The first de novo assembly is carried out with the population of El Carmen and the protocol suggested by Shin et al., (2019) was followed. Since it is designed and tested as a good workflow to improve the integrity of hybrid genomes with MinION and Illumina sequences.

To generate the results in this workflow you can run the script: DeNovo_assembly.sh. And if you want to follow each step, here is the order in which they should be executed and a short description of each step:

Data Cleaning

1.- FastQC: It is used to review the quality of Illumina reads and helps you better decide on trimming and clean parameters.

2.- Trimmomatic: Here low quality bases or reads and sequenced illumina adapters are trimmed to improve genome quality and integrity.

3.- Porechop It is also used to trim bases or low-quality sequences but is exclusive of adapters and MinION sequences.

Assembly Genome

4.- Canu: It is used to generate corrected sequences to improve the precision of the bases and to make the assembly with these sequences.

5.- SMARTdenovo: Finally this is the assembler for long reads with which the assembly was made using corrected sequences.

Assembly Polishing

6.- BWA: Is for aligning the illumina short reads to the newly assembled genome draft in order to correct the assembly and fill in gaps

7.- Samtools: Is for sorting and indexing the data.

8.- Pilon: Finally helps us to automatically polish and improve genome orientation based on coverage.

References

Barrera, M. O. A., Ciros, P. J., Ortega, M. E., Alcántara, R. J. A., & Piedra, I. E. (2015). From local adaptation to ecological speciation in copepod populations from neighboring lakes. PloS one, 10(4), e0125524.

Shin, S. C., Kim, H., Lee, J. H., Kim, H. W., Park, J., Choi, B. S., & Kim, S. (2019). Nanopore sequencing reads improve assembly and gene annotation of the Parochlus steinenii genome. Scientific Reports, 9(1), 1-10.

Ortega, M. E., Alcántara, R. J. A., Urbán, O. J., Campos, C. J. E., & Ciros, P. J. (2020). Genomic evidence of adaptive evolution patterns in lacustrine calanoid copepods. Molecular Ecology (in review).

javierurban / genome-assembly-of-the-copepod-leptodiaptomus Goto Github PK