This repository contains a short description of the workflow for the assembly and comparison of genomes of the copepod Leptodiaptomus group sicilis that is under a process of ecological speciation.

License: MIT License

Shell 0.63% HTML 99.34% R 0.03%

genome-assembly-of-the-copepod-leptodiaptomus's Introduction

Genome-assembly-of-the-copepod-Leptodiaptomus

Generalities

This repository contains a short description of the workflow for the assembly and comparison of genomes of the copepod Leptodiaptomus sicilis group.

Copepods are a little-studied model, but in recent years they are gaining relevance as a genomic model because they have shown extensive adaptive evolution, colonizing almost any aquatic system, and it has been observed that the genomes of phylogenetically close copepods may differ in size and structure. genetics, as is the case with the copepod group L. sicilis. This model is in a process of ecological speciation (1 and 2) which is mainly promoted by the contrast in salinity and permanence of the habitat (Table 1). However, the genomic bases of adaptive evolution are unknown and to better understand these bases we have set the following objectives:

1.- Assembly and annotation of genome de Novo
2.- Assembly genomes and compare with the reference.
3.- Characterize divergent genomic regions subject to selection and adaptive evolution.

Key ecological features of the four lakes inhabited by L. sicilis group

This repository is a report and a guide to the bioinformatics methodology used to meet the objectives set. Genome size and type of sequences used are briefly summarized in Table 2. The short sequences (illumina) were created unique dual-indexed (UDI) TruSeq Nano DNA libraries and sequenced on a single lane of a NovaSeq SP 2x150-bp run. And the long sequences (PacBio) were created with a express 20 Kb library and sequenced on a sequel II 8M SMRT cell using continuous long read (CLR) method.

The size of the genome has been estimated by flow cytometry and the conversion to Mpb was carried out with a function in R: picograms_to_Mpb(n) and you can see in plot_genome_size

The workflow is structured in four parts, following different bioinformatic strategies that depend on the data type. 1)De novo assembly and annotation of the El carmen genome (MinIon + illumina). 2)De novo assembly and annotation of the Atexcac genome (PacBio + illumina). 3)Assembly with reference for the rest of the populations (illumina) 4). Comparison of the 5 genomes and identification of variants. At the moment only the strategy of the first step presented and as progress is made the repository do will be updated.

Pre-requirements (Software versions)

Fastqc:v0.11.8

Trimmomatic:v0.39

Porechop:v0.2.4

Canu:v2.0

SMARTdenovo:v2.12

Repository structure

The repository is organized into five folders for better monitoring and visualization:

/bin/ In this folder are the scripts that will help us obtain the results, each script assumes that the working directory is this folder, that the input files are in meta-data and that the outputs will go to results.
/courses/ The creation of this repository arose as part of a project in a bioinformatics workshop, and in this folder there are files that are not necessary for the precented workflows, but that were part of the course work.
/images/ Here they are deposited all images that have been used to illustrate this repository
/results/ In this folder are the results obtained, mainly graphs or tables obtained during the assembly process. Here is a readme with a short summary of the general results because you can not put all the results obtained, by size or because they are not published.
/meta-data/ Here are some tables or data of general information of the five study populations and that can be used for future analysis, for example: the general_data.txt has general ecological data such as the precense of predators, degrees of salinity and permanence of the habitat, and added the differential size of the genomes of each population.

Nota: Because the work is not yet published and the weight of most of the data and results is large, they are not here, but if there is any doubt or question, you can write an issue here or send an email to [email protected]

Workflows structure

Workflow to de Novo assembly of the "El Carmen"

The first de novo assembly is carried out with the population of El Carmen and the protocol suggested by Shin et al., (2019) was followed. Since it is designed and tested as a good workflow to improve the integrity of hybrid genomes with MinION and Illumina sequences.

To generate the results in this workflow you can run the script: DeNovo_assembly.sh. And if you want to follow each step, here is the order in which they should be executed and a short description of each step:

Data Cleaning

1.- FastQC: It is used to review the quality of Illumina reads and helps you better decide on trimming and clean parameters.

2.- Trimmomatic: Here low quality bases or reads and sequenced illumina adapters are trimmed to improve genome quality and integrity.

3.- Porechop It is also used to trim bases or low-quality sequences but is exclusive of adapters and MinION sequences.

Assembly Genome

4.- Canu: It is used to generate corrected sequences to improve the precision of the bases and to make the assembly with these sequences.

5.- SMARTdenovo: Finally this is the assembler for long reads with which the assembly was made using corrected sequences.

Assembly Polishing

6.- BWA: Is for aligning the illumina short reads to the newly assembled genome draft in order to correct the assembly and fill in gaps

7.- Samtools: Is for sorting and indexing the data.

8.- Pilon: Finally helps us to automatically polish and improve genome orientation based on coverage.

References

Barrera, M. O. A., Ciros, P. J., Ortega, M. E., Alcántara, R. J. A., & Piedra, I. E. (2015). From local adaptation to ecological speciation in copepod populations from neighboring lakes. PloS one, 10(4), e0125524.

Shin, S. C., Kim, H., Lee, J. H., Kim, H. W., Park, J., Choi, B. S., & Kim, S. (2019). Nanopore sequencing reads improve assembly and gene annotation of the Parochlus steinenii genome. Scientific Reports, 9(1), 1-10.

Ortega, M. E., Alcántara, R. J. A., Urbán, O. J., Campos, C. J. E., & Ciros, P. J. (2020). Genomic evidence of adaptive evolution patterns in lacustrine calanoid copepods. Molecular Ecology (in review).

genome-assembly-of-the-copepod-leptodiaptomus's People

Contributors

Stargazers

Watchers

genome-assembly-of-the-copepod-leptodiaptomus's Issues

Database download in KRAKEN2

I want to identify if there are bacteria or viruses common that could be in my sequences and for this I would like to use KRAKEN2 and its databases

KRAKEN2 I'm running it from Docker with the repository: tbattaglia/kraken2:latest

I ran the program correctly, but I could not download the databases, try to follow the manual.

This is the help of KRAKEN2

Usage: kraken2-build [task option] [options]

Task options (exactly one must be selected):
  --download-taxonomy        Download NCBI taxonomic information
  --download-library TYPE    Download partial library
                             (TYPE = one of "archaea", "bacteria", "plasmid",
                             "viral", "human", "fungi", "plant", "protozoa",
                             "nr", "nt", "env_nr", "env_nt", "UniVec",
                             "UniVec_Core")
  --special TYPE             Download and build a special database
                             (TYPE = one of "greengenes", "silva", "rdp")
  --add-to-library FILE      Add FILE to library
  --build                    Create DB from library
                             (requires taxonomy d/l'ed and at least one file
                             in library)
  --clean                    Remove unneeded files from a built database
  --standard                 Download and build default database
  --help                     Print this message
  --version                  Print version information

Options:
  --db NAME                  Kraken 2 DB name (mandatory except for
                             --help/--version)
  --threads #                Number of threads (def: 1)
  --kmer-len NUM             K-mer length in bp/aa (build task only;
                             def: 35 nt, 15 aa)
  --minimizer-len NUM        Minimizer length in bp/aa (build task only;
                             def: 31 nt, 12 aa)
  --minimizer-spaces NUM     Number of characters in minimizer that are
                             ignored in comparisons (build task only;
                             def: 7 nt, 0 aa)
  --protein                  Build a protein database for translated search
  --no-masking               Used with --standard/--download-library/
                             --add-to-library to avoid masking low-complexity
                             sequences prior to building; masking requires
                             dustmasker or segmasker to be installed in PATH,
                             which some users might not have.
  --max-db-size NUM          Maximum number of bytes for Kraken 2 hash table;
                             if the estimator determines more would normally be
                             needed, the reference library will be downsampled
                             to fit. (Used with --build/--standard/--special)
  --use-ftp                  Use FTP for downloading instead of RSYNC; used with
                             --download-library/--download-taxonomy/--standard.
  --skip-maps                Avoids downloading accession number to taxid maps,
                             used with --download-taxonomy.

I used this command line:kraken2-build --standard --db "bacteria"

But i get this error:

Downloading taxonomy tree data.../kraken2-2.0.8-beta/download_taxonomy.sh: line 27: rsync: command not found

I think the command: rsync is to give data download permissions, but I have doubts if I would have to install it in the Docker image or if I only have to give some permission from my computer or in the same image

Describir que reads utilizas (Ilumina, nanopore)

better explain metadata

Your metadata directory should only contain metadata info, but currently it also has images and other files related to issues or examples of the class. Please clean this and have a separate directory for non metadata files.

Difficulties to use databases running BLAST

Do you need to create databases or is there a better option?

I created a database for copepods:

makeblastdb -in ../blastpacope/genomas_copepods/ncbi_dataset/data/db_all_copepods.fna -dbtype nucl -parse_seqids -out my_refrence2.fa

But they weigh a lot and running the BLAST is very slow.

To run BLAST use:

blastn -db my_refrence2.fa -query ../blastpacope/minion/minion_carmen.fa -out results_allgenomes_tab.out -outfmt "6 sframe qseqid sseqid evalue pident mismatch"

The function -outfmt "6 <options>" shows results in a tab separated table. And although it's hard to see, the line of identity percentage in general it looks higher than 80%, so I think that maybe it is not so contaminated, however, I think it makes a BLAST with many databases it will take longer and be more difficult, so I would like to know if anyone knows a more easier way to run BLAST?

Or if it will be a better option to start testing assemblers with different parameters?
The first draft I have seems to be fragmented

Comments to improve repo organization, scritps and final project

Your repo is looking good, but the following points need attention:

README: It would be clearer if you could briefly state in which order one needs to run all of your scripts. Notice that all scripts in /bin should be mentioned in the README, this is currently not the case.
README: not all of the programs you used are mentioned in "Software versions used", for instance bwa is missing.
README: specify if your Illumina data is single end or pair end and whether it was already demultiplexed
README: briefly explain the contents of all of the files in /meta-data. In the case of files with columns, please briefly mention what does each column mean.
Script DeNovo_assembly.sh line 43 and script pilon.sh line 6: mkadir is not a command, correct it for mkdir. Since the directories where not being created, also check that this was not causing an error.
Scripts: When you use mkdir in an script, is good practice to add the flag -p so that the script can be run several times without causing an error because the directory already exists. Please use mkdir -p in all your scripts that currently use mkdir
Don't forget to add a short summary explaining your main results, this can be part of the README or a separate md file.
Don't forget to add a R figure and the code used to make it. This can be done a simple plot of your number of reads or any data you already have.
When making commits, don't forget to add relevant short messages indicating what did you change.

denovo_map.pl: Aborted because the last command failed (Error: Unable to load data)

My problem is that I can't finish running denovo_map.pl.

I am using stacks on my local computer from a docker container with the following script stacks.denovo_map.prueba.sh

denovomap.pl is a program that is used to construct loci and to call SNPs de novo, this is used when there is no reference genome.

I want to run denovo_map.pl to identify SNPs in 93 samples with 23 and 22 individuals from 4 populations, but before I wanted to perform a test with 3 individuals from each population but I could not finish the process for these samples the program starts to run, but when you with to continue the analysis of sample 2, the process stops and displays an aborted message: denovo_map.pl: Aborted because the last command failed (1); see log file.

I first ran process_radtags, this is a program that checks the raw data from illumina and demultiplexes the samples for quality and cutoff sites of the restriction enzymes

I run the command as follows:

stacks process_radtags -P -p ../stacks/isuue2/GBS_raw/ --interleaved \
-b ../stacks/isuue2/barcodes_copes_iss.tsv -o ../stacks/isuue2/process_map_res/ \
-c -q -r --index_index --renz_1 mspI --renz_2 nsiI

The following image is an example of the raw data format and this is the barcode file that I use.

The output of process_radtags generates 4 different files .1 and .2 which are used to run denovo_map.pl

And then run denovo_map.pl:

stacks denovo_map.pl --samples ../stacks/isuue2/process_map_res/ \
--popmap ../stacks/isuue2/popmap_tarea_issue.tsv -o ../stacks/isuue2/denovo_map_re2 \
-M 3 -n 2 -m 3 -X "populations: -r 0.50 --min_maf 0.01 --genepop"

Here is the population map file i use

And the following happens, showing the following message: denovo_map.pl: Aborted because the last command failed (1); see log file.

I tried looking for the error, and saw that it could be due to my computer's memory, but I also ran it on a cluster, and got the same error, I also saw that it could probably be due to the ID of the samples, but I tried to change them and they still do not finish the process, and I'm still confused because I don't know if I'm using some command or parameter wrong

javierurban / genome-assembly-of-the-copepod-leptodiaptomus Goto Github PK