Git Product home page Git Product logo

nanosim's Introduction

Release Downloads Conda Stars

NanoSim

NanoSim is a fast and scalable read simulator that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of nanopore sequencing technology.

The second version of NanoSim (v2.0.0) uses minimap2 as default aligner to align long genomic ONT reads to reference genome. It leads to much faster alignment step and reduces the overall runtime of NanoSim. We also utilize HTSeq, a python package, to read SAM alignment files efficiently.

NanoSim (v2.5) is able to simulate ONT transcriptome reads (cDNA / direct RNA) as well as genomic reads. It also models features of the library preparation protocols used, including intron retention (IR) events in cDNA and directRNA reads. Further, it has stand-alone modes which profiles transcript expression patterns and detects IR events in custom datasets. Additionally, we improved the homopolymer simulation option which simulates homopolymer expansion and contraction events with respect to chosen basecaller. Multiprocessing option allows for faster runtime for large library simulation.

NanoSim (v2.6) is able to simulate ONT reads in fastq format. The base quality information is simulated with truncated log-normal distributions, learnt separately from match bases, mismatch bases, insertion bases, deletion bases, and unaligned bases, each from different basecaller and read type.

NanoSim (v3.0) is able to simulate ONT metagenome reads. It quantifies metagenome abundance in the characterization stage, and accomodates for chimeric reads. In the simulation stage, it simulates both features as well. In addition, the simulation of chimeric reads is available in genome mode too. Some pre-trained models are re-trained for compatibility issues.

NOTE: We usually add new pre-trained models and make them available through the latest releases! Users can choose to download the whole package or only scripts without models to speed it up. Please check the downloads section for more information regarding the pre-trained models.

Citation
If you use NanoSim to simulate genomic reads, please cite the following paper:

NanoSim
Yang C, Chu J, Warren RL, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience. 2017 Apr 1;6(4):1-6. doi: 10.1093/gigascience/gix010. PMID: 28327957; PMCID: PMC5530317.

If you use NanoSim to simulate transcriptomic reads, please cite the following paper:

Trans-NanoSim
Hafezqorani S, Yang C, Lo T, Nip KM, Warren RL, Birol I. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. Gigascience. 2020 Jun 1;9(6):giaa061. doi: 10.1093/gigascience/giaa061. PMID: 32520350; PMCID: PMC7285873.

If you use NanoSim to simulate metagenomic reads, please cite the following paper:

Meta-NanoSim
Yang C, Lo T, Nip KM, Hafezqorani S, Warren RL, Birol I. Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim. Gigascience. 2023 Mar 20;12:giad013. doi: 10.1093/gigascience/giad013. PMID: 36939007; PMCID: PMC10025935.

Dependencies

NOTE: Please kindly note that the pretrained models in NanoSim (v3.0.2) were made using an older version of scikit-learn (e.g. <=0.22.1). If you have to use these models (instead of creating your own models), then you must use scikit-learn=0.22.1 but not the newer versions. If you have a newer version of scikit-learn installed, then you will get the error for No module named 'sklearn.neighbors.kde'. If you would like to create your own models (instead of using the pretrained models), then NanoSim should work just fine with scikit-learn=1.0.2 from our experience. For future releases of NanoSim, we will try to include newly pre-trained models with the updated versions of required packages in order to solve the incompatibility issues.

Python}
Python packages:

  • HTSeq (Tested with version 0.11.2)
  • joblib (Tested with version 0.14.1)
  • numpy (Tested with version 1.17.2)
  • pybedtools (Tested with version 0.8.2)
  • pysam (Tested with version 0.13 or above)
  • scikit-learn (Tested with version 0.21.3)
  • scipy (Tested with verson 1.4.1)
  • six (Tested with version 1.16.0)

External programs:

  • minimap2 (Tested with versions 2.10, 2.17, 2.18)
  • LAST (Tested with versions 581 and 916)
  • samtools (Tested with version 1.12)
  • GenomeTools (Tested with version 1.6.1)

Installation

Before you begin, make sure that your conda channels are configured properly for bioconda.

Option 1. Install the latest release through bioconda:

conda install -c bioconda nanosim

Option 2. Install the development code in our repo:

  1. git clone https://github.com/bcgsc/NanoSim.git
  2. conda create --name nanosim python=3.7
  3. conda activate nanosim
  4. conda install --file requirements.txt -c conda-forge -c bioconda

NOTE: If you have difficulty with installing all the dependent packages with conda. We strongly recommend that you create a dedicated environment for running NanoSim. If you have issues with conda install being eternally stuck, use mamba instead of conda to install your conda packages: https://github.com/mamba-org/mamba .

So, integrating all these together:

conda create -n nanosim_env
conda activate nanosim_env

mamba install scikit-learn=0.22.1 six samtools pysam pybedtools minimap2 joblib htseq genometools-genometools

Note that here we only specified the version for scikit-learn but not for the other packages. mamba should be able to pick the appropriate versions for the specified packages, python, and numpy, etc.

Downloads

Some ONT read profiles are ready to use for users. With the profiles, users can run the simulation tool directly.

For releases before v2.2.0, we provide profiles trained for E. coli or S. cerevisiae datasets. Flowcell chemistry is R7.3 and R9, and they were basecalled by Metrichor. They can be downloaded from our ftp site

For release v2.5.0 and onwards, we provide profiles trained for H. sapiens NA12878 gDNA, cDNA 1D2, and directRNA datasets, Mus. musculus cDNA dataset, and the ZymoBIOMICS mock community datasets with 10 species and two abundance levels. Flowcell chemistry is R9.4 for all datasets. NA12878 gDNA and directRNA were basecalled by Guppy 3.1.5; NA12878 cDNA 1D2 was basecalled by Albacore 2.3.1; mouse cDNA was basecalled by Metrichor. These models are available within pre-trained_models folder.

For release v3.1.1, we provide a trained model for H. sapiens NA24385 - AshkenazimTrio - Son (hg002) which is sequenced using Kit v14 (R10 chemistry) and basecalled by dorado. You may find the trained model on 1 Million subsampled reads on the GitHub page (available along with the other models at pre-trained_models folder). If you are interested in the trained model based on the whole dataset, you can get it through Zenodo - DOI: 10.5281/zenodo.10064740. The model is trained using NanoSim v3.0.2 with scikit-learn v0.23.2 and python v3.7.10.

If you have any issues using the pre-trained models, check the dependencies section for some information and tips.

Usage

NanoSim is implemented using Python for error model fitting, read length analysis, and simulation. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis, and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation tool uses the model built in the previous step to produce in silico reads for a given reference genome/transcriptome. It also outputs a list of introduced errors, consisting of the position on each read, error type and reference bases.

1. Characterization stage

Characterization stage runs in five mode: genome, transcriptome, metagenome, quantify, and detect_ir. Below you may see the general usage of this code. We will explain each mode separately as well.

Characterization step general usage:

usage: read_analysis.py [-h] [-v]
                        {genome,transcriptome,metagenome,quantify,detect_ir} ...

Read characterization step
-----------------------------------------------------------
Given raw ONT reads, reference genome, metagenome, and/or 
transcriptome, learn read features and output error profiles

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

subcommands:
  
  There are five modes in read_analysis.
  For detailed usage of each mode:
      read_analysis.py mode -h
  -------------------------------------------------------

  {genome,transcriptome,metagenome,quantify,detect_ir}
    genome              Run the simulator on genome mode
    transcriptome       Run the simulator on transcriptome mode
    metagenome          Run the simulator on metagenome mode
    quantify            Quantify transcriptome expression or metagenome abundance
    detect_ir           Detect Intron Retention events using the alignment file

genome mode
If you are interested in simulating ONT genomic reads, you need to run the characterization stage in "genome" mode with following options. It takes a reference genome and a training read set in FASTA or FASTQ format as input and aligns these reads to the reference using minimap2 (default) or LAST aligner. User can also provide their own alignment file in SAM/BAM formats. If a SAM/BAM file is provided, make sure the alignments contain the cs tag (See minimap2) and MD flag. The output of this is a bunch of profiles which you should use in simulation stage.

genome mode usage:

usage: read_analysis.py genome [-h] -i READ [-rg REF_G] [-a {minimap2,LAST}] [-ga G_ALNM]
                               [-o OUTPUT] [-c] [--no_model_fit] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input read for training
  -rg REF_G, --ref_g REF_G
                        Reference genome, not required if genome alignment file is provided
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used, minimap2 or LAST (Default = minimap2)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in SAM/BAM format (optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profiles (Default = training)
  -c, --chimeric        Detect chimeric and split reads (Default = False)
  --no_model_fit        Disable model fitting step
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment and model fitting (Default = 1)

transcriptome mode
If you are interested in simulating ONT transcriptome reads (cDNA / directRNA), you need to run the characterization stage in "transcriptome" mode with following options. It takes a reference transcriptome, a reference genome, and a training read set in FASTA or FASTQ format as input and aligns these reads to the reference using minimap2 (default) or LAST aligner. User can also provide their own alignment file in SAM/BAM formats. If a SAM/BAM file is provided, make sure the alignments contain the cs tag (See minimap2) and MD flag. The output of this is a bunch of profiles which you should use in simulation stage.

transcriptome mode usage:

usage: read_analysis.py transcriptome [-h] -i READ -rg REF_G -rt REF_T [-annot ANNOTATION]
                                      [-a {minimap2,LAST}] [-ga G_ALNM] [-ta T_ALNM] [-o OUTPUT]
                                      [--no_model_fit] [--no_intron_retention] [-t NUM_THREADS]
                                      [-c] [-q] [-n]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input read for training
  -rg REF_G, --ref_g REF_G
                        Reference genome
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome
  -annot ANNOTATION, --annotation ANNOTATION
                        Annotation file in ensemble GTF/GFF formats, required for intron
                        retention detection
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used: minimap2 or LAST (Default = minimap2)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in SAM/BAM format (optional)
  -ta T_ALNM, --t_alnm T_ALNM
                        Transcriptome alignment file in SAM/BAM format (optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profiles (Default = training)
  --no_model_fit        Disable model fitting step
  --no_intron_retention
                        Disable Intron Retention analysis
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment and model fitting (Default = 1)
  -c, --chimeric        Detect chimeric and split reads (Default = False)
  -q, --quantification  Perform abundance quantification (Default = False)
  -n, --normalize       Normalize by transcript length

metagenome mode If you are interested in simulating ONT metagenome reads, you need to run the characterization stage in "metagenome" mode with following options. It takes a metagenome list with paths pointing to each genome and a training read set in FASTA or FASTQ format as input and aligns these reads to the reference using minimap2. User can also provide their own alignment file in SAM formats. If a SAM/BAM file is provided, make sure the alignments contain the cs tag (See minimap2) and MD flag. The output of this is a bunch of profiles which you should use in simulation stage.

metagenome mode usage:

usage: read_analysis.py metagenome [-h] -i READ -gl GENOME_LIST [-ga G_ALNM] [-o OUTPUT] [-c]
                                   [-q] [--no_model_fit] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -i READ, --read READ  Input read for training
  -gl GENOME_LIST, --genome_list GENOME_LIST
                        Reference metagenome list, tsv file, the first column is species/strain
                        name, the second column is the reference genome fasta/fastq file
                        directory, the third column is optional, if provided, it contains the
                        expected abundance (sum up to 100)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in sam format, the header of each species should
                        match the metagenome list provided above (optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profiles (Default = training)
  -c, --chimeric        Detect chimeric and split reads (Default = False)
  -q, --quantification  Perform abundance quantification and compute the deviation between
                        calculated abundance and expected values (Default = False)
  --no_model_fit        Disable model fitting step
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment and model fitting (Default = 1)

quantify mode
If you are interested in quantifying the expression levels of transcriptome or abundance of metagenome, you can run this mode with following options. This mode is independent from the other features in characterization stage, and the output is not used for simulation stage.

quantifty mode usage:

usage: read_analysis.py quantify [-h] -e E -i READ [-rt REF_T] [-gl GENOME_LIST] [-ga G_ALNM]
                                 [-ta T_ALNM] [-o OUTPUT] [-t NUM_THREADS] [-n]

optional arguments:
  -h, --help            show this help message and exit
  -e E                  Select from (trans, meta)
  -i READ, --read READ  Input reads for quantification
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome
  -gl GENOME_LIST, --genome_list GENOME_LIST
                        Reference metagenome list, tsv file, the first column is species/strain
                        name, the second column is the reference genome fasta/fastq file
                        directory, the third column is optional, if provided, it contains the
                        expected abundance (sum up to 100)
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in sam format, the header of each species should
                        match the metagenome list provided above (optional)
  -ta T_ALNM, --t_alnm T_ALNM
                        Transcriptome alignment file in sam format(optional)
  -o OUTPUT, --output OUTPUT
                        The location and prefix of outputting profile (Default = expression)
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment (Default = 1)
  -n, --normalize       Normalize by transcript length

If -e trans is used, users need to provide the reference transcriptome through -rt parameter; if -e meta is used, users need to provide a genome list containing all reference genome and the path to them through -gl parameter OR provide genome alignment file through -a parameter.

detect_ir mode usage
The "transcriptome" mode of the NanoSim is able to model features of the library preparation protocols used, including intron retention (IR) events in cDNA and directRNA reads. Further, it optionally profiles transcript expression patterns. However, if you are interested in only detecting Intron Retention events without running other analysis in the characterization stage, you may use the "detect_ir" mode. Details are as follows:

detect_ir mode usage:

usage: read_analysis.py detect_ir [-h] -annot ANNOTATION [-i READ] [-rg REF_G] [-rt REF_T]
                                  [-a {minimap2,LAST}] [-o OUTPUT] [-ga G_ALNM] [-ta T_ALNM]
                                  [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -annot ANNOTATION, --annotation ANNOTATION
                        Annotation file in ensemble GTF/GFF formats
  -i READ, --read READ  Input read for training, not required if alignment files are provided
  -rg REF_G, --ref_g REF_G
                        Reference genome, not required if genome alignment file is provided
  -rt REF_T, --ref_t REF_T
                        Reference Transcriptome, not required if transcriptome alignment file is
                        provided
  -a {minimap2,LAST}, --aligner {minimap2,LAST}
                        The aligner to be used: minimap2 or LAST (Default = minimap2)
  -o OUTPUT, --output OUTPUT
                        The output name and location
  -ga G_ALNM, --g_alnm G_ALNM
                        Genome alignment file in SAM/BAM format (optional)
  -ta T_ALNM, --t_alnm T_ALNM
                        Transcriptome alignment file in SAM/BAM format (optional)
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for alignment (Default = 1)

* NOTICE: -ga/-ta option allows users to provide their own alignment file. Make sure that the name of query sequences are the same as appears in the FASTA files. For FASTA files, some headers have spaces in them and most aligners only take part of the header (before the first white space/tab) as the query name. However, the truncated headers may not be unique if using the output of poretools. We suggest users to pre-process the fasta files by concatenating all elements in the header via '_' before alignment and feed the processed FASTA file as input of NanoSim.

2. Simulation stage

Simulation stage takes reference genome/transcriptome and read profiles as input and outputs simulated reads in FASTA format. Simulation stage runs in two modes: "genome" and "transcriptome" and you may use either of them based on your needs.

Simulation stage general usage:

usage: simulator.py [-h] [-v] {genome,transcriptome,metagenome} ...

Simulation step
-----------------------------------------------------------
Given error profiles, reference genome, metagenome,
and/or transcriptome, simulate ONT DNA or RNA reads

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

subcommands:
  
  There are two modes in read_analysis.
  For detailed usage of each mode:
      simulator.py mode -h
  -------------------------------------------------------

  {genome,transcriptome}
                        You may run the simulator on genome, transcriptome,
                        or metagenome mode.
    genome              Run the simulator on genome mode
    transcriptome       Run the simulator on transcriptome mode
    metagenome          Run the simulator on metagenome mode

genome mode
If you are interested in simulating ONT genomic reads, you need to run the simulation stage in "genome" mode with following options.

genome mode usage:

usage: simulator.py genome [-h] -rg REF_G [-c MODEL_PREFIX] [-o OUTPUT]
                           [-n NUMBER] [-max MAX_LEN] [-min MIN_LEN]
                           [-med MEDIAN_LEN] [-sd SD_LEN] [--seed SEED]
                           [-k KMERBIAS] [-b {albacore,guppy,guppy-flipflop}]
                           [-s STRANDNESS] [-dna_type {linear,circular}]
                           [--perfect] [--fastq] [--chimeric] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -rg REF_G, --ref_g REF_G
                        Input reference genome
  -c MODEL_PREFIX, --model_prefix MODEL_PREFIX
                        Location and prefix of error profiles generated from
                        characterization step (Default = training)
  -o OUTPUT, --output OUTPUT
                        Output location and prefix for simulated reads
                        (Default = simulated)
  -n NUMBER, --number NUMBER
                        Number of reads to be simulated (Default = 20000)
  -max MAX_LEN, --max_len MAX_LEN
                        The maximum length for simulated reads (Default =
                        Infinity)
  -min MIN_LEN, --min_len MIN_LEN
                        The minimum length for simulated reads (Default = 50)
  -med MEDIAN_LEN, --median_len MEDIAN_LEN
                        The median read length (Default = None), Note: this
                        simulation is not compatible with chimeric reads
                        simulation
  -sd SD_LEN, --sd_len SD_LEN
                        The standard deviation of read length in log scale
                        (Default = None), Note: this simulation is not
                        compatible with chimeric reads simulation
  --seed SEED           Manually seeds the pseudo-random number generator
  -k KMERBIAS, --KmerBias KMERBIAS
                        Minimum homopolymer length to simulate homopolymer
                        contraction and expansion events in, a typical k is 6
  -b {albacore,guppy,guppy-flipflop}, --basecaller {albacore,guppy,guppy-flipflop}
                        Simulate homopolymers with respect to chosen
                        basecaller: albacore, guppy, or guppy-flipflop
  -s STRANDNESS, --strandness STRANDNESS
                        Proportion of sense sequences. Overrides the value
                        profiled in characterization stage. Should be between
                        0 and 1
  -dna_type {linear,circular}
                        Specify the dna type: circular OR linear (Default =
                        linear)
  --perfect             Ignore error profiles and simulate perfect reads
  --fastq               Output fastq files instead of fasta files
  --chimeric            Simulate chimeric reads
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for simulation (Default = 1)

transcriptome mode
If you are interested in simulating ONT transcriptome reads, you need to run the simulation stage in "transcriptome" mode with following options.

transcriptome mode usage:

usage: simulator.py transcriptome [-h] -rt REF_T [-rg REF_G] -e EXP
                                  [-c MODEL_PREFIX] [-o OUTPUT] [-n NUMBER]
                                  [-max MAX_LEN] [-min MIN_LEN] [--seed SEED]
                                  [-k KMERBIAS] [-b {albacore,guppy}]
                                  [-r {dRNA,cDNA_1D,cDNA_1D2}] [-s STRANDNESS]
                                  [--no_model_ir] [--perfect] [--polya POLYA]
                                  [--fastq] [-t NUM_THREADS] [--uracil]

optional arguments:
  -h, --help            show this help message and exit
  -rt REF_T, --ref_t REF_T
                        Input reference transcriptome
  -rg REF_G, --ref_g REF_G
                        Input reference genome, required if intron retention
                        simulation is on
  -e EXP, --exp EXP     Expression profile in the specified format as
                        described in README
  -c MODEL_PREFIX, --model_prefix MODEL_PREFIX
                        Location and prefix of error profiles generated from
                        characterization step (Default = training)
  -o OUTPUT, --output OUTPUT
                        Output location and prefix for simulated reads
                        (Default = simulated)
  -n NUMBER, --number NUMBER
                        Number of reads to be simulated (Default = 20000)
  -max MAX_LEN, --max_len MAX_LEN
                        The maximum length for simulated reads (Default =
                        Infinity)
  -min MIN_LEN, --min_len MIN_LEN
                        The minimum length for simulated reads (Default = 50)
  --seed SEED           Manually seeds the pseudo-random number generator
  -k KMERBIAS, --KmerBias KMERBIAS
                        Minimum homopolymer length to simulate homopolymer contraction and
                        expansion events in, a typical k is 6
  -b {albacore,guppy}, --basecaller {albacore,guppy}
                        Simulate homopolymers with respect to chosen  
                        basecaller: albacore or guppy
  -r {dRNA,cDNA_1D,cDNA_1D2}, --read_type {dRNA,cDNA_1D,cDNA_1D2}
                        Simulate homopolymers with respect to chosen read
                        type: dRNA, cDNA_1D or cDNA_1D2
  -s STRANDNESS, --strandness STRANDNESS
                        Proportion of sense sequences. Overrides the value
                        profiled in characterization stage. Should be between
                        0 and 1
  --no_model_ir         Ignore simulating intron retention events
  --perfect             Ignore profiles and simulate perfect reads
  --polya POLYA         Simulate polyA tails for given list of transcripts
  --fastq               Output fastq files instead of fasta files
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for simulation (Default = 1)
  --uracil              Converts the thymine (T) bases to uracil (U) in the
                        output fasta format

sample expression file for transcriptome simulation
The expression profile is a tsv file containing expression levels of each isoform to be simulated. Users can use the output of quantify mode as template for modify or use the following format for constructing a new expression profile.

The first row is header row specifying the format of the file. target_id is the id of each transcript. The tpm column is used for simulating, while the est_count is just a placeholder to be compatible with the output of the quantify mode and other quantification tools such as Salmon.

The following rows are entries for each transcript isoform and the id of which needs to exist in the provided reference transcriptome. The id should start with ENS.

Example:

target_id est_counts tpm
ENST00000222247.9 2307.2992 3145.3749
ENST00000274065.8 2641.9534 3601.5848
ENST00000400259.5 623.6130 850.1268
ENST00000344548.7 1828.3466 2492.4533
ENST00000484610.5 766.3528 1044.7137

metagenome mode
If you are interested in simulating ONT metagenome reads, you need to run the simulation stage in "metagenome" mode with following options. We have provided sample config files for users to construct their own -gl, -a, and -dl config files correctly.

metagenome mode usage:

usage: simulator.py metagenome [-h] -gl GENOME_LIST -a ABUN -dl DNA_TYPE_LIST
                               [-c MODEL_PREFIX] [-o OUTPUT] [-max MAX_LEN]
                               [-min MIN_LEN] [-med MEDIAN_LEN] [-sd SD_LEN]
                               [--seed SEED] [-k KMERBIAS]
                               [-b {albacore,guppy,guppy-flipflop}]
                               [-s STRANDNESS] [--perfect]
                               [--abun_var ABUN_VAR [ABUN_VAR ...]] [--fastq]
                               [--chimeric] [-t NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -gl GENOME_LIST, --genome_list GENOME_LIST
                        Reference metagenome list, tsv file, the first column
                        is species/strain name, the second column is the
                        reference genome fasta/fastq file directory
  -a ABUN, --abun ABUN  Abundance list, tsv file with header, the abundance of all species in
                        each sample need to sum up to 100. See example in README and provided
                        config files
  -dl DNA_TYPE_LIST, --dna_type_list DNA_TYPE_LIST
                        DNA type list, tsv file, the first column is
                        species/strain, the second column is the chromosome
                        name, the third column is the DNA type: circular OR
                        linear
  -c MODEL_PREFIX, --model_prefix MODEL_PREFIX
                        Location and prefix of error profiles generated from
                        characterization step (Default = training)
  -o OUTPUT, --output OUTPUT
                        Output location and prefix for simulated reads
                        (Default = simulated)
  -max MAX_LEN, --max_len MAX_LEN
                        The maximum length for simulated reads (Default =
                        Infinity)
  -min MIN_LEN, --min_len MIN_LEN
                        The minimum length for simulated reads (Default = 50)
  -med MEDIAN_LEN, --median_len MEDIAN_LEN
                        The median read length (Default = None), Note: this
                        simulationis not compatible with chimeric reads
                        simulation
  -sd SD_LEN, --sd_len SD_LEN
                        The standard deviation of read length in log scale
                        (Default = None), Note: this simulation is not
                        compatible with chimeric reads simulation
  --seed SEED           Manually seeds the pseudo-random number generator
  -k KMERBIAS, --KmerBias KMERBIAS
                        Minimum homopolymer length to simulate homopolymer
                        contraction andexpansion events in, a typical k is 6
  -b {albacore,guppy,guppy-flipflop}, --basecaller {albacore,guppy,guppy-flipflop}
                        Simulate homopolymers and/or base qualities with
                        respect to chosen basecaller: albacore, guppy, or
                        guppy-flipflop
  -s STRANDNESS, --strandness STRANDNESS
                        Percentage of antisense sequences. Overrides the value
                        profiled in characterization stage. Should be between
                        0 and 1
  --perfect             Ignore error profiles and simulate perfect reads
  --abun_var ABUN_VAR [ABUN_VAR ...]
                        Simulate random variation in abundance values, takes
                        in two values, format: relative_var_low,
                        relative_var_high, Example: -0.5 0.5)
  --fastq               Output fastq files instead of fasta files
  --chimeric            Simulate chimeric reads
  -t NUM_THREADS, --num_threads NUM_THREADS
                        Number of threads for simulation (Default = 1)

sample abundance file for metagenome simulation

The abundance file is a tsv file, with rows representing the abundance of each sample and columns representing each sample. Each column (except for the first row) needs to sum up to 100, because the total abundance of each sample needs to be 100.

The first row is header row to specify the number of reads in each sample. The format of the first row is:
Size total_reads_in_sample1 total_reads_in_sample2 ...
The following rows are in the format as:
Species abundance_in_sample1 abundance_in_sample2 ...

Size 200000 100
Bacillus subtilis 12 0.89
Cryptococcus neoformans 2 0.00089
Enterococcus faecalis 12 0.00089
Escherichia coli 12 0.089
Lactobacillus fermentum 12 0.0089
Listeria monocytogenes 12 89.1
Pseudomonas aeruginosa 12 8.9
Saccharomyces cerevisiae 2 0.89
Salmonella enterica 12 0.089
Staphylococcus aureus 12 0.000089

In the above example, there are two samples. The first sample will contain 20,0000 reads, while the second sample will contain 100 reads. The abundances in sample 1 and 2 are as shown in th table, and both of them add up to 100.

* Notice: the use of max_len and min_len in genome mode will affect the read length distributions. If the range between max_len and min_len is too small, the program will run slowlier accordingly.

* Notice: the transcript name in the expression tsv file and the ones in th polyadenylated transcript list has to be consistent with the ones in the reference transcripts, otherwise the tool won't recognize them and don't know where to find them to extract reads for simulation.

* Notice: the species name in the genome list file, dna type file, and abundance file has to be consistent. The chromosome names in the dna type file has to match the ones in the reference genomes.

Example runs:
1 If you want to simulate E. coli genome, then circular command must be chosen because it's a circular genome
./simulator.py genome -dna_type circular -rg Ecoli_ref.fasta -c ecoli

2 If you want to simulate only perfect reads, i.e. no snps, or indels, just simulate the read length distribution
./simulator.py genome -dna_type circular -rg Ecoli_ref.fasta -c ecoli --perfect

3 If you want to simulate S. cerevisiae genome with kmer bias, then linear command must be chosen because it's a linear genome
./simulator.py genome -dna_type linear -rg yeast_ref.fasta -c yeast --KmerBias

4 If you want to simulate human genome with length limits between 1000nt to 10000nt
./simulator.py genome -dna_type linear -rg human_ref.fasta -c human -max 10000 -min 1000

5 If you want to simulate human genome with controlled median read length and standard deviation, NanoSim will use log-normal distribution to approximate the read length distribution ./simulator.py genome -dna_type linear -rg human_ref.fasta -c human -med 5000 -sd 1.05

6 If you want to simulate ten thousands cDNA/directRNA reads from mouse reference transcriptome
./simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -rg Mus_musculus.GRCm38.dna.primary_assembly.fa -c mouse_cdna -e abundance.tsv -n 10000

7 If you want to simulate five thousands cDNA/directRNA reads from mouse reference transcriptome without modeling intron retention
./simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -c mouse_cdna -e abundance.tsv -n 5000 --no_model_ir

8 If you want to simulate two thousands cDNA/directRNA reads from human reference transcriptome with polya tails, mimicking homopolymer bias (starting from homopolymer length >= 6) and reads in fastq format
./simulator.py transcriptome -rt Homo_sapiens.GRCh38.cdna.all.fa -c Homo_sapiens_model -e abundance.tsv -rg Homo_sapiens.GRCh38.dna.primary.assembly.fa --polya transcripts_with_polya_tails --fastq -k 6 --basecaller guppy -r dRNA

9 If you want to simulate two metagenome samples with abundance variation, and chimeric reads
.simulator.py metagenome -gl sample_config_file/metagenome_list_for_simulation -a sample_config_file/abundance_for_simulation_multi_sample.tsv -dl sample_config_file/dna_type_list.tsv -c pre-trained_models/metagenome_ERR3152364_Even/training --abun_var -0.5 0.5 --chimeric

Explanation of output files

1. Characterization stage

1.1 Characterization stage (genome)

  1. training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
  2. training_aligned_reads.pkl Kernel density function of aligned reads
  3. training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
  4. training_besthit.bam The best alignment of each read based on length
  5. training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
  6. training_first_match.hist Histogram of the first match length of each alignment
  7. training_error_markov_model Markov model of error types
  8. training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
  9. training.bam The alignment output
  10. training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
  11. training_model_profile Fitted model for errors
  12. training_processed.bam A re-formatted BAM file for user-provided alignment file
  13. training_unaligned_length.pkl Kernel density function of unaligned reads
  14. training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
  15. training_strandness_rate Strandness rate in input reads

1.2 Characterization stage (transcriptome)

  1. training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
  2. training_aligned_region_2d.pkl Two-dimensional kernel density function of aligned regions over the length of reference transcript they aligned
  3. training_aligned_reads.pkl Kernel density function of aligned reads
  4. training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
  5. training_besthit.bam The best alignment of each read based on length
  6. training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
  7. training_first_match.hist Histogram of the first match length of each alignment
  8. training_error_markov_model Markov model of error types
  9. training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
  10. training.bam The alignment output
  11. training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
  12. training_model_profile Fitted model for errors
  13. training_processed.bam A re-formatted BAM file for user-provided alignment file
  14. training_unaligned_length.pkl Kernel density function of unaligned reads
  15. training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
  16. training_strandness_rate Strandness rate in input reads
  17. training_addedintron_final.gff3 gff3 file format containing the intron coordinate information
  18. training_IR_info List of transcripts in which there is a retained intron based on IR modeling step
  19. training_IR_markov_model Markov model of Intron Retention events

1.3 Characterization stage (metagenome)

  1. training_aligned_region.pkl Kernel density function of aligned regions on aligned reads
  2. training_aligned_reads.pkl Kernel density function of aligned reads
  3. training_ht_length.pkl Kernel density function of unaligned regions on aligned reads
  4. training_besthit.bam The best alignment of each read based on length
  5. training_match.hist/training_mis.hist/training_del.hist/training_ins.hist Histogram of match, mismatch, and indels
  6. training_first_match.hist Histogram of the first match length of each alignment
  7. training_error_markov_model Markov model of error types
  8. training_ht_ratio.pkl Kernel density function of head/(head + tail) on aligned reads
  9. training.bam The alignment output
  10. training_match_markov_model Markov model of the length of matches (stretches of correct base calls)
  11. training_model_profile Fitted model for errors
  12. training_processed.bam A re-formatted BAM file for user-provided alignment file
  13. training_unaligned_length.pkl Kernel density function of unaligned reads
  14. training_error_rate.tsv Mismatch rate, insertion rate and deletion rate
  15. training_strandness_rate Strandness rate in input reads
  16. training_chimeric_info Information for chimeric reads
  17. training_gap_length.pkl The gap size between suplementary alignments

2. Simulation stage

  1. simulated_reads.fasta FASTA file of simulated reads. Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.

To explain the information in the header, we have two examples:

  • >ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
    All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.
  • >ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
    This is an aligned read coming from chromosome XI at position 115406. 16565 is the sequence index. R represents a reverse complement strand. 92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

The information in the header can help users to locate the read easily.

Specific to transcriptome simulation: for reads that include retained introns, the header contains the information starting from Retained_intron, each genomic interval is separated by ;.

Specific to chimeric reads simulation: for chimeric reads, different source chromosome and locations are separated by ;, and there's a chimeric in the header to indicate.

  1. simulated_error_profile Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases.

Acknowledgements

Sincere thanks to our labmates and all contributors and users of this tool.

nanosim's People

Contributors

aafshinfard avatar abremges avatar alphasquad avatar cheny19 avatar fairliereese avatar haghshenas avatar jasperlinthorst avatar justinchu avatar karel-brinda avatar kmnip avatar lokiluciferase avatar saberhq avatar theottlo avatar warrenlr avatar xinehc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nanosim's Issues

Does NanoSim works for large genomes?

Hello,

I tried to simulate reads using NanoSim for the human genome.
The read_analysis.py script worked and generated the required models.
However, simulator.py script is running for 3 days now and is not generating any output or simulated reads.
Is there anything special I need to do for large genomes?
Can you please comment on how NanoSim scales for human size genome?

problem ftp adress with example data

Dear team,

I have a problem with the ftp adress with example data. I used the one in the example.sh file:
ftp://ftp.bcgsc.ca/supplementary/NanoSim/
but the directory "upplementary/NanoSim" does not exist ?
Do you have an updated link ?

Thanks,

Best regards,
Maud

Perfect option gives error

The --perfect option give the following error for me:

Traceback (most recent call last):
  File "./simulator.py", line 646, in <module>
    main()
  File "./simulator.py", line 638, in main
    read_profile(number, model_prefix, perfect)
  File "./simulator.py", line 156, in read_profile
    for i in xrange(number_unaligned):
UnboundLocalError: local variable 'number_unaligned' referenced before assignment

simulator.py gets stuck (training done on NGMLR bam file)

Hello,

I wanted to compare the results of an SV detection program when using minimap2 mapping (minisv) or NGMLR mapping (ngmlrsv) as input, on reads simulated by nanosim.

In order not to favour minisv (minimap2 being nanosim default mapper), I wanted to use nanosim with NGMLR mapping as input.

Although read_analysis.py works fine with an NGMLR bam file, when I run simulator.py asking for 100,000 simulated reads, it gets stuck, each time at a different number of generated reads, usually above 20,000.

I am using nanosim 2.2.0 and the following command lines (2nd step asking for 80G of ram):
read_analysis.py -i sub10k.fa -m sub10kreads.genome.aln.sam -r Perca_flavescens.PFLA1.1.dna.toplevel.okseqids.fa.gz -t 4 > read_analysis.out 2> read_analysis.log

simulator.py linear -r Perca_flavescens.PFLA1.1.dna.toplevel.okseqids.fa -n 100000 -c training -o simulating > simulated.out 2> simulated.err

I have put the input files needed for the second step here, let me know if you need anything else?
http://genoweb.toulouse.inra.fr/~sdjebali/issues/nanosim/tosend.tar.gz

Best,
Sarah

Calculate length of simulated reads

Hi,

I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.".
Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?

This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer):
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642

Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?

Thank you,
Natasha

Division by zero and other edge case errors

There are couple of bugs I encountered which are mostly of the same nature:

  • No unmapped reads means that read_analysis.py will not generate a training_unaligned_length.pkl file which breaks simulator.py script:
Traceback (most recent call last):
  File "extern/nanosim/src/simulator.py", line 739, in <module>
    main()
  File "extern/nanosim/src/simulator.py", line 723, in main
    read_profile(number, model_prefix, perfect)
  File "extern/nanosim/src/simulator.py", line 167, in read_profile
    kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
  File "/home/borabi/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 570, in load
    with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'genes/E2F1/P000R001/training_unaligned_length.pkl'
  • No head soft clipping
  • No mismatch/ins/del in a read:
    Example of the last:
2019-04-03 16:49:11: match and error models
Traceback (most recent call last):
  File "extern/nanosim/src/read_analysis.py", line 190, in <module>
    main(sys.argv[1:])
  File "extern/nanosim/src/read_analysis.py", line 180, in main
    error_model.hist(prefix, file_extension)
  File "extern/nanosim/src/besthit_to_histogram.py", line 383, in hist
    out_error_rate.write("Mismatch rate:\t" + str(total_mis * 1.0 / (total_mis + total_match + total_del)) + '\n')
ZeroDivisionError: float division by zero

What will NanoSim do when it encounters NNNN region?

Hi,

I am using NanoSim to simulate nanopore reads from hg38 genome. I am wondering what NanoSim is programmed to do when it encounters NNNN regions in the hg38 genome. I am asking because I have encountered some simulated reads that actually produced nucleotide (ATCG) sequences at NNNN regions, and these nucleotide sequence are unmappable to anywhere in the genome.

To explain it clearer,
Region in hg38 genome: AGCTCATGCAAGGGANNNNNNNNNNNNNNNNNNNNN
NanoSim simulate read: AGCTCATGCAAGGGATCGAGCTGATCGGATCGGATGC

Please explain how this sequence (TCGAGCTGATCGGATCGGATGC) is generated.

Thanks

Best,
Tham

Index out of range

Hello I'm trying to run nanofilt and after running simulation.py I get the following error message :

File "~/soft/nanosim/src/simulator.py", line 739, in <module> main() File "~/soft/nanosim/src/simulator.py", line 733, in main max_readlength, min_readlength) File "~/soft/nanosim/src/simulator.py", line 295, in simulation read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False) File ~/soft/nanosim/src/simulator.py", line 579, in mutate_read tmp_bases.remove(read[key + i]) IndexError: string index out of range
I tried to change --max_len however it doesn't have any effect.

Do you know what's the issue ?

Thanks for your help

using nanosim to generate fastq files?

Is there a way to make fastq files with nanosim? It seems to me that it would be preferable to generate fastq versus fasta files, so that nanopore QC software (eg., nanofilt) can be run on the fastq files.

"Index out of range" Error in read_analysis.py

When tested, the read_analysis.py step outputs the following message after minimap2 alignment:

2018-06-26 02:00:48: Read pre-process and unaligned reads analysis
2018-06-26 02:01:23: Alignment with minimap2
2018-06-26 03:39:49: Aligned reads analysis
2018-06-26 04:12:57: match and error models
Traceback (most recent call last):
  File "/lab01/Tools/NanoSim-2.0.0/src/read_analysis.py", line 202, in <module>
    main(sys.argv[1:])
  File "/lab01/Tools/NanoSim-2.0.0/src/read_analysis.py", line 186, in main
    error_model.hist(prefix, file_extension)
  File "/lab01/Tools/NanoSim-2.0.0/src/besthit_to_histogram.py", line 330, in hist
    add_dict(list_hist[i], dic_mis)
IndexError: list index out of range

The input is a downloaded a fastq file from Nanopore WGS Consortium for NA12878 and converted to fasta using seqtk. The reference is GRCh38. I've tested both version 2.1.0 and version 2.0.0.

Read Output ID

Relative to the comments in the readme:

ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
All information before the first _ are chromosome information. 468529 is the start position and unaligned suggesting it should be unaligned to the reference. The first 0 is the sequence index. F represents a forward strand. 0_3236_0 means that sequence length extracted from the reference is 3236 bases.
ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
This is an aligned read coming from chromosome XI at position 115406. 16565 is the index of simulation. R represents a reverse complement strand. 92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.

The 4th field confuses me. Is it "index of simulation" or "seqeunce index" and what does this mean? Is it basically a unique read ID? If so, why does the numbering reset when transitioning between unaligned reads and aligned reads?

eg. an example of what I see in the output:

ref|NC-001137|-[chromosome=V]_468529_unaligned_123_F_0_3236_0
ref|NC-001143|-[chromosome=XI]_115406_aligned_0_R_92_12710_2
The numbering resets, if intended to be a ID of sorts (which will be useful, as I may want to convert the header to something simpler) they should not reset.

Read length distribution unexpected peaks

I used the following commands to simulate reads for an E.coli strain from the NCBI database, I simulated 2 sets of reads with different length parameters but in both cases when I plotted the distribution of read lengths, regular but unexpected peaks in frequency of certain read lengths can be seen.
Is there an explanation for this and is it avoidable?

`~/NanoSim-2.1.0$simulator.py circular -r ../MG1655_reference.fasta -c ecoli -n 200000 -o MG1655_Q3_simulation  --max_len 2500

~/NanoSim-2.1.0$simulator.py circular -r ../MG1655_reference.fasta -c ecoli -n 200000 -o MG1655_Q2_simulation  --min_len 2500 --max_len 5000`

screen shot 2018-07-10 at 2 23 39 pm

Add option to control error percentage

Hey there,

Would it be possible to have an additional option to be able to control the average error rate in the simulated reads?

Usage being that I would like to create a range of simulated reads with varying error profiles in order to test the error tolerances of various other tools. This tool would be fantastic for this because it uses actual ONT characteristics.

Maybe by editing the following?

 with open(model_prefix + "_unaligned_length_ecdf", 'r') as u_profile:
        new = u_profile.readline().strip()
        rate = new.split('\t')[1]
        # if parameter perfect is used, all reads should be aligned, number_aligned equals total number of reads.
        if per or rate == "100%":
            number_aligned = number
        else:
            number_aligned = int(round(number * float(rate) / (float(rate) + 1)))
        number_unaligned = number - number_aligned
        unaligned_dict = read_ecdf(u_profile)

What do you think?

In the meantime, i'm just going to hack out some kind of solution, but it would be great if it was part of the tool.

Cheers

Unavailability of long Read MinION template

Hi,

As I understand, we need a long raw read from MinION for the simulator to do a read error profile characterization. I am unable to find raw long reads online apart from ecoli and yeast. Do we need reads of each genus to simulate artificial reads of that genus? For ex, I need artificial reads of 10 types of bacteria, do I need template reads for all 10 of them to configure the simulator?

model fitting

Hi,

I am trying to run NanoSim on a Drosophila dataset, I ran it fine on E. coli and other organisms but now I am getting this repeatedly (~10,000 times) in the model_fitting.Rout file

...

  • }

mis.fit.tmp1 <- mis_fit_func(LL.mis)
[1] "Try different initial value of MLE"
[1] "Try different initial value of MLE"

Do you have any ideas how to get around this or what is wrong? Thanks!

indexerror: string index out of range

What can I do if my reference genome is smaller than the reference used for training the error model? I get the error "indexerror: string index out of range" in this case.

Using "sam" alignments with "read_analysis.py"

Hi,

I would like to use my own alignment file with "read_analysis.py" and skip the default alignment with LAST. However, the alignment file I have is in "sam" format, and "read_analysis.py" requires alignment file in "maf" format. I was wondering if it is possible to use "sam" alignments in this step, or if not, do you know how can I convert the "sam" files to "maf" ones ?

Thank you,
Natasha

Minimap2 vs Last running time

Hi,

I tried "NanoSim-2.0" with both "minimap2" and "last" on my data.
The alignment finishes quickly; however, the run with "minimap2" took 10 hours compared to the run with "last" that took about an hour.

For "minimap2", you can see below that there have been almost 10 hours before the simulation was reported as finished.
2018-05-01 02:44:54: Model fitting
2018-05-01 12:20:20: Finished!

while for "last" on the same data, there has been only an hour:
2018-05-01 02:51:10: Model fitting
2018-05-01 03:43:43: Finished!

This is when I ran "read_analysis.py", so I was wondering if this is normal, or I am missing some optimization step for "minimap2"? I did use "minimap2" with 4 threads.

Thank you,
Natasha

Training model failure

Hi,

I run the scripts as;

read_analysis.py -i C002_05_6_50kb_TemplateFail.fasta -r E_coli_K12_NC000913.1.fasta

simulator.py circular -r E_coli_K12_NC000913.1.fasta -n 200000 --max_len 200000

and I get the following errors;

2017-02-01 20:24:49: Read pre-process and unaligned reads analysis
2017-02-01 20:24:51: Alignment with LAST
2017-02-01 20:36:07: Aligned reads analysis
2017-02-01 20:36:08: match and error models
2017-02-01 20:36:34: Model fitting
2017-02-01 20:41:16: Finished!
Traceback (most recent call last):
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 671, in
main()
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 663, in main
read_profile(number, model_prefix, perfect, max_readlength, min_readlength)
File "/gs/project/wst-164-ab/anthony/software/NanoSim-master/src/simulator.py", line 129, in read_profile
with open(model_profile, 'r') as mod_profile:
IOError: [Errno 2] No such file or directory: 'training_model_profile'

Could you kindly help me understand what the problem is.

Thanks

not python>=2.6 compatible

The README.md for nanosim states:

Python (2.6 or above)

Thus, nanosim should be compatible with Python>=2.6 and Python>=3. However, when using nanosim v1.3.0, I got the following error:

Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 716, in <module>
    main()
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 708, in main
    read_profile(number, model_prefix, perfect, max_readlength, min_readlength)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 148, in read_profile
    match_ht_list = read_ecdf(fm_profile)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/read_sim/bin/simulator.py", line 79, in read_ecdf
    for i in xrange(lanes):
NameError: name 'xrange' is not defined

...which suggests that the code is not actually python>=3 compatible. It would help to clarify the README or update the python code. Also, the bioconda recipe for nanosim should specify python>=2.6.

.log file for read_analysis.py? (Killed error)

I try to run read_analysis.py for some time now. Now I am at a stage where I get "Killed" everytime after a while, with no error code or other explanation. Other errors in the past I could solve with the python errors I got, but here I got nothing. The simulator.py does seem to give a log file (according to the readme), but the read_analysis.py doesn' t seem to have this, so I cannot look into what goes wrong.

#copy/paste of my terminal screen:

[user@01 ~]$ cd Documents/tools/NanoSim/src/
[user@01 src]$ python read_analysis.py -i /path/to/reads/ERR2173373.fasta -r /path/to/ref/OMOL01.fasta -m /path/to/sam/arabidopsis_minimap_CStag.sam -o /path/to/output 
2018-10-26 08:04:04: Read pre-process and unaligned reads analysis
2018-10-26 08:04:59: Processing alignment file: sam
2018-10-26 08:24:27: Aligned reads analysis
2018-10-26 08:41:20: match and error models
Killed
[user@01 src]$

p.s I made my own minimap2.sam, because the NanoSim variant keeps giving me a memory error. I did use the same parameters though.

[IndexError: string index out of range] Trying to introduce errors at indices > length of read

NanoSim is occusionally producing an error caused by the generating a ref_l greater than the length of a reference contig. In this example, my reference has only one contig of length ~850. Note that I added two lines to print the error_dict and new_read variables for debugging.

[borabi@xavier3 freddie]$ simulator.py linear -r reference.fasta -c training -o simulated -n 3
len(new_read)
851
error_dict:
{13: ['del', 1], 22.5: ['ins', 4], 48: ['mis', 1], 52: ['mis', 1], 53: ['mis', 1], 134: ['del', 1], 147: ['del', 1], 165: ['mis', 1], 166: ['mis', 1], 177: ['del', 1], 228: ['mis', 1], 229: ['mis', 1], 237.5: ['ins', 1], 246: ['mis', 1], 250: ['mis', 1], 265: ['del', 1], 288: ['del', 1], 310: ['del', 1], 337: ['mis', 1], 370: ['del', 1], 371: ['mis', 2], 494: ['del', 1], 507.5: ['ins', 3], 521: ['mis', 1], 531.5: ['ins', 5], 535.5: ['ins', 3], 573: ['del', 3], 577: ['del', 1], 578.5: ['ins', 1], 590: ['del', 3], 615: ['mis', 2], 618: ['del', 1], 638: ['del', 1], 671: ['mis', 1], 674: ['mis', 2], 699: ['del', 2], 701: ['mis', 1], 752: ['del', 2], 754: ['mis', 2], 756.5: ['ins', 1], 759: ['mis', 1], 763: ['mis', 1], 770: ['mis', 1], 775: ['mis', 2], 778: ['mis', 1], 782: ['del', 1], 810: ['mis', 1], 819.5: ['ins', 2], 845.5: ['ins', 2], 870: ['mis', 1], 890: ['del', 2], 909: ['del', 1], 928: ['del', 2], 942.5: ['ins', 2], 943.5: ['ins', 2], 974: ['mis', 1], 983: ['del', 3], 996.5: ['ins', 1], 1018: ['mis', 1], 1054: ['mis', 2], 1067: ['del', 1], 1072: ['del', 3], 1078: ['mis', 1], 1090: ['mis', 1], 1104.5: ['ins', 3], 1107: ['mis', 1], 1122: ['del', 1], 1125: ['mis', 2], 1127: ['mis', 2], 1130: ['mis', 2], 1132: ['mis', 1], 1136: ['del', 2], 1149: ['del', 2], 1153: ['del', 1], 1163: ['mis', 1], 1170.5: ['ins', 1], 1211.5: ['ins', 2], 1212: ['mis', 1], 1214: ['del', 1], 1215: ['mis', 1], 1223: ['del', 1], 1233: ['mis', 1], 1258: ['mis', 1], 1291: ['mis', 1], 1299: ['mis', 1], 1303.5: ['ins', 1], 1305: ['mis', 2], 1310.5: ['ins', 4], 1314: ['mis', 1], 1316: ['del', 1], 1322.5: ['ins', 2], 1341.5: ['ins', 1], 1358: ['del', 1]}
Traceback (most recent call last):
  File "simulator.py", line 741, in <module>
    main()
  File "simulator.py", line 735, in main
    max_readlength, min_readlength)
  File "simulator.py", line 384, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
  File "simulator.py", line 581, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

Change read length distribution simulated reads

I don't think it is possible, but I wanted to change some parameters in my simulated reads, to test different factors. One of the parameters I really want to change is the read length like max, average, min. I tried changing some parameters in the ecdf files (_aligned_reads_ecdf, _aligned_length_ecdf and _unaligned_length_ecdf) by changing the bins and the percentages after them, but that doesn't seem to work. Only changing _aligned_reads_ecdf (what I tried first) doesn't even seem to do anything.

Do you know how I could possibly change the read lengths of the reads I am going to simulate? I believe that the error rates should be enough to make the reads? I guess they do not change according to the size of the reads (that is only library preparation)?

NanoSim Simulation and reference use

Hi,
I'm actually working with NanoSim to simulate reads from customized reference:
to do That I'm using:

  • predefined error profile (from nanosim-h) which is ecoli_R9_1D.
  • 2 differents customized references (to see the impact of homopolymers on reads).
    But according to my results, I think that change the reference file has no impact on reads simulation the error profile put same errors type and number for same read length sequences.
    Is my results correct (i.e nanoSim use just the error profile) or it consider the reference file (number of homopolymers and other things) ?
    thank's.

Simulator fails using different reads

Hi,
The read_analysis.py and simulator.py steps are working fine when I use the supplied ecoli and yeast reads in the profile stage. However, when I use Nick Lomans MAP-006-2 reads ( http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/ ) I get this error in the simulation stage:

Traceback (most recent call last):
File "/home/mike/NanoSim/src/simulator.py", line 643, in
main()
File "/home/mike/NanoSim/src/simulator.py", line 635, in main
read_profile(number, model_prefix, perfect)
File "/home/mike/NanoSim/src/simulator.py", line 149, in read_profile
number_aligned = int(round(number * float(rate) / (float(rate) + 1)))
ValueError: invalid literal for float(): 100%

These were my commands:

read_analysis.py -i ecoli_K12_MG1655_ref.fa -r MAP006-2_2D_pass.fasta -o MAP

simulator.py circular -r Pmarinus.fasta -c MAP -n 10 -o pmarinus_MAP_10

This data set has longer reads (>9kbp) and higher accuracy than the supplied data sets I think.
Any ideas?

Many thanks,
Mike

Error encountered in the KernelDensity function

Dear authors,

The error encountered is ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

I used the datasets provided by the Loman lab and an ecoli reference genome as input.

The error message prevents me from running the simulater.py script.

Any suggestions you have regarding how to resolve the issue are appreciated! Or can you make the precomputed profiles available again?

Thank you!

root@549758d01510:/# INPUT=/data/R9_Ecoli_K12_MG1655_lambda_MinKNOW_0.51.1.62.all.fasta
root@549758d01510:/# REF=/data/13002263354.fna
root@549758d01510:/# PROFILE=/data/ecoli
root@549758d01510:/#
root@549758d01510:/# read_analysis.py -i $INPUT -r $REF -o $PROFILE
2019-02-21 18:10:06: Read pre-process and unaligned reads analysis
2019-02-21 18:10:23: Alignment with minimap2
[M::mm_idx_gen::0.187*0.85] collected minimizers
[M::mm_idx_gen::0.271*0.88] sorted minimizers
[M::main::0.272*0.88] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.288*0.87] mid_occ = 12
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.304*0.86] distinct minimizers: 838544 (98.18% are singletons); average occurrences: 1.034; average spacing: 5.352
Killed
2019-02-21 18:13:00: Aligned reads analysis
Traceback (most recent call last):
  File "/NanoSim/src/read_analysis.py", line 190, in <module>
    main(sys.argv[1:])
  File "/NanoSim/src/read_analysis.py", line 161, in main
    num_aligned = align.head_align_tail(prefix, file_extension)
  File "/NanoSim/src/head_align_tail_dist.py", line 175, in head_align_tail
    kde_aligned = KernelDensity(bandwidth=10).fit(aligned_2d)
  File "/root/.local/lib/python3.7/site-packages/sklearn/neighbors/kde.py", line 128, in fit
    X = check_array(X, order='C', dtype=DTYPE)
  File "/root/.local/lib/python3.7/site-packages/sklearn/utils/validation.py", line 582, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.
root@549758d01510:/# ls -lah /data
total 1.8G
drwxr-xr-x 7 root root  224 Feb 21 18:12 .
drwxr-xr-x 1 root root 4.0K Feb 21 18:08 ..
-rw-rw-r-- 1 root root 4.6M Feb 20 21:55 13002263354.fna
-rw-r--r-- 1 root root 901M May 25  2016 R9_Ecoli_K12_MG1655_lambda_MinKNOW_0.51.1.62.all.fasta
-rw-r--r-- 1 root root    0 Feb 21 18:10 ecoli.sam
-rw-r--r-- 1 root root    0 Feb 21 18:12 ecoli_primary.sam
-rw-r--r-- 1 root root 901M Feb 21 18:10 ecoli_processed.fasta

simulator.py: UnicodeDecodeError

I used the command below and obtained an error:

$ python ~/repositories/NanoSim/src/simulator.py linear -r ~/GRCh38_recommended/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz -o nanosim-10000 --median_len 10000 --sd_len 1
/home/wdecoster/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 739, in <module>
    main()
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 730, in main
    max_readlength, min_readlength, median_readlength, sd_readlength)
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 254, in simulation
    for seqN, seqS, seqQ in readfq(infile):
  File "/home/wdecoster/repositories/NanoSim/src/simulator.py", line 211, in readfq
    for l in fp:  # search for the start of the next record
  File "/home/wdecoster/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This error was resolved by using a non-compressed reference genome, but it would be good if you could either support compressed reference genomes or alternatively give a more informative error message.

Cheers,
Wouter

problem with dna_type

dear Developper,
I get this error when i run the tool with this command:
command:
~/Downloads/NanoSim/src/simulator.py genome -dna_type linear -rg Fusve2_AssemblyScaffolds.fasta -c training

error:
`running the code with following parameters:

ref_g Fusve2_AssemblyScaffolds.fasta
model_prefix training
out simulated
number 20000
perfect False
kmer_bias 0
dna_type linear
strandness None
sd_readlength None
median_readlength None
2019-05-21 08:25:33: /home/lfaino/Downloads/NanoSim/src/simulator.py genome -dna_type linear -rg Fusve2_AssemblyScaffolds.fasta -c training
mkdir: missing operand
Try 'mkdir --help' for more information.
2019-05-21 08:25:33: Read in reference
Traceback (most recent call last):
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 1184, in
main()
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 1123, in main
read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness)
File "/home/lfaino/Downloads/NanoSim/src/simulator.py", line 281, in read_profile
if len(seq_dict) > 1 and dna_type == "circular":
NameError: global name 'dna_type' is not defined`

Define read length parameters

Hello,

Is it possible to use a generic genome to train for error rates and the distribution, and to run the simulation stage with manually defined mean read length ?

Thanks for your help

Read Simulation: reference length limitation?

Hi,
I was hoping to use this tool to simulate some reads for a set of amplicons. However, it gives me an error when I try it:

src/simulator.py linear -r ~/R/projects/umi.sim/nanosim.input.fasta -c training.juplasmid -o
sim.pcrprod
Traceback (most recent call last):
  File "src/simulator.py", line 716, in <module>
    main()
  File "src/simulator.py", line 710, in main
    simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength)
  File "src/simulator.py", line 284, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False)
  File "src/simulator.py", line 577, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

the reference file contains 10 000 entries with a size of 1,3 kb. So I thought maybe it can not deal with so many entries. That is why I tried the simulation with a plasmid of 7.4 kb, but that outputs the same error.

When I try it with a 49kb lambda genome it works, that is why I assume there is a limitation in the reference size?

Is it possible to adjust NanoSim for shorter references or am I missing something?

all reads aligned bug?

Hi developers
With a particular simulation we repeatedly get an error when starting the sim, because the file _unaligned_length.pkl has not been generated during training. This has not been a problem when running nanosim on our other datasets. Is this happening because there are no unaligned reads in this particular training set? Is there a way to sort this out so we can still get a sim from this dataset?
thanks!

example files absent

trying to run the example.sh, I get folder not found errors
could it be that the files have been moved around?
thanks

wget ftp://ftp.bcgsc.ca/supplementary/NanoSim/ecoli_R7_2D.fasta
--2019-06-06 10:56:57--  ftp://ftp.bcgsc.ca/supplementary/NanoSim/ecoli_R7_2D.fasta
           => ‘ecoli_R7_2D.fasta’
Resolving ftp.bcgsc.ca (ftp.bcgsc.ca)... 134.87.4.91
Connecting to ftp.bcgsc.ca (ftp.bcgsc.ca)|134.87.4.91|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /supplementary/NanoSim ... 
No such directory ‘supplementary/NanoSim’.

=> it seems the files are now @ http://www.bcgsc.ca/downloads/supplementary/NanoSim/ which could be corrected in the example.sh wget commands

Options min_len and max_len throw an error or do not work

Trying to use min_len gives:

python NanoSim/src/simulator.py linear -r ecoli.fa -c projects/cheny_prj/nanopore/paper/R9/1D/ecoli -n 500 --min_len 1500
./simulator.py [command]
[command] circular | linear
Do not choose 'circular' when there is more than one sequence in the reference
:
-h : print usage message
-r : reference genome in fasta file, specify path and file name, REQUIRED
-c : The prefix of training set profiles, same as the output prefix in read_analysis.py, default = training
-o : The prefix of output file, default = 'simulated'
-n : Number of generated reads, default = 20,000 reads
--max_len : Maximum read length, default = Inf
--min_len : Minimum read length, default = 50
--perfect: Output perfect reads, no mutations, default = False
--KmerBias: prohibits homopolymers with length >= n bases in output reads, default = 6

Trying to use max_len gives an error:

python NanoSim/src/simulator.py linear -r ecoli.fa -c projects/cheny_prj/nanopore/paper/R9/1D/ecoli -n 500 --max_len 1000
Traceback (most recent call last):
File "NanoSim/src/simulator.py", line 671, in
main()
File "NanoSim/src/simulator.py", line 665, in main
simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength)
File "NanoSim/src/simulator.py", line 344, in simulation
read_mutated = ''.join(np.random.choice(BASES, head)) + read_mutated
AttributeError: 'module' object has no attribute 'choice'

nanopore data

Does Nanosim work on generating RNA data as well?

Covering the entire genome

Does Nanosim simulate reads from every part of the reference genome? Is there a way to configure coverage or a minimum number of reads to ensure that the entire genome is covered. My requirement is that WGS assembly should be possible from the artificial reads generated.

power overflow

Hi,
When I was running the read_analysis file, I got this:

/mixed_model.py:25: RuntimeWarning: overflow encountered in power
wei_cdf = 1 - np.exp(-1 * np.power(x / l, k))

I was wondering if any action is required for this warning.

Maybe a bug in the error_list function

Hi, I am just wondering if the error_list function works correctly. When creating an e_dict, it sometimes introduces steps of a negative length. I think that it might cause problems in the mutate_read function as it assumes (without any checking) that the step length (val[1]) will be positive. The program does not crash but I am afraid that the resulting data do not fully correspond to the used model. To observe the problem, you can use assert key>=0 after key = int(round(key)).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.