Git Product home page Git Product logo

sharkmer's Introduction

sharkmer - a kmer analysis tool

Functionalities of sharkmer include:

  • Incremental kmer counting. This allows you to run kmers on incrementally larger subsets of your data. Applications include assessing the robustness of genome size estimates to sequencing depth.
  • in silico PCR (sPCR). You supply a fastq file of whole genome shotgun reads and sequences for primer pairs, and get a fasta file of the amplicons that would be produced by PCR on the genome. This is useful for assembling and isolating particular genes from raw genome skimming data.

There are two components to sharkmer:

  • The sharkmer executable, written in rust, that inputs reads, counts kmers, and outputs histograms.
  • The optional sharkmer_viewer.py python script for viewing and analyzing histograms from incremental kmer counting

sharkmer counts kmers on subsets of the data, and then builds incremental histograms from these counts. This is more efficient than counting kmers on all the data at once, and allows you to see how the results change as data are added. This builds more insight from the same data. It also helps with practical questions, such as figuring out how much coverage you need to get good genome size estimates in your organism and deciding whether to collect more data.

Here is an overview of how kmer counting works in sharkmer:

  1. fastq data are ingested one read at a time and recoded as 8 bit integers, with 2 bits per base. Reads are broken into subreads at any instances of N, since 2 bit encoding only covers the 4 unambiguous bases and kmers can't span N anyway. The subreads are distributed across n chunks of subreads as thew are read.
  2. Within each chunk, kmers are counted in a hashmap. The counting of kmers in parallelized across chunks, allowing multiple threads to be used.
  3. The hashmaps for the n chunks are summed one by one, and a histogram is generated after each chunk of counts is added in. This produces n histograms, each summarizing more reads than the last.

A few notes:

  • The read data must be uncompressed before analysis. No .fastq.gz files, just .fastq.

Installation

This repository includes sharkmer, which is written in rust, and some helper programs written in python. The python components are optional tools for some followup analyses.

Rust components

From source

First, install the rust build tools.

Then clone this repository and build sharkmer (note that the sharkmer rust code is in the sharkmer/sharkmer folder, not the top level sharkmer folder):

git clone https://github.com/caseywdunn/sharkmer.git
cd sharkmer/sharkmer
cargo build --release

The executable will be in sharkmer/target/release/sharkmer. Move it to a location in your path.

Python components

You can create a conda environment with all optional python components as follows:

conda create -n shark -c conda-forge -c bioconda python==3.10 ffmpeg genomescope2
conda activate shark
cd sharkmer_viewer/
pip install .

Test data

Thermus thermophilus

This repository includes a test dataset from Thermus thermophilus for simple tests and to develop against.

After cloning the repo, gunzip the data in the data dir:

cd sharkmer/data/ # Note that this is the sharkmer folder within the sharkmer repository
gunzip -c SRR5324768_pass_1.fastq.gz > SRR5324768_pass_1.fastq

Additional datasets

Additional datasets are available from the NCBI SRA. Install the sra toolkit to download these datasets as described below.

Usage

To get full usage information, run

sharkmer --help

Incremental kmer counting

Incremental kmer counting for genome size estimation takes a lot of data (about 50x coverage of the genome), and a large amount of RAM. So this example isn't practical on most laptops, given their disk and RAM limitations, and will require a workstation or cluster.

We will need a relatively large dataset. Download this Cordagalma ordinatum dataset from NCBI SRA using the sra toolkit:

fastq-dump --split-files SRR23143278

Now perform incremental kmer counting on the downloaded reads:

sharkmer -o output/ -s Cordagalma-ordinatum data/SRR23143278_1.fastq data/SRR23143278_2.fastq

Notice that you can specify multiple fastq files, in this case the R1 and R2 reads. The -o argument specifies the output directory. The -s argument specifies a prefix for the output files.

The incremental histogram files in this case would be:

output/Cordagalma-ordinatum.histo # All the incremental histograms, each in their own column. Suitable for analysis with `sharkmer_viewer.py`.

output/Cordagalma-ordinatum.final.histo # Just the final histogram with all data. Suitable for analysis with genomescope and other tools.

Then to explore the results with the optional python tools:

conda activate shark
sharkmer_viewer Cordagalma-ordinatum.histo

The final histogram on all the data is in a standard format and can be examined with a variety of tools, for example, GenomeScope2:

genomescope2 -i Cordagalma-ordinatum.final.histo -s Cordagalma-ordinatum -k 21

The included genomemovie.sh script will generate a movie of the incremental GenomeScope2 histograms. For example, to create a movie of the Cordagalma test dataset:

conda activate shark
bash genomescopemovie.sh Cordagalma-ordinatum.histo Cordagalma-ordinatum.output

in silico PCR (sPCR)

Investigators often want to pull small genome regions out of genome skimming data, for example to blast a commonly sequenced gene to verify that the sample is the expected species. This task is surprisingly challenging in practice, though. You can map reads to known sequences and then collapse them into a sequence prediction, but this does not always work well across species and can miss variable regions. You can assemble all the reads and then pull out the region of interest, but this is computationally expensive and often the region of interest is not assembled well given how shallow skimming data often are.

in silico PCR (sPCR) is a new alternative approach. You specify a file with raw reads and one or more primer pairs, and sharkmer outputs a fasta file with the sequence of the region that would be amplified by PCR on the genome the reads are derived from. There are multiple advantages to this approach:

  • sPCR directly leverages the decades of work that have been done to optimize PCR primers that work well across species and span informative gene regions. These primers tend to bind conserved regions that flank variable informative regions and have minimal off-target binding.
  • Because sPCR is primer based, you can use it to obtain the exact same gene regions (co1, 16s, 18s, 28s, etc...) that have been PCR amplified for decades and still remain the most broadly sampled across species in public databases.
  • sPCR doesn't take much data. You can use small datasets, or analyze small (eg one million read) subsets of your data.
  • sPCR is fast and has minimum computational requirements. It can be run on a laptop in a couple minutes on a million reads.
  • sPCR requires a single tool (sharkmer), not complex workflows with multiple tools.

sPCR is useful when you want specific genes from skimming datasets you have collected for other purposes. But in some cases it may be more cost effective and easier to skim and apply sPCR than to use traditional PCR. With sPCR you sequence once and then pull out as many gene regions as you want, as opposed to PCR where you amplify and sequence each region separately. There is little additional computational cost for each added primer pair, since most of the work is counting kmers and this is done once for all primer pairs. So the cost of sPCR is fixed and does not depend on the number of genes considered.

in silico PCR example

sPCR of nuclear ribosomal RNA genes (eg animal 16s, 18s) and mitochondrial genes does not take much coverage, given the relatively high copy number of these genes in skimming data. For Illumina raw reads, 0.25x coverage of the genomes is sufficient.

We will download a smaller dataset from the coral Stenogorgia casta:

fasterq-dump SRR26955578

Run sPCR on the downloaded reads by specifying that we want to run a panel of cnidarian primers:

sharkmer --max-reads 1000000 -s Stenogorgia_casta -o output/ --pcr cnidaria SRR26955578_1.fastq SRR26955578_2.fastq

This is equivalent to specifying the primer pairs manually, also with the --pcr argument:

sharkmer \
  --max-reads 1000000 \
  -s Cordagalma_CWD6 -o output/ \
  --pcr "GACTGTTTACCAAAAACATA,AATTCAACATCGAGG,1000,16s" \
  --pcr "TCATAAAGATATTGG,ATGCCCGAAAAACCA,2000,co1" \
  --pcr "AACCTGGTTGATCCTGCCAGT,TGATCCTTCTGCAGGTTCACCTAC,2500,18s" \
  --pcr "CCYYAGTAACGGCGAGT,SWACAGATGGTAGCTTCG,4000,28s"  \
  --pcr "TACACACCGCCCGTCGCTACTA,ACTCGCCGTTACTRRGG,1000,ITSfull" \
  SRR26955578_1.fastq SRR26955578_2.fastq

The --pcr argument passes a string with the format forward,reverse,max-length,gene-name. Note that commas delimit fields. max-length should be greater than the expect PCR product size. It indicates the furthest distance from the forward primer that sharkmer should search for a reverse primer.

The --max-reads 1000000 arguments indicates that the first million reads should be used. THis is plenty for nuclear rRNA sequences 18s, 28s, and ITS, since it occurs in many copies in the genome, and mitochondrial sequences 16s and co1. Single copy nuclear genes would require more data.

This analysis will generate one fasta file for each primer pair. These fasta files are named with the argument passed to --pcr. If no product was found, the fasta file is not present. The fasta file can contain more than one sequence.

There are a limited set of preconfigured primer panels available at this time. You can see where they are hard coded here. If you have other primers that you would like to have added to the tool, or suggestings for optimizing the primers that are already there, please open an issue in the issue tracker.

Reading compressed data

sharkmer does not read compressed data directly, but it can read uncompressed data from stdin. So, for example, you could gunzip files and pipe them to sharkmer:

zcat agalma_*.fastq.gz | sharkmer -k 21 -n 10 --histo-max 10000 -s Agalma-elegans

Decompressing files takes quite a bit of compute. Handling decompression outside of sharkmer allows you to use whichever approach you prefer on your system, for example parallel tools such as pigz.

Development

Rust

Some common tasks in development:

cargo test
cargo clippy
cargo clippy --fix
cargo fmt
cargo build # Debug
cargo build --release

sharkmer's People

Contributors

caseywdunn avatar

Stargazers

Mike T Connelly avatar Adena Collens avatar Bishoy Hanna avatar

Watchers

 avatar darrin t schultz avatar  avatar

sharkmer's Issues

Problems with sPCR - binding sites were not found for the forward primer. Abandoning PCR.

I tried to do a sPCR for the 16S gene with universal primers 27F and 1492R and I received the message "For gene 16s, binding sites were not found for the forward primer. Abandoning PCR.".
I've tried also the primers from the sharkmer itself and other primer sequences and the problem remains.

Can someone please help me?

The complete message is:

sharkmer --max-reads 1000000 --pcr "TTGATCCTGGCTCAG,TACCTTGTTACGACTT,10000,16s" A7.fasta
sharkmer 0.2.0
Args { k: 21, histo_max: 10000, n: 100, max_reads: 1000000, threads: 1, sample: "sample", outdir: "./", input: Some(["A7.fasta"]), pcr: ["TTGATCCTGGCTCAG,TACCTTGTTACGACTT,10000,16s"], rad: [], verbosity: 0 }
Processing sample sample
Ingesting reads... done
Read 70 reads
Read 2288403 bases
Ingested 70 subreads
Ingested 2288403 bases
Ingested 2287003 kmers
Time to ingest reads: 296.109779ms
Consolidating chunks and creating histograms... done, time: 1.75227365s
2275455 unique kmers with a total count of 2287003 were found
Writing histograms to file... done
Writing final histogram to file... done
2275455 unique kmers in histogram
2287003 kmers in histogram
Writing stats to file... done
Running in silico PCR...
Running PCR on gene 16s
Removing kmers with coverage less than 3...
The number of unique kmers went from 2275455 to 1689
Expanding the forward primer into all variants
There are 991 variants of the primer
Expanding the reverse primer into all variants
Trimming the primer to ACCTTGTTACGACTT so that it is within the trim length of 15.
There are 991 variants of the primer
Finding kmers that contain the forward primer
Finding kmers that contain the reverse primer
For gene 16s, binding sites were not found for the forward primer. Abandoning PCR.
Suggested action: optimize primer sequence.
Done running in silico PCR
Total run time: 6.516381262s

do not store reads

It has become infeasible to store reads. On larger dataset, ran out of RAM on a 2TB machine.

Main reason I was storing reads was that I needed to know how many there are so I know how many to allocate to each chunk for hashing. Can't read them twice, once to count and once to hash, since sometimes they come from STDIN.

So will take a new strategy. Will read in batches that will be much smaller than chunk size, and allocate them to each chunk as they are read.

There are two ways to do this. Could split each batch and fan reads out across chunks, or I could allocate each batch in its entirety to a chunk, and then put the next one in the next chunk, etc, and then wrap around and do it again.

nested primer gives no product

Sam found that when the following is run:

sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0" --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1" .\NA24_164_029_S8_L002_R1_001_1M.fastq

The co1bp0 primer pair gives a product and the co1bp1 does not, even though co1bp1 has the same reverse primer and the forward primer differs only by the addition of one more nucleotide, which should have the effect of advancing the trimmed primer by one. That advanced nucleotide is in the product given by the first primer. So the second primer is a nested primer.

This is the kmer containing the primer in that first primer pair:

TCATAAAGATATAGGAACATTATACCTAGTC, count 16, keep true

And this is the kmer containing the primer in the second primer pair:
CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true

So they have the same count.

panics on some runs

@shchurch found that sharkmer panics on this run:

zcat /vast/palmer/scratch/dunn/sc2962/physalia_novaseq_2023/Sample_KM5634_126_067/KM5634_126_067_S67_L003_R1_001.fastq.gz |  ./sharkmer -k 31 -m 500000 -n 100 -t 8 -s KM5634         --pcr "GACTGTTTACCAAAAACATA_AATTCAACATCGAGG_1000_16s"         --pcr "GGTCAACAAATCATAAAGATATTGG_TAAACTTCAGGGTGACCAAAAAATCA_1000_co1"     --pcr "TCATAAAGATATAGGAACA_GTGACCAAAAAACCA_1000_CO1"         --pcr "AACCTGGTTGATCCTGCCAGT_TGATCCTTCTGCAGGTTCACCTAC_2500_18s"         --pcr "CCYYAGTAACGGCGAGT_SWACAGATGGTAGCTTCG_4000_28s"         --pcr "TACACACCGCCCGTCGCTACTA_ACTCGCCGTTACTRRGG_1000_ITSfull"          --pcr "ACGTGGTATGGTTGCCTCTG_CTTGATAACGCCAACGGCWAC_3000_EF1a"

single end "pcr" mode

If user does not specify second primer with --pcr, then attempt to extend to max length and return the longest product. Suggested by @beroe .

getting two product 0

Get the following:


>Physalia-physalis-Pacific cnidaria_co1 product 0 length 688 kmer count stats mean 53.14 median 54 min 37 max 65
TCATAAAGATATAGGAACATTATACCTAGTCTTTGGTTTATTTTCAGGTATGGTAGGAACTGCTCTTAGTATGTTAATCAGGTTAGAGTTATCAGGACCCGGTACTATGTTTGGAGATGATCACCTTTATAATGTTATAGTAACTGCTCATGCCTTTGTCATGATCTTTTTCCTTGTAATGCCAGT
CCTAATCGGAGGCTTCGGTAACTGGTTCGTACCCCTATTTATAGGTGCTCCGGATATGGCCTTCCCTAGGTTGAACAACCTAAGTTTTTGATTATTACCCCCTGCTTTATTACTACTATTAGGGTCATCCTTGATAGAACAAGGTGCAGGAACTGGTTGAACTGTTTACCCTCCTTTGTCTGGCCC
CCAAACTCATTCTGGGGGATCAGTTGATATGGCTATTTTCAGTTTACACTGTGCGGGTGCCTCCTCAATTATGGGTGCTATTAACTTCATAACCACTATATTTAATATGAGGGCCCCTGGTATGACTATGGATAAGTTACCATTATTTGTCTGATCAGTTTTGATAACTGCCTTCCTCTTATTACT
ATCATTACCCGTGTTGGCCGGAGCTATAACTATGTTACTTACTGATAGAAATTTTAATACTACTTTCTTCGACCCTGCGGGAGGTGGTGATCCAGTTCTCTATCAACATTTGTTCTGGTTTTTTGGTCAC
>Physalia-physalis-Pacific cnidaria_co1 product 0 length 1915 kmer count stats mean 17.18 median 18 min 3 max 70
TCAGAAGAATATTGGATATATTATCAATCAGAGTTGCTCTTTGGTTTATATAAAGAGGGCATGCAAATAAGAAGTGTGTTGTATCTTCAATACCATCGTTGCAAATGCAAACAGGAGAGGGGGTATCAATAAAACCATGACGATTTTTATGACATCTTAATGGACTTAAACCAACTCTCAAAAAAA
GAGATTACGAACTCCAATGGGATTAAAAATATTAAAAATACTTTTTCTCTCTGGACGAAAGAGTGATAACAAATGCTTTTTTAGGATTTTAATAGGTGGCATGTTTTTAAAATCAGAAATGGTAATATTCCATAATTTGACAGCATTTGGGAAAAAGCTATTCTTGTACCTATCAGAATTACATCG
CGTTTCTTGGAAGGTATTGTCATTGTTATATCTATATAGTGCATGGCGAGGAGGGGGAAGTTTATTATTGAGATAAGCTGGCGTCTTATTACTAACAATTTTATGAAATTGCATAATTCTTCTAGAGAGACGGCGATCAGTTAAGGACTCCCAACCTAGTTCTTCGTATAATTTGGAACGACTTGA
GCCACGCCATGTACCAGTAACAGCAAGCGCTGCTAAATATTGGACCTGTTCAAGCTTTTCCATTAAGTAATTAAGTACTTCACCAAGTTGGTCTTGTTTTGACGGGATATGATAAATTACATCGCAGTAATCAAAATGAGGTCGAACAACTGATTTATAAATTAGATTAAGAGTCTTAAGGGGTAA
AAATTTTGATAGATGTTTAATTATACCAAGATGTTGATTAGCCTTTTTGATTTTTTCACTAAGATGTTTATTAAATGATAAGTTTGATTCAAGAATTAGACCTAAATGTTTATGTCCCTTGACTTTGGTGACAGCAGTTCCATTGAAAAATATTTTTGGATGAATAACAGTGGTCTTTTTACAAGA
AAATATAATTTCAGTGGCTTGTTTAAGAGGATCAGGATTGAATTCAAGTTTCCATTGGTTAGCCCATTTATTAATGGTATCTAAATCATGATTTAGATACCATTAATAAATGGGCTAACCAATGGAAACTCGAATTCAACCCTGATCCTCACAAACAAGCCACTGAAGTCTTATTTTCATGTAAAC
AAAAAAGTCCCAATCATCCTCAACTCTTTTTCAATGGAACTGTTGTGGCTAAAGTGGATGAGCATAAACATTTAGGTCTTGTTCTTGAATCAGATTTGTCTTTTAAGAAACATATTCATGAAAAAATTAAGAAGGCTAAAAAGAACATTGGTATAATAAAGTACCTTTCTAAATTTTTACCCCTTA
AGACTCTTGATCTAATTTATAAATCAGTTGTTCGACCTCATTTTGATTACTGCGATGTAATTTATCATATCCCGTCAAAACAAGACCAACTTGGTGAAGTACTTAATTACTTAATGGAAAAGCTTGAACAGGTCCAATATTTAGCAGCGCTTGCTGTTACTGGTGCATGGCAAGGTTCTAGCTGTT
CAAAATTATACGAAGAATTGGGCTGGGAATCCCTTTCAGATCGTCGTTGGTGTAGGCGAATTCTTCAAATTCATAAGATTACGAATAAAAATACACCTAATTATCTCTACAACAAGCTTCCATGTCGCCGTAGGCCTTTATACAGACTAACCAACTATAATATATTTCACGAAATACGATGCCAGA
CTGATCGGTACAAAAATAGTTTTTTCCCAGATGCAATTAAGGGTTGGAATATTGTTATTCAAATTTTTCCTAATATCCCATCAATAAATGTTCTTAAAAATCATATTTTATCTCTCACTCGTCCAGAGAAAAAAACCTTTTTCAACATACACGACCCTGTGGGACTTCGCTACCTTTTCTTTTTGA
GATTGGGCTTTAGTCCTCTAAGGAGTCACAAAAACAGACATGGTTTTTTAGATAC
>Physalia-physalis-Pacific cnidaria_co1 product 1 length 1915 kmer count stats mean 17.18 median 18 min 3 max 70
TCAGAAGAATATTGGATATATTATCAATCAGAGTTGCTCTTTGGTTTATATAAAGAGGGCATGCAAATAAGAAGTGTGTTGTATCTTCAATACCATCGTTGCAAATGCAAACAGGAGAGGGGGTATCAATAAAACCATGACGATTTTTATGACATCTTAATGGACTTAAACCAACTCTCAAAAAAA
GAGATTACGAACTCCAATGGGATTAAAAATATTAAAAATACTTTTTCTCTCTGGACGAAAGAGTGATAACAAATGCTTTTTTAGGATTTTAATAGGTGGCATGTTTTTAAAATCAGAAATGGTAATATTCCATAATTTGACAGCATTTGGGAAAAAGCTATTCTTGTACCTATCAGAATTACATCG
CGTTTCTTGGAAGGTATTGTCATTGTTATATCTATATAGTGCATGGCGAGGAGGGGGAAGTTTATTATTGAGATAAGCTGGCGTCTTATTACTAACAATTTTATGAAATTGCATAATTCTTCTAGAGAGACGGCGATCAGTTAAGGACTCCCAACCTAGTTCTTCGTATAATTTGGAACGACTTGA
GCCACGCCATGTACCAGTAACAGCAAGCGCTGCTAAATATTGGACCTGTTCAAGCTTTTCCATTAAGTAATTAAGTACTTCACCAAGTTGGTCTTGTTTTGACGGGATATGATAAATTACATCGCAGTAATCAAAATGAGGTCGAACAACTGATTTATAAATTAGATTAAGAGTCTTAAGGGGTAA
AAATTTTGATAGATGTTTAATTATACCAAGATGTTGATTAGCCTTTTTGATTTTTTCACTAAGATGTTTATTAAATGATAAGTTTGATTCAAGAATTAGACCTAAATGTTTATGTCCCTTGACTTTGGTGACAGCAGTTCCATTGAAAAATATTTTTGGATGAATAACAGTGGTCTTTTTACAAGA
AAATATAATTTCAGTGGCTTGTTTAAGAGGATCAGGATTGAATTCAAGTTTCCATTGGTTAGCCCATTTATTAATGGTATCTAAATCATGATTTAGATACCATTAATAAATGGGCTAACCAATGGAAACTCGAATTCAACCCTGATCCTCACAAACAAGCCACTGAAGTCTTATTTTCATGTAAAC
AAAAAAGTCCCAATCATCCTCAACTCTTTTTCAATGGAACTGTTGTGGCTAAAGTGGATGAGCATAAACATTTAGGTCTTGTTCTTGAATCAGATTTGTCTTTTAAGAAACATATTCATGAAAAAATTAAGAAGGCTAAAAAGAACATTGGTATAATAAAGTACCTTTCTAAATTTTTACCCCTTA
AGACTCTTGATCTAATTTATAAATCAGTTGTTCGACCTCATTTTGATTACTGCGATGTAATTTATCATATCCCGTCAAAACAAGACCAACTTGGTGAAGTACTTAATTACTTAATGGAAAAGCTTGAACAGGTCCAATATTTAGCAGCGCTTGCTGTTACTGGTGCATGGCAAGGTTCTAGCTGTT
CAAAATTATACGAAGAATTGGGCTGGGAATCCCTTTCAGATCGTCGTTGGTGTAGGCGAATTCTTCAAATTCATAAGATTACGAATAAAAATACACCTAATTATCTCTACAACAAGCTTCCATGTCGCCGTAGGCCTTTATACAGACTAACCAACTATAATATATTTCACGAAATACGATGCCAGA
CTGATCGGTACAAAAATAGTTTTTTCCCAGATGCAATTAAGGGTTGGAATATTGTTATTCAAATTTTTCCTAATATCCCATCAATAAATGTTCTTAAAAATCATATTTTATCTCTCACTCGTCCAGAGAAAAAAACCTTTTTCAACATACACGACCCTGTGGGACTCCGCTACCTTTTCTTTTTGA
GATTGGGCTTTAGTCCTCTAAGGAGTCACAAAAACAGACATGGTTTTTTAGATAC

with:
zcat /gpfs/ycga/work/dunn/sequences/illumina/Novaseq_siphgenomes/Sample_NA22_176_017/*.fastq.gz | sharkmer -k 21 -t 1 --max-reads 2000000 -o output -s Physalia-physalis-Pacific --verbosity 0 --pcr cnidaria > logs/sharkmer.Physalia-physalis-Pacific.log 2>&1

separate graphs for separate start nodes

Starting about here

let seed_graph = create_seed_graph(&forward_primer_kmers, &reverse_primer_kmers, &kmer_counts);
, refactor so that assembly is attempted for one start node at a time rather than by throwing all the start nodes into a single graph.

The intent is to avoid cases where a poisonous start node balloons the graph and blocks extension of a different perfectly good start node.

Allow arbitrary read length

At present, number of nucleotides in a read must be a multiple of 4 since I use two bit encoding in a vector of bytes. Create a Read structure that uses the same encoding, but also has a length element. If the number of nucleotides is not divisible by 4, than the lower significant bits of the last byte are ignored.

Issue with finding sPCR products

HI,
I am testing Sharkmer's sPCR function on a library that I have previously retrieved a full mitogenome from using the standard primers for this group

sequence I am trying to recover
GCTGCTAGTTGGTATTGGCATTTTGTAGATGTGGTTTGGTTGTTTTTATATGTATGTATTTACTGGTGGGGTTCTTAGCCCGGGCCATGGTTACATAATGCTTAGCTAAGCTTGTAGAGTCAACTAACGGTAAGTTTTCAGACTCATAATCTGATTATGCAGGTTCAACTCCTGCCTCTACCTATAGAATCGCCATGGGGAACTTATTCGCTCTGTATGCCCTTAGCAAGGGCCAAATACCTGTACAGTATTTTAACTTAATGGAGGAGAATTATTCTAAGTATGGGCTATCAGTAATCCAGCTTATCCAGATTGGTAAGTTCTATGAACTTTGGCAGAAGCCTGATGTTCCTAGTATCCAACAAGCATACTCTCAAGCTGAGTTATTAGTTGAGTCGTCCATACGAAGTAGGTCTTTAGGGGTAACACCTCCCATTGAACAAGTTGCTTCGTTACTTGATATGAGAATAACATCACCCGGTAAAAGATCCTTGCTTCAAATGGGGTTTCCTATTTATTCCCTAACTACTTATCTAAGCACCTTGTTGGATAAAGGTTGGACTATTATAGTTATTGATGAATTAGCCACTGATAAATCGGGGCCGAAACAACGCGCAGTATCTCAGGTTTATTCTCCTAGTTGTAATTTAGAAGATTGTCCGGAATTATCTTATGTGATATCAATCTATTTTAAAGATGACTTACTAGGTATTACTTTATTTTCTGCCATGAATGGGCACAGTATAATGTTTCCTGTCTATTGGGCCGACAGAGATAAAGTGGCTCGACTACTAATCAGTTATCGTATTAAAGAGGTAGTAATTAGGGCGGACTCGGAGGCCAATTATAGGGCACTAAATAAGATATATGATTTATTAATTGGTTGGAATTTATTCCCCTCTGAACCTAATGCCATAATTGAAGTTATGGAGGACCCGTGTTATTTATCTTATAGGTACGAAAATGATAATAAGGAGTGGCTTTTGCTTCA

Forward
GCTGCTAGTTGGTATTGGCAT
Reverse
TSGAGCAAAAGCCACTCC

code:

zcat BAM_R*_P.fastq.gz | sharkmer -s BAM --max-reads 1000000 --pcr "GCTGCTAGTTGGTATTGGCAT,TSGAGCAAAAGCCACTCC,1200,mtMutS"

For gene mtMutS, no path was found from a forward primer binding site to a reverse binding
site. Abandoning PCR.
Suggested actions:
- The max_length for the PCR product of 1200 my be too short. Consider increasing it.
- The primers may have non-specific binding and are not close enough to generate a
product. Consider increasing the primer TRIM length from the default to
create a more specific primer.

Is there any reason why this isn't working?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.