Git Product home page Git Product logo

salmontools's Introduction

SalmonTools

This repository contains (or will contain) a suite of tools that are useful for working with Salmon output. This is the ideal repository for tools that don't quite belong in the Salmon repository itself, but which are too small to warrant their own separate project. It's nice to have such things collected in one place. Contributions and pull-requests are welcome!

Tools

salmontools is the main command-line interace for interacting with tools. Like samtools, it uses separate commands to execute separate functionality. The available commands are:

  • extract-unmapped —Takes an unmapped_names.txt file from a run of Salmon, as well as the original FASTA/FASTQ files from which the unmapped names were generated, and extracts the corresponding reads from the FASTA/FASTQ file. The results (the read names and sequences) are written to a user-provided output file.
  • generateDecoyTranscriptome.sh — Located in the scripts/ directory, this is a preprocessing script for creating augmented hybrid fasta file for salmon index. It consumes genome fasta (one file given through -g), transcriptome fasta (-t) and the annotation (GTF file given through -a) to create a new hybrid fasta file which contains the decoy sequences from the genome, concatenated with the transcriptome (gentrome.fa). It runs mashmap (path to binary given through -m) to align transcriptome to an exon masked genome, with 80% homology and extracts the mapped genomic interval. It uses awk and bedtools (path to binary given through -b) to merge the contiguosly mapped interval and extracts decoy sequences from the genome. It also dumps decoys.txt file which contains the name/id of the decoy sequences. Both gentrome.fa and decoys.txt can be used with salmon index with salmon >=0.14.0.
    NOTE: Salmon version v1.0 can directly index the genome and transcriptome and doesn't mandates to run the generateDecoyTranscriptome script, however it's still backward compatible. Please checkout this tutorial on how to run salmon with full genome + transcriptome without the annotation.

Salmon in Alignment mode w/ decoy BAM

Salmon by default, if provided with the decoy aware index and --writeMappings flag, dumps the reads aligning to decoys with better aligninment score than transcriptomic target. In an atypical situation where the decoy tagged BAM has to be requantified with salmon in alignment mode, salmon can fail. The general recommendation for such scenario is to filter the BAM file for all such decoy alignment before requantifying with salmon. The following command will remove both the decoy target and the decoy alignment from the decoy tagged BAM and makes it compatible to run in alignment mode in salmon.

samtools view -h input.bam | grep -v 'XT:A:D\|DS:D' | samtools view -bS > output.sam

salmontools's People

Contributors

flyingdeveloper avatar hiraksarkar avatar k3yavi avatar rob-p avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

salmontools's Issues

Time spent to finish it

Hi,

I am running generatedecoytranscriptome.sh. But this step seems very time-consuming, so I submitted to slurm for execution. How long does it take to finish it? I applied for 1 cpu with 1 node. I set 10h but I am afraid it's not enough.

Thank you for your answer!

Output Files Not Created In Latest Version

The same wrapper code running with 474ba5947711074a015a268b47058ac8930cf566 and the current master HEAD creates different behavior.

With 474ba5947711074a015a268b47058ac8930cf566, there are files at the expected location with the command:

salmontools extract-unmapped -u /home/user/data_store/salmontools/double_output/aux_info/unmapped_names.txt -o /home/user/data_store/salmontools/double_salmontools/unmapped_by_salmon -1 /home/user/data_store/salmontools/double_input/reads_1.fastq -2 /home/user/data_store/salmontools/double_input/reads_2.fastq

When using master, the output files don't exist at the location. It simply says, There were 100 unmapped reads, but no new files appear.

Any ideas?

GenerateDecoy with multiple transcriptome fasta files at the -t argument

Hi

I'm stuck at the moment to generate a decoy file from two transcriptome fasta files (A. thaliana cDna and ncrna from Ensemblplants). I tried several ways of putting both files after the -t argument (like , or ; or using -t two times) but nothing worked. I guess there is a super easy solution for that. Or can i map on the level of Salmon to two indices ? Many thanks in advance.

Segmentation fault on MashMap step of generateDecoyTranscriptome.sh

Hi all,

I get Segmentation fault (core dumped) on step 3 of generateDecoyTranscriptome.sh.

I've filed marbl/MashMap#21 upstream with more detailed information. I wanted to file an issue here in case you have any insight or I am using the script improperly.

Here's how I'm using this:

bash scripts/generateDecoyTranscriptome.sh \
	-j 8 \
	-g Homo_sapiens.GRCh38.dna.toplevel.fa \
	-t Homo_sapiens.GRCh38.cdna.all.fa \
	-a Homo_sapiens.GRCh38.96.gtf \
        -o ${human_output}

I realize you have gentrome.fa and decoys.txt for human here: https://github.com/COMBINE-lab/salmon#pre-computed-decoy-transcriptomes

I'm interested in generating this for zebrafish and happened to run into this problem with human first/before I found that on the Salmon README.

Thank you!

generateDecoyTranscriptome.sh gets ABORTED

Hi,
I got stuck while trying to generate a hybrid fasta decoy file. The script is aborted after the file 'reference.masked.genome.fa' is generated. Since the script is not 'killed' or 'segmentation faulted' (i.e. two issues reported here before) I decided to open a new issue.
Any suggestions would be appreciated.
Thanks!

Source files:
ftp://ftp.ensemblgenomes.org/pub/metazoa/release-44/fasta/lottia_gigantea

My code:

[guidoh@localhost Work]$ bash generateDecoyTranscriptome.sh \
>  -j 1 \
>  -g ./ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa.gz \
>  -t ./ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa.gz \
>  -a ./ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3.gz \
>  -o gentrome.files
****************
*** getDecoy ***
****************
-j <Concurrency level> = 1
-g <Genome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.dna.toplevel.fa.gz
-t <Transcriptome fasta> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.cdna.all.fa.gz
-a <Annotation GTF file> = /Work/ENSEMBL/Lottia_gigantea.Lotgi1.44.gff3.gz
-o <Output files Path> = gentrome.files
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
generateDecoyTranscriptome.sh: line 101: 23362 Aborted                 (core dumped) $bedtools maskfasta -fi $genomefile -bed exons.bed -fo reference.masked.genome.fa

***************
*** ABORTED ***
***************

An error occurred. Exiting...
[guidoh@localhost Work]$ 

[Error: The index version file /salmon_genome/BD_index/versionInfo.json doesn't seem to exist.

Hi!

An error pops in when I performed the salmon quantification for my data. The command and error is as follows:
Command
"salmon quant -i media/waqas/Chaudhary/BD_analysis/salmon_genome/BD_index -l A -1 /media/waqas/MG-ZDZ2D2R9/RNA_Seq_SC/Trimmed_BD/3/3_38_12_1_paired_R1.fastq.gz -2 /media/waqas/MG-ZDZ2D2R9/RNA_Seq_SC/Trimmed_BD/3/3_38_12_1_paired_R2.fastq.gz -o /media/waqas/Chaudhary/BD_analysis/Salmon_Out/3/3_38_12_1/"

Error:
"Version Info: This is the most recent version of salmon.

salmon (mapping-based) v1.2.1

[ program ] => salmon

[ command ] => quant

[ index ] => { media/waqas/Chaudhary/BD_analysis/salmon_genome/BD_index }

[ libType ] => { A }

[ mates1 ] => { /media/waqas/MG-ZDZ2D2R9/RNA_Seq_SC/Trimmed_BD/3/3_38_12_1_paired_R1.fastq.gz }

[ mates2 ] => { /media/waqas/MG-ZDZ2D2R9/RNA_Seq_SC/Trimmed_BD/3/3_38_12_1_paired_R2.fastq.gz }

[ output ] => { /media/waqas/Chaudhary/BD_analysis/Salmon_Out/3/3_38_12_1/ }

Logs will be written to /media/waqas/Chaudhary/BD_analysis/Salmon_Out/3/3_38_12_1/logs
[2020-05-30 17:22:09.253] [jointLog] [info] setting maxHashResizeThreads to 12
Exception : [Error: The index version file media/waqas/Chaudhary/BD_analysis/salmon_genome/BD_index/versionInfo.json doesn't seem to exist. Please try re-building the salmon index.]
salmon quant was invoked improperly.
For usage information, try salmon quant --help
Exiting.
[2020-05-30 17:22:09.253] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2020-05-30 17:22:09.253] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2020-05-30 17:22:09.253] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.
[2020-05-30 17:22:09.253] [jointLog] [info] parsing read library format
[2020-05-30 17:22:09.253] [jointLog] [info] There is 1 library."

Any help is much appriciated.

Thank you!
Kind regards

CS

generateDecoyTranscriptome.sh gets 21 killed

I've made a docker container for SalmonTools https://quay.io/repository/comp-bio-aging/salmon-tools
However, I constantly get:

/opt/SalmonTools/scripts/generateDecoyTranscriptome.sh: line 105: 21 Killed $mashmap -r reference.masked.genome.fa -q $txpfile -t $threads --pi 80 -s 500

I run it on 32 cores machine with 64 GB RAM and I use Ensembl human genome.
I think something may be wrong in the bash script itself

/opt/SalmonTools/scripts/generateDecoyTranscriptome.sh: line 105:    21 Killed                  $mashmap -r reference.masked.genome.fa -q $txpfile -t $threads --pi 80 -s 500

***************
*** ABORTED ***
***************

An error occurred. Exiting...

the command is:

/opt/SalmonTools/scripts/generateDecoyTranscriptome.sh -a /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.96.gtf -g /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.dna.primary_assembly.fa -t /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.cdna.all.fa -j 16 -o output

the stdout file is:

*** getDecoy ***
****************
-a <Annotation GTF file> = /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.96.gtf
-g <Genome fasta> = /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.dna.primary_assembly.fa
-t <Transcriptome fasta> = /cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.cdna.all.fa
-j <Concurrency level> = 16
-o <Output files Path> = output
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
[3/10] Aligning transcriptome to genome
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [/cromwell-executions/decoy/9f2ca769-5a26-4149-a40c-ecc606e9b76c/call-generate/inputs/-848260311/Homo_sapiens.GRCh38.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 16
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 938129647

The command realpath, used by the script, is not available on mac

Hello everyone,

Ran into this issue when trying out the generateDecoyTranscriptome bash script :

****************
*** getDecoy ***
****************
./generateDecoyTranscriptome.sh: line 52: realpath: command not found

***************
*** ABORTED ***
***************

An error occurred. Exiting...

And I solved using this issue : whatwg/html-build#90

Just a note for others users that might run into the same problem !

Best,

Tag a release?

I'd like to include this in one of our pipelines, can you tag a release for SalmonTools? I'll add it to Bioconda then :)

command line

Hi,
I have a GFP and transposon sequence which I want to incorporate for RNASeq analysis. I tried doing this with STAR, but it gives me only the TE sequence alignment and reads in the sb_alignmentsReadsPerGene.out.tab

./generateDecoyTranscriptome.sh [-j <N> =1 default] [-b <bedtools binary path> =bedtools default] [-m <mashmap binary path> =mashmap default] -a <gtf file> -g <genome fasta> -t <txome fasta> -o <output path>

Could you kindly give an explanation for the command line please. Here the exogenous sequences have to be incorporated via the cat command and entries made into the GTF and genome.fasta and txome.fasta ?

I am new to this kind of analysis. Another thing that I found very difficult to do was that if I have to match the coordinates of the BED regions from my DNAseq with the transposon into RNA seq BAM files , I cannot do it via this method of inserting exogenous sequences since it shows the header which we have coined in the input files for the GFP and the TE sequence.

When I ran the analysis:

./generateDecoyTranscriptome.sh -b /usr/local/bin/bedtools -m /home/amit/miniconda3/bin/mashmap -a ../hg19.ncbiRefseq.added.gtf -g ../hg19_added_genes.fa -t ../Homo_sapiens.GRCh38.cdna.all.fa -o /home/amit/Downloads/SalmonTools-master/sb_results
****************
*** getDecoy ***
****************
-b <bedtools binary> = /usr/local/bin/bedtools
-m <mashmap binary> = /home/amit/miniconda3/bin/mashmap
-a <Annotation GTF file> = /home/amit/Downloads/SalmonTools-master/hg19.ncbiRefseq.added.gtf
-g <Genome fasta> = /home/amit/Downloads/SalmonTools-master/hg19_added_genes.fa
-t <Transcriptome fasta> = /home/amit/Downloads/SalmonTools-master/Homo_sapiens.GRCh38.cdna.all.fa
-o <Output files Path> = /home/amit/Downloads/SalmonTools-master/sb_results
[1/10] Extracting exonic features from the gtf
[2/10] Masking the genome fasta
[3/10] Aligning transcriptome to genome
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [/home/amit/Downloads/SalmonTools-master/Homo_sapiens.GRCh38.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 1
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 937226701
./generateDecoyTranscriptome.sh: line 105: 14872 Segmentation fault      (core dumped) $mashmap -r reference.masked.genome.fa -q $txpfile -t $threads --pi 80 -s 500

***************
*** ABORTED ***
***************

An error occurred. Exiting...

Kindly guide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.