giantspacerobot / findfungi Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 15.0 116 KB

A pipeline for the identification of fungi in public metagenomics datasets

Python 53.36% Shell 45.11% Perl 1.53%

findfungi's Introduction

Hi, I’m @GiantSpaceRobot
I’m interested in bioinformatics in general, primarily omics technologies of all sorts. The field of transcriptomics is where I spend most of my time though.

findfungi's People

Contributors

Stargazers

Watchers

Forkers

mdehollander urinieto sophierowland m-gall ypchan qxibai brandoninvergo rebaonam novapyth khemlalnirmalkar nodefather beccaevans-soil stanikae

findfungi's Issues

Issue with running FindFungi.sh

I am having trouble running the FindFungi.sh script. I'm not sure where in the script I am supposed to put the path to the kraken and skewer executables and I think that may be the issue. I have added the paths to the directories containing the databases and scripts etc.

However when I run bash FindFungi.sh reads.fastq EXAMPLE I just get this response:
Started at Thursday 5 March 09:57:25 AEDT 2020
Finished at Thursday 5 March 09:57:25 AEDT 2020

Not sure what I need to change. Do I need to add anything to lines 17 and 18 in the script, being x and z?

database link doesn't work

Hello, has the link to the Kraken databases changed? This link is not working: http://bioinformatics.czc.hokudai.ac.jp/findfungi/

Missing bsub_reports to gather FASTA reads

Hi,
I am trying to set up the pipeline using the test data provided but am having issues with this:

### Gather FASTA reads for each taxid predicted
for d in $Dir/Processing/ReadNames.*; do
	File=$(basename $d)
	Taxid=$(echo $File | awk -F '.' '{print $2}')
	$PreDir/FindFungi/bsub_reports/ReadNames-to-FASTA.$Taxid.stderr -o $Dir/Processing/ReadNames_bsub.$Taxid.fsa awk -v reads="$Dir/Processing/ReadNames.$Taxid.txt" -F "\t" 'BEGIN{while((getline k < reads)>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}' $Dir/Processing/Reads-From-Kraken-Output.$z.Reformatted.fsa &
done
wait

I had to remove the bsub commands so my $PreDir/FindFungi/bsub_reports folder is empty and I just get the error messages such as:

h: line 133: FindFungi/bsub_reports/ReadNames-to-FASTA.990650.stderr: No such file or directory

Which I think then means I don't get any BLAST results. Where should my 'ReadNames-to-FASTA.$Taxid.stderr' come from in this case?

Run only BLAST and Skewness correction on kraken2 results?

Hello,
I'm interested in this pipeline but it seems that it's running on an older version of Kraken, and I'm not sure when the databases were last updated. I'd like to run my metagenomics data on kraken2 against the RefSeq fungi database and/or nt/nr database, and then put the kraken2 results into the BLAST algorithm to calculate skewness and remove putative false positives. Does FindFungi have an option to do this? Reading through the FindFungi-0.23.3.sh, it seems that Kraken output is reformatted since it is run in 32 databases in FindFungi.

Thanks! -Emily

"Completed" but no output files

Dear Sir/Madam

I am trying to use FindFungi on my dataset. I have downloaded all the Kraken and blast databases, and have specified my directory paths on the FindFungi-0.23.3.sh and LowestCommonAncestor_V4.sh script files. However, when I put in the command:

./FindFungi-0.23.3.sh joined_input_01.fastq 01_file

It only generates the following

Started at Thu Jan 3 13:35:29 HKT 2019
Finished at Thu Jan 3 13:35:29 HKT 2019

Given then the task finished at the same time, it seems like it actually did not perform the task. There are no output files nor log files for me to see what the issue is, or what it is that I am missing.

How can I go out finding out what exactly the error is when I cannot locate a log file? Thank you!

Regards

Marc

Multiple errors/core dumps with FindFungi

While running the SLURM version of FindFungi, I continually see this type of error - and core dumps - in my results. There are really three things I'm worried about here:

The segmentation fault in megablast
the "Not BLASTing" lines
The "cannot calculate Pearson Coefficient

Any suggestions?

Not BLASTing 98765, too few reads
Not BLASTing 994086, too few reads
Ending : Blasting Kraken : Fri Mar 1 01:23:59 EST 2019
Starting : Gather fasta : Fri Mar 1 01:23:59 EST 2019
Ending : Gather fasta : Fri Mar 1 01:29:47 EST 2019
Starting : BLAST against the genome : Fri Mar 1 01:29:47 EST 2019
/opt/FindFungi/0.23/FindFungi-v0.23.3/FindFungi-0.23.3.sh: line 164: 4597 Segmentation fault (core dumped) blastn -task megablast -query $Dir/Processing/ReadNames.$Taxid.fsa -db $BLAST_DB_Dir/Taxid-$Taxid -out $Dir/Results/BLAST_Processing/BLAST.$Taxid -evalue 1E-20 -num_threads 2
0 -outfmt 6
/opt/FindFungi/0.23/FindFungi-v0.23.3/FindFungi-0.23.3.sh: line 171: 6872 Segmentation fault (core dumped) blastn -task megablast -query $Dir/Processing/ReadNames.$Taxid.fsa -db $BLAST_DB_Dir/Taxid-$Taxid -out $Dir/Results/BLAST_Processing/BLAST.$Taxid -evalue 1E-20 -num_threads 2
0 -outfmt 6
Ending : BLAST against the genome : Fri Mar 1 01:34:14 EST 2019
Starting : Calculate Pearson coefficient : Fri Mar 1 01:34:14 EST 2019
The file OUTPUT/FindFungi/Results/BLAST_Processing/Skewness.BLAST.1126212 is empty. Cannot calculate Pearson Coefficient of Skewness
The file OUTPUT/FindFungi/Results/BLAST_Processing/Skewness.BLAST.1081104 is empty. Cannot calculate Pearson Coefficient of Skewness
The file OUTPUT/FindFungi/Results/BLAST_Processing/Skewness.BLAST.105984 is empty. Cannot calculate Pearson Coefficient of Skewness

Split the search and merge the results

Hi,
Following my question (Issue 7) on the size of the uncompressed database.
I was wondering if it was possible to run the script with 5/10 database files at a time and then merge the results? My main issue is that I cannot easily find 1.25 Tb of free space to uncompress all the database at once.
Thanks!
Greg

threshold

Hi,
I want to know this tool's threshold in identify fungi species.

sqlite3.OperationalError: disk I/O error

Hi Paul,
I am running FindFungi for the demo ERR675624 fastq on cluster(CentOS Linux release 7.4.1708). I had remove bsub and got the following results.
There something error I can't fix, could help have a check? Thank you.

ouput files in ./ERR675624/FindFungi/Results

128K Oct 21 09:34 BLAST_Processing
24bytes Oct 21 11:32 ERR675624.WordCloud.R
0bytes Oct 21 11:32 Final_Results_ERR675624-lca.sorted.csv
21bytes Oct 21 11:29 Final_Results_ERR675624-taxids.txt
91K Oct 21 05:48 Final_Results_ERR675624.tsv
2.9K Oct 21 11:30 Final_Results_ERR675624.tsv_AllResults-taxids.txt
11M Oct 21 05:48 Final_Results_ERR675624.tsv_AllResults.tsv
0bytes Oct 21 11:32 Final_Results_ERR675624_AllResults-lca.sorted.csv

The CSV is always empty with the errors showing below

source script code

$ScriptPath/LowestCommonAncestor_V4.sh $Dir/Results/Final_Results_$z.tsv
$ScriptPath/LowestCommonAncestor_V4.sh $Dir/Results/Final_Results_$z.tsv_AllResults.tsv

error information (using ete3-3.1.1 )

Uploading to /users/username/.etetoolkit/taxa.sqlite
Traceback (most recent call last):
  File "/users/username/tools/FindFungi-v0.23.3/LowestCommonAncestor_V4.py", line 26, in <module>
    ncbi = NCBITaxa()
  File "/users/username/username/tools/conda3/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 120, in __init__
    self.update_taxonomy_database(taxdump_file)
  File "/users/username/username/tools/conda3/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 129, in update_taxonomy_database
    update_db(self.dbfile)
  File "/users/username/username/tools/conda3/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 760, in update_db
    upload_data(dbfile)
  File "/users/username/username/tools/conda3/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 791, in upload_data
    db.execute(cmd)
sqlite3.OperationalError: disk I/O error
Done

error information(using older version ete3)

/users/username/tools/FindFungi-v0.23.3//LowestCommonAncestor_V4.sh: line 14: 28870 Segmentation fault      python2.7 ~/tools/FindFungi-v0.23.3/LowestCommonAncestor_V4.py ${1} ${y}-taxids.txt ${y}-lca.csv
Done
/users/username/tools/FindFungi-v0.23.3//LowestCommonAncestor_V4.sh: line 14: 28875 Segmentation fault      python2.7 ~/tools/FindFungi-v0.23.3/LowestCommonAncestor_V4.py ${1} ${y}-taxids.txt ${y}-lca.csv
Done

Best
Keli

Question regarding gathering FASTA reads from taxid step

Hi Paul

So I have managed to get this to start running until the part "### Gather FASTA reads for each taxid predicted." It appears that to get the fasta reads a bsub command is necessary. I have removed bsub (I am running on a single cluster), but it still needs

-e [dir]/bsub_reports/ReadNames-to-FASTA.$Taxid.stderr

I am not entire sure when the files bsub_reports/ReadNames-to-FASTA.$Taxid.stderr are generated if I remove all the bsubs.

Thanks

Marc

issues with Running FindFungi-0.23.3.sh

Dear Sir/Madam
I am trying to use FindFungi on the example ERR675624. I have downloaded all the Kraken and blast databases, and have specified my directory paths on the FindFungi-0.23.3.sh and LowestCommonAncestor_V4.sh script files. However, when I run to : python2.7 $ScriptPath/Consensus-CrossRef-Skewness_V2.py $Dir/Processing/Consensus.sorted.$z.All-Kraken-Results.tsv $Dir/Results/BLAST_Processing/All-Skewness-Scores $Dir/Results/Final_Results_$z.tsv
Final_Results_$z.tsv is 0 bytes in size.
How can I solve a problem and ask for your help. Thank you!
Regards
Qian

FASTA fails to be processed

I found a FASTA that doesn't seem to play well with FindFungi. The directory structure is created, but Kraken seems to fail for some reason with this error:

Loading database... complete.
classify: malformed fasta file - expected header char > not found
0 sequences (0.00 Mbp) processed in 0.008s (0.0 Kseq/m, 0.00 Mbp/m).
0 sequences classified (-nan%)
0 sequences unclassified (-nan%)

The input file does have the > character - here's some sample sequences:

> NODE_24005_length_1000_cov_2.279365
TCGTCGGCTGACACAAGGCGGGCGCTGCCAACTTTCTCGGTTCTCAGAATGCCGTCGCGA
ATCATGGCGTGTAGGCGCGAGGTGCTGACATCGAGGATATCCGCCGCTGCTTGGACGGTC
ATGGTGGTGAGGGAAGGCGTGTCGGCATCGCAGTCCACGGCAACGGCGATGGCCTTGCCG
TTTGCGGGGGCGGGATGGTCCATCCGCGGCTCCGGCAAGGGCCGTCCTTCGCGAAGCGCT
TGGGAGATCCAGAGGGTCAGCAGATCTTGCGCCATGAAAGCCGCATCGAACAGAGTGTCC
CCCTGGGTGCAGATACCCAGGTCGGGAAACTCAGCTTCCCACCCGCCGCTCCACGGGGTG
AGGATGGCTTCGTAGAGAAACTTCATAGGTCGCTTCCTCTCTATTGGGTGAAGTATTAGA
GAAGCCCTAGCTTTTCTTTGATGGCTCGCTCGACCCCTGGCGAAAGATCTCCGGGATGCC
GGGGGATAGGGAACTCTTCGCCAGCCGCGTTCGCGAAAATGTCGTGTCTCGCTCCGTGCC
TGACGAAGCGTCCGCCCATTTTCTTGGTAAGTCGTATCGCTTCTCTAGCTGTCATGGGCA
TGTCTCTCGACCTCCTGCATGGCACAATTTATTGTAACAATTAGTAACAATAAATGCAAG
CACTGCCGTCTTCTCAAAGCGAATTATAGCAAACAAATGTTCTTAGAACAATAGTTCATT
TCAGGATAATCGAGGGGGTCTCGGCAGAAGGCTCTCGGATTTTCCGATGGCGTTTGCTGG
GGACTTCCTATAACGGGGGTGCTCCGGGTGCTTCTATAATAGGCGAGCGAAAAAATGCGA
GACCAAAAGGAGCCCTGAATGAGCGAGCGGATTGGAACGACCTGCGTGCAAGGCGGGTGG
CGGCCCGGCGACGGCGAGCCGCGCCAAGTGCCCATCTACCAGAACACCACCTGGAAGTAC
GACACGAGCGAGCATATGGGGCGCCTGTTCGATCTGGAGG

> NODE_24006_length_1000_cov_2.211640
GTGCGAGTGTATTTTTACATCAGCTTACCGGTGAAGAGAAATATGTAAAAAATGCGGTTC
TGGCAGCGGATTATACAATGGAATACCTGTATAAAAACGGCATCATGAACAACGAAGCGG
ATGGAGACGACATGCCAGGATTTAAGGGGATTCTGGCAAGATGGCTCAGCAAGCTCGTTT
ATGAAGAGAACCAGACCAAATATTTTGCATGGATGGAGAAAAATGCGGACAGCGCATGGC
TGCACCGTAACACGCAGAACCTGATGTGGACGGCATGGGAGTTCCCGACCAATGAGTTCC
CGCGCTGCGCATGGGGCTGCAGCGCGGCGGTAGCACAGCAGTTTGCGTGTCTGCCGTACA
AAAAATAAACAATAGAATGCGCGTTCTGCGAACATTTTGTTGTCACGAACAAAAAGGGGA
AATCATATGAAACGGGTACACTTAATTTGCAACGCACACCTCGACCCTGTATGGCTCTGG
CGCTGGCAGGAAGGCTGCACGGAAGCGCTTTCCACATTCCGCACGGCAGAAGCCTTTACG
GATGAATTTCCGGGCTTTGTGTTCAACCACAACGAAGCGATCCTCTATGAATGGGTCAAG
GAAAATGAGCCGGAGCTGTTTGCTCGCATTCAGCAGAAGGTCAAAGAGGGCAAATGGCAT
ATCATGGGCGGCTGGTATTTGCAGCCCGACTGCAACATGCCAAACGGCGAATCCATTATC
CGAAACATTTCGGAAGGACACCGGTTCTTCGAAGAAGAATTCGGCGTGCGCCCGACAACG
GCGATTAACTTTGACTCCTTTGGTCATTCTGTAGGTCTGGTTCAGATCTTAAATCAGGCT
GGCTATGATACTTATGTGGTATGCCGTCCGGCAAAGGCGCAGTTCCCGTTTGAGGAACAG
GATTATCTGTGGAAAGGTCTTGCCGGTTCGGAGGTTCTGGTGCATCGTTCCGATGAAAAC
TATAACTCCGTTTACGGGCATGTCGGAAAAGAACTGGAAC

Any ideas why this would fail?

Sort_parallel not found

I am giving the pipeline a try with the demo ERR675624 data. I removed the bsub commands and now run into an error that the sort_parallel command is not found. This is not a command that is available by default on Ubuntu 16.04 LTS and is not listed in the requirements. How can I install it?

assembly_data

Hi!

I'm trying to adapt the pipeline to run it with an assembly data. I changed the cut-off of 10 reads predicted per species, but I'm not sure if the blastn is running. I got this message:

Do you know if the blastn is running and I don't have any match? Or if is it not running?

thank you for the attencion!

Use of <<< redirect fails on Ubuntu

Started at Thu 31 Oct 2019 10:58:15 AM EDT
./FindFungi-v0.23.3/FindFungi-0.23.3.sh: 193: ./FindFungi-v0.23.3/FindFungi-0.23.3.sh: Syntax error: redirection unexpected

According to this, the default shell on Ubuntu is dash not bash, and <<< isn't valid. Changing the shebang to #!/bin/bash worked.

What reads are being blasted?

Hello! I am reworking the main pipeline script, and now I'm at the blasting stage. In you script there is a line: tail -n +31 $d | head -n -6 > $Dir/Processing/ReadNames.$Taxid.fsa. Am I right that you are not blasting all predicted reads?

Size kraken db uncompress

Hi,
I saw that there is 32 files of 16Gb each for the kraken database. But these files are compressed so before downloading them, I'd like to know if you have an idea about the size of all uncompress?
Thanks!