Git Product home page Git Product logo

pirovc / ganon Goto Github PK

View Code? Open in Web Editor NEW
86.0 4.0 13.0 24.75 MB

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more

Home Page: https://pirovc.github.io/ganon/

License: MIT License

CMake 1.51% Python 58.62% C++ 36.10% Shell 3.77%
metagenomics bioinformatics k-mer bloom-filter microbiome minimizers taxonomy genbank gtdb ncbi

ganon's Introduction

ganon GitHub release (latest by date)

Build Status codecov Anaconda-Server Badge Anaconda-Server Badge install with bioconda Publication

ganon2 pre-print

ganon2 classifies DNA sequences against large sets of genomic reference sequences efficiently. It features:

  • integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates
  • NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all
  • customizable database build for local or non-standard sequence files
  • optimized taxonomic binning and classification configurations
  • build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization
  • hierarchical classification using several databases in one or more levels in just one run
  • EM and/or LCA algorithms to solve multiple-matching reads
  • reporting of multiple and unique matches for every read
  • reporting of sequence, taxonomic or multi-match abundances with optional genome size correction
  • advanced tree-like reports with several filter options
  • generation of contingency tables with several filters for multi-sample studies

Find out more information in the user manual: https://pirovc.github.io/ganon/

Quick install and usage

# Install
conda install -c bioconda -c conda-forge ganon
# Download and Build (Archaea - complete genomes - NCBI RefSeq)
ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 24
# Classify
ganon classify --db-prefix arc_cg_rs --output-prefix classify_results --paired-reads my_reads.1.fq.gz my_reads.2.fq.gz --threads 24

For further examples, database build guides, installation from source and more: https://pirovc.github.io/ganon/

ganon's People

Contributors

benvenutti avatar pirovc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ganon's Issues

Suggestions on reducing DB size?

Hi again,

Do you have any suggestions on how to reduce Ganon's database size without giving up too much sensitivity? As RefSeq or other databases continue to grow, it's becoming more I/O and memory demanding to run Ganon.

The other possible solution I can think of would be breakdown the database into fixed sized chunk that get loaded incrementally. Not sure if that is possible, but might be helpful long term.

Error code1: Please provide a single or one-per-hierarchy --kmer-size value[s]

ganon v1.1

all databases built also with 1.1 and with default settings. so all same k-mers etc.

ganon build -t 18 -m 275 --db-prefix EUK_refseq_CG_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_cg/22_12_2021/files --input-extension genomic.fna.gz &

so why am I getting this error and how to fix with the following command:

for i in *_1_val_1.fq.gz; do
b=${i%%_1_val_1.fq.gz}
ganon classify -d /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db
/media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db
/media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db
/media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db
-p "$b"_1_val_1.fq.gz "$b"_2_val_2.fq.gz
-o "$b"_ganon_results --output-lca --output-unclassified -t 16
done &

OUTPUT:
Classifying reads (ganon-classify)
The following command failed to run:
/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon-classify --paired-reads SL335732_1_val_1.fq.gz,SL335732_2_val_2.fq.gz --ibf /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db.ibf,/media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db.ibf,/media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db.ibf,/media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db.ibf --map /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db.map,/media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db.map,/media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db.map,/media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db.map --tax /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db.tax,/media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db.tax,/media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db.tax,/media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db.tax --kmer-size 19,19,19,19 --window-size 0,0,0,0 --rel-cutoff 0.25 --abs-filter 0 --output-prefix SL335732_ganon_results --threads 16

Error code: 1
Out:
Error:
Please provide a single or one-per-hierarchy --kmer-size value[s]

Json support

Question: would it be interesting for the user to be able to use a json file for the setup (as an alternative to the command line arguments)?

Building Refseq database fails with a numpy type error

Hi,

I am using ganon 0.4.0 and genome_updater 0.2.5 in a clean Conda environment to generate new Ganon reference. After downloading the latest version of Refseq and concatenating the genomes for processing, I get the following Numpy error after running the database build command. I did manually specify python=3.6 in the environment as well.

I can always downgrade to a previous version of Ganon, but given the lack of an outstanding issue I figure it might be something simple to resolve on my end. Any suggestions are appreciated!

Download command:

genome_updater.sh -g "archaea,bacteria" \
                  -d "refseq" \
                  -l "Complete Genome" \
                  -f "genomic.fna.gz,assembly_report.txt" \
                  -o "RefSeqCG_arc_bac" -b "v1" \
                  -a -m -u -r -p -t 24

Build command:

ganon build --db-prefix ganon_db \
            --input-files merged.genomic.fna.gz \
            --seq-info-file RefSeqCG_arc_bac/v1/seqinfo.txt \
            --taxdump-file RefSeqCG_arc_bac/v1/taxdump.tar.gz

Stack trace:

- - - - - - - - - -
   _  _  _  _  _
  (_|(_|| |(_)| |
   _|   v. 0.4.0
- - - - - - - - - -
Unpacking taxdump
 - done in 1.68s.

Parsing taxonomy
 - done in 5.51s.

Parsing --seq-info-file
 - 45892 unique sequence entries in the --seq-info-file RefSeqCG_arc_bac/v1/seqinfo.txt
 - done in 0.05s.

Build: adding 45892 sequences

  File "/opt/conda/envs/ganon_env/bin/ganon", line 33, in <module>
    sys.exit(load_entry_point('ganon==0.4.0', 'console_scripts', 'ganon')())
  File "/opt/conda/envs/ganon_env/lib/python3.6/site-packages/ganon/ganon.py", line 48, in main_cli
    sys.exit(0 if main() else 1)
  File "/opt/conda/envs/ganon_env/lib/python3.6/site-packages/ganon/ganon.py", line 34, in main
    ret=build(cfg)
  File "/opt/conda/envs/ganon_env/lib/python3.6/site-packages/ganon/build_update.py", line 69, in build
    bin_length, approx_size, n_bins = estimate_bin_len_size(cfg, seqinfo, tax)
  File "/opt/conda/envs/ganon_env/lib/python3.6/site-packages/ganon/build_update.py", line 558, in estimate_bin_len_size
    bin_lens = np.geomspace(min_bin_len, max_bin_len, num=300)
  File "<__array_function__ internals>", line 6, in geomspace
  File "/opt/conda/envs/ganon_env/lib/python3.6/site-packages/numpy/core/function_base.py", line 403, in geomspace
    both_negative = (_nx.sign(start) == -1) & (_nx.sign(stop) == -1)
numpy.core._exceptions.UFuncTypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('<U32') -> dtype('<U32')

error message with ganon tests

(base) fvangef@larix:~/ganon/build$ ./ganon-tests

--seqid-bin-file files/bacteria_acc_bin.txt
--output-filter-file test_output.filter
--filter-size 0
--filter-size-bits 8388352
--hash-functions 3
--kmer-size 19
--n-batches 1000
--n-refs 400
--threads 2
--update-filter-file
--update-complete 0
--verbose 1
--reference-files
sequences/bacteria_NC_010333.1.fasta.gz
sequences/bacteria_NC_017163.1.fasta.gz
sequences/bacteria_NC_017164.1.fasta.gz
sequences/bacteria_NC_017543.1.fasta.gz

ganon-tests is a Catch v2.6.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Scenario: Build
-------------------------------------------------------------------------------
/home/mi/fvangef/ganon/tests/ganon-build/GanonBuild.test.cpp:29
...............................................................................

/home/mi/fvangef/ganon/tests/ganon-build/GanonBuild.test.cpp:29: FAILED:
due to a fatal error condition:
  SIGFPE - Floating point error signal

===============================================================================
test cases: 1 | 1 failed
assertions: 1 | 1 failed

Gleitkomma-Ausnahme

Dealing with paired reads

Hi,

I see that ganon supports several fastq inputs, but no concept of paired reads, right? I am planning to fuse the reads with a "n" in the middle to break the k-mer prediction and be able to extract all k-mer for a pair... I think this works with other tools. Would that work with ganon, or I would end with chimeric k-mers overlapping the two pairs?

question about output files of ganon 0.1.5.

Could it be that the descriptions of the output files in this github are not valid for the output files generated with the version 0.1.5.? There is no .all file, instead a .out file and the .rep file has way to many columns for the description given .... is there any updated description available?
Thanks in advance!

seqan3::parse_error on database build

Python 3.8.12
ganon 1.0.0

Building database from NCBI nt database in fasta format (354GB)

Command:

ganon build --db-prefix nt --input-files nt.fa

Error trace:

`Building database files

  • nt.map
  • nt.tax
  • nt.gnn
  • done in 12.71s.

Building index (ganon-build)

  • max unique 19-mers: 102541
  • IBF calculated size with fp<=0.05: 5.11MB (669469 bits/bin * 64 optimal bins [42 real bins])
    The following command failed to run:
    ~/envs/taxonomy/bin/ganon-build --seqid-bin-file nt_tmp/acc_bin.txt --bin-size-bits 669469 --kmer-size 19 --hash-functions 3 --threads 2 --output-filter-file nt.ibf --reference-files nt.fa

Error code: -6
Out:
Error:
terminate called after throwing an instance of 'seqan3::parse_error'
what(): Encountered an unexpected letter: char_is_valid_forseqan3::dna15 evaluated to false on 'x'`

Nucleotide alphabet

Hi,

After building a database, smoothly, my run gets stuck on the first "Y" that appears in a read. No error, no crash, cpu usage drops to 0.00% and the task is frozen.

I use CAMI sets that contain that kind of characters, and were used in your preprint... so before filtering the reads, I just want to be sure there no parameter to deal with this, and can you tell me which alphabet is supported ? ACGTN ...URYSWKMDHV

Keep track of empty bins after update

An updated IBF may have empty bins since the current IBF can only increase in size. Those bins are not being tracked by the database files (map, bins) since re-use of bins is allowed. It is clear which bins are empty when their are not at the end of the filter, but once the last bins of the IBF are empty, it is not possible to track them from the database file.

Possible solutions:

  • make .map an exact translation of the IBF, with placeholders for empty bins.
  • check if the new IBF can be downsized

Add code coverage

With the work done in #29, we can add test coverage now. Here is a draft plan for this:

  1. create a codecov account;
  2. add a Coverage build type to ganon;
  3. add Coverage build plan to travis and send data do codecov (using lcov, probably);
  4. add badge on README

.out files get too huge

Hi,
I try to classify some reads using ganon and everytime the server runs out of memory at some point. It just produces a ridiculously big *.out file (>300GB) although the input file is only 10G, and because the server runs out of memory it never produces the other outfiles. Although I run it on really fat node on a HPC with 450G memory. Do you have any idea what could cause this problem?

This is the command I use:
ganon classify --ganon-path /path/to/ganon/ganon/0.1.2/bin/ --db-prefix my_db --reads /path/to/my/reads.fq -o /output/file -k species

Keep DB in memory between runs

Hi.

Is there a way to keep the Ganon DB in memory between running the classify method on different samples? At least for my use case, the majority of time is spent loading the DB into memory. I appreciate I could combine all my samples into a single file, but this makes for a rather awkward workflow and a lot of extra post-processing of results.

Thanks.

Peak memory usage of ganon build

How do I estimate RAM usage of ganon build? And how do I control it?

I keep getting bad_allocs. I have 100 Gb RAM. Fasta file is 36 Gb gzipped (human, RefSeq viruses and fungi, GTDB representative archaea and bacteria). Formal count (seqkit) says 127 billion bases.

Parsing taxonomy
 - done in 0.40s.

Extracting sequence identifiers
 - 3426349 entries in the --seq-info-file
 - done in 2.38s.

Build: adding 3426085 sequences

Calculating best bin length
 - Approx. min. size possible: 100974.06MB
 - bin length: 2522550bp (approx: 75825 bins / 148893.99MB)
 - done in 39.76s.

Running taxonomic clustering (TaxSBP)
 - 75903 bins created
 - done in 98.11s.

Building database files
 - ncbi_human_virus_gtdb_bacteria_archaea_ganon.map
 - ncbi_human_virus_gtdb_bacteria_archaea_ganon.tax
 - ncbi_human_virus_gtdb_bacteria_archaea_ganon.gnn
 - done in 39.96s.

Building index (ganon-build)
 - max unique 19-mers: 2522532
 - IBF calculated size with fp<=0.05: 149019.62MB (16469055 bits/bin * 75904 optimal bins [75903 real bins])
The following command failed to run:
/rds/general/user/ajm3018/home/anaconda3/envs/ganon/bin/ganon-build --seqid-bin-file ncbi_human_virus_gtdb_bacteria_archaea_ganon_tmp/acc_bin.txt --filter-size-bits 1250067150720 --kmer-size 19 --hash-functions 3 --threads 48 --output-filter-file ncbi_human_virus_gtdb_bacteria_archaea_ganon.ibf     --reference-files ncbi_human_virus_gtdb_bacteria_archaea_ganon/library.fasta.gz

Error code: -6
Out:
Error:
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

I tried to reduce max-bloom-size to 100 Gb, but ganon complained this was too small for the specified false positive rate.

Thanks!

Force option

Option to avoid “ERROR: temp folder already exists” when re-running failed attempts

Custom DB

Dears,
Along with greeting you, I have hundred of sequences with no accession ID since they are denovo assembled, but we know the corresponding taxID. How can I build my custom DB without a acc ID? is possible to give ganon the names.dmp and nodes.dmp directly? I cannot find the correct parameter to that.

thanks in advance!

SyntaxError: can use starred expression only as assignment target

Hi,

python-3.5 is listed as a requirement, but I get the following syntax error(s)

> python --version
Python 3.5.5 :: Anaconda custom (64-bit)

> ./TaxSBP.py create --sorted-output -f t_tmp/acc_len_taxid.txt -n t_tmp/nodes.dmp -m t_tmp/merged.dmp -l 10000000 -r taxid
  File "./TaxSBP.py", line 254
    ret.append((sum_length,*ids))
                              ^
SyntaxError: can use starred expression only as assignment target

Building database takes very long

This tool got my attention, because I hope it will make maintaining reference databases a lot easier.

I installed Ganon from this github repository and started to build a database (fungal genomes, 60 GB). According to the information in the abstract of the pre-print, this should take 1 hour or so. But now even after a day it is still busy with the taxsbp script get_len_taxid.sh. Things are still being written to acc_len_taxid.txt. I see that the scripts queries the ncbi via eutils. Could it be that this step is rate limited? Any suggestions to make this run faster? I think tools like kraken download the taxonomy and than parse it locally.

This is how I ran Ganon:

$ ./ganon build --ganon-path build/ --taxsbp-path ../taxsbp/ -d ganon-genbank-fungi -t 8 -i /data/db/genomes/genbank/fungi/*.gz --verbose                                  
Extracting accessions... Done. Elapsed time: 1212.7593030929565 seconds.

log files

I am running ganon (build at present) through a remote connection to a server so I don't always have possibilty to see the screen log. Sometimes the program exits and as such it is difficult to troubleshoot as to why. Is it possible that the log can be written to a file?

Building database of all RefSeq bacteria, archaea, fungi and viruses

Hi. I'm keen to give ganon a try, but a little unclear at how to build an initial database. Is there a script or recommend way to build a ganon database that covers all RefSeq bacteria, archaea, fungi and viruses? I'd like to build the RefSeq-ALL and/or RefSeq-ALL-top-3 database as described in the ganon bioRxiv manuscript. Alternatively, are there pre-built RefSeq databases available (I appreciate this sort of defeats the purposes of ganon)? Thanks.

Error with cmake

@benvenutti we need to update the cmake dependecy because of the following:

Error occurs with cmake version 3.7.2 and it works fine with cmake version 3.12.3

$ cmake -DCMAKE_BUILD_TYPE=Release ..
-- The CXX compiler identification is GNU 7.2.0
-- Check for working CXX compiler: /home/pirov/SCRATCH_NOBAK/miniconda3/envs/gcc7/bin/c++
-- Check for working CXX compiler: /home/pirov/SCRATCH_NOBAK/miniconda3/envs/gcc7/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.8")
-- Performing Test CXX14_BUILTIN
-- Performing Test CXX14_BUILTIN - Success
-- Looking for C++ include execinfo.h
-- Looking for C++ include execinfo.h - found
-- Found Seqan: /home/pirov/SCRATCH_NOBAK/ganon/libs/seqan/include (found version "2.4.0")
-- Configuring done
CMake Error at src/CMakeLists.txt:37 (add_executable):
  CXX_STANDARD is set to invalid value '17'


CMake Error at src/CMakeLists.txt:47 (add_executable):
  CXX_STANDARD is set to invalid value '17'


CMake Error at src/CMakeLists.txt:57 (add_executable):
  CXX_STANDARD is set to invalid value '17'


-- Generating done
-- Build files have been written to: /home/pirov/SCRATCH_NOBAK/ganon/build

ganon classify: --rel-cutoff and --rel-filter guidelines?

Hi, I'm using a --rel-cutoff (c) and --rel-filter (e) value of 0.25 in classify (window size of 32 from build). I'm getting approx 50% classified and approx 20% of these as unique matches. I would in theory like to bump up both values, if indeed possible? I'm using 130-150bp reads in pairs.

I'm wondering if you have any guidelines to the use of these parameters above that written on the main page? Or even a suggestion as to recommended values? I'm guessing that lowering --rel-cutoff (c) will recruit more reads but at the cost of unique matches. So what is a good trade off?

Compatability with recentrifuge? (add read len to the output)

Hi
I'm wondering as to the compatability of ganon (classify/report) output with recentrifuge (https://github.com/khyox/recentrifuge/wiki/Running-recentrifuge-for-a-generic-classifier)? I need to remove contaminant reads from the dataset and for a person who doesn't use R (Decontam), then recentrifuge seems like the best option. As such, I'm wondering if a) the ganon output is compatible with Recentrifuge (which I presume it is)? and b) What output, either from classify or report, will give me a file that is closest to the requested input of Recentrifuge?

GTDB

Any experiences/suggestions for using Ganon with GTDB? There seem to be great advantages of GTDB for taxonomic assignment!

Thanks.

ganon-build, Errorcode: -4

Hello,

I am trying to include Ganon in the benchmarking platform we currently developing (https//lemmi.ezlab.org). In short, it means wrapping it in a docker to both build a reference using specific genomes, and then run an analysis reusing this database.

We managed to make it work for Ganon on different servers and laptops, however it crashes on the server we use for the actual benchmarking runs... seems related to the kernel or install of this machine, as docker is not a complete virtualization.

The error I get is

Running taxonomic clustering (TaxSBP.py)...
169 bins created. Done. Elapsed time: 5.922635078430176 seconds.
Approximate (upper-bound) # unique k-mers: 5999982
Bloom filter calculated size with fp<=0.05: 896.5887MB (39172558 bits/bin * 192 optimal bins [169 real bins])
Building index (ganon-build)...
The following command failed to execute:
/ganon/build/ganon-build -e /bbx/tmp/DB_tmp/acc_bin.txt --filter-size-bits 7521131136 -k 19 -n 3 -t 32 -o /bbx/tmp/DB.filter --verbose   /bbx/tmp/All_seqs.fna
Errorcode: -4
Error:
--seqid-bin-file      /bbx/tmp/DB_tmp/acc_bin.txt
--output-filter-file  /bbx/tmp/DB.filter
--filter-size         896
--filter-size-bits    7521131136
--hash-functions      3
--kmer-size           19
--n-batches           1000
--n-refs              400
--threads             32
--update-filter-file
--update-complete     0
--verbose             1
--reference-files
                      /bbx/tmp/All_seqs.fna


Traceback (most recent call last):
  File "/ganon/ganon", line 596, in <module>
    main()
  File "/ganon/ganon", line 195, in main
    stdout, stderr, errcode = run(run_ganon_build_cmd, print_stderr=True)
  File "/ganon/ganon", line 472, in run
    if errcode!=0: raise Exception()
Exception

Probably not sufficient to give me an answer, unless a parameter is wrong, but I would be glad to provide extra detail if I knew what to look for.

Of note, when building the docker container on a different machine, ganon-build -h leads to a core dump. When building the container on the incriminated server (following: Installing GCC7 in a separate environment with conda), ganon-build -h does display the help. An improvement, but the main error keeps happening unfortunately.

Any suggestion for a debug strategy is welcome.

Thanks

Mathieu

ganon build results in Argument list too long

Hi,

I'm trying to build a Ganon database with something like:
ganon build --db-prefix sample_bacteria --input-files tests/ganon-build/data/sequences/bacteria*.fasta.gz

This results in a Argument list too long error. Is it possible to simply specify the directory containing all the genomes along with the desired extension?

Cheers,
Donovan

Functional annotation?

Hi,
I'm wondering as to the possibility of incorporating a functional annotation of reads into ganon, such as that in kraken2?

ValueError: max() arg is an empty sequence

Having multiple databases that won't build, get the same error for all.

cd /media/ubuntu/Elements/reference_genomes/ganon
conda activate python3.7_environment
rm -r EUK_refseq_ALL_db
mkdir EUK_refseq_ALL_db
cd EUK_refseq_ALL_db
ganon build -t 30 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz > out.log 2>&1 &
cd ..



(|(|| |(_)| |
_| v. 1.1.0


1346 file(s) [genomic.fna.gz] found in /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files

Downloading taxdump

  • done in 7.02s.

Unpacking taxdump

  • done in 3.97s.

Parsing taxonomy

  • done in 5.64s.

Extracting sequence identifiers

  • 13354133 unique sequence headers successfully retrieved from 1346 input file(s)
  • done in 24040.90s.

Extracting sequence lengths

  • 13354133 sequences lenghts successfully retrieved
  • done in 15867.69s.

Downloading nucl_gb.accession2taxid.gz

  • done in 315.38s.

Downloading nucl_wgs.accession2taxid.gz

  • done in 326.77s.

Parsing accession2taxid files

  • 13051429 entries found in the nucl_gb.accession2taxid.gz file
  • 0 entries found in the nucl_wgs.accession2taxid.gz file
  • done in 17115.37s.

Build: adding 13354133 sequences

Simulating parameters
Traceback (most recent call last):
File "/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon", line 33, in
sys.exit(load_entry_point('ganon==1.1.0', 'console_scripts', 'ganon')())
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 48, in main_cli
sys.exit(0 if main() else 1)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 34, in main
ret=build(cfg)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 70, in build
bin_length = estimate_bin_length(cfg, seqinfo, tax)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 625, in estimate_bin_length
max_bin_len = max(groups_len.values())
ValueError: max() arg is an empty sequence

Pandas error df.concat depreciated during `ganon build`

Hi @pirovc

I am using the bioconda version of ganon and am getting some strange results and am trying to trouble shoot through them. I have noticed that when I use ganon build I get the following error.

/master/nplatt/anaconda3/envs/ganon/lib/python3.9/site-packages/ganon/seqinfo.py:32: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. self.seqinfo = self.seqinfo.append(df, ignore_index=True, sort=False)

and it repeats several thousand times....

This seems like an easy fix and shouldn't be causing any major problems but wanted to make note of it for you. I tried messing around with the code and changing it to the new pd.concat() but crashed things every time I tried.

Relevant portions of my env are:

name: ganon
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - ganon=1.1.2=py39h65541a6_1
  - pandas=1.4.2=py39h1832856_2
  - python=3.9.13=h2660328_0_cpython
...

Thanks

possibility to integrate pigz

I am classifying quite large fastq files that are gzipped and I'm wondering if a large part of the time used is for unzipping/zipping the files instead of classifying? I am happy to add pigz to my pipeline to speed things up, if needs be? but I'm wondering if pigz could be integrated in ganon, passing the same -t (threads) option?

regards

Mixing GTDB and NCBI taxonomy

Hi,
is it possible to mix taxonomy systems when using multiple databases? E.g. id be interested in using GenBank and the latest GTDB release.

Best

Oskar

Use of --input-directory still results in `Argument list too long`

Hello,

I tried to build a database using the --input-directory and --input-extension option.It looks like the script is still expanding this out and trying to run all genomes via a shell command resulting in:

OSError: [Errno 7] Argument list too long: '/bin/sh'

Cheers,
Donovan

Confusing error message when using `build-custom`

Hello,

I'm testing a conda-installed ganon 1.5 on this FASTA File.

However when running with the following command I get the error:

$ ganon build-custom -i genome.fasta -d out/ --verbose
- - - - - - - - - -
   _  _  _  _  _   
  (_|(_|| |(_)| |  
   _|   v. 1.5.0
- - - - - - - - - -
Total valid files: 1

Downloading and parsing ncbi taxonomy
 - done in 20.09s.

Parsing sequences from --input (1 files)
 - 1 unique entries
 - done in 0.01s.

Retrieving sequence information from NCBI e-utils
The following command failed to run:
 -i out/_files/build/accessions.txt -k -e 
[Errno 2] No such file or directory: '-i'
Error code: 1

I'm not sure what the -i is referring to here... do you have any idea what is causing it?

no pylca in ganon v4 clone

Hi,
I tried to install ganon 4 but it always gives me an error
"LCA module not found (pylca)"
The error message is triggered by the line
from pylca.pylca import LCA
However I can't find any pylca* in my ganon clone...
Do you know how to solve this problem?

Thanks a lot!

Segmentation fault

Hej I got ganon to work for smaller datasaets, but when I try to run a larget set of genomes I get the error pasted below, however it doesn´t give me much information so I can´t begin to debug the error message! If I use exactly the same setup, but limits the number of genomes it runs smoothly.

(Running without verbose)
Error code: -11
Out:
Error:

(Running with verbose)
miniconda3/envs/flextaxd/bin/ganon-build --seqid-bin-file ganon_databases/NCBI_standard_bugtest/_tmp/acc_bin.txt --filter-size-bits 4260405324672 --kmer-size 19 --hash-functions 3 --threads 30 --output-filter-file ganon_databases/NCBI_standard_bugtest/.ibf --reference-files ganon_databases/NCBI_standard_bugtest/.tmp0.fasta,ganon_databases/NCBI_standard_bugtest/.tmp1.fasta,ganon_databases/NCBI_standard_bugtest/.tmp2.fasta,ganon_databases/NCBI_standard_bugtest/.tmp3.fasta,ganon_databases/NCBI_standard_bugtest/.tmp4.fasta,ganon_databases/NCBI_standard_bugtest/.tmp5.fasta,ganon_databases/NCBI_standard_bugtest/.tmp6.fasta,ganon_databases/NCBI_standard_bugtest/.tmp7.fasta,ganon_databases/NCBI_standard_bugtest/.tmp8.fasta,ganon_databases/NCBI_standard_bugtest/.tmp9.fasta,ganon_databases/NCBI_standard_bugtest/.tmp10.fasta,ganon_databases/NCBI_standard_bugtest/.tmp11.fasta,ganon_databases/NCBI_standard_bugtest/.tmp12.fasta,ganon_databases/NCBI_standard_bugtest/.tmp13.fasta,ganon_databases/NCBI_standard_bugtest/.tmp14.fasta,ganon_databases/NCBI_standard_bugtest/.tmp15.fasta,ganon_databases/NCBI_standard_bugtest/.tmp16.fasta,ganon_databases/NCBI_standard_bugtest/.tmp17.fasta,ganon_databases/NCBI_standard_bugtest/.tmp18.fasta,ganon_databases/NCBI_standard_bugtest/.tmp19.fasta,ganon_databases/NCBI_standard_bugtest/.tmp20.fasta,ganon_databases/NCBI_standard_bugtest/.tmp21.fasta,ganon_databases/NCBI_standard_bugtest/.tmp22.fasta,ganon_databases/NCBI_standard_bugtest/.tmp23.fasta,ganon_databases/NCBI_standard_bugtest/.tmp24.fasta,ganon_databases/NCBI_standard_bugtest/.tmp25.fasta,ganon_databases/NCBI_standard_bugtest/.tmp26.fasta,ganon_databases/NCBI_standard_bugtest/.tmp27.fasta,ganon_databases/NCBI_standard_bugtest/.tmp28.fasta,ganon_databases/NCBI_standard_bugtest/.tmp29.fasta --verbose 2>&1 > ganono-build_log.txt

--reference-files
ganon_databases/NCBI_standard_bugtest/.tmp0.fasta
ganon_databases/NCBI_standard_bugtest/.tmp1.fasta
ganon_databases/NCBI_standard_bugtest/.tmp2.fasta
ganon_databases/NCBI_standard_bugtest/.tmp3.fasta
ganon_databases/NCBI_standard_bugtest/.tmp4.fasta
ganon_databases/NCBI_standard_bugtest/.tmp5.fasta
ganon_databases/NCBI_standard_bugtest/.tmp6.fasta
ganon_databases/NCBI_standard_bugtest/.tmp7.fasta
ganon_databases/NCBI_standard_bugtest/.tmp8.fasta
ganon_databases/NCBI_standard_bugtest/.tmp9.fasta
ganon_databases/NCBI_standard_bugtest/.tmp10.fasta
ganon_databases/NCBI_standard_bugtest/.tmp11.fasta
ganon_databases/NCBI_standard_bugtest/.tmp12.fasta
ganon_databases/NCBI_standard_bugtest/.tmp13.fasta
ganon_databases/NCBI_standard_bugtest/.tmp14.fasta
ganon_databases/NCBI_standard_bugtest/.tmp15.fasta
ganon_databases/NCBI_standard_bugtest/.tmp16.fasta
ganon_databases/NCBI_standard_bugtest/.tmp17.fasta
ganon_databases/NCBI_standard_bugtest/.tmp18.fasta
ganon_databases/NCBI_standard_bugtest/.tmp19.fasta
ganon_databases/NCBI_standard_bugtest/.tmp20.fasta
ganon_databases/NCBI_standard_bugtest/.tmp21.fasta
ganon_databases/NCBI_standard_bugtest/.tmp22.fasta
ganon_databases/NCBI_standard_bugtest/.tmp23.fasta
ganon_databases/NCBI_standard_bugtest/.tmp24.fasta
ganon_databases/NCBI_standard_bugtest/.tmp25.fasta
ganon_databases/NCBI_standard_bugtest/.tmp26.fasta
ganon_databases/NCBI_standard_bugtest/.tmp27.fasta
ganon_databases/NCBI_standard_bugtest/.tmp28.fasta
ganon_databases/NCBI_standard_bugtest/.tmp29.fasta
--seqid-bin-file ganon_databases/NCBI_standard_bugtest/_tmp/acc_bin.txt
--output-filter-file ganon_databases/NCBI_standard_bugtest/.ibf
--update-filter-file
--update-complete 0
--filter-size 507879
--filter-size-bits 4260405324672
--hash-functions 3
--kmer-size 19
--threads 30
--n-refs 400
--n-batches 1000
--verbose 1
--quiet 0

WARNING: sequence not defined on seqid-bin-file [NZ_CP054027.1_GCF_005862185.2]
WARNING: sequence not defined on seqid-bin-file [NZ_CP054028.1_GCF_005862185.2]
WARNING: sequence not defined on seqid-bin-file [NZ_CP054029.1_GCF_005862185.2]
WARNING: sequence not defined on seqid-bin-file [NZ_CP054029.1_GCF_005862185.2]
WARNING: sequence not defined on seqid-bin-file [NZ_CP047412.1_GCF_013487885.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP047413.1_GCF_013487885.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP047413.1_GCF_013487885.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP047416.1_GCF_013487925.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP047416.1_GCF_013487925.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP045506.1_GCF_013376495.1]
WARNING: sequence not defined on seqid-bin-file [NZ_CP045506.1_GCF_013376495.1]
Segmentation fault

Invalid sequences in build step

While building a reference db for fungal sequences from genbank, all of them are marked as invalid during the build process. What makes them invalid?

FJUY01000046.1 not defined on seqid-bin file

ganon-build processed 0 sequences (0 Mbp) in 154.275 seconds (0 Kseq/m, 0 Mbp/m)
 - 73767 sequences and 1446 bins defined on ganon-genbank-fungi_tmp/acc_bin.txt
 - 87436 sequences (87436 invalid) were read from the input sequence files.
 - 0 valid sequences in 1446 bins were written to ganon-genbank-fungi.filter
Done. Elapsed time: 177.28528761863708 seconds.
Storing nodes... Done. Elapsed time: 3.003349542617798 seconds.
Total elapsed time: 206.2393398284912 seconds.

In the 2019-02-20_15-51-10_updated_sequence_accession.txt from genome_updater the stats look ok.

A       GCA_900074925.1 FJUY01000031.1  NW_019716285.1  109818  112498

The fasta header look like:

>FJUY01000077.1 Ramularia collo-cygni strain URUG2 genome assembly, contig: RCC_scaffold77, whole genome shotgun sequence

Comparison of run-time classification step

I ran ganon classify and kraken1 on the same samples against a db build of NCBI fungal genbank sequences. I noticed that ganon took much longer than kraken1 to process the sequences:

analysis/ganon/T7V21.log:ganon-classify processed 1017638 sequences (270.743 Mbp) in 1348.76 seconds (45.2699 Kseq/m, 12.0441 Mbp/m)
analysis/kraken/T7V21.log:1017638 sequences (270.74 Mbp) processed in 38.897s (1569.7 Kseq/m, 417.63 Mbp/m).

The databases are located on PCI Express (PCIe) storage. The ganon filter file is 54 GB where the kraken1 kdb file is 546 GB (!), so the ganon db is also much smaller.

I was expecting ganon to be faster or even comparable. Is this expected? Should I change something? Any thoughts on this are appreciated.

ganon-build (core dumped)

Hello,

I would like to index all of refseq bacteria,archea,viral and human with ganon-build.

I downloaded all the files with genome_updater.sh with the following command:

genome_updater.sh -d "refseq" -g "archaea,bacteria,viral,human" -o path/to/DB -t 12 -b refseq -f "genomic.fna.gz,assembly_report.txt" -m -u -r -p

For indexing the downloaded files I used the following command:

ganon-build -e path/to/DB/refseq/acc_len_taxid_assembly.txt -o path/to/DB/refseq/refseq_db.filter -s 70000 -n 3 -k 19 -t 60 --verbose path/to/DB/refseq/files/*.fna.gz

At first I get a an Argument list too long error, so I concatenated all fna.gz into one file called refseq.fna.gz and retried the command.

I get the following error:

--seqid-bin-file      path/to/DB/refseq/acc_len_taxid_assembly.txt
--output-filter-file  path/to/DB/refseq/refseq_db.filter
--filter-size         70000
--filter-size-bits    3087007744
--hash-functions      3
--kmer-size           19
--n-batches           1000
--n-refs              400
--threads             60
--update-filter-file
--update-complete     0
--verbose             1
--reference-files
                     path/to/DB/refseq/refseq.fna.gz

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoul
/bin/bash: line 1: 14186 Aborted                 (core dumped) ganon-build -e path/to/DB/refseq/acc_len_taxid_assembly.txt -o path/to/DB/refseq/refseq_db.filter -s 70000 -n 3 -k 19 -t 60 --verbose path/to/DB/refseq/refseq.fna.gz

Any suggestions on how to debug and fix this ?

Reporting of subspecies/strain

Hi,

The CLI of ganon classify indicates it will report results at species+. I took this to mean subspecies/strains as defined by the NCBI Taxonomy. However, I have yet to observe classifications below the rank of species even on trivial test sets (e.g. simulated reads from a subspecies/strain in the reference database). Personally, I think this is fine, but I just wanted to confirm that ganon doesn't reporting subspecies/strains.

Cheers,
Donovan

avoid duplicated download of accession2taxid.gz

to reduce run time it might be possible to add the *accession2taxid.gz files to tmp folder (maybe by copying them in there).

if ganon would check, if the files are present and move on, this might help, if there are problems to retrieve those files via ganon's download option.

--input-directory fails on sequence folder

Hi,

i downloaded data via genome_updater

$ Tools/genome_updater/genome_updater.sh -g "taxids:1386" -d "refseq" -l "Chromosome" -f "genomic.fna.gz,assembly_report.txt" -t 36 -o References/all_bacillus_chromosome/ -r -a
----------------------------------------
      genome_updater version: 0.2.0
----------------------------------------
Mode: NEW - DOWNLOAD
Working directory: [...]
----------------------------------------
Downloading assembly summary
 - 4480/4579 entries removed [RefSeq category: all, Assembly level: Chromosome, Version status: latest]
 - 99 entries available
 - Downloading 198 files with 36 threads
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

 - 185/198 files successfully downloaded
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

 - Sequence accession report written [...]

Downloading current Taxonomy database [...]
 - Done

Downloading current Taxonomy database [...]
 - Done

# 185/198 files successfully obtained
 - 13 file(s) failed to download. Please re-run your command with -i to fix it again

Missing files were fixed successfully with -i setting. Thanks for this option!

Start to build ganon index fails.
I tried different levels of --input-directory meaning going one or two levels up, but to no avail.

$ ganon build \
--db-prefix References/Bacillus_Ganon \
--input-directory References/all_bacillus_chromosome/2019-12-03_17-47-21/files/ \
--input-extension *.fna.gz
Extracting accessions... Traceback (most recent call last):
  File "[...]/bin/ganon", line 1116, in <module>
    main()
  File "[...]/bin/ganon", line 211, in main
    taxsbp_input_file, ncbi_nodes_file, ncbi_merged_file, ncbi_names_file = prepare_files(args, tmp_output_folder, use_assembly, ganon_get_len_taxid_exec)
  File "[...]/bin/ganon", line 618, in prepare_files
    taxsbp_input_file = retrieve_ncbi(tmp_output_folder, args.input_files, args.threads, ganon_get_len_taxid_exec, args.seq_info, use_assembly)
  File "[...]/bin/ganon", line 728, in retrieve_ncbi
    run_get_header = "cat {0} {1} | gawk 'BEGIN{{FS=\" \"}} /^>/ {{print substr($1,2)}}'".format(" ".join(files), "| zcat" if files[0].endswith(".gz") else "")
IndexError: list index out of range

The same command using --input-files succeeds.

 $ ganon build \
--db-prefix References/Bacillus_Ganon \
--input-files References/all_bacillus_chromosome/2019-12-03_17-47-21/files/*.fna.gz

Extracting accessions... 260 accessions retrieved. Done. Elapsed time: 4.39 seconds.
Retrieving sequence lengths and taxid from NCBI E-utils... Done. Elapsed time: 4.38 seconds.
Downloading taxdump... Done. Elapsed time: 35.49 seconds.
Estimating best bin length... 3635902bp. Done. Elapsed time: 4.09 seconds.
Running taxonomic clustering (TaxSBP)... 165 bins created. Done. Elapsed time: 3.65 seconds.
Approximate (upper-bound) # unique k-mers: 3635884
Bloom filter calculated size with fp<=0.05: 543.3170MB (23737884 bits/bin * 192 optimal bins [165 real bins])
Building index (ganon-build)...

ganon-build processed 260 sequences (509.441 Mbp) in 155.684 seconds (0.100203 Kseq/m, 196.337 Mbp/m)
 - 260 sequences and 165 bins defined on References/Bacillus_Ganon_tmp/acc_bin.txt
 - 260 sequences (0 invalid) were read from the input sequence files.
 - 260 valid sequences in 165 bins were written to References/Bacillus_Ganon.filter
Done. Elapsed time: 156.48 seconds.
Building database files... Done. Elapsed time: 5.83 seconds.
Total elapsed time: 216.89 seconds.

I guess I did something wrong, but could not identify it by myself.

[Errno 7] Argument list too long v 0.4.0

Hi, I'm trying to use ganon but I got an error when I tried to build the dataset

ganon build --db-prefix ganon_db -t 32 --input-files RefSeqCG_arc_bac/v1/files/*genomic.fna.gz



(|(|| |(_)| |
_| v. 0.4.0


Downloading taxdump

  • done in 5.72s.

Unpacking taxdump

  • done in 6.61s.

Parsing taxonomy

  • done in 4.71s.

Extracting sequence identifiers

  • 44723 unique sequence headers successfully retrieved from 21442 input file(s)
  • done in 2417.85s.

Retrieving sequence information from NCBI E-utils

  • 44723 sequences successfully retrieved
  • done in 639.22s.

Build: adding 44723 sequences

Calculating best bin length

  • Approx. min. size possible: 68961.01MB
  • bin length: 11707469bp (approx: 11269 bins / 103218.53MB)
  • done in 6.24s.
    Building database files
  • ganon_db.map
  • ganon_db.tax
  • ganon_db.gnn
  • done in 1.00s.

Building index (ganon-build)

  • max unique 19-mers: 11707451
  • IBF calculated size with fp<=0.05: 110799.55MB (76435362 bits/bin * 12160 optimal bins [12144 real bins])
    The following command failed to run:
    Huge list of file
    ~[Errno 7] Argument list too long: '/lucmac/miniconda3/bin/ganon-build'
    Error code: 0
    Out:
    Error:

ganon.db.filter file is huge

Hello,
I try to build a database for ganon containing only human reference genome and the RVDB virus database. The command I use is just like in the tutorial:
ganon build --db-prefix ganon_viruses_db --input-files clean5.U-RVDB16_human.fasta
I end up with having huge ganon_viruses_db.filter file, which is 159G, when the fasta files I used for builing it clean5.U-RVDB16_human.fasta take only 7.8G of space. Is this expected or I'm doing something wrong to end up with such a huge file?
20K Sep 25 10:18 ganon_viruses_db.bins
159G Sep 25 11:24 ganon_viruses_db.filter
4.0K Sep 25 10:18 ganon_viruses_db.map
1.8K Sep 25 11:25 ganon_viruses_db.nodes
Best regards,
Aga

parallelise parse_seqids

parse_seqids is called multiple times and is a very slow process.
on 57000 unique sequence headers it takes on our system more than 30min.

This might be avoidable, if

  1. the for loop over all input files would parse multiple files simultaneously
  2. wouldn't be called twice once for sequence names and later for sequence length

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.