dutilh / cat Goto Github PK

CAT/BAT/RAT: tools for taxonomic classification of contigs and metagenome-assembled genomes (MAGs) and for taxonomic profiling of metagenomes

License: MIT License

Python 100.00%

cat's Introduction

CAT, BAT, and RAT

Introduction
Dependencies and where to get them
Installation
Getting started
Taxonomic annotation of contigs or MAGs with CAT and BAT
Estimating the microbial composition with RAT
- Output files

Introduction

Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of long DNA sequences and metagenome assembled genomes (MAGs / bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against a protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs. CAT and BAT can be run from intermediate steps if files are formated appropriately.

A paper describing the algorithm together with extensive benchmarks can be found at https://doi.org/10.1186/s13059-019-1817-x. If you use CAT or BAT in your research, it would be great if you could cite us:

von Meijenfeldt F.A.B., Arkhipova K., Cambuy D.D., Coutinho F.H., Dutilh B.E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biology. 2019;20:217.

Read Annotation Tool (RAT) estimates the taxonomic composition of metagenomes using CAT and BAT output. A manuscript describing RAT with benchmarks can be found at https://doi.org/10.1038/s41467-024-47155-1. If you use RAT in your research, it would be great if you could cite:

Hauptfeld, E., Pappas, N., van Iwaarden, S., Snoek B.L., Aldas-Vargas A., Dutilh B.E., von Meijenfeldt F.A.B. Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes. Nature Communications 15, 3373 (2024).
von Meijenfeldt F.A.B., Arkhipova K., Cambuy D.D., Coutinho F.H., Dutilh B.E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biology. 2019;20:217.

To cite the code itself:

Dependencies and where to get them

Python 3, https://www.python.org/.
DIAMOND, https://github.com/bbuchfink/diamond.
Prodigal, https://github.com/hyattpd/Prodigal.

RAT further requires (not needed for CAT and BAT):

BWA, https://github.com/lh3/bwa.
SAMtools, http://www.htslib.org/download/.

CAT, BAT, and RAT have been thoroughly tested on Linux systems, and should run on macOS as well.

Installation

No installation is required. You can run CAT, BAT and RAT by supplying the absolute path:

$ ./CAT_pack/CAT_pack --help

Alternatively, if you add the files in the CAT_pack directory to your $PATH variable, you can run CAT, BAT, and RAT from anywhere:

$ CAT_pack --version

Getting started

To get started with CAT/BAT/RAT, you will have to get the database files on your system. You can either download preconstructed database files, or generate them yourself.

Downloading preconstructed database files

To download the database files, find the most recent version on tbb.bio.uu.nl/tina/CAT_pack_prepare/, download and extract, and you are ready to go!

For NCBI nr:

$ wget tbb.bio.uu.nl/tina/CAT_pack_prepare/20240422_CAT_nr.tar.gz

$ tar -xvzf 20240422_CAT_nr.tar.gz

For GTDB:

$ wget tbb.bio.uu.nl/tina/CAT_pack_prepare/20231120_CAT_gtdb.tar.gz     # release 214

$ tar -xvzf 20231120_CAT_gtdb.tar.gz

Creating a fresh NCBI nr or GTDB database yourself

Instead of using the preconstructed database, you can construct a fresh database yourself. The download module can be used to download and process raw data, in preparation for building a new CAT pack database. This will ensure that all input dependencies are met and correctly formatted for CAT_pack prepare.

Currently, two databases are supported, NCBI's nr and the Genome Taxonomy Database (GTDB) proteins.

NCBI non-redundant protein database (nr)

$ CAT_pack download -db nr -o path/to/nr_data_dir

Will download the fasta file with the protein sequences, their mapping to a taxid, and the taxonomy information from NCBI's ftp site.

Genome Taxonomy Database (GTDB) proteins

$ CAT_pack download -db gtdb -o path/to/gtdb_data_dir

The files required to build a CAT pack database are provided by the GTDB downloads page.

CAT_pack download fetches the necessary files and does some additional processing to get them ready for CAT_pack prepare:

The taxonomy information from GTDB is transformed into NCBI style nodes.dmp and names.dmp files.
Protein sequences are extracted from gtdb_proteins_aa_reps.tar.gz and are subjected to a round of deduplication. The deduplication reduces the redundancy in the DIAMOND database, thus simplifying the alignment process. Exact duplicate sequences are identified based on a combination of the MD5sum of the protein sequences and their length. Only one representative sequence is kept, with all duplicates encoded in the fasta header. This information is later used by CAT_pack prepare to assign the LCA of the protein sequence appropriately in the .fastaid2LCAtaxid file.
The mapping of all protein sequences to their respective taxonomy is created.
In addition, the newick formatted trees of Bacteria and Archaea are downloaded and - artificially - concatenated under a single root node, to produce an all.tree file. This file is not used by the CAT pack but may come in handy for downstream analyses.

When the download and processing of the files is finished successfully you can build a CAT pack database with CAT_pack prepare.

For all command line options available see

$ CAT_pack download -h

and

$ CAT_pack prepare -h

Creating a custom database

For a custom CAT pack database, you must have the following input ready before you launch a CAT_pack prepare run.

A fasta file containing all protein sequences you want to include in your database.
A names.dmp file that contains mappings of taxids to their ranks and scientific names. The format must be the same as the NCBI standard names.dmp (uses \t|\t as field separator).

An example looks like this:

1	|	root	|	scientific name	|
2	|	Bacteria	|	scientific name	|
562	|	Escherichia	coli	|	scientific name	|

A nodes.dmp file that describes the child-parent relationship of the nodes in the taxonomy tree and their (official) rank. The format must be the same as the NCBI standard nodes.dmp (uses \t|\t as the field separator).

An example looks like this:

1	|	1	|	root	|
2	|	1	|	superkingdom	|
1224	|	2	|	phylum	|
1236	| 1224	|	class	|
91437	|	1236	|	order	|
543	|	91347	|	family	|
561	|	543	|	genus	|
562	|	561	|	species	|

For more information on the nodes.dmp and names.dmp files, see the NCBI taxdump_readme.txt.

A 2-column, tab-separated file containing the mapping of each sequence in the fasta file to a taxid in the taxonomy. This file must contain the header accession.version taxid.

An example looks like this

accession.version	taxid
protein_1	562
protein_2	123456

Once all of the above requirements are met you can run CAT_pack prepare. All the input needs to be explicitly specified for CAT_pack prepare to work, for example:

$ CAT_pack prepare \
--db_fasta path/to/fasta \
--names path/to/names.dmp \
--nodes path/to/nodes.dmp \
--acc2tax path/to/acc2taxid.txt.gz \
--db_dir path/to/output_dir

will create an output_dir that will look like this

output_dir
├── 2023-11-05_CAT_pack.log
├── db
│   ├── 2023-11-05_CAT_pack.dmnd
│   ├── 2023-11-05_CAT_pack.fastaid2LCAtaxid
│   └── 2023-11-05_CAT_pack.taxids_with_multiple_offspring
└── tax
    ├── names.dmp
    └── nodes.dmp

Notes:

The two subdirs db and tax are created that contain all necessary files.
The nodes.dmp and names.dmp in the tax directory are copied from their original location. This is to ensure that the -t flag of CAT, BAT, and RAT work.
The default prefix is <YYYY-MM-DD>_CAT_pack. You can customize it with the --common_prefix option.

For all command line options available see

$ CAT_pack prepare -h

Running CAT/BAT/RAT.

The database files are needed in subsequent CAT/BAT/RAT runs. They only need to be generated/downloaded once or whenever you want to update the database.

To run CAT/BAT/RAT, respectively:

$ CAT_pack contigs     # Runs CAT.

$ CAT_pack bins        # Runs BAT.

$ CAT_pack reads       # Runs RAT.

Getting help.

If you are unsure what options a program has, you can always add --help to a command. This is a great way to get you started with CAT, BAT, or RAT.

$ CAT_pack --help

$ CAT_pack contigs --help

$ CAT_pack summarise --help

If you are unsure about what input files are required, you can just run CAT/BAT/RAT, as the appropriate error messages are generated if formatting is incorrect.

Taxonomic annotation of contigs or MAGs with CAT and BAT

After you have got the database files on your system, you can run CAT to annotate your contig set:

$ CAT_pack contigs -c {contigs fasta} -d {database folder} -t {taxonomy folder}

Multiple output files and a log file will be generated. The final classification files will be called out.CAT.ORF2LCA.txt and out.CAT.contig2classification.txt.

Alternatively, if you already have a predicted proteins fasta file and/or an alignment table for example from previous runs, you can supply them to CAT, which will then skip the steps that have already been done and start from there:

$ CAT_pack contigs -c {contigs fasta} -d {database folder} -t {taxonomy folder} -p {predicted proteins fasta} -a {alignment file}

The headers in the predicted proteins fasta file must look like this >{contig}_{ORFnumber}, so that CAT can couple contigs to ORFs. The alignment file must be tab-seperated, with queried ORF in the first column, protein accession number in the second, and bit-score in the 12th.

To run BAT on a set of MAGs:

$ CAT_pack bins -b {bin folder} -d {database folder} -t {taxonomy folder}

Alternatively, BAT can be run on a single MAG:

$ CAT_pack bins -b {bin fasta} -d {database folder} -t {taxonomy folder}

Multiple output files and a log file will be generated. The final classification files will be called out.BAT.ORF2LCA.txt and out.BAT.bin2classification.txt.

Similarly to CAT, BAT can be run from intermidate steps if gene prediction and alignment have already been carried out once:

$ CAT_pack bins -b {bin folder} -d {database folder} -t {taxonomy folder} -p {predicted proteins fasta} -a {alignment file}

If you have previously run CAT on the set of contigs from which the MAGs originate, you can use the previously predicted protein and alignment files to classify the MAGs.

$ CAT_pack contigs -c {contigs fasta} -d {database folder} -t {taxonomy folder}

$ CAT_pack bins -b {bin folder} -d {database folder} -t {taxonomy folder} -p {predicted proteins fasta from contig run} -a {alignment file from contig run}

This is a great way to run both CAT and BAT on a set of MAGs without needing to do protein prediction and alignment twice!

Interpreting the output files

The ORF2LCA output looks like this:

ORF	number of hits (r: 10)	lineage	bit-score
contig_1_ORF1	7	1;131567;2;1783272	574.7

Where the lineage is the full taxonomic lineage of the classification of the ORF, and the bit-score the top-hit bit-score that is assigned to the ORF for voting. The BAT ORF2LCA output file has an extra column where ORFs are linked to the MAG in which they are found.

The contig2classification and bin2classification output looks like this:

contig or bin	classification	reason	lineage	lineage scores (f: 0.3)
contig_1	taxid assigned	based on 14/15 ORFs	1;131567;2;1783272	1.00; 1.00; 1.00; 0.78
contig_2	taxid assigned (1/2)	based on 10/10 ORFs	1;131567;2;1783272;17id98711;1117;307596;307595;1890422;33071;1416614;1183438*	1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;0.23;0.23
contig_2	taxid assigned (2/2)	based on 10/10 ORFs	1;131567;2;1783272;1798711;1117;307596;307595;1890422;33071;33072	1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00;0.77
contig_3	no taxid assigned	no ORFs found

Where the lineage scores represent the fraction of bit-score support for each classification. contig_2 has two classifications. This can happen if the f parameter is chosen below 0.5. For an explanation of the starred classification, see Marking suggestive taxonomic assignments with an asterisk.

To add names to the taxids in either output file, run:

$ CAT_pack add_names -i {ORF2LCA / classification file} -o {output file} -t {taxonomy folder}

This will show you that for example contig_1 is classified as Terrabacteria group. To only get official rank (i.e. superkingdom, phylum, ...):

$ CAT_pack add_names -i {ORF2LCA / classification file} -o {output file} -t {taxonomy folder} --only_official

Or, alternatively:

$ CAT_pack add_names -i {ORF2LCA / classification file} -o {output file} -t {taxonomy folder} --only_official --exclude_scores

If you have named a CAT or BAT classification file with official names, you can get a summary of the classification, where total length and number of ORFs supporting a taxon are calculated for contigs, and the number of MAGs per encountered taxon for MAG classification:

$ CAT_pack summarise -c {contigs fasta} -i {named CAT classification file} -o {output file}

$ CAT_pack summarise -i {named BAT classification file} -o {output file}

CAT_pack summarise currently does not support classification files wherein some contigs / MAGs have multiple classifications (as contig_2 above).

Marking suggestive taxonomic assignments with an asterisk

When we want to confidently go down to the lowest taxonomic level possible for a classification, an important assumption is that on that level conflict between classifications could have arisen. Namely, if there were conflicting classifications, the algorithm would have made the classification more conservative by moving up a level. Since it did not, we can trust the low-level classification. However, it is not always possible for conflict to arise, because in some cases no other sequences from the clade are present in the database. This is true for example for the family Dehalococcoidaceae, which in our databases is the sole representative of the order Dehalococcoidales. Thus, here we cannot confidently state that an classification on the family level is more correct than an classification on the order level. For these cases, CAT and BAT mark the lineage with asterisks, starting from the lowest level classification up to the level where conflict could have arisen because the clade contains multiple taxa with database entries. The user is advised to examine starred taxa more carefully, for example by analysing sequence identity between predicted ORFs and hits, or move up the lineage to a confident classification (i.e. the first classification without an asterisk).

If you do not want the asterisks in your output files, you can add the --no_stars flag to CAT or BAT.

Optimising running time, RAM, and disk usage

CAT and BAT may take a while to run, and may use quite a lot of RAM and disk space. Depending on what you value most, you can tune CAT and BAT to maximize one and minimize others. The classification algorithm itself is fast and is friendly on memory and disk space. The most expensive step is alignment with DIAMOND, hence tuning alignment parameters will have the highest impact:

The -n / --nproc argument allows you to choose the number of cores to deploy.
You can choose to run DIAMOND in sensitive mode with the --sensitive flag. This will increase sensitivity but will make alignment considerably slower.
Setting the --block_size parameter lower will decrease memory and temporary disk space usage. Setting it higher will increase performance.
For high memory machines, it is adviced to set --index_chunks to 1 (currently the default). This parameter has no effect on temprary disk space usage.
You can specify the location of temporary DIAMOND files with the --tmpdir argument.

Examples

Getting help for running the prepare utility:

$ CAT_pack prepare --help

Run CAT on a contig set with default parameter settings deploying 16 cores for DIAMOND alignment. Name the contig classification output with official names, and create a summary:


$ CAT_pack contigs -c contigs.fasta -d db/ -t tax/ -n 16 --out_prefix first_CAT_run

$ CAT_pack add_names -i first_CAT_run.contig2classification.txt -o first_CAT_run.contig2classification.official_names.txt -t tax/ --only_official

$ CAT_pack summarise -c contigs.fasta -i first_CAT_run.contig2classification.official_names.txt -o CAT_first_run.summary.txt

Run BAT on the set of MAGs that was binned from these contigs, reusing the protein predictions and DIAMOND alignment file generated previously during the contig classification:

$ CAT_pack bins -b bins/ -d db/ -t tax/ -p first_CAT_run.predicted_proteins.faa -a first_CAT_run.alignment.diamond -o first_BAT_run

Run the contig classification algorithm again with custom parameter settings, and name the output with all names in the lineage, excluding the scores:

$ CAT_pack contigs --range 5 --fraction 0.1 -c contigs.fasta -d db/ -t tax/ -p first_CAT_run.predicted_proteins.faa -a first_CAT_run.alignment.diamond -o second_CAT_run

$ CAT_pack add_names -i second_CAT_run.contig2classification.txt -o second_CAT_run.contig2classification.names.txt -t tax/ --exclude_scores

Run BAT on the set of MAGs with custom parameter settings, suppressing verbosity and not writing a log file. Next, add names to the ORF2LCA output file:

$ CAT_pack bins -r 3 -f 0.1 -b bins/ -s .fa -d db/ -t tax/ -p first_CAT_run.predicted_proteins.faa -a first_CAT_run.alignment.diamond --o second_BAT_run --quiet --no_log

$ CAT_pack add_names -i second_BAT_run.ORF2LCA.txt -o second_BAT_run.ORF2LCA.names.txt -t tax/

Identifying contamination/mis-binned contigs within a MAG

We often use the combination of CAT / BAT to explore possible contamination within a MAG.

$ CAT_pack contigs -c ../bins/interesting_MAG.fasta -d db/ -t tax/ -o CAT.interesting_MAG

$ CAT_pack bins -b ../bins/interesting_MAG.fasta -d db/ -t tax/ -p CAT.interesting_MAG.predicted_proteins.faa -a CAT.interesting_MAG.alignment.diamond -o BAT.interesting_MAG

Contigs that have a different taxonomic signal than the MAG classification are probably contamination.

Alternatively, you can look at contamination from the MAG perspective, by setting the f parameter to a low value:

$ CAT_pack bins -f 0.01 -b ../bins/interesting_MAG.fasta -d db/ -t tax/ -o BAT.interesting_MAG

$ CAT_pack add_names -i BAT.interesting_MAG.bin2classification.txt -o BAT.interesting_MAG.bin2classification.names.txt -t tax/

BAT will output any taxonomic signal with at least 1% support. Low scoring diverging signals are clear signs of contamination!

Estimating the microbial composition with RAT

RAT estimates the taxonomic composition of metagenomes by integrating taxonomic signals from MAGs, contigs, and reads. RAT has been added to the CAT pack from version 6.0. To use RAT, you need the CAT pack database files (see Getting started for more information).

RAT makes an integrated profile using MAGs/bins, contigs, and reads. To specify which elements should be integrated, use the --mode argument. Possible letters for --mode are m (for MAGs), c (for contigs), and r (for reads). All combinations of the three letters are possible, except r alone. To run RAT's complete workflow, specify the mode, read files, contig files, bin folder, and database files:

$ CAT_pack reads --mode mcr -b bin_folder/ -c contigs.fasta -1 forward_reads.fq.gz -2 reverse_reads.fq.gz -d db/ -t tax/

Currently, RAT supports single read files as well as paired-end read files. Interlaced read files are currently not supported. RAT will run CAT and BAT on the contigs and MAGs, will map the reads back to the contigs, and then try to annotate any unmapped reads separately. If you already have a sorted mapping file, you can supply it and RAT will skip the mapping step:


$ CAT_pack reads --mode mcr -b bin_folder/ -c contigs.fasta --bam1 mapping_file_sorted.bam -1 forward_reads.fq.gz -2 reverse_reads.fq.gz -d db/ -t tax/

If CAT and/or BAT have already been run on your data, you can supply the output files to RAT to skip the CAT and BAT runs:


$ CAT_pack reads --mode mcr -b bin_folder/ -c contigs.fasta -1 forward_reads.fq.gz -2 reverse_reads.fq.gz -d db/ -t tax/ --c2c CAT_contig2classification_file.txt --b2c BAT_bin2classification_file.txt

Similarly, if a previous RAT run crashed after the unmapped reads have already been aligned to the database with diamond, you can supply the intermediate files to continue the run:


$ CAT_pack reads --mode mcr -b bin_folder/ -c contigs.fasta -1 forward_reads.fq.gz -2 reverse_reads.fq.gz -d db/ -t tax/ --c2c CAT_contig2classification_file.txt --b2c BAT_bin2classification_file.txt --alignment_unmapped unmapped_alignment_file.diamond

After a RAT run is finished, you can run add_names on the abundance files (only for RAT runs with nr database):


$ CAT_pack add_names -i RAT.completete_abundance_file.txt -o RAT.completete_abundance_file_with_names.txt -t tax/

Similar to CAT and BAT, the paths to all dependencies can be supplied via an argument:


$ CAT_pack reads --mode mcr -b bin_folder/ -c contigs.fasta -1 forward_reads.fq.gz -2 reverse_reads.fq.gz -d db/ -t tax/ --path_to_samtools /path/to/samtools

Output files

The RAT output consists of:

A log file.
All CAT output files for the contig fasta.
All BAT output files for the MAGs (except DIAMOND alignment and protein fasta).
A table that contains the abundance of each MAG.
A table that contains all detected taxa and their abundance in the sample.
A table that contains the lineage for each read, as well as which step the annotation was made in (optional without r in --mode).
A table that contains the abundance of each contig in the contig fasta.
A fasta containing the sequences of all unmapped reads and contigs that could not be annotated by CAT.
The diamond alignment of unmapped reads and unannotated contigs.
A table that contains the annotations for unmapped reads and (previously) unannotated contigs.

cat's People

Contributors

Stargazers

Watchers

cat's Issues

CAT and MacOS

Thank you for trying to make CAT available for MacOS users.

I've run into one slight problem though, it is that, on MacOS, the most used filesystems (HFS+, and later APFS) are case insensitive.

A consequence of this is that terminal commands can be called both lower and uppercase.
As an example:

(metagenomics) hadrien:~/Documents/workspace/cat_test$ ls
a.txt  b.txt  hello.fasta
(metagenomics) hadrien:~/Documents/workspace/cat_test$ LS
a.txt  b.txt  hello.fasta

See where this is going? That's right, CAT replace the cat command usually used to concatenate or display the content of a file!

(metagenomics) hadrien:~/Documents/workspace/cat_test$ cat hello.fasta 
usage: CAT (prepare | contigs | bin | bins | add_names | summarise) [-v / --version] [-h / --help]
CAT: error: one of the arguments prepare contigs bin bins add_names summarise is required
(metagenomics) hadrien:~/Documents/workspace/cat_test$ CAT hello.fasta 
usage: CAT (prepare | contigs | bin | bins | add_names | summarise) [-v / --version] [-h / --help]
CAT: error: one of the arguments prepare contigs bin bins add_names summarise is required
(metagenomics) hadrien:~/Documents/workspace/cat_test$ /bin/cat hello.fasta 
>my_sequence
ATGC

which is highly irritating to say the least.

May I suggest to either:

Write a big disclaimer about this in the documentation, and beg people to either not putting CAT in their PATH on MacOS, or to have a conda environment just for your tool.
change the name of your software.

Thank you in advance.
Hadrien.

Incompatible Diamond Version

Hello,

I'm getting problem similar to #4 using the downloaded CAT databases (2018-12-12).

Error: Database was built with a different version of Diamond and is incompatible.
[2019-03-11 23:37:55.541631] ERROR: Diamond finished abnormally.

I'm running Diamond v0.9.21.122, and installed everything from bioconda, which should have compatible Diamond versions. Which version of diamond was the 2018-12-12 CAT database built with?

$ CAT -v
CAT v4.3.3 (December 14, 2018) by F. A. Bastiaan von Meijenfeldt.

Jeff

Check path to diamond

I have downloaded the taxonomy files(names.dmp, nodes.dmp and prot.accesion2taxid) which are all placed in the same folder and created nr.dmnd from nr.faa. I have specified the path(once in the script) to diamond, prodigal and to taxonomy files but I keep getting an error that Please check path to diamond, presence information about database location or fasta file with coded proteins. I cannot figure out what is wrong? Thank you in advance.

which files are finally needed?

Hi,

After installation, I have several files under "documentation" directory and I seems that I only need the following four files:
gi_taxid_prot.dmp names.dmp nodes.dmp nr.dmnd
For other files:
nr.faa gc.prt merged.dmp gencode.dmp division.dmp delnodes.dmp citations.dmp
Can I just delete them?

Best,
Quan

IndexError: list index out of range

Hi,

I run CAT and have the following error:

"Warning: Sequence is long (max 32000000 for training).
Training on the first 32000000 bases.

diamond file is /path_to/target.fa_contigs_prodigal.m8
gff file is /path_to/target.fa_contigs_prodigal.gff
faa file is /path_to/target.fa_contigs_prodigal.faa
bitscore 1 is 0.9
bitscore 2 is 0.5
Traceback (most recent call last):
File "/tools/analysis/metagenomics/CAT-master/CAT/bin/contig_id.py", line 109, in
protein=line[1].split("|")[1]
IndexError: list index out of range
running prodigal
running diamond search"

the first message is showing that sequence is long. But in my fasta file, "sum = 332021388, n = 37221, ave = 8920.27, largest = 27143". Is this warning due to that the total file size is larger than 32000000 bases?
For the "IndexError: list index out of range" error, I have checked the diamond file and here are several lines:
"1_5 ACA64215.1 45.3 75 40 1 10 83 283 357 8.4e-05 55.5
1_5 WP_059303418.1 37.8 74 46 0 13 86 319 392 1.1e-04 55.1
1_5 SCF81904.1 40.5 84 48 1 4 85 543 626 1.1e-04 55.1"
For the corresponding codes:
"all_pro=open(diamond,'r')
for line in all_pro:
line=line.strip()
line=line.split("\t")
protein=line[1].split("|")[1]
d_allpro[protein]=True "
The cause of this error should be that there is no "|“ in protein ids. How can I solve this?

Best,
Quan

bioconda installable

Hey @bastiaanvonmeijenfeldt

I' would make a bioconda installable of CAT. here
Could you give CAT a license?

Do you have any test to run?

If I'm not wrong you can make the CAT executable. This information is somehow stored even on github.

It would be great if you could update the version and sha256 when you have new releases.

parameter "r" is ratio or absolute number

Hi,

If I set the Dimond parameter top =10, and r=10, how many hits will be included to do the LCA of ORF? 10 or 1?

Taxonomy output

There is a way to represent with a taxonomy plot (like Krona) the output generated after CAT add_names function?

Thank you so much

Appended '*' (asterisk) to taxonomy

I have only found one example and it seems random. In classifications of metagenomics contigs, I get this output:

$ grep "NODE_3871_" CAT.ORF2LCA.txt
NODE_3871_length_1244_cov_2.250214_1 1;10239;35237;549779;985780;1977630;1977635;1977631*    59.3

(Note the asterisk appended to taxid 1977631!!)

Then the asterisk is taken along to the other output files:

$ grep "NODE_3871_" contig2classification.txt
NODE_3871_length_1244_cov_2.250214      classified      1       1       1;10239;35237;549779;985780;1977630;1977635;1977631*  1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00

And apparently it ends up appended to the species name as well:

$ egrep "Catovirus CTV1\*" add_names.txt
NODE_3871_length_1244_cov_2.250214    classified      1       1       1;10239;35237;549779;985780;1977630;1977635;1977631*  1.00;1.00;1.00;1.00;1.00;1.00;1.00;1.00 Viruses: 1.00   NA      NA   NA       Mimiviridae: 1.00       Catovirus: 1.00 Catovirus CTV1*: 1.00

So it turns "Catovirus CTV1" into "Catovirus CTV1*".

The sequence I get this with, is:

>NODE_3871_length_1244_cov_2.250214_1 # 457 # 1242 # -1 # ID=3871_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.288
DMSHAFRDAILNNLISHLNKLVNRTNDLNIFDLCGGRGGDIFKLSSLVNIKNLMVEDADK

Sidenote: Pasting this sequence in blastp returns different species.

Other contigs in the same dataset don't have this asterisk.

CAT version: 4.3.3
DIAMOND version: 0.9.14
BLAST nr database: tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20181212.tar.gz

Use of CAT in nextflow nf-core pipeline

Hi,
we would like use, or more precise continue using CAT in the nf-core mag pipeline, however, there are a few issues we are facing currently.

If I see it correctly, one still needs to use the diamond version which was used to create the database, as already discussed in #45. Is there any plan that this might change in future releases?
This is a problem for nextflow pipelines, since they use one specific container, i.e. with one specific diamond version. Thus only certain db versions could be used with the pipeline, and one would need to update the pipeline immediately after each new db release, if necessary. (And then it would not work with older db versions anymore.)
To know if a database version would be compatible to the current pipeline version, one would need to download 180 GB first.
Accessibility of older database versions: are older versions somewhere accessible? I know the databases are quite large, but for the sake of reproducibility, it would be good if also older database versions would remain accessible. Furthermore, if not, older pipeline versions would also not work anymore.

I just thought I double check if there is anything planned for the future or if we missed something here.

Thanks in advance!
Best, Sabrina

Unclassified vs "not classified"

Hi,
I am trying to use CAT to annotate contigs derived from Illumina sequencing and assembled by Megahit. My understanding is that "Not classified" means that something prevented the classification of the contig, like "no hits to database", "hits not found in taxonomy files" or "no ORFs found". I have a number of "Classified" in the out.CAT.alignment.diamond file that have alignment scores of > 0.5 at one or two levels to a NCBI protein ID but - although that protein ID has taxonomy, they are labeled "not classified" at all levels.

Why are these in 2 separate bins, instead of the 2nd instance being a subset of the first "unable to be confidently classified"? Am I misunderstanding these results?

Thank you for your help,
Kathie Mihindukulasuriya

No error in object file, empty output files

Hi there,

I have used CAT in the past, and it has worked perfectly. Unfortunately, this time, my output files are empty, despite numerous tries and re-running the preparation steps (i.e. making the blast database, running the blastp). The object file that appears only says:

Parsing diamond alignment
Parsing prot.accession2taxid file
Computation
Let's summarise...
-----------------------

My input blastp database is even formatted with the better headings for CAT (no spaces in labels, tab delimited). For example:

3C_NODE_166_length_19564_cov_6.887693_8 WP_008679397.1:MULTISPECIES:_bifunctional_(p)ppGpp_synthetase/guanosine-3',5'-bis(diphosphate)_3'-pyrophosphohydrolase 31.3 195 120 5 39 229 23 207 5.7e-14 82.8

What could be preventing CAT from performing?

Thank you so much for your help!
Alaina 😃

No database folder when decompressing CAT_prepare_20210107.tar.gz

Hi,

When decompressing CAT_prepare_20210107.tar.gz the log files say that a database folder is produced but it is not. Log is below. Taxonomy folder is produced. What to do?

Thanks,

Viola

CAT v5.2.1.

CAT prepare is running, constructing a fresh database.
Rawr!

WARNING: preparing the database files may take a couple of hours.

Supplied command: ../../CAT_pack_latest_code/CAT_pack/CAT prepare --fresh --path_to_diamond ./Diamond_2.0.6/diamond

Taxonomy folder: ./2021-01-07_taxonomy/
Database folder: ./2021-01-07_CAT_database/
Log file: ./2021-01-07.CAT_prepare.fresh.log

[2021-01-07 21:01:49] DIAMOND found: diamond version 2.0.6.
[2021-01-07 21:01:49] Taxonomy folder ./2021-01-07_taxonomy/ is created.
[2021-01-07 21:01:49] Database folder ./2021-01-07_CAT_database/ is created.
[2021-01-07 21:01:49] Downloading and extracting taxonomy files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ to t$
[2021-01-07 21:01:54] Download complete.
[2021-01-07 21:01:54] Checking file integrity via MD5 checksum.
[2021-01-07 21:01:54] MD5 of ./2021-01-07_taxonomy/2021-01-07.taxdump.tar.gz checks out.
[2021-01-07 21:01:56] Extracting complete.
[2021-01-07 21:01:56] Downloading mapping file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ to ta$
[2021-01-07 21:07:44] Download complete.
[2021-01-07 21:07:44] Checking file integrity via MD5 checksum.
[2021-01-07 21:07:44] WARNING: no MD5 found in ./2021-01-07_taxonomy/2021-01-07.prot.accession2taxid.FULL.gz.md5. $
[2021-01-07 21:07:44] Downloading nr database from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ to database folder.
[2021-01-07 22:13:57] Download complete.
[2021-01-07 22:13:57] Checking file integrity via MD5 checksum.
[2021-01-07 22:20:40] MD5 of ./2021-01-07_CAT_database/2021-01-07.nr.gz checks out.
[2021-01-07 22:20:40] Constructing DIAMOND database ./2021-01-07_CAT_database/2021-01-07.nr.dmnd from ./2021-01-07$
[2021-01-07 22:51:13] DIAMOND database constructed.
[2021-01-07 22:51:13] Loading file ./2021-01-07_taxonomy/nodes.dmp.
[2021-01-07 22:51:15] Loading file ./2021-01-07_CAT_database/2021-01-07.nr.gz.
[2021-01-08 00:18:00] Loading file ./2021-01-07_taxonomy/2021-01-07.prot.accession2taxid.FULL.gz.
[2021-01-08 02:05:36] Finding LCA of all protein accession numbers in fasta headers.
[2021-01-08 03:16:29] Done! File ./2021-01-07_CAT_database/2021-01-07.nr.fastaid2LCAtaxid is created. 12,997,882 o$
[2021-01-08 04:53:37] Searching nr database for taxids with multiple offspring.
[2021-01-08 05:25:25] Writing ./2021-01-07_CAT_database/2021-01-07.nr.taxids_with_multiple_offspring.

[2021-01-08 05:25:25] CAT prepare is done!
You may remove ./2021-01-07_CAT_database/2021-01-07.nr.gz now.

Supply the following arguments to CAT or BAT if you want to use this database:
-d / --database_folder ./2021-01-07_CAT_database/
-t / --taxonomy_folder ./2021-01-07_taxonomy/

Can I adjust memory usage and temporary disk size of Diamond search step?

Hi,

I'm trying to use CAT for analyses of my metagenomic contigs (~110,000 contigs, total length of ~400 Mbp).

I've run CAT for my data using the database file downloaded from http://tbb.bio.uu.nl/bastiaan/CAT_prepare/, on a Linux server with 256 GB memory and ~280 GB of empty HDD space.

Prodigal worked nice and predicted ~450,000 proteins.

But, Diamond finished abnormally with the following error messages, after running ~10 hours using ~10 GB of memory.

"No space left on device
Error: Error writing file diamond-tmp-NNi5RC"

As far as I know, we can adjust memory usage and temporary disk space (and location) when using Diamond. And, to my knowledge, Diamond takes less time when using more memory.

Are there any methods to adjust memory usage and temporary disk size of Diamond search step of CAT?

Best,
Ilnam

protein input

@bastiaanvonmeijenfeldt
The option to input the genes directly to cat is nice. But what exactly do you mean by concatenated genes?

IndexError: list index out of range

Hi,

CAT is very nice for classification of MAGs. I have tried to use CAT bins on several datasets containing <20 bins. All runs are worked! However, I met a problem of program interruption when I run a larget dataset contain ~400 bins. The running log file is attached below.
run_CAT_bins_201911061143.nohup.txt

Any help would be much appreciated!
Zhuofei

LCA biased by huge number of diamond hits

Hi Bas,
first of all: thank you very much for this nice piece of software.
I ran into some minor problems with my data running CAT.

When I run CAT on the attached TSC* files, diamond gives me > 30.000 valid hits for my one sequence.
A quick blast search shows, that the first 100 hits are against Archaea, so it's pretty safe to say that the sequence is of archaeal origin.
CAT gives me "root" as LCA though.

I assume this is due to the large number of valid hits ( and that only one among them needs to be non-archaeal to "mess up" the LCA assignment) .

Maybe a quick workaround could be to limit the diamond output to an absolute number of valid hits instead of using --top, or maybe just to increase --top (to maybe 75 or something).
(--max-target-seqs (-k) maximum number of target sequences to report alignments for
--top report alignments within this percentage range of top alignment score (overrides --max-target-seqs).

Any hints on which of these approaches you'd consider more suitable would be greatly appreciated (before I run diamond again on the complete dataset - I just grep-ed for one example sequence here).

Thanks a lot,
with my best wishes
Thomas

CAT_TSC.contig2classification.txt
CAT_TSC.log
CAT_TSC.ORF2LCA.txt

TSC.contig.fna.txt
TSC.predicted_protein.faa.txt

process output

Hey @bastiaanvonmeijenfeldt ,

I run CAT bin on my bins and got the lineages and the official taxonomic names. I see you are very cautios not to overclassify.

To process the output I would like to

visualize the output in a tree, e.g by using the taxid
get a short taxonomic name for each bin, e.g. Bacteroides*

Unfortunately this is more complicated than I thougth. Because, neither the lineages nor the names are in a easy to read format, e.g. table.

The lineage countan stars, so they are no longer numbers.
The names come allways with the score e.g. Escherichia: 0.8
Of course I can extract the information I want, but I thoght It would make sense to ask you to change the output format.

Then there is the problem of double annotations, which I don't completly understand why this happends. At least it makes it almost impossible to computationally add a unique name to each bin.

what do you think?

A error with used CAT

Hi,

I am used the CAT to classified taxa from a contigs set with error (Error: Invalid scoring matrix: BLOSUM62 [2020-02-11 16:36:04.597504] ERROR: DIAMOND finished abnormally). I used the new nr database files developed by myself as your suggested commond line CAT prepare --existing -d {folder containing nr} -t {folder containing taxonomy files}.

[2020-02-11 16:36:00.847390] Prodigal found: Prodigal V2.6.3: February, 2016.
[2020-02-11 16:36:00.850510] DIAMOND found: diamond version 0.8.16.
[2020-02-11 16:36:00.850816] Importing contig names from /data/liaohp/project3-rebound/ARGs-OAP/hosts/C0/C0-contigs.trim.500.host.fa.
[2020-02-11 16:36:00.874224] Running Prodigal for ORF prediction. Files out.CAT.predicted_proteins.faa and out.CAT.predicted_proteins.gff will be generated. Do not forgetto cite Prodigal when using CAT or BAT in your publication!
[2020-02-11 16:36:04.591133] ORF prediction done!
[2020-02-11 16:36:04.591290] Parsing ORF file out.CAT.predicted_proteins.faa
[2020-02-11 16:36:04.594085] Homology search with DIAMOND is starting. Please be patient. Do not forget to cite DIAMOND when using CAT or BAT in your publication!
query: out.CAT.predicted_proteins.faa
database: ./CAT_database/2020-02-11.nr.dmnd
mode: fast
number of cores: 64
block-size (billions of letters): 2.0
index-chunks: 4
tmpdir: ./
top: 50
Error: Invalid scoring matrix: BLOSUM62
[2020-02-11 16:36:04.597504] ERROR: DIAMOND finished abnormally.

Can you help me to solve this question? Thanks.

Liao

CAT add_names returns KeyError: 'no taxid found (5EY5_A)'

Hi,
I'm running the code CAT add_names -i out.BAT.ORF2LCA.txt -o out.BAT.ORF2LCA.names.txt -t ../2018-10-16_taxonomy/ and it is returning me the error:

CAT v4.0.

[2018-10-22 11:46:36.384011] Importing file ../2018-10-16_taxonomy/nodes.dmp.
[2018-10-22 11:46:38.874349] Importing file ../2018-10-16_taxonomy/names.dmp.
[2018-10-22 11:46:40.758726] Appending names...
Traceback (most recent call last):
File "/gpfs/ts0/home/bt273/BIOS-SCOPE/metag/ashley/data/CAT/CAT_pack/CAT", line 72, in
main()
File "/gpfs/ts0/home/bt273/BIOS-SCOPE/metag/ashley/data/CAT/CAT_pack/CAT", line 60, in main
add_names.run()
File "/gpfs/ts0/projects/Research_Project-172179/metag/ashley/data/CAT/CAT_pack/add_names.py", line 174, in run
add_names(args)
File "/gpfs/ts0/projects/Research_Project-172179/metag/ashley/data/CAT/CAT_pack/add_names.py", line 163, in add_names
scores)
File "/gpfs/ts0/projects/Research_Project-172179/metag/ashley/data/CAT/CAT_pack/tax.py", line 235, in convert_to_names
name = taxid2name[taxid]
KeyError: 'no taxid found (5EY5_A)'

I'm using Python 3.6.6
My Taxonomy file contains:
2018-10-16.prot.accession2taxid.gz
citations.dmp
division.dmp
gencode.dmp
names.dmp
readme.txt
2018-10-16.taxdump.tar.gz
delnodes.dmp
gc.prt
merged.dmp
nodes.dmp

I've used CAT prepare --fresh and CAT bins to assess my metagenome bins with no problems.

Anyone have any ideas to where I'm going wrong?

Issues in add_names, bins command

I have a couple of issues with some cat commands:

When I try to run cat bins command I get an error:
[2020-05-22 13:09:01.050855] Importing bins from T45metabat_output/.
Traceback (most recent call last):
File "/Applications/miniconda3/bin/cat", line 76, in
main()
File "/Applications/miniconda3/bin/cat", line 62, in main
bins.run()
File "/Applications/miniconda3/share/cat-5.0.4-0/CAT_pack/bins.py", line 703, in run
bins(args)
File "/Applications/miniconda3/share/cat-5.0.4-0/CAT_pack/bins.py", line 522, in bins
quiet)
File "/Applications/miniconda3/share/cat-5.0.4-0/CAT_pack/bins.py", line 253, in import_bins
for line in f1:
File "/Applications/miniconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

When I try cat bin command on a single bin, it is working. why is this happening with multiple bins?

cat add_names is working but the following error pops up in output file:
/Volumes/Nidhi/CAT_prepare_20200304/bin3_addednames: line 2: syntax error near unexpected token (' /Volumes/Nidhi/CAT_prepare_20200304/bin3_addednames: line 2: T45metabat_output.3.fa classified based on 9893/10197 ORFs 1;131567;2;1783272;201174;1760;85011;2062;1883 1.00;0.99;0.98;0.73;0.71;0.71;0.33;0.33;0.33 root (no rank): 1.00 cellular organisms (no rank): 0.99 Bacteria (superkingdom): 0.98 Terrabacteria group (no rank): 0.73 Actinobacteria (phylum): 0.71 Actinobacteria (class): 0.71 Streptomycetales (order): 0.33 Streptomycetaceae (family): 0.33 Streptomyces (genus): 0.33'
logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.

what does "syntax error near unexpected token `(' " mean?

cat add_names --only_official gives following error in the output executable file:
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2 ; exit;
(base) Aishwaryas-MBP:~ aishwarya$ /Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2 ; exit;
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: T45metabat_output.3.fa: command not found
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: 131567: command not found
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: 2: command not found
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: 1783272: command not found
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: 201174: command not found
/Volumes/Nidhi/CAT_prepare_20200304/bin3bin2class_official2: line 2: 1760: command not found

Please help me with these issues

ERROR: DIAMOND finished abnormally

Hi there,

I am using CAT to annotate contigs and the procedure stops at using diamond with ERROR: DIAMOND finished abnormally. The diamond version is 0.9.21 the same as in 2020-03-04.CAT_prepare.fresh.log. CAT was installed with Conda. Any ideas ?

Help updating taxonomy

Hi,
I just updated my databases and need to update my annotations following these instructions:
Alternatively, if you already have a predicted proteins fasta file and/or an alignment table for example from previous runs, you can supply them to CAT, which will then skip the steps that have already been done and start from there:

$ CAT contigs -c {contigs fasta} -d {database folder} -t {taxonomy folder} -p {predicted proteins fasta} -a {alignment file}

Can you please tell me the name of the predicted proteins fasta and alignment file to use?

Thank you,
Kathie Mihindukulasuriya

How to pass the diamond parameter (--evalue) to CAT

How to pass the diamond parameters ( such as --evalue) to CAT

US Mirror of the data

Hey Baastian,

I posted the databases on http://edwards.sdsu.edu/CATBAT/CAT_prepare_20200618.tar.gz and http://edwards.sdsu.edu/CATBAT/CAT_prepare_20200618.tar.gz.md5 to provide a US mirror. It might help people download a little faster.

Rob

donwload of taxonomy files failed

Hi Bastiaan，
When I perform the command “$ CAT prepare --fresh” on Centos, the result always show that “ERROR: donwload of taxonomy files failed”.I have tried several times but it always failed. How can I deal with that? (The dependencies and database file have been exit) .

This is the entire CAT log.
(cat) [pwl@ibg pwl]$ CAT prepare --fresh

CAT v4.3.3.

CAT prepare is running, constructing a fresh database.
Rawr!
WARNING: preparing the database files may take a couple of hours.
Supplied command: /home/xiyangd/miniconda2/envs/cat/bin/CAT prepare --fresh
Taxonomy folder: 2019-06-15_taxonomy/
Database folder: 2019-06-15_CAT_database/
Log file: 2019-06-15.CAT_prepare.fresh.log
[2019-06-15 04:38:16.966006] Diamond found: diamond version 0.9.21.
[2019-06-15 04:38:16.966592] Taxonomy folder exists already. Taxonomy files will be downloaded to it.
[2019-06-15 04:38:16.966789] Database folder exists already. Database file will be downloaded to it / constructed in it.
[2019-06-15 04:38:16.967401] Downloading and extracting taxonomy files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz to 2019-06-15_taxonomy.
[2019-06-15 05:25:47.875356] ERROR: donwload of taxonomy files failed._
Best ,
piwling

Split concatenated files per bin

Hi there,

I want to use the final .gff and .faa after running BAT. How can I now split these files by bin?

Thanks!

Unknown format code 'd' in prepare.py

Hello,
I just installed CAT via bioconda. I'm trying to generate the database from myself, by running

$ CAT prepare --fresh

$ CAT prepare --existing -d {folder containing nr} -t {folder containing taxonomy files}

In both cases, I get this error message :

File "/home/aline/anaconda3/envs/cat_env/share/cat-5.2-0/CAT_pack/prepare.py", line 138, in memory_bottleneck
'at least {0:,d}GB of memory is needed for the database '
ValueError: Unknown format code 'd' for object of type 'str'

the environement is ubuntu 18.04, with anaconda3 and the CAT version is v5.2 (20 November, 2020)

I really have no clue on how to solve this. Thanks in advance for your help :)

Aline

Requirements

Hey @bastiaanvonmeijenfeldt

I try to download the NR database. I tried it on the head node, but I don't have enough memory.
Strangely your code didn't end with an error.

So, I would need to perform this on my cluster system. Do you have any idea how much memory and time it would need to generate the diamond database?

Using gzip for output files

It would be nice to have the option to compress the output files, while this would increase the computational load, it would reduce the HDD space requirements.

I could start to work on a PR for that.

Best wishes,
Lukas Jansen

'CAT prepare --fresh' doesn't create all required files?

Hi there again!
After running the following commands

$ CAT prepare --fresh 
$ CAT prepare --existing -d {folder containing nr} -t {folder containing taxonomy files}

In the log file, I see these error messages:

[2020-03-08 16:11:11.405982] ERROR: file fastaid2LCAtaxid is not found in database folder.
[2020-03-08 16:11:11.406296] ERROR: file taxids_with_multiple_offspring not found in database folder.

So, it seems that CAT prepare doesn't create fastaid2LCAtaxid and taxids_with_multiple_offspring files?

The database files are unavailable for downloading

Dear Bastiaan,
the command wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20190719.tar.gz is not working because tbb.bio.uu.nl is not available

addnames stops with no taxid found (HCB58249.1) in ORF2LCA file

Hello,

I have been using CAT to assign taxonomy to individual contigs in bins. It's been going really well but I just hit a snag! I am using the ORF2LCA file to add official names and one of the output files was truncated. Standard error file says:

KeyError: 'no taxid found (HCB58249.1)'

When I compare the outputs of the ORF2LCA file and the addnames file it is clear that the program stops after it hits the row with no taxid found (HCB58249.1).

I can probably go through the remaining contigs manually but it probably needs a fix. In the meantime, can you suggest a workaround?

Thanks and this tool has been super helpful!

Viola

CAT prepare --existing: Process gets killed during loading of files

I would like to create the necessary files to run CAT based on an existing diamond database (so only the fastaid2LCAtaxid and taxids_with_multiple_offspring files need to be created, as all other files are "found"). While running CAT prepare --existing -d . -t . the process gets killed during loading of the nr.gz file or the prot.accession2taxid file. I tried several times, but the killing happens at different time points. The loading of the nr.gz files takes more than an hour (sometimes the process is killed before it finishes) and the loading of the prot file gets killed after roughly an hour too (I never got past this point). Is it normal for this step to take such a long time? I assumed it would go rather quickly as the nr.dmnd file has already been created... Does it require a lot of memory?

what does no support mean?

Hello,

Thank you so much for your wonderful tool. I was wondering, what does it mean when it says "no support" for a species CAT annotation? Also, sometimes there is a numeric value for support. Should we filter based on that value?

Best wishes,

found a protein in the predicted Error: proteins fasta file that can not be traced back to one of the contigs in the contigs fasta file

I have been running a set of 31 bins but am having problems after prodigal running BAT_run.concatenated.predicted_proteins.faa.
Below is an example but I have had the same situation occurring with other bins and it appears to be random, that is if I rerun the same command, the error could be with the protein from a different bin.

Below is the log file

CAT v5.0.4.

BAT is running. Protein prediction, alignment, and bin classification are carried out.
Rarw!

Supplied command: /home/bharat/opt/anaconda3/envs/CAT/bin/CAT bins -r 10 -f 0.1 -b bins_fna/ -d /media/bharat/volume1/db/CAT_prepare_20200304/2020-03-04_CAT_database/ -t /media/bharat/volume1/db/CAT_prepare_20200304/2020-03-04_taxonomy/ -o /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassembled_bins/bins_fna/BAT_run

Bin folder: bins_fna/
Taxonomy folder: /media/bharat/volume1/db/CAT_prepare_20200304/2020-03-04_taxonomy/
Database folder: /media/bharat/volume1/db/CAT_prepare_20200304/2020-03-04_CAT_database/
Parameter r: 10.0
Parameter f: 0.1
Log file: /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassembled_bins/bins_fna/BAT_run.log

Doing some pre-flight checks first.
[2020-05-03 21:09:06.080226] Prodigal found: Prodigal V2.6.3: February, 2016.
[2020-05-03 21:09:06.087578] DIAMOND found: diamond version 0.9.21.
Ready to fly!

[2020-05-03 21:09:06.088300] Importing bins from bins_fna/.
[2020-05-03 21:09:06.777792] 31 bin(s) found!
[2020-05-03 21:09:06.777914] Writing /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassembled_bins/bins_fna/BAT_run.concatenated.fasta.
[2020-05-03 21:09:08.017832] Running Prodigal for ORF prediction. Files /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassembled_bins/bins_fna/BAT_run.concatenated.predicted_proteins.faa and /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassem
bled_bins/bins_fna/BAT_run.concatenated.predicted_proteins.gff will be generated. Do not forget to cite Prodigal when using CAT or BAT in your publication!
[2020-05-03 21:26:52.857614] ORF prediction done!
[2020-05-03 21:26:52.858098] Parsing ORF file /home/bharat/Desktop/MAG_Analysis/CS1BS/CS1BS_BIN_REASSEMBLY/reassembled_bins/bins_fna/BAT_run.concatenated.predicted_proteins.faa
[2020-05-03 21:26:53.401296] ERROR: found a protein in the predicted proteins fasta file that can not be traced back to one of the contigs in the contigs fasta file: bin.12.orig_1. Proteins should be named contig_name_#.

diamond takes all available threads

Hi guys,

I noticed while running CAT that during the diamond step it uses all available threads. It would be great to be able to specify this instead of having all run as default.

Nori

libc6

I get an error stemming from a wrong version of libc6. It requires compiling the executables on my system, as I'm using an academic cluster and its OS can't be upgraded right now. Can you post instructions and the source files to do my own compilation?
Thanks

IndexError: list index out of range when creating bin2classification.txt

I'm using CAT to classify some viral MAGs and everything seems to run fine until it gets to the point of creating the summary file. Protein prediction, diamond alignment, etc. all work great until it starts creating the bin2classification file. In case it matters, I didn't use the full nr database but instead created my own out of the RefSeq viral database but followed all the instructions/looked through the CAT code to verify it would work so it should be compatible. Here is the log:

# CAT v5.0.3.

BAT is running. Protein prediction, alignment, and bin classification are carried out.
Rarw!

Supplied command: /home/wlclose/miniconda3/envs/catbat/bin/CAT bins -r 10 -f 0.1 --index_chunks 1 --block_size 6 --nproc 4 --sensitive -s .fa -b data/metabat/bins/ -d envs/share/catbat/database/ -t envs/share/catbat/taxonomy/ -o test1/BAT_run

Bin folder: data/metabat/bins/
Taxonomy folder: envs/share/catbat/taxonomy/
Database folder: envs/share/catbat/database/
Parameter r: 10.0
Parameter f: 0.1
Log file: test1/BAT_run.log

-----------------

Doing some pre-flight checks first.
[2020-02-20 09:11:32.640270] Prodigal found: Prodigal V2.6.3: February, 2016.
[2020-02-20 09:11:32.978950] DIAMOND found: diamond version 0.9.14.
Ready to fly!

-----------------

[2020-02-20 09:11:32.990032] Importing bins from data/metabat/bins/.
[2020-02-20 09:16:30.119352] 6342 bin(s) found!
[2020-02-20 09:16:30.124320] Writing test1/BAT_run.concatenated.fasta.
[2020-02-20 09:16:36.237435] Running Prodigal for ORF prediction. Files test1/BAT_run.concatenated.predicted_proteins.faa and test1/BAT_run.concatenated.predicted_proteins.gff will be generated. Do not forget to cite Prodigal when using CAT or BAT in your publication!
[2020-02-20 09:55:27.250312] ORF prediction done!
[2020-02-20 09:55:27.256219] Parsing ORF file test1/BAT_run.concatenated.predicted_proteins.faa
[2020-02-20 09:55:28.086029] Homology search with DIAMOND is starting. Please be patient. Do not forget to cite DIAMOND when using CAT or BAT in your publication!
				query: test1/BAT_run.concatenated.predicted_proteins.faa
				database: envs/share/catbat/database/refseq_viral.dmnd
				mode: sensitive
				number of cores: 4
				block-size (billions of letters): 6.0
				index-chunks: 1
				tmpdir: test1
				top: 50
[2020-02-20 10:00:39.372717] Homology search done! File test1/BAT_run.concatenated.alignment.diamond created.
[2020-02-20 10:00:39.374596] Parsing DIAMOND file test1/BAT_run.concatenated.alignment.diamond.
[2020-02-20 10:00:41.020247] Importing file envs/share/catbat/taxonomy/nodes.dmp.
[2020-02-20 10:00:46.049918] Importing file envs/share/catbat/database/2020-02-19.nr.fastaid2LCAtaxid.
[2020-02-20 10:00:46.227850] Importing file envs/share/catbat/database/2020-02-19.nr.taxids_with_multiple_offspring.
[2020-02-20 10:00:46.239088] BAT is flying! Files test1/BAT_run.bin2classification.txt and test1/BAT_run.ORF2LCA.txt are created.
Traceback (most recent call last):
  File "/home/wlclose/miniconda3/envs/catbat/bin/CAT", line 76, in <module>
    main()
  File "/home/wlclose/miniconda3/envs/catbat/bin/CAT", line 62, in main
    bins.run()
  File "/home/wlclose/miniconda3/envs/catbat/share/cat-5.0.3-0/CAT_pack/bins.py", line 693, in run
    bins(args)
  File "/home/wlclose/miniconda3/envs/catbat/share/cat-5.0.3-0/CAT_pack/bins.py", line 605, in bins
    starred_lineage = tax.star_lineage(lineage,
  File "/home/wlclose/miniconda3/envs/catbat/share/cat-5.0.3-0/CAT_pack/tax.py", line 145, in star_lineage
    questionable_taxids = find_questionable_taxids(lineage,
  File "/home/wlclose/miniconda3/envs/catbat/share/cat-5.0.3-0/CAT_pack/tax.py", line 137, in find_questionable_taxids
    taxid_parent = lineage[i + 1]
IndexError: list index out of range

Create files in a specified folder other than in database folders

Hi,

I noticed that Diamond was trying to create a database in the nr database folder when I run CAT prepare command. Is it possible to create files in other folders other than in the same folder where the third-party databases exist? The thing is, for some of us HPC users, we already have nr.gz in a shared directory, which we can only read but cannot write. So I was thinking it would be good if we can read nr database from a read-only shared directory and write other 'new files' in a specified personal directory.

Cheers,
Heyu

visualisation for CAT results

Hello all,

I am a devoted user of CAT/BAT, and I am wondering if there is a visualisation tool compatible with CAT/BAT output format? Perhaps you're using it in your practice or might know how to convert CAT results effectively. I would like to get something like Pavian or Krona, but these two require specific formats and not the CAT one.

Thanks!

Best,
Polina

Using own gene calls - where do I specify the tab delimited file?

There doesn't seem to be a parameter to specify the tab delimited file suggested in the README. How do you incorporate this file into a CAT run?

Database was built with a different version of Diamond and is incompatible

I get the following error when running CAT v.4.6 with the CAT_prepare_20200618.tar.gz I downloaded. I was running in a conda environment with Diamond v0.9.24.125

Error: Database was built with a different version of Diamond and is incompatible.
[2020-08-14 20:06:10.946813] ERROR: DIAMOND finished abnormally.
# CAT v4.6.

BAT is running. Protein prediction, alignment, and bin classification are carried out.
Rarw!

Supplied command: /isilon/lethbridge-rdc/users/ortegapoloro/catrione_metagenomics/work/conda/nf-core-mag-1.1.0dev-fdd28300e312d42ca78a7e617c1313b0/bin/CAT bins -b bins/ -d database/ -t taxonomy/ -n 8 -s .fa --top 6 -o MEGAHIT-SRR9030505 --I_know_what_Im_doing

Bin folder: bins/
Taxonomy folder: taxonomy/
Database folder: database/
Parameter r: 5
Parameter f: 0.3
Log file: MEGAHIT-SRR9030505.log

-----------------

Doing some pre-flight checks first.
[2020-08-14 20:01:28.582365] Prodigal found: Prodigal V2.6.3: February, 2016.
[2020-08-14 20:01:28.607029] DIAMOND found: diamond version 0.9.24.
[2020-08-14 20:01:28.611196] WARNING: [--top] is set lower than 50. This might conflict with future runs with higher settings of the [-r / --range] parameter, see README.md.
Ready to fly!

-----------------

[2020-08-14 20:01:28.613953] Importing bins from bins/.
[2020-08-14 20:01:30.108568] 28 bin(s) found!
[2020-08-14 20:01:30.109987] Writing MEGAHIT-SRR9030505.concatenated.fasta.
[2020-08-14 20:01:30.848442] Running Prodigal for ORF prediction. Files MEGAHIT-SRR9030505.concatenated.predicted_proteins.faa and MEGAHIT-SRR9030505.concatenated.predicted_proteins.gff will be generated. Do not forget to cite Prodigal when using CAT or BAT in your publication!
[2020-08-14 20:06:10.640599] ORF prediction done!
[2020-08-14 20:06:10.644498] Parsing ORF file MEGAHIT-SRR9030505.concatenated.predicted_proteins.faa
[2020-08-14 20:06:10.908072] Homology search with DIAMOND is starting. Please be patient. Do not forget to cite DIAMOND when using CAT or BAT in your publication!
                                query: MEGAHIT-SRR9030505.concatenated.predicted_proteins.faa
                                database: database/2020-06-18.nr.dmnd
                                mode: fast
                                number of cores: 8
                                block-size (billions of letters): 2.0
                                index-chunks: 4

Preparing a database based on a refseq file

Hello,

Is there a way to prepare an input database from a refseq file, as you describe in the section "benchmark 2" of your manuscript? If so, how does one provide the mapping between the headers and the taxonomic identifiers (should I prepare my own Prot.accession2taxid file and in what format?).

Thank you!

Error reading temporary file of Diamond

Hi,

When running "CAT contigs" I get this error:
No such file or directory Error: Error reading file /CAT_outout_directory/diamond-tmp-J1k18Q terminate called after throwing an instance of 'File_read_exception' what(): Error reading file /CAT_outout_directory/diamond-tmp-J1k18Q [2019-04-03 17:03:41.921512] ERROR: Diamond finished abnormally. [2019-04-03 17:04:51.988192] ERROR: input file /CAT_outout_directory/sample.CAT.ORF2LCA.txt does not exist.

And my script is:
`#$ -pe smp 16
#$ -l h_vmem=8G

CAT contigs -n 16 -c ${Scaffold} -d ${Dir_DB} -t ${Dir_taxon} -o ${Dir_out}/${Sample}
--proteins_fasta ${Protein}
--path_to_prodigal /usr/.pyenv/shims/prodigal
--path_to_diamond /usr/.pyenv/shims/diamond`

Version of dependencies are:
DIAMOND: 0.9.14
Prodigal: V2.6.3
Python 3: 3.6.8

Best,
Mika

Problem running BAT with CAT output

Hi, in the readme you state "You can also do this the other way around; start with contig classification and classify the entire MAG with BAT in single bin mode based on the files generated by CAT."

With the CAT output, I'm trying to classify individual bins. There's an error with a protein not being traced back to the contigs fasta (because it's in another bin!)

Command:

CAT bin -b 4018/maxbin.003_sub.contigs.fa -d $CATDB -t $CATTAXDB -o 4018_BAT/bin.2 -p 4018.predicted_proteins.faa -a 4018.alignment.diamond

Error:

[2019-11-18 15:16:20.924689] ERROR: found a protein in the predicted proteins fasta file that can not be traced back to one of the contigs in the contigs fasta file: contig_10_1. Proteins should be named contig_name_#.

Not a problem with contig names, I guess, because I can compile all the bins into a single fasta and then run "single" bin mode fine. It looks like you are parsing underscores which I hope isn't messing things up.

$ CAT bin -b bin_composite.fasta -d $CATDB -t $CATTAXDB -o 4018_BAT/bin.2 -p 4018.predicted_proteins.faa -a 4018.alignment.diamond

[2019-11-18 15:20:19.087538] Importing contig names from bin_composite.fasta.
[2019-11-18 15:20:19.199695] Parsing ORF file 4018.predicted_proteins.faa
[2019-11-18 15:20:19.273408] Parsing DIAMOND file 4018.alignment.diamond.

Any help is appreciated! I like this feature because one run can be used for many applications.

Error in downloading database

Hello,
There is an error when downloading database by the following commands:

$ CAT prepare --fresh 
$ CAT prepare --existing -d {folder containing nr} -t {folder containing taxonomy files}

File fastaid2LCAtaxid and taxids_with_multiple_offspring can't be created because of a memory error.

Doing some pre-flight checks first.
[2020-12-20 21:50:34.879804] DIAMOND found: diamond version 2.0.5.
[2020-12-20 21:50:34.881043] Taxonomy folder found.
[2020-12-20 21:50:34.881699] Nodes.dmp found: /lustre/home/liutang/01software/cat_5.2/2020-12-19_taxonomy/nodes.dmp.
[2020-12-20 21:50:34.882206] Names.dmp found: /lustre/home/liutang/01software/cat_5.2/2020-12-19_taxonomy/names.dmp.
[2020-12-20 21:50:34.882683] Prot.accession2taxid file found: /lustre/home/liutang/01software/cat_5.2/2020-12-19_taxonomy/2020-12-19.prot.accession2taxid.gz.
[2020-12-20 21:50:34.883191] Database folder found.
[2020-12-20 21:50:34.883646] Nr file found: /lustre/home/liutang/01software/cat_5.2/2020-12-19_CAT_database/2020-12-19.nr.gz.
[2020-12-20 21:50:34.884077] DIAMOND database found: /lustre/home/liutang/01software/cat_5.2/2020-12-19_CAT_database/2020-12-19.nr.dmnd.
[2020-12-20 21:50:34.884510] File fastaid2LCAtaxid will be created.
[2020-12-20 21:50:34.884934] File taxids_with_multiple_offspring will be created.
Ready to fly!

-----------------

[2020-12-20 21:50:34.886539] Loading /lustre/home/liutang/01software/cat_5.2/2020-12-19_taxonomy/2020-12-19.prot.accession2taxid.gz into memory. Please be patient...
Traceback (most recent call last):
  File "/lustre/home/liutang/.conda/envs/cat_5.2/bin/CAT", line 84, in <module>
    main()
  File "/lustre/home/liutang/.conda/envs/cat_5.2/bin/CAT", line 62, in main
    prepare.run()
  File "/lustre/home/liutang/.conda/envs/cat_5.2/share/cat-5.1.2-0/CAT_pack/prepare.py", line 815, in run
    run_existing(args)
  File "/lustre/home/liutang/.conda/envs/cat_5.2/share/cat-5.1.2-0/CAT_pack/prepare.py", line 804, in run_existing
    prepare(step_list, args)
  File "/lustre/home/liutang/.conda/envs/cat_5.2/share/cat-5.1.2-0/CAT_pack/prepare.py", line 437, in prepare
    make_fastaid2LCAtaxid_file(
  File "/lustre/home/liutang/.conda/envs/cat_5.2/share/cat-5.1.2-0/CAT_pack/prepare.py", line 269, in make_fastaid2LCAtaxid_file
    prot_accession2taxid = import_prot_accession2taxid(
  File "/lustre/home/liutang/.conda/envs/cat_5.2/share/cat-5.1.2-0/CAT_pack/prepare.py", line 257, in import_prot_accession2taxid
    prot_accession2taxid[line[1]] = line[2]
MemoryError

Can you help me to solve this problem? Thanks.

diamond version

Hi,

May I know which diamond version is used? I just tried to use the nr database created under documentation directory using the latest diamond verison but failed with this error:
"Error: Incompatible database version".

Best,
Quan

Can we parallelize the diamond search?

@bastiaanvonmeijenfeldt
I'm integrating CAT in my snakemake pipeline. I would like to parallelize the taxonomic annotation, e.g. submit the annotation of each bin to another cluster node.

Would it be possible to add the option that CAT bin an take a single fasta file as input not only a bin folder?

Allowing CAT to use UniProt databases

I would like to use CAT with UniProts databases as they contain a more suitable set of proteins for my comparisons.
I am working on making CAT prepare work with this, but it requires a lot of manual parsing and fixing of TaxIDs to make UniProt compatible with the NCBI taxonomy.
Is this a feature that would be interesting to have or not? If not, just close this issue again.

dutilh / cat Goto Github PK

cat's Introduction

CAT, BAT, and RAT

Introduction

Dependencies and where to get them

Installation

Getting started

Downloading preconstructed database files

Creating a fresh NCBI nr or GTDB database yourself

NCBI non-redundant protein database (nr)

Genome Taxonomy Database (GTDB) proteins

Creating a custom database

Running CAT/BAT/RAT.

Getting help.

Taxonomic annotation of contigs or MAGs with CAT and BAT

Interpreting the output files

Marking suggestive taxonomic assignments with an asterisk

Optimising running time, RAM, and disk usage

Examples

Identifying contamination/mis-binned contigs within a MAG

Estimating the microbial composition with RAT

Output files

cat's People

Contributors

Stargazers

Watchers

Forkers

cat's Issues

CAT v5.2.1.

CAT v4.0.

CAT v4.3.3.

CAT v5.0.4.

Recommend Projects

Recommend Topics

Recommend Org