iquasere / recognizer Goto Github PK

View Code? Open in Web Editor NEW

29.0 2.0 2.0 62.82 MB

A tool for domain based annotation with databases from the Conserved Domains Database

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.06% Shell 0.04% Python 9.66% HTML 90.24%

cog-assignment annotation protein fasta quantification genomics functional-annotation

recognizer's Introduction

reCOGnizer

A tool for domain-based annotation with databases from the Conserved Domains Database.

Features

reCOGnizer performs domain-based annotation with RPS-BLAST and databases from CDD as reference.

Reference databases currently implemented: CDD, NCBIfam, Pfam, TIGRFAM, Protein Clusters, SMART, COG and KOG.
reCOGnizer performs multithread annotation with RPS-BLAST, significantly increasing the speed of annotation.
After domain assignment to proteins, reCOGnizer converts CDD IDs to the IDs of the respective DBs, and obtains domain descriptions available at CDD.
Further information is retrieved depending on the database in question:
- NCBIfam, Pfam, TIGRFAM and Protein Clusters annotations are complemented with taxonomic classifications and EC numbers.
- SMART annotations are complemented with SMART descriptions.
- COG and KOG annotations are complemented with COG/KOG categories, EC numbers and KEGG Orthologs.

A detailed representation of reCOGnizer's workflow is presented in Fig. 1.

Installing reCOGnizer

To install reCOGnizer, simply run: conda install -c conda-forge -c bioconda recognizer

Annotation with reCOGnizer

The simplest way to run reCOGnizer is to just specify the fasta filename and an output directory - though even the output directory is not mandatory.

recognizer -f input_file.faa -o output_directory

Output

reCOGnizer takes a FASTA file (of aminoacids, commonly either .fasta or .faa) as input and produces two main outputs into the output directory:

reCOGnizer_results.tsv and reCOGnizer_results.xlsx, tables with the annotations from every database for each protein
cog_quantification.tsv and respective Krona representation (Fig. 2), which describes the functional landscape of the proteins in the input file

Fig. 2. Krona plot with the quantification of COGs identified in the simulated dataset used to test MOSCA and reCOGnizer. Click in the plot to see the interactive version that is outputed by reCOGnizer.

Using previously gathered taxonomic information

reCOGnizer can make use of taxonomic information by filtering Markov Models for the specific taxa of interest. This can be done by providing a file with the taxonomic information of the proteins. To simulate this, run the following commands, after installing reCOGnizer:

git clone https://github.com/iquasere/reCOGnizer.git
cd reCOGnizer/ci
recognizer -f proteomes.fasta --f UPIMAPI_results.tsv --tax-col 'Taxonomic lineage IDs (SPECIES)' --protein-id-col qseqid --species-taxids

Running reCOGnizer this way will usually obtain better results, but will likely take much longer to finish.

reCOGnizer parameters

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Fasta file with protein sequences for annotation
  -t THREADS, --threads THREADS
                        Number of threads for reCOGnizer to use [max available]
  --evalue EVALUE       Maximum e-value to report annotations for [1e-3]
  -o OUTPUT, --output OUTPUT
                        Output directory [reCOGnizer_results]
  -dr DOWNLOAD_RESOURCES, --download-resources DOWNLOAD_RESOURCES
                        This parameter is deprecated. Please do not use it [None]
  -rd RESOURCES_DIRECTORY, --resources-directory RESOURCES_DIRECTORY
                        Output directory for storing databases and other resources [~/recognizer_resources]
  -dbs DATABASES, --databases DATABASES
                        Databases to include in functional annotation (comma-separated) [all available]
  --custom-databases    If databases inputted were NOT produced by reCOGnizer [False]. Default databases of reCOGnizer (e.g., COG, TIGRFAM, ...) can't be used simultaneously with custom
                        databases. Use together with the '--databases' parameter.
  -mts MAX_TARGET_SEQS, --max-target-seqs MAX_TARGET_SEQS
                        Number of maximum identifications for each protein [1]
  --keep-spaces         BLAST ignores sequences IDs after the first space. This option changes all spaces to underscores to keep the full IDs.
  --no-output-sequences
                        Protein sequences from the FASTA input will be stored in their own column.
  --no-blast-info       Information from the alignment will be stored in their own columns.
  --output-rpsbproc-cols
                        Output columns obtained with RPSBPROC - 'Superfamilies', 'Sites' and 'Motifs'.
  -sd SKIP_DOWNLOADED, --skip-downloaded SKIP_DOWNLOADED
                        This parameter is deprecated. Please do not use it [None]
  --keep-intermediates  Keep intermediate annotation files generated in reCOGnizer's workflow, i.e., ASN, RPSBPROC and BLAST reports and split FASTA inputs.
  --quiet               Don't output download information, used mainly for CI.
  --debug               Print all commands running in the background, including those of rpsblast and rpsbproc.
  --test-run            This parameter is only appropriate for reCOGnizer's tests on GitHub. Should not be used.
  -v, --version         show program's version number and exit

Taxonomy Arguments:
  --tax-file TAX_FILE   File with taxonomic identification of proteins inputted (TSV). Must have one line per query, query name on first column, taxid on second.
  --protein-id-col PROTEIN_ID_COL
                        Name of column with protein headers as in supplied FASTA file [qseqid]
  --tax-col TAX_COL     Name of column with tax IDs of proteins [Taxonomic identifier (SPECIES)]
  --species-taxids      If tax col contains Tax IDs of species (required for running COG taxonomic)

Referencing reCOGnizer

If you use reCOGnizer, please cite its publication.

recognizer's People

Contributors

Stargazers

Watchers

Forkers

rajaldebnath suharoschi

recognizer's Issues

cd.aux not found! and Error: Unknown argument: "max_smp_vol"

Hello.
I encountered the error below, and I think all the cd.aux are not found! And Error: Unknown argument: "max_smp_vol" :

Error: Unknown argument: "max_smp_vol"
cat: 'recognizer_output/blast/KOG_aligned.blast': No such file or directory
2023-01-02 12:57:22: Organizing annotation results
cat: 'recognizer_output/blast/CDDaligned.blast': No such file or directory
[1/8] Handling CDD annotation
cat: 'recognizer_output/blast/Pfamaligned.blast': No such file or directory
[2/8] Handling Pfam annotation
cat: 'recognizer_output/blast/NCBIfamaligned.blast': No such file or directory
[3/8] Handling NCBIfam annotation
cat: 'recognizer_output/blast/Protein_Clustersaligned.blast': No such file or directory
[4/8] Handling Protein_Clusters annotation
cat: 'recognizer_output/blast/Smartaligned.blast': No such file or directory
[5/8] Handling Smart annotation
cat: 'recognizer_output/blast/TIGRFAMaligned.blast': No such file or directory
[6/8] Handling TIGRFAM annotation
cat: 'recognizer_output/blast/COGaligned.blast': No such file or directory
[7/8] Handling COG annotation
cat: 'recognizer_output/blast/KOG*_aligned.blast': No such file or directory
[8/8] Handling KOG annotation
/home/nwezejus/miniconda3/envs/reCOGnizer/bin/recognizer.py:844: FutureWarning: save is not part of the public API, usage can give unexpected results and will be removed in a future version
xlsx_report.save()

Implement conversion of CDDs to GOs

Converting CDD IDs to Gene Ontology terms has been implemented previously, resulting in a comprehensive relation of both terms. However, this database has seen no updates since 2002.

In the future, a new version of this database could be released, allowing to obtain GO terms from CDD annotations, following a methodology inspired by the original work.

Error while running with download-resources

Hi!

I am using reCOGnizer (version 1.4.3) from Conda on Debian 10.

This is how I launch the program:
recognizer.py -f input_file.fasta -o recognizer_output --download-resources

So, I hope, that all the resources were downloaded correctly (63 383 files). But I'm getting these errors:

~/Programs/anaconda3/share/resources_directory/cd_10_0.aux not found!
Some part of CDD was not valid!
Generating databases for [10] threads.
...
ERROR: line 3797: "wordScoreThreshold": unexpected member, should be one of: "scores" "lambda" "kappa" "h" "scalingFactor" "lambdaUngapped" "kappaUngapped" "hUngapped" 
...
~/Programs/anaconda3/share/resources_directory/pfam_10_0.aux not found!
Some part of Pfam was not valid!
...
~/Programs/anaconda3/share/resources_directory/NF_10_0.aux not found!
Some part of NCBIfam was not valid!
...
BLAST engine error: Cannot retrieve path to RPS database
...
2021-03-19 10:07:06: Organizing annotation results
[1/8] Handling CDD identifications
Traceback (most recent call last):
  File "~/Programs/anaconda3/bin/recognizer.py", line 477, in <module>
    main()
  File "~/Programs/anaconda3/bin/recognizer.py", line 413, in main
    report = parse_blast('{}/{}_aligned.blast'.format(args.output, db))
  File "~/Programs/anaconda3/bin/recognizer.py", line 205, in parse_blast
    blast = pd.read_csv(file, sep='\t', header=None)
  File "~/Programs/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "~/Programs/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "~/Programs/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "~/Programs/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "~/Programs/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Could you help me with this, please?

Thanks!

running error of recognizer.py

FileNotFoundError: [Errno 2] No such file or directory: '/gss1/home/zlcui/dataBase/fun.tsv'

grep 'fun.tsv' recognizer.py
fun = pd.read_csv(f'{sys.path[0]}/fun.tsv', sep='\t')

I seems fun.tsv should exist at first

Please check download_resources(directory)

Hi, thank you for your tool!

I am trying to run from scratch with "-rd" option. But it did not download properly.
Could you check you function in my case I closed strings #174 and #175 and changed level of run_command(f'wget {location} -P {directory}') to start upload all resources.
But I can figure out in which folder this script should be written 'https://bitbucket.org/scilifelab-lts/lts-workflows-sm-metagenomics/raw/screening_legacy/', 'lts_workflows_sm_metagenomics/source/utils/cog2ec.py'? In the resource folder?

Thank you in advance!

Setup database bug

Hi,

I was setting up the database before running recognizer on prodigal predicted orf (on multiple samples), so I have run the following command:

recognizer --download-resources --resources-directory .

but I have encountered these errors, which seem to prevent the next step of running multiple recognizer on different predicted orf.

cat: 'reCOGnizer_results/blast/KOG_*_aligned.blast': No such file or directory
2023-09-07 17:12:30: Organizing annotation results
cat: 'reCOGnizer_results/blast/CDD_*_aligned.blast': No such file or directory
[1/8] Handling CDD annotation
cat: 'reCOGnizer_results/blast/Pfam_*_aligned.blast': No such file or directory
[2/8] Handling Pfam annotation
cat: 'reCOGnizer_results/blast/NCBIfam_*_aligned.blast': No such file or directory
[3/8] Handling NCBIfam annotation
cat: 'reCOGnizer_results/blast/Protein_Clusters_*_aligned.blast': No such file or directory
[4/8] Handling Protein_Clusters annotation
cat: 'reCOGnizer_results/blast/Smart_*_aligned.blast': No such file or directory
[5/8] Handling Smart annotation
cat: 'reCOGnizer_results/blast/TIGRFAM_*_aligned.blast': No such file or directory
[6/8] Handling TIGRFAM annotation
cat: 'reCOGnizer_results/blast/COG_*_aligned.blast': No such file or directory
[7/8] Handling COG annotation
cog2ko not found! Going to build it
join: -:13109: is not sorted: 1001530.BACE01000023_gene1	COG5184
join: ./string2ko.tsv:61: is not sorted: 264203.ZMO1579 K06889
join: input is not in sorted order
cat: 'reCOGnizer_results/blast/KOG_*_aligned.blast': No such file or directory
[8/8] Handling KOG annotation
2023-09-07 17:15:06: reCOGnizer analysis finished in 00h17m04s

Do you know how can I solve this issue?

Best,
Davide

Struggling to install

Hi,

I run the command below as suggested for installation:
(base) PS C:\Users\verno> conda install -c conda-forge -c bioconda recognizer

Here's what I get:
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.

I can't get past this issue - any advice would be great.

Thanks!

fun.tsv cant be found?

Hello i just installed and wanted to try reCOGnizer, but i have a problem.
I installed as explained from source and setup the database.
But when i want to execute a anotation i get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/group/opt/reCOGnizer/fun.tsv'

I am calling the program as follows:

recognizer.py -f ../../S1.hg38unaligned.contigs.fraggenescan.w0.faa -o . --resources-directory /group/db/recognizer/

where should this file be located?
should it be inside the folder were all the databses are stored?

I am using the latest version.

I would appreciate some help

thanks in advance.

taxonomy.rdf is too long to stop at ET.parse('taxonomy.rdf').getroot()'

hello iquasre!
it was stop on ' root = ET.parse('taxonomy.rdf').getroot()', i find the file of 'taxonomy.rdf' is too big ,it total length is 1.2g
my command is
-f inputfile.fasta -o output_directory

Error

Hi :) i get this error and have no idea where it comes from:

recognizer.py -o /Users/laura/Desktop/try1 -f /Users/laura/Desktop/BIOINFORMATICS/OLD/Newannotations-PDD/508/Complete/QMA0508.fna Created /Users/laura/Desktop/try1 /Users/laura/miniconda3/envs/iquasere/bin/Databases/COG0001.smp not found! Retrieving from cdd.tar.gz... /Users/laura/miniconda3/envs/iquasere/bin/Databases/cdd.tar.gz not found! Downloading... wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz -P /Users/laura/miniconda3/envs/iquasere/bin/Databases Traceback (most recent call last): File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 477, in <module> main() File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 391, in main download_resources(args.resources_directory) File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 123, in download_resources run_command('wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz -P {}'.format(database_directory)) File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 84, in run_command subprocess.run(bashCommand.split(), stdout = stdout) File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'wget': 'wget' (iquasere) laura@192-168-1-111 ~ % recognizer.py -o /Users/laura/Desktop/try1 -f /Users/laura/Desktop/BIOINFORMATICS/OLD/Newannotations-PDD/508/Complete/QMA0508.fna -rd /Users/laura/Desktop/data /Users/laura/Desktop/data/COG0001.smp not found! Retrieving from cdd.tar.gz... /Users/laura/Desktop/data/cdd.tar.gz not found! Downloading... wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz -P /Users/laura/Desktop/data Traceback (most recent call last): File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 477, in <module> main() File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 391, in main download_resources(args.resources_directory) File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 123, in download_resources run_command('wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz -P {}'.format(database_directory)) File "/Users/laura/miniconda3/envs/iquasere/bin/recognizer.py", line 84, in run_command subprocess.run(bashCommand.split(), stdout = stdout) File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 423, in run with Popen(*popenargs, **kwargs) as process: File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/Users/laura/miniconda3/envs/iquasere/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'wget': 'wget'

Also, could you please explain what kind of databases you need to specify?

Thanks.

Laura

All the databases are not valid

╰─$ recognizer.py -f Bin/8_Concoct_127.fasta -o recognizer_output -rd resources_directory -t 20

Does it convert nucleotides to protein?
I converted my bins into protein sequences using gotranseq and it did not work

2023-01-02 13:13:48: Loading relational tables
2023-01-02 13:13:48: Replacing spaces with underscores
2023-01-02 13:13:49: Running annotation with RPS-BLAST and CDD database as reference.
resources_directory/dbs/cd.aux not found!
Some part of cd was not valid! Will rebuild!
2023-01-02 13:16:34: Running annotation with RPS-BLAST and Pfam database as reference.
resources_directory/dbs/pfam.aux not found!
Some part of pfam was not valid! Will rebuild!
2023-01-02 13:19:01: Running annotation with RPS-BLAST and NCBIfam database as reference.
resources_directory/dbs/NF.aux not found!
Some part of NF was not valid! Will rebuild!
2023-01-02 13:19:23: Running annotation with RPS-BLAST and Protein_Clusters database as reference.
resources_directory/dbs/PRK.aux not found!
Some part of PRK was not valid! Will rebuild!
2023-01-02 13:20:46: Running annotation with RPS-BLAST and Smart database as reference.
resources_directory/dbs/smart.aux not found!
Some part of smart was not valid! Will rebuild!
2023-01-02 13:20:53: Running annotation with RPS-BLAST and TIGRFAM database as reference.
resources_directory/dbs/TIGR.aux not found!
Some part of TIGR was not valid! Will rebuild!
2023-01-02 13:21:52: Running annotation with RPS-BLAST and COG database as reference.
resources_directory/dbs/COG.aux not found!
Some part of COG was not valid! Will rebuild!
2023-01-02 13:22:46: Running annotation with RPS-BLAST and KOG database as reference.
resources_directory/dbs/KOG.aux not found!
Some part of KOG was not valid! Will rebuild!
2023-01-02 13:26:16: Organizing annotation results
[1/8] Handling CDD annotation
[2/8] Handling Pfam annotation
[3/8] Handling NCBIfam annotation
[4/8] Handling Protein_Clusters annotation
[5/8] Handling Smart annotation
[6/8] Handling TIGRFAM annotation
[7/8] Handling COG annotation
[8/8] Handling KOG annotation
/home/nwezejus/miniconda3/envs/reCOGnizer/bin/recognizer.py:844: FutureWarning: save is not part of the public API, usage can give unexpected results and will be removed in a future version
xlsx_report.save()

Version of databases

Hi,

Thanks for the tool. Can you please clarify what database version of Kog is being downloaded? It seems to be 2003 version but it will be quite outdated to use now. It is possible to include https://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/ version of KOG?

problem with resouces download and rebuilding the databases

Hi,

I have recently updated recognizer to the newest version using conda and such error appears after running the tool:

recognizer.py -t 12 -f IBB3394_genome_dfast_out/protein.faa -o IBB3394_protein_recognizer_COG_out -rd /mnt/SSD2/recognizer_resources_dir/
2021-12-04 20:26:04: Loading relational tables
2021-12-04 20:26:08: Replacing spaces for commas
2021-12-04 20:26:08: Running annotation with RPS-BLAST and CDD database as reference.
/mnt/SSD2/recognizer_resources_dir/dbs/cd.aux not found! Rebuilding database...
Some part of cd was not valid! Will rebuild!
Traceback (most recent call last):
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 892, in
main()
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 883, in main
max_target_seqs=args.max_target_seqs, evalue=args.evalue)
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 794, in multiprocess_workflow
db_report = pd.DataFrame(columns=['qseqid', 'sseqid', 'SUPERFAMILIES', 'SITES', 'MOTIFS'])
File "/home/jang/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in init
mgr = init_dict(data, index, columns, dtype=dtype)
File "/home/jang/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 239, in init_dict
val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
File "/home/jang/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1440, in construct_1d_arraylike_from_scalar
dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

Any hints?

Bests,
Jan

initial set up

Thank you for your tool - I want to use it to annotate novel sequences and differentiate to cell populations based upon the presence or absence of conserved domains.

However, I am failing at the first step of setting up the databases -- the cdd.tar.gz gets downloaded but not the other databases. Instructions for the initial set up do not make this first part clear.

rebuilding databases each run / databases not valid?

Hello! I've installed reCOGnizer using bioconda but seem to have run into some issues with the databases. I'm storing them in a directory called "dir" (code below).

After my first run of reCOGnizer (which prompts me to install the directories then), I continue to get warnings that certain databases are not found/invalid. Looking in the dbs folder, I see that some of the not found/invalid databases have files likes cd.00.aux, cd.01.aux, cd.02.aux, etc. but no cd.aux -- could this be the issue?

Any ideas?

Output:

recognizer.py -f contig1.faa -o test2.out -rd dir/
Created test2.out/asn
Created test2.out/blast
Created test2.out/rpsbproc
Created test2.out/tmp
2022-07-22 23:18:51: Loading relational tables
2022-07-22 23:18:51: Replacing spaces with underscores
2022-07-22 23:18:52: Running annotation with RPS-BLAST and CDD database as reference.
dir/dbs/cd.aux not found! Rebuilding database...
Some part of cd was not valid! Will rebuild!
2022-07-22 23:18:57: Running annotation with RPS-BLAST and Pfam database as reference.
dir/dbs/pfam.aux not found! Rebuilding database...
Some part of pfam was not valid! Will rebuild!
2022-07-22 23:19:01: Running annotation with RPS-BLAST and NCBIfam database as reference.
A valid NF split database was found!
2022-07-22 23:19:04: Running annotation with RPS-BLAST and Protein_Clusters database as reference.
dir/dbs/PRK.aux not found! Rebuilding database...
Some part of PRK was not valid! Will rebuild!
2022-07-22 23:19:09: Running annotation with RPS-BLAST and Smart database as reference.
A valid smart split database was found!
2022-07-22 23:19:12: Running annotation with RPS-BLAST and TIGRFAM database as reference.
dir/dbs/TIGR.aux not found! Rebuilding database...
Some part of TIGR was not valid! Will rebuild!
2022-07-22 23:19:16: Running annotation with RPS-BLAST and COG database as reference.
dir/dbs/COG.aux not found! Rebuilding database...
Some part of COG was not valid! Will rebuild!
2022-07-22 23:19:19: Running annotation with RPS-BLAST and KOG database as reference.
dir/dbs/KOG.aux not found! Rebuilding database...
Some part of KOG was not valid! Will rebuild!
2022-07-22 23:19:24: Organizing annotation results
[1/8] Handling CDD annotation
[2/8] Handling Pfam annotation
[3/8] Handling NCBIfam annotation
[4/8] Handling Protein_Clusters annotation
[5/8] Handling Smart annotation
[6/8] Handling TIGRFAM annotation
[7/8] Handling COG annotation
Writing test2.out/COG_quantification.html...
[8/8] Handling KOG annotation

Warning: [rpsblast] Query is Empty! error

Hi,

Thanks for putting together this wonderful resource. I've run reCOGnizer before and I am now experiencing an issue with the pipeline suggesting my fasta file has empty queries. I've troubleshooted several things and (e.g., check protein sequence format for unwanted characters, redownloading the databases) it all seems to be fine. When I take a subset of the proteins and run them with different pipelines it seems like the queries are processed just fine.

I'm attaching my protein fasta file.

Any thoughts onto why I get this error?
no_duplicateT6SE.zip

Thanks in advance!

COG identifications issue

Hi,
Thank you very much for such amazing tool.

When I try to run recognizer for COG database I encounter such issue:

[1/1] Handling COG identifications
grep -E 'K[0-9]{5}$' /media/jang/SSD2/recognizer_resources_dir/protein.info.v11.0.txt | awk '{{if (length($NF) == 6) print $1, $NF}}'
awk '{if (length($4) == 7) print $1" "$4}' /media/jang/SSD2/recognizer_resources_dir/COG.mappings.v11.0.txt | sort | join - /media/jang/SSD2/recognizer_resources_dir/string2ko.tsv
join: -:13109: is not sorted: 1001530.BACE01000023_gene1 COG5184
join: /media/jang/SSD2/recognizer_resources_dir/string2ko.tsv:61: is not sorted: 264203.ZMO1579 K06889
Traceback (most recent call last):
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 461, in
main()
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 432, in main
report = cog2ko(report, cog2ko=f'{args.resources_directory}/cog2ko.tsv')
File "/home/jang/anaconda3/envs/deepnog/bin/recognizer.py", line 272, in cog2ko
df[['COG', 'KO']].groupby('COG')['KO'].agg([('KO', ','.join)]).reset_index().to_csv(
File "/home/jang/anaconda3/envs/deepnog/lib/python3.9/site-packages/pandas/core/generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "/home/jang/anaconda3/envs/deepnog/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
csv_formatter.save()
File "/home/jang/anaconda3/envs/deepnog/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 237, in save
with get_handle(
File "/home/jang/anaconda3/envs/deepnog/lib/python3.9/site-packages/pandas/io/common.py", line 702, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'f{directory}/cog2ko.tsv'