matteopaluh / kemet Goto Github PK

KEGG Module Evaluation Tool

License: Other

Python 78.58% Jupyter Notebook 21.42%

kegg kegg-modules metabolic-models metabolic-reconstruction gap-filling genome-scale-metabolic-model gem

kemet's Introduction

KEMET

KEgg Module Evaluation Tool
"KEMET - a python tool for KEGG Module evaluation and microbial genome annotation expansion"

Script description

The kemet.py script works as a command line tool that serves three main functions:

Evaluate KEGG Modules Completeness and summarize metabolic potential of MAGs/Genomes of interest, organizing the info into tables.
Perform HMM-based searches for ortholog genes (KO) of interest, to expand KEGG Module Completeness evaluation.
Genome-scale models (GSMM or GEM) gapfill with evidence from nucleotidic HMM searches, regarding KOs of interest.

KEMET is well suited for metagenome-spanning analyses as well as single genomes usage, in order to get a better understanding of microbial metabolic and ecological functions.

Citing KEMET

If you are using this program, please refer to our paper published in Computational and Structural Biotechnology Journal, available here.

Installation

The program is designed to have an easy installation procedure on UNIX-based machines, nonetheless the code is compatible with Windows systems.

Full installation is achieved in just a couple of minutes following the command-lines described in the wiki pages.

This tool was meant to have few external dependencies to ensure stability.

General Setup process and use - Conda environment

Refer to the Setup wiki page to properly set the working directory.
Moreover it is important to follow the instructions to place relevant input files in the appropriate subdirectories and using proper format for said files.

Command line (minimal required arguments)

./kemet.py [FASTA_file] -a [FORMAT] --hmm_mode [MODE] --gsmm_mode [MODE] (--skip_hmm) (--skip_gsmm) (--no_genome)

[FASTA_file]: FASTA file indication of the MAG/Genome of interest (with or without path indication e.g. genomes/bin1.fasta). With further arguments it can also be the indication of a KEGG annotation file.

-a [FORMAT]: program used to annotate KEGG KOs, i.e. KEGG annotation format (either eggnog / kaas / kofamkoala) - used to generate KEGG MODULES recap tables. Default file extension must be maintained (e.g. .emapper.annotations, .ko)

--hmm_mode [MODE]: when HMM analysis is desired, use this parameter to indicate a subset of KOs to search further using profile HMMs. [MODE] should be either one of onebm, module, kos, as described in the wiki pages.

--gsmm_mode [MODE]: when GSMM/GEM gapfilling is desired, use this parameter to indicate whether to perform de-novo GSMM/GEM reconstruction or add reactions to an existing model. [MODE] should be either denovo or existing, respectively, as described in the wiki pages.

--skip_hmm: use this to stop after KEGG MODULES Completeness Evaluation. The only output would be organized tables of metabolic potential.

--skip_gsmm: use this to stop after HMM analysis.

--no_genome: use this to indicate the (path to a) file with KEGG annotations, in order not to include MAG/Genome operations. Using this will result in stopping after KEGG MODULES Completeness Evaluation.

Other suggested optional parameters include:

--log: store KEMET progress in a log file (STRONGLY suggested).
-v: print more informations, for progress reporting purposes & more info.
-q: print less informations and silence MAFTT and HMMER soft-errors (suggested).

--as_kegg: changes how incomplete KEGG MODULES are summarized in recap tables - following KEGG-Mapper convention, i.e. Modules with less than 3 blocks are marked as INCOMPLETE regardless of the number of incomplete blocks. This imply using a more conservative approach regarding annotations.

--threshold_value [VALUE]: use another quality filter to differentiate between legit HMM hits (default: 0.43).

Script details

For detailed info on the process/outputs of each KEMET task, as well as info on custom KEGG Modules & other, please refer to the wiki pages.

Credits

Developed by Matteo Palù at Università degli Studi di Padova (2020-2023).

kemet's People

Contributors

Stargazers

Watchers

Forkers

ale-rossi mattoslmp chuanfaliu dtdoering wocer2019

kemet's Issues

'ktest' error

While running this program, getting the following error:
python kemet.py genomes/test.fna -a eggnog --skip_hmm --skip_gsmm
Traceback (most recent call last):
File "./kemet.py", line 2450, in
if ktest in sorted(os.listdir()):
NameError: name 'ktest' is not defined

conversion to python package

Any plans on making KEMET a legit python package that can be installed via pip (from pypi)? I see that the setup.py is non-standard. Converting the current code in the setup.py to a separate script that is referenced via scripts: in a standard setup.py would likely be all that is needed.

Custom Modules?

Hi,
I saw that the README indicated that custom modules could be added.
Should they just be included in the kk_files folder, or do they need to be included elsewhere?
Cheers
Greg

[error] problems with KoFamKOALA

Hi,

Thanks for the kemet package. The package and the article looks awesome.
I installed kemet following the instructions and when i run it, i get the following error:

`python kemet.py genomes/mcs.fasta -a kofamkoala --hmm_mode kos

Traceback (most recent call last):
File "kemet.py", line 2514, in
if LOGflag:
NameError: name 'LOGflag' is not defined`

Incorrect recognition of MAG filename

Hello, thanks for developing this useful tool.

I put co_metabat2.1.fa, co_metabat2.12.fa, co_metabat2.100.fa, co_metabat2.199.fa in the genomes folder at the same time, and their annotation files are also all placed in the KEGG_annotations folder.
If I run kemet.py -a eggnog --skip_hmm genomes/co_metabat2.1.fa, only reportKMC_co_metabat2.199.tsv is displayed in the
reports_tsv folder, but co_metabat2.1.ktest, co_metabat2.12.ktest, co_metabat2.100.ktest, co_metabat2.199.ktest are displayed in the ktests folder.

Can you help me with this problem?

"ktest" file error - due to file naming convention

I met an error.
$./kemet.py genomes/x23.fna -a eggnog --skip_hmm

Traceback (most recent call last):
File "/root/KEMET/./kemet.py", line 2440, in
if ktest in sorted(os.listdir()):
NameError: name 'ktest' is not defined

What is the problem?

Merge multiple KEMET results

Dear,
I performed kemet against several samples, can you give me some tips on how to merge these tables into one?
Best regards,
Leandro.

Create .kk files

Hi,

would it be possible to share the script that you are using to create .kk files?
I need to pin the completeness analysis to a specific KEGG version that I am also using for different other analyses.

add_taxonomy_from_gtdb-tk.py - help!

I am trying to run this script but it keeps returning with this
"The genomes.instruction file has been updated with 0 genome(s) taxonomy indications, using '.fasta' extension"
Could you please tell me if there is anything that I can do to fix it ?

Module completeness as stand-alone package

First of all, thank you for putting together this really great package.
I find the module completeness assessment really unique, with only a few other lesser options out there (e.g., KeggDecoder). I also liked the way you break down the module definition in .kk files for improved completeness assessment. Therefore, I look forward to see continued support and development for this function.

In my case, I use ko annotations made within a different pipeline to assess module completeness with KEMET. In theory I would only need the annotation .txt file, but I have to also provide the genome assembly .fasta file to run the script (which is not really needed when running with --skip_hmm and --skip_gsmm arguments).

If I could make a feature request/suggestion, it would be to separate the module completeness functionality where it accepts just ko annotation files (either a path to a file or a path to a folder for batch operation).

It would also be great to have a stand-alone tool to create module definition .kk files from the official kegg module .txt files, for situations where KEMET is not continuously supported and current .kk files become obsolete.

Thank you for giving these some consideration.

How to use the script to convert the KEGG module file to <module_id>.kk?

Hi,
Thank you for the excellent tool! It's very helpful to me!

I have a question for you. Do you have a script or program to convert the KEGG module file to <module_id>.kk?

The KEGG module file I mean here is:
M00001_Glycolysis_(Embden-Meyerhof_pathway).txt

I'm asking because I want to assess the completeness of some KEGG pathways in the bacterial genome and I can't process the KEGG htext format files in bulk (Unless convert them manually).

I hope you have a solution for me...

I really appreciate any help you can provide.
Hao Jin

Error regarding output directory

Thank you for your code, but I encountered an issue when running it.

This is how I used it. Under the 'eggnog' directory, there is a file named 'emapper.annotations', and under the 'genome' directory, there is a file named 'genome.fna'. The code I used is
python ./kemet.py -I ./eggnog -a eggnog --skip_hmm --skip_gsmm ./genome -q --log --path_output ./output

However, an error occurred: FileNotFoundError: [Errno 2] No such file or directory: './output/ktests/'

Strangely, when I manually created the entire folder, the code seemed to run smoothly, but no output file was generated.

I got the same error even when using your test files.

Dealing with different MAG completness?

Hi,
Very nice tool that I'm excited to try
I wanted to know how (if it does) the software delt with different MAGs completeness
Best
Greg

Kofamscan format problem

Hi,

Thank you for developing KEMET. I'm very interested in making use of the three main modules included in this package. Nonetheless, I'm facing a couple of issues and I kindly request some assistance.

For testing purposes, I'm currently working with a high-quality MAG with filename KEMET/genomes/SB_biofilm_MAG_1_.fa. KEGG annotations were performed with KofamKOALA, and were included as a tsv file (KEMET/KEGG_annotations/SB_biofilm_MAG_2_.tsv). I'm running the --hmm_mode modules option with "M00001" as the only input for the "module_file.instruction" file . The "genomes.instruction" file contains the following (tab separated):

id      taxonomy        universe
SB_biofilm_MAG_1_.fa    Bacteroidetes   gramneg

Currently I'm running the following command in the KEMET directory: ./kemet.py genomes/SB_biofilm_MAG_1_.fa -a kofamkoala --log --hmm_mode modules --skip_gsmm

Issues:

For the KEGG modules completeness evaluation, I'm getting unexpected results compared to the output from KEGG mapper. While in KEGG mapper I'm having multiple complete modules (e.g., M00001), both outputs from KEMET (.tsv and .txt) display that every module is incomplete (with 0% completeness). Here is an example of how the output .txt file looks:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       0.0     0__10   INCOMPLETE
1.      K00844, K12407, K00845, K00886, K08074, K00918
2.      K01810, K06859, K13810, K15916
3.      K00850, K16370, K00918
4.      K01623, K01624, K11645, K16305, K16306
5.      K01803
6.      K00134, K00150, K11389
7.      K00927, K11389
8.      K01834, K15633, K15634, K15635
9.      K01689
10.     K00873, K12406

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       0.0     0__6    INCOMPLETE
1.      K01803
2.      K00134, K00150, K11389
3.      K00927, K11389
4.      K01834, K15633, K15634, K15635
...

Despite the previous issue, I tried to run the --hmm_mode modules option with "M00001". While running the HMM search function, the following is printed:

Alignment input open failed.
   couldn't guess alphabet (maybe try --dna/--rna/--amino if available)
   while reading file K12406.msa
   while parsing for aligned FASTA format
Alignment input open failed.
   couldn't guess alphabet (maybe try --dna/--rna/--amino if available)
   while reading file K01624.msa
   while parsing for aligned FASTA format
...

I looked for the *.msa files at their respective directories, and it seems that the files are blank. Consequently, the following is printed on screen:

Error: File existence/permissions problem in trying to open query file K12406.h$
HMM file K12406.hmm not found (nor an .h3m binary of it)


Error: File existence/permissions problem in trying to open query file K01624.h$
HMM file K01624.hmm not found (nor an .h3m binary of it)
...

After the completion of nhmer significant hints, the following traceback is printed on screen:

Traceback (most recent call last):
  File "./kemet.py", line 2536, in <module>
    HMM_hits_longestTRANSLATED_dict = HMM_hits_longest_translated_sequences(HMM$
  File "./kemet.py", line 1290, in HMM_hits_longest_translated_sequences
    max_len_dict[fasta_nf].append(seq_max) # add the longest to list
KeyError: '>SB_biofilm_MAG_1'

If necessary, I would gladly share via e-mail the original nucleotide fasta and KEGG annotations files.

Thank you so much in advance.

Best,

David

Equivocal README & file-naming problems

Hello and thanks for creating this software,

I have gene-to-ko annotations for all my MAGs. I would like to use KEMET to calculate the completeness of KEGG modules for these MAGs.

Unfortunately, i have not yet managed to do so. I think the instructions in README.md are not up-to-date. The file setup.py is mentioned in multiple places but seems to be missing from the repository. It is unclear to me why i cannot run the tool without providing a FASTA file when I'm using --skip_hmm and --skip_gsmm. The help text references the genomes.instruction file in this context, but that one is also not part of the repository.

I'm also not sure if a am providing KO annotations in the right format. For each MAG, i created a tab-separated file with gene identifiers in the first column and KOs (e.g. K24042) in the second column. They are named bin1_ko.txt, bin2_ko.txt etc.
If one gene has multiple KO annotations, the file will contain one row for each of those annotations.
Is this approach correct? What would i put for --annotation_format? If my approach is incorrect, can you give me an example of how i should format my input to match one of the valid annotation formats?

Thank you very much for any help.

Kind Regards,
Tom