Hi, Thank you for developing KEMET. I'm very interested in making us

Kofamscan format problem about kemet HOT 7 CLOSED

David-Madrigal commented on June 10, 2024

Kofamscan format problem

from kemet.

Comments (7)

David-Madrigal commented on June 10, 2024 1

Sure!

I'm attaching both original files here. "SB_biofilm_MAG_1_.txt" is the KEGG annotation file and "SB_biofilm_MAG_1_FASTA.txt" is the .fa file (I changed the name and extension since GitHub does not accept .fa files).

SB_biofilm_MAG_1_.txt
SB_biofilm_MAG_1_FASTA.txt

Best,
David

from kemet.

Matteopaluh commented on June 10, 2024

Hi David,
I guess all the errors are connected and have a few ideas on which could be the problem.

FIrst of all, you're possibly be using KoFamKOALA annotation in a format different from the one accepted by KEMET (have a look at the example files for reference), which I believe is the cause of point 1 on your errors list.
EDIT: the files you're using are called in two different ways, i.e. "MAG_1_" and "MAG_2_,", I assume this is a typo, otherwise that could also point to another error

Second, I would ask if you're using the later versions of KEMET (i.e. you've cloned the repository after the commit cef9acb performed on July 8th - here) KEGG developers tweaked with the url to access their API. This could have interefered with downloading Bacteroidetes gene sequences and resulting in the empty .msa files.
Furthermore I'm assuming you've rightfully added a line about which module to check for via HMMs in the module_file.instruction, otherwise that could also be a source of error.
Related to the third point, not finding any sequences, the nhmmer command results in an unspecified error, which is something I could look into and handle better (by adding error flags) for further users.

All that being said I would suggest to clone into the latest repository, if not previously done, and starting anew with empty report and HMM folders, as well as using a suited annotation format (have a look at the toy folder).
If the problem persist, feel free to send the files via mail and I'll look into it thoroughly!

Best,
Matteo

from kemet.

David-Madrigal commented on June 10, 2024

Hi Matteo,

Thank you for your response.

I apologize, the KEGG annotation file I used was called "MAG_1_". the "2" That was a typo while writing the previous comment.

For the following, I cloned the latest repository.

I changed the KEGG annotation file extension from .tsv to .txt, as suggested. Additionally, I replaced tab characters for whitespace characters, just in case. Yet, I'm getting the same output:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       0.0     0__10   INCOMPLETE
1.      K00844, K12407, K00845, K00886, K08074, K00918
2.      K01810, K06859, K13810, K15916
3.      K00850, K16370, K00918
4.      K01623, K01624, K11645, K16305, K16306
5.      K01803
6.      K00134, K00150, K11389
7.      K00927, K11389
8.      K01834, K15633, K15634, K15635
9.      K01689
10.     K00873, K12406

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       0.0     0__6    INCOMPLETE
1.      K01803
2.      K00134, K00150, K11389
3.      K00927, K11389
4.      K01834, K15633, K15634, K15635
...

Alternatively, I used the input to KEGG mapper file and the output is the correct one:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       100.0   10__10  COMPLETE

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       100.0   6__6    COMPLETE

M00003.kk       M00003_Gluconeogenesis, oxaloacetate => fructose-6P
%       100.0   8__8    COMPLETE

M00004.kk       M00004_Pentose phosphate pathway (Pentose phosphate cycle)
%       62.5    5__8    INCOMPLETE
1.      K13937, K00036, K19243
2.      K13937, K01057, K07404
3.      K00033

M00005.kk       M00005_PRPP biosynthesis, ribose 5P => PRPP
%       100.0   1__1    COMPLETE
...

I believe this has something to do with the original detail file in .tsv format (which I got from command-line KofamKOALA). Due to this, I end up using the mapper format instead.

For the HMM search module, I kept having the same traceback error as before. I realized that the traceback error shows " KeyError: '>SB_biofilm_MAG_1' ", but my files were named as "SB_biofilm_MAG_1_.fa" and "SB_biofilm_MAG_1_.txt" respectively (i.e., with an additional underscore). Changing the file names to "SB_biofilm_MAG01" did the trick. Maybe that ending underscore has something to do with it.

Although everything is running smoothly now, several .msa files are still empty, resulting in non-exsistent .hmm files. I inspected several of the KEGG IDs (e.g., K22896) on the KEGG website and found that no entries were associated to the input phylum (in this case, Bacteroidetes). Hence, it makes sense that the .msa files were empty.

Thank you very much once again!

Best,
David

from kemet.

Matteopaluh commented on June 10, 2024

Great! Now that I see it

Since the command you launched did not have the "quiet" argument (-q), the soft errors from MAFFT and HMMER were still printed on screen, as in your point 2! I didn't see it because I'm used to the quiet setting myself..

The genes that are not present even once in the KEGG genes of the phylum of choice result in blank .msa, I get it is somehow redundant but without any priors it was the choice I made to have the full spectrum of genes. KEGG database becomes bigger by the months and gets reorganised in terms of new orthologs being split from general to specifics to some phylogenetic groups.

I'm not sure if the file name ending with an underscore was the real deal, as the key error traceback refers to the names of the contig fasta headers, but anyhow glad it worked!

For the sake of good practice and to avoid other similar issues, could I still ask you the two files, to sort out the initial problem with the annotation file? This will likely help KoFamKOALA users, as that program really speeds up functional annotation matters.

Best,
Matteo

from kemet.

Matteopaluh commented on June 10, 2024

Ok, looking thoroughly into the KEGG annotation file format I got this was a two-sided problem.

First, the file you shared encodes an erroneous information in line#2 with respect to how that is handled in KEMET
i.e. dashes do not represent how following lines fields are structured, and kemet.py uses that as info to parse the annotation and recover KOs.

Second, the code handles wrongly the spacings, so I'm patching that asap. And thanks a lot for this "test", I only used the two Kofam files obtained from the web-server so I was missing this use case!

Just the last question before I'm resolving this, which --format option did you use for kofam_scan? Secondly, have you edited in any way the file obtained from the command-line to obtain the one you shared?

Best,
Matteo

from kemet.

David-Madrigal commented on June 10, 2024

Yes! That file was edited, since I got .tsv format from KofamKOALA (command-line), and I was trying to adjust to the .txt format by making a whitespace substitution to the tab values.

I'm sharing with you the original .tsv file. This one was not edited.

SB_biofilm_MAG_1_TSV.txt

from kemet.

Matteopaluh commented on June 10, 2024

I'll close this issue after the fix performed on July.

If you'd need some other help please reopen.
Best,
Matteo

from kemet.

Kofamscan format problem about kemet HOT 7 CLOSED

Comments (7)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent