iquasere / keggcharter Goto Github PK

View Code? Open in Web Editor NEW

45.0 4.0 6.0 531.65 MB

A tool for representing genomic potential and transcriptomic expression into KEGG pathways

License: BSD 3-Clause "New" or "Revised" License

Shell 0.46% Python 98.96% Dockerfile 0.58%

metabolic-pathways metagenomics metatranscriptomics metaproteomics kegg-pathway kegg-ortholog

keggcharter's Introduction

KEGGCharter

A tool for representing genomic potential and transcriptomic expression into KEGG pathways.

Features

KEGGCharter is a user-friendly implementation of KEGG API and Pathway functionalities. It allows for:

Conversion of KEGG IDs, EC numbers and COG IDs to KEGG Orthologs (KO) and of KO to EC numbers
Representation of the metabolic potential of the main taxa in KEGG metabolic maps (each distinguished by its own colour)
Representation of differential expression between samples in KEGG metabolic maps (the collective sum of each function will be represented)

Installation

KEGGCharter can be easily installed with Bioconda.

conda install -c conda-forge -c bioconda keggcharter

Running KEGGCharter

To run KEGGCharter, an input file must be supplied. This file only needs to contain one column with either KEGG IDs, KOs or EC numbers. Beyond that:

to obtain distinct taxonomic identifications in the maps, a column with taxonomic identification must be specified with the -tcol parameter. If no such column exists, KEGGCharter must be run with the -it parameter.
to obtain maps with differential expression, at least one column with genomic and/or transcriptomic quantification must be specified with the -qcolparameter. If no such column exists, KEGGCharter must be run with the -iq parameter.

An example input file is available here. It contains all fields referenced above, and should be used as guidance for building inputs for KEGGCharter. The following command will obtain metabolic representations for "Methane Metabolism" (KEGG map00680) with KEGGCharter:

keggcharter -f keggcharter_input.tsv -rd resources_directory -keggc 'KEGG' -koc 'KO' -ecc 'EC number' -cogc 'COG ID' -iq -it "My community" -mm 00680 -o first_time_running_KC

After it is over, you should have, inside the first_time_running_KC folder:

additional information concerning your data in the file KEGGCharter_results.tsv
maps in PNG format inside a maps folder
JSONs with the information painted on the maps inside a json folder

Additionally, you should have the data_for_charting.tsv and taxon_to_mmap_to_orthologs files. These are there so KEGGCharter can be run again, for other maps, by running the same command as before, but with the additional parameter --resume. With this parameter, KEGGCharter will look for those files, and new maps can be generated by changing the --metabolic-maps parameter. No need for repeated KOs and EC numbers retrieval!

What maps are available?

You can see what maps are available for the --metabolic-maps parameter by running keggcharter --show-available-maps.

First time KEGGCharter runs it will take a long time

KEGGCharter needs KGMLs and EC numbers to boxes relations, which it will automatically retrieve for every map inputted. This might take some time, but you only need to run it once.

Default directory for storing these files is the folder containing the keggcharter script, but it can be customized with the -rd parameter.

Outputs

KEGGCharter produces a table from the inputed data with two new columns - KO (KEGG Charter) and EC number (KEGG Charter) - containing the results of conversion of KEGG IDs to KOs and KOs to EC numbers, respectively. This file is saved as KEGGCharter_results in the output directory. KEGGCharter then represents this information in KEGG metabolic maps. If information is available as result of (meta)genomics analysis, KEGGCharter will localize the boxes whose functions are present in the organisms' genomes, mapping their genomic potential. If (meta)transcriptomics data is available, KEGGCharter will consider the sample as a whole, measuring gene expression and performing a multi-sample comparison for each function in the metabolic maps.

maps with genomic information are identified with the prefix "potential_" from genomic potential (Fig. 1).

Fig. 1 - KEGG metabolic map of methane metabolism, with identified taxa for each function from a simulated dataset.

maps with transcriptomic information are identified with the prefix "differential_" from differential expression (Fig. 2).

Fig. 2 - KEGG metabolic map of methane metabolism, with differential analysis of quantified expression for each function from a simulated dataset.

Arguments for KEGGCharter

KEGGCharter provides several options for customizing its workflow.

  -h, --help            show this help message and exit
  -f FILE, --file FILE  TSV or EXCEL table with information to chart
  -o OUTPUT, --output OUTPUT
                        Output directory
  -rd RESOURCES_DIRECTORY, --resources-directory RESOURCES_DIRECTORY
                        Directory for storing KGML and CSV files.
  -mm METABOLIC_MAPS, --metabolic-maps METABOLIC_MAPS
                        IDs of metabolic maps to output
  -qcol QUANTIFICATION_COLUMNS, --quantification-columns QUANTIFICATION_COLUMNS
                        Names of columns with quantification
  -tls TAXA_LIST, --taxa-list TAXA_LIST
                        List of taxa to represent in genomic potential charts (comma separated)
  -not NUMBER_OF_TAXA, --number-of-taxa NUMBER_OF_TAXA
                        Number of taxa to represent in genomic potential charts (comma separated)
  -keggc KEGG_COLUMN, --kegg-column KEGG_COLUMN
                        Column with KEGG IDs.
  -koc KO_COLUMN, --ko-column KO_COLUMN
                        Column with KOs.
  -ecc EC_COLUMN, --ec-column EC_COLUMN
                        Column with EC numbers.
  -cogc COG_COLUMN, --cog-column COG_COLUMN
                        Column with COG IDs.
  -tc TAXA_COLUMN, --taxa-column TAXA_COLUMN
                        Column with the taxa designations to represent with KEGGCharter. NOTE - for valid taxonomies, check: https://www.genome.jp/kegg/catalog/org_list.html
  -iq, --input-quantification
                        If input table has no quantification, will create a mock quantification column
  -it INPUT_TAXONOMY, --input-taxonomy INPUT_TAXONOMY
                        If no taxonomy column exists and there is only one taxon in question.
  -t THREADS, --threads THREADS
                        Number of threads to run KEGGCharter with [max available]
  --step STEP           Number of IDs to submit per request through the KEGG API [40]
  --map-all             Ignore KEGG taxonomic information. All functions for all KOs will be represented, even if they aren't attributed by KEGG to the specific species.
  --include-missing-genomes
                        Map the functions for KOs identified for organisms not present in KEGG Genomes.
  --resume              If data inputed has already been analyzed by KEGGCharter.
  -v, --version         show program's version number and exit

Special functions:
  --show-available-maps
                        Outputs KEGG maps IDs and descriptions to the console (so you may pick the ones you want!)

Mock imputation of quantification and taxonomy

Sometimes, not all information required for KEGGCharter will be available. In these cases, KEGGCharter may use mock imputations of quantification and/or taxonomy.

To input mock quantification, run with the --input-quantification parameter. This will attribute a quantification of 1 to every protein in the input dataset. This replaces the --quantification-columns parameter.

To input mock taxonomy, run with the --input-taxonomy [mock taxonomy] parameter, where [mock taxonomy] should be replaced with the value to be presented in the maps. This will attribute that taxonomic classification to every protein in the input dataset, which might be useful to, for example, represent "metagenome" in the genomic potential maps. This replaces the --taxonomic-columns parameter.

Handling missing information in KEGG Genomes

KEGGCharter attempts to download taxa specific KGMLs for organisms in KEGG Genomes, and use them to determine which functions are available for which organisms. Since KOs are promiscuous, the same KO will likely map for functions that organisms have available in their genomes, and for functions not available for them. Using this workflow of KEGGCharter will produce maps such as the example in Fig. 3.

Fig. 3 - Original KEGGCharter workflow. Only arcticus had KOs with functions for the TCA cycle attributed that, simultaneously, were present in the KGML for the TCA cycle and the taxon arcticus.

This type of workflow uses both taxon-specific information and results from the datasets inputted. All functions represented validated by KEGG (i.e., those functions are available for those organisms), but many identifications may be lacking, since information at KEGG is often incomplete.

Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes

Organisms that are not identified in KEGG Genomes can still be represented, if running KEGGCharter with the option --include-missing-genomes. All functions for the KOs identified for that organism will be represented (Fig. 4).

Fig. 4 - KEGGCharter output expanded with --include-missing-genomes parameter. hydrocola is not present in KEGG Genomes, but all functions attributed to its KOs are still represented.

This setting allows to still obtain validated information for the taxonomies that are present in KEGG Genomes, while also allowing for representation of organisms not present in KEGG Genomes. It should offer the best compromise between false positives and false negatives.

Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified

Functions that are not present organisms specific KGMLs can still be represented, if running KEGGCharter with the option --map-all. This will bypass all taxon specific KGMLs, and map all functions for all KOs present in the input dataset (Fig. 5).

Fig. 5 - KEGGCharter output expanded with --map-all parameter. No functions for oleophylus and franklandus were simultaneously present in the KOs identified and available in their KGMLs. In this case, the requirement for presence in the KGMLs is bypassed, and all functions are represented for all taxa.

This setting represents the most information on the KEGG maps, and will produce the most colourful representations, but will likely return many false positives. Maps produced should be analyzed with caution This setting may be required, however, if information for organisms in KEGG Genomes is very incomplete.

Referencing KEGGCharter

If you use KEGGCharter, please cite its publication.

keggcharter's People

Contributors

Stargazers

Watchers

Forkers

rajaldebnath suharoschi liupfskygre mattoslmp genostack duartebred

keggcharter's Issues

GTDB taxonomic info

Hi Joao,

As I use taxonomic info obtained through kraken (GTDB) instead of UPIMAPI, I did not expect all genera to be mapped, but it seems that practically none are mapped, not even the infamous Lactobacillus. These are the messages I get for practically all genera:

[52/63] Getting information for taxon [Parabacteroides]
[0] maps inputted for org [pdi]
[0] KGMLs already obtained for org [pdi]

[40/63] Getting information for taxon [Anaerovibrio]
[Anaerovibrio] was not found in taxon to KEGG prefix conversion!

[48/63] Getting information for taxon [Lactobacillus]
[0] maps inputted for org [ljo]
[0] KGMLs already obtained for org [ljo]

What do you think could the problem be?

My command:
keggcharter.py -f test.tsv -tcol "value" -koc "KO" -tc "Taxonomic lineage (GENUS)" -gcol "gcol" -o test_out -mm "Methane metabolism"

Kind Regards
Dany

Explanations Needed with Quantifications

Hello again! Thanks for the quick responsiveness on my previous question. Now that I have the software working, I'm playing around with the interpretation of MGS data coming out of our MicrobiomeHelper pipeline (https://github.com/LangilleLab/microbiome_helper/wiki) that we use for our own data + offer to clients of our core - if I can get KEGGCharter to work well, we might like to include this in our new MH ver2.0 that might be coming along in 2024. We already are developing a tool for visualizing the stratified output (JarrVis: https://github.com/dhwanidesai/JarrVis) using interactive Sankey diagrams and KEGGCharter could be a nice complement to that for the metabolic maps part, since we don't have a good visualizer for that now (plus we could write some nice scripts to convert our pipeline data to "talk" between the two).

That being said, I have a few questions regarding how the quantifications are handled (PS: there also seem to be some legacy references to --genomic-columns when I think you mean -qcol) - I checked the paper and your wiki here, but there are a few things I wanted to ask and thought would be nice to have them here for other people to see. I initially started playing around with my full data file for input (only using the first two samples to start) when I encountered an issue that the color scale in the "MT" mode didn't seem to match the input RPKM values and so I made a little mock-up example file instead to be able to test and ask the below questions:

EC	N1	N2
1.7.1.4	4000	2000
1.9.6.1	400	200
1.7.2.5	40	20

...for this small test example, I've simply restricted to 3 of the EC numbers corresponding to the Nitrogen Metabolism pathway, which were in our original dataset and are of particular interest to us. After I run this through KC (keggcharter -f TestInput-forKEGGCharter.txt -o KC_test_run -it 'MirallesMGS-CEMEX2018' --map-all -t 40 -ecc 'EC' -qcol 'N1,N2' -mm 00910), I get the following output in the KEGGCharter_results.tsv:

Function	N1	N2	Taxon (KEGGCharter)	KO (ec-column)	EC (ec-column)	KO (KEGGCharter)	EC number (KEGGCharter)
1.7.1.4	4000	2000	MirallesMGS-CEMEX2018	K00361,K17877,K26138,K26139	1.7.1.4	K00361,K17877,K26138,K26139	1.7.1.4
1.7.2.5	400	200	MirallesMGS-CEMEX2018				1.7.2.5
1.9.6.1	40	20	MirallesMGS-CEMEX2018				1.9.6.1
				K02567	1.9.6.1
				K04561	1.7.2.5

..and then the following for the data_for_charting.tsv:

Function	N1	N2	Taxon (KEGGCharter)	KO (ec-column)	EC (ec-column)	KO (KEGGCharter)	EC number (KEGGCharter)
1.7.1.4	4000	2000	MirallesMGS-CEMEX2018	K00361,K17877,K26138,K26139	1.7.1.4	K00361	1.7.1.4
1.7.1.4	4000	2000	MirallesMGS-CEMEX2018	K00361,K17877,K26138,K26139	1.7.1.4	K17877	1.7.1.4
1.7.1.4	4000	2000	MirallesMGS-CEMEX2018	K00361,K17877,K26138,K26139	1.7.1.4	K26138	1.7.1.4
1.7.1.4	4000	2000	MirallesMGS-CEMEX2018	K00361,K17877,K26138,K26139	1.7.1.4	K26139	1.7.1.4
1.7.2.5	400	200	MirallesMGS-CEMEX2018				1.7.2.5
1.9.6.1	40	20	MirallesMGS-CEMEX2018				1.9.6.1

...this test data then results in the following map:

Therefore, a few questions/problems are coming up here:

Why are there two lines (for K02567+4561) down below the lines of the actual data in the KEGGCharter_results file above, instead of having those lines in-line with their corresponding 1.7.2.5+1.9.6.1 entries in the two lines above that to which they seemingly belong? My actual file of the full dataset has 2880 lines of actual data, whereas the KC-processed file then has an extra 1130 lines after the main data in the same way as above (ie: only in the "KO (ec-column)" and "EC (ec-column)" columns).
In the data_for_charting, there are data cells missing info and I suspect this is related to your having a minimum score cutoff for inclusion, but I can't seem to find that written anywhere on the Github, nor in the paper - what is it and can it be modified? We express our MGS gene counts normalized as RPKM and so the scale of numbers can look comparatively very small and so could be a problem (would prefer not to have to transform them and then redraw all the scales). This is also then reflected in the final map which does not have the last two "low count" ECs included. Similarly, in my full dataset file processed by KC (inflated from the 2880 original lines to 5123), the "KO (ec-column)" and "EC (ec-column)" columns stop at line 3819 (as does the next KO (KC) line, but the EC (KC) column goes to the end of the 5123).
Also in the data_for_charting, which obviously then the basis for the coloring of the final maps, you can see the 1.7.1.4 is associated with 4 KOs, therefore the lines are repeated 4 times, however the mapping is then summing those supposed 4,000 counts into 16,000 which then massively overinflates the total counts. If that EC is associated with four KOs, then at a maximum those 4,000 counts could be distributed among those 4 KOs, but not present 4,000 times in each KO - I would think the idea here, given no a priori info about which KOs they might be, would have been to evenly divide those 4,000 counts by the number of KOs to make avg. 1,000 for each, so that the final summed value would match the original, no?
Finally, since many of us would be using the "MT" mode for MGS data (gene counts instead of expression read depth) and the values are then absolute/comparative and not differential (ie: there is no 0 being average), the use of white as a color in the middle of the scale is problematic as it looks like "no counts" or "gene not present" for those boxes that happen to fall in that middle range. Is there a way to change the color scale used?

It is a shame about these current issues, as I had expected KC to basically be a fully automated way to get the same map coloring as doing KEGGMapper manually...however, when I use the above mock/test data in KM manually, this is what I get:

That map then faithfully reproduces the right relationships between the 4,000/2,000 counts (max out their values in each instance of 1.7.1.4 instead of multiplying), and then displays the lower 400/200 + 40/20 mock numbers. Of course, there we can also switch the color scale to avoid the white.

At the moment, I'd be forced to go this way of inputting the values manually into the KM pop-up box online instead of resorting to using KC...however, it would be nice to see if we can get it working! I don't have a ton of maps to do with my current paper I'm working on (just a few of the C/N/P ones), but still nice to see for here and eventually being able to offer to all clients/users of our pipeline. Thanks in advance for answering this long post!

possible bug in main script

Hi, I have been trying to use the reCOGnizer output with the keggcharter and got following error message after successfully reading the data. Can you tell me whether this is a bug or a mistake on my side?

`2023-04-26 08:39:08: Arguments valid.
2023-04-26 08:39:08: Data successfully read.
Converting 27 KOs to EC numbers through the KEGG API: 100%|███████| 1/1 [00:01<00:00, 1.85s/it]
Converting 13 EC numbers to KOs through the KEGG API: 100%|███████| 1/1 [00:02<00:00, 2.16s/it]
Converting 35 EC numbers to KOs through the KEGG API: 100%|███████| 1/1 [00:02<00:00, 2.09s/it]
Traceback (most recent call last):
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
return self._engine.get_loc(casted_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'KO (KEGGCharter)'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/bin/keggcharter.py", line 490, in
main()
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/bin/keggcharter.py", line 422, in main
data, main_column = further_information(
^^^^^^^^^^^^^^^^^^^^
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/bin/keggcharter.py", line 121, in further_information
data = get_cross_references(data, kegg_column=kegg_column, ko_column=ko_column, ec_column=ec_column, step=step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/bin/keggcharter.py", line 190, in get_cross_references
data = ids_interconversion(data, column='KO (KEGGCharter)', ids_type='ko', step=step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/bin/keggcharter.py", line 157, in ids_interconversion
ids = list(set(data[data[column].notnull()][column]))
~~~~^^^^^^^^
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/lib/python3.11/site-packages/pandas/core/frame.py", line 3761, in getitem
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gxfs_home/geomar/smomw539/miniconda3/envs/keggcharter/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
raise KeyError(key) from err
KeyError: 'KO (KEGGCharter)'`

Usage with eggnog-mapper output

Hello,

I just read your paper showing the UPIMAPI, reCOGnizer and KEGGChart tools and found it very interesting.

I intend to test your UPIMAPI/reCOGnizer annotation pipeline in future projects, however, I have a gene catalog with 10 million proteins that was already annotated with eggnog-mapper and I have been using this annotation to make all my analysis. Is there an easy way to use visualize the eggnog-mapper results in this KEGGCharter tool?

Thank you for your time
Kind regards
Lucas

Invalid file format now after implementation of regex checks

Hello again, so soon! My input file is now not working as of the new version 1.1.0 due to the newly implemented regex checks - this is the same file I was using just a few days ago (same EC numbers, just summed qcols instead of individual ones) and so must be the syntax for the regex.

Perhaps in the EC# check you are not allowing any letters, when provisional EC#s do have "n1" (for example), which was working before you put the check in. The input file I'm using is attached. Thanks!

MirallesMetaG-unstrat-matrix-RPKM_finalTypeSortAndSummed-forKEGGCharter.txt

question about transforming KO IDs and losing data

Hi,
I noticed that many of the IDs I wanted to map weren't included in the map, I think the reason is that in the conversion step of KO's many are lost. Why If I have already gave a KO column your program generates an "KO (KEGGCharter)" column in the results.tsv and use those to fill the map instead of the original column?

Here are the first rows of the KEGGCharter_results.tsv to explain myself better.

Genome	Gene ID	KO	KO (KEGGCharter)
CK1	HJENIJGG_00001	K02625
CK1	HJENIJGG_00001	K03761
CK1	HJENIJGG_00002	K05811
CK1	HJENIJGG_00003	K00998	K00998,K17103
CK1	HJENIJGG_00004	K09181
CK1	HJENIJGG_00005	K05812
CK1	HJENIJGG_00006	K03672
CK1	HJENIJGG_00007	K03214
CK1	HJENIJGG_00008	K03648	K21929,K03648,K25266
CK1	HJENIJGG_00009	K06866
CK1	HJENIJGG_00010	K05590	K12647,K12823,K12835,K13185,K17678,K14326,K17820,K17043,K12813,K25022,K11594,K18422,K18432,K12858,K11273,K14808,K13983,K03732,K14810,K13117,K05592,K20099,K03724,K25328,K18995,K14777,K18409,K16911,K21869,K13178,K05590,K14780,K13177,K12820,K13182,K12814,K13183,K12649,K18408,K26077,K03579,K21505,K22273,K12818,K17679,K12646,K11701,K19466,K18994,K12598,K13026,K14778,K20096,K03578,K05591,K13179,K14806,K14807,K12614,K26394,K14442,K12811,K18655,K13181,K13982,K14779,K18711,K17675,K14809,K13116,K14776,K12812,K14805,K12815,K11927,K18664,K14433,K19036,K13131,K18656,K17642,K13025,K14811,K20103,K26438,K18692,K13184,K17265,K14781,K12854,K20101,K12599
CK1	HJENIJGG_00011	K15460
CK1	HJENIJGG_00012	K00278
CK1	HJENIJGG_00013	K03088
CK1	HJENIJGG_00014	K03597
CK1	HJENIJGG_00015	K03598
CK1	HJENIJGG_00016	K03803
CK1	HJENIJGG_00017	K03596
CK1	HJENIJGG_00018	K03100

For instance, I wanted to draw the map m00910, Nitrogen metabolism. One of the many missing enzymes on the map that I have in the table is K00363, which is correctly converted to EC 1.7.1.15, but nonetheless in the map the corresponding box is left unfilled.

If I use the original KEGG Mapper I get the map with all the KOs correctly charted, but I liked the functionality of your software of being able to draw multiple genomes in a map. Thanks for your work.

Help with multiple Genomes

Hello!

I was wondering if I could please have some help in that I want to create metabolism maps for the genomes of multiple archaea.

While I got the program to work great! The problem I am having is that the maps only show that one species has a specific gene/enzyme (via a particular colour); however, when I access my excel document, my other species also have this gene.

For example, for gene EO 4.2.1.3, both of my practice genomes have these genes, but when I create metabolic maps, only one colour comes up (suggestive that only one genome has it). However, strange enough, sometimes I do get a split of two colours (hence saying that both genomes have this gene) for some genes in some pathways, but this isn't always the case.

Therefore, if I could please have some help, it would be greatly appreciated.

here is my code
keggcharter.py -f Book223.xlsx --input-quantification -koc "KO_column" -tc "Taxonomic lineage (SPECIES)" -o pratice25

And my excel
Book223.xlsx

Thank you very much

main script error

Hello,

I am getting an error even when I apply this code : keggcharter.py -f test.xlsx -tcol occ -gcol gene_name -it -keggc ko -o KEGG_Test

Error message I am getting:

Created KEGG_Test
2022-11-13 10:47:35: Arguments valid.
2022-11-13 10:47:39: Data successfully read.
Converting 2376 KEGG IDs to KOs through the KEGG API: 100%|██████████| 16/16 [00:32<00:00, 2.02s/it]
Converting 0 KOs to EC numbers through the KEGG API: 0it [00:00, ?it/s]
2022-11-13 10:48:11: Results saved to KEGG_Test/KEGGCharter_results.tsv
Getting information for 2 taxa: 50%|█████ | 1/2 [00:03<00:03, 4.00s/it]
Traceback (most recent call last):
File "/ibex/scratch/alamourt/conda_keggcharter_env/bin/keggcharter.py", line 480, in
main()
File "/ibex/scratch/alamourt/conda_keggcharter_env/bin/keggcharter.py", line 450, in main
taxon_to_mmap_to_orthologs = download_resources(data, args.resources_directory, args.taxa_column, metabolic_maps)
File "bin/keggcharter.py", line 331, in download_resources
kegg_prefix = taxon2prefix(taxon, taxa_df)
File "bin/keggcharter.py", line 296, in taxon2prefix
if taxon_name.split(' (')[0] in organism_df.index: # Homo sapiens (human) -> Homo sapiens
AttributeError: 'bool' object has no attribute 'split'

I performed the example file given along with the commands given and it worked perfectly. however in this case it doesn't seem to work, can you please advise?

need help with -gcol param

Your tool seems very useful for me, but I don't really understand what the -gcol param should specify. I tried out the example data and the example call provided in the readme, but it gives me an exception:
KeyError: "Columns not found: 'mg'"
which makes sense to me, as I also couldn't find that specific column in the example data.
Could you please clarify this for me?

Edit:

the command call I used:
python3 keggcharter.py -f MOSCA_Entry_Report.xlsx -gcol Entry -tcol mt_0.01a_normalized,mt_1a_normalized,mt_100a_normalized,mt_0.01b_normalized,mt_1b_normalized,mt_100b_normalized,mt_0.01c_normalized,mt_1c_normalized,mt_100c_normalized -keggc "Cross-reference (KEGG)" -o test_keggcharter -tc "Taxonomic lineage (GENUS)"

abundance info

Hi Joao,

I might have missed this, but why is there more than 1 column for abundance?
example: -mgc mg_column1,mg_column2
also, does KeggCharter expect it to be normzalized by sample size or does it expect raw abundance info?
is it possible to skip giving abundance and taxa info, and still get a png on a specific metabolic pathway from keggcharter?

Kind Regards
Dany

import error after install

Hello,
i just installed keggcharter in a clean conda enviroment and cant run it.
The enviroment i installed it in is active and when i try to run it with " keggcharter -h" i get the following output:

``
/home/lfuernwein/.conda/envs/keggcharter/bin/keggcharter:9: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at pandas-dev/pandas#54466

import pandas as pd
Traceback (most recent call last):
File "/home/lfuernwein/.conda/envs/keggcharter/bin/keggcharter", line 23, in
from keggpathway_map import KEGGPathwayMap, expand_by_list_column
File "/home/lfuernwein/.conda/envs/keggcharter/share/keggpathway_map.py", line 9, in
from matplotlib import pyplot as plt, colors, colormaps
ImportError: cannot import name 'colormaps' from 'matplotlib' (/home/lfuernwein/.conda/envs/keggcharter/lib/python3.10/site-packages/matplotlib/init.py)
``

I installed it as follows:
conda create -n keggcharter
conda activate keggcharter
conda install -c conda-forge -c bioconda keggcharter
i would appreciate your help and thank you in advance.

could not download resources

Hi Joao,

thanks for creating reCOGnizer and KeggCharter.

My command:
keggcharter.py -f test.tsv -tcol "value" -koc "KO" -tc "Taxonomic lineage (GENUS)" -gcol "gcol" -o test_out -mm "Methane metabolism"

My output:

2022-07-27 11:30:36: Creating KEGG Pathway representations for 1 metabolic pathways.
Some resources were not found for map [koMethane metabolism]! Going to download them
Could not download resources for [koMethane metabolism]!
Analysis of map Methane metabolism failed!
KEGGCharter analysis finished in 00h00m59s

Could you tell what is happening? Why is it attempting downloading and failing? I installed keggcharter through conda.

Kind Regards
Dany

HTTP Error 400 at EC conversion

Hello! Thanks for making what seems to be some very useful software...however, after a seamless install, I'm running into a problem getting the info from KEGG through the API on a simple first test run of data:

(base) andre@vulcan:~/MirallesProjects/CanteraMGS-CEMEX2018$ keggcharter -f MirallesMetaG-unstrat-matrix-RPKM_final.txt -o keggcharter/first_test -it 'MirallesMGS-CEMEX2018' --map-all -t 40 -ecc 'function' -qcol 'N1,N2' -mm 00680
Created keggcharter/first_test/maps
Created keggcharter/first_test/json
Created /home/andre/bin/miniconda3/share/kc_kgmls
Created /home/andre/bin/miniconda3/share/kc_csvs
2023-12-30 01:51:20: Reading input data.
2023-12-30 01:51:20: Arguments valid.
Converting 2880 EC numbers to KOs through the KEGG API:   0%|                                                                                                              | 0/72 [00:00<?, ?it/s]IDs conversion broke at index 0; Error: HTTP Error 400: Bad Request; Trying again...
IDs conversion broke at index 0 again; Error: HTTP Error 400: Bad Request
Converting 2880 EC numbers to KOs through the KEGG API:   1%|=                                                                                                     | 1/72 [00:02<03:05,  2.61s/it]IDs conversion broke at index 40; Error: HTTP Error 400: Bad Request; Trying again...
IDs conversion broke at index 40 again; Error: HTTP Error 400: Bad Request
Converting 2880 EC numbers to KOs through the KEGG API:   3%|==>                                                                                                   | 2/72 [00:05<02:54,  2.49s/it]IDs conversion broke at index 80; Error: HTTP Error 400: Bad Request; Trying again...
IDs conversion broke at index 80 again; Error: HTTP Error 400: Bad Request
Converting 2880 EC numbers to KOs through the KEGG API:   4%|====                                                                                                  | 3/72 [00:07<02:48,  2.44s/it]IDs conversion broke at index 120; Error: HTTP Error 400: Bad Request; Trying again...
IDs conversion broke at index 120 again; Error: HTTP Error 400: Bad Request
Converting 2880 EC numbers to KOs through the KEGG API:   6%|=====>                                                                                                | 4/72 [00:09<02:45,  2.43s/it]IDs conversion broke at index 160; Error: HTTP Error 400: Bad Request; Trying again...
IDs conversion broke at index 160 again; Error: HTTP Error 400: Bad Request

This is happening with the same install on our two different servers and at first I thought it might be a problem with our servers' firewalls preventing it, but I can successfully query the KEGG REST server from our servers (using example wget http://rest.genome.jp/link/hsa:56894) and get the results back. So do you think there has been a change in the way the REST API is called since you created the software? Or do you think from my side still? Thanks!

Missing 1 species and nothing is highlighted

Hello

I am using Keggcharter (v. 0.5) to compare the metabolic potential (using KEGG Orthology only) between four species of archaea. The file that I am using is attached below (Book13)Book13.xlsx.

The code I am using to run Keggcharter is keggcharter.py -f Book13.xlsx -o kcar -koc "KO_ids" -tc "Taxonomic lineage (SPECIES)" --input-quantification --metabolic-maps 00020

(I am just running the map for TCA (00020) cycle as a proof of concept for my supervisor).

While the program runs without problem when I look at my results I notice I only have 3 out of 4 species, but also none of the blocks are highlighted even though from manually searching the KEGG Orthology I do have some of the enzymes from the TCA cycle.

Therefore if I could please have some help in solving this problem it would be greatly appreciated.

Thank you very much

nonexplicit error message

Hi Joao,

If the gcol is not given:
keggcharter.py -f MOSCA_Entry_Report.xlsx -keggc "Cross-reference (KEGG)" -tcol mt_0.01a_normalized
This is the message:

2022-08-10 12:52:48: Arguments valid.
2022-08-10 12:53:03: Data successfully read.
Converting 16059 KEGG IDs to KOs through the KEGG API: 100%|████████████████████████████████████████████████████████████████████████████████| 108/108 [02:26<00:00,  1.36s/it]
Converting 2416 KOs to EC numbers through the KEGG API: 100%|█████████████████████████████████████████████████████████████████████████████████| 17/17 [00:20<00:00,  1.19s/it]
2022-08-10 12:55:52: Results saved to KEGGCharter_results/KEGGCharter_results.tsv
Traceback (most recent call last):
  File "/shared/homes/152324/miniconda3/envs/keggcharter_env/bin/keggcharter.py", line 451, in <module>
    main()
  File "/shared/homes/152324/miniconda3/envs/keggcharter_env/bin/keggcharter.py", line 415, in main
    args.genomic_columns = args.genomic_columns.split(',')
AttributeError: 'NoneType' object has no attribute 'split'

Not really intuitive. It should just say "gcol is missing"

using output reCOgnizer... error

Hello, I am using keggcharter with the "output" that I got from reCOgnizer, but I constantly get errors, could you help me build my command line, please?

This is the command line I am using:
keggcharter -f reCOGnizer_results.tsv -ecc 'EC number' -koc 'KO' -o keggcharter_output -it "taxonomic_range_name" -iq -t 90 -mm 00680

These are the headers of the output (reCOGnizer_results.tsv) obtained with reCOgnizer:
qseqid DB ID Protein description DB description EC number CDD ID taxonomic_range_name taxonomic_range pident length mismatch gapopen qstart qend sstart send evalue bitscore General functional category Functional category KO

Example run on reCOGnizer output

Hi,

is it possible to run keggcharter on reCOGnizer output files?

If yes, can you show an example code?

Best,
Davide

iquasere / keggcharter Goto Github PK

keggcharter's Introduction

KEGGCharter

Features

Installation

Running KEGGCharter

What maps are available?

First time KEGGCharter runs it will take a long time

Outputs

Arguments for KEGGCharter

Mock imputation of quantification and taxonomy

Handling missing information in KEGG Genomes

Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes

Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified

Referencing KEGGCharter

keggcharter's People

Contributors

Stargazers

Watchers

Forkers

keggcharter's Issues

I am getting an error even when I apply this code : keggcharter.py -f test.xlsx -tcol occ -gcol gene_name -it -keggc ko -o KEGG_Test

Recommend Projects

Recommend Topics

Recommend Org