elizabethmcd / metabolishmm Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 5.0 26.35 MB

Tool for constructing phylogenies and summarizing metabolic characteristics based on curated and custom profile HMMs

License: GNU General Public License v3.0

Python 100.00%

metabolishmm's People

Contributors

Stargazers

Watchers

Forkers

kenkeni-zju arghya1611 vinisalazar pypi-buildability-project srisvs33

metabolishmm's Issues

Ribosomal markers for Archaea vs Bacteria

fix options for ribosomal markers for bacteria vs archaea

connect to kofamkoala

option to point to exec_annotation file for threshold cutoffs

The directory of curated metabolic markers could not be found.

Hi,
I am running the 'summarize-markers' as

summarize-metabolism --input aquifer-genomes/ --output summary --metadata groups.csv

but I am getting the following error:

#############################################
metabolisHMM v1.4.0
     The directory of curated metabolic markers could not be found.
     Please either download the markers from https://github.com/elizabethmcd/metabolisHMM/releases/download/v2.0/metabolisHMM_v2.0_markers.tgz and decompress the tarball, or move the directory to where you are running the workflow from.

However, the models exist in curated_markers/metabolic_markers/*hmm
Also, where is the make-heatmap.R?

heatmap row order

Add option to provide list for custom ordering of heatmap rows instead of default alphabetical so user can do what they want or put in taxonomical order of the tree

Single Marker Analysis

Quick analysis of single marker in all genomes (such as nifA) for looking at evolutionary history of the specific protein and how it compares to the genome phylogeny

genome directory for metabolic summary

set custom directory for metabolic summaries script

Additions to create-genome-phylogeny

Integrating putative genome classifications with the Hug 2016 et al. TOL of genomes, highlighting input genomes

Argument to download TOL genomes (bacteria, archaea, eukarya, all or some)
Point to already downloaded TOL genomes directory or custom references directory
Point to input references
Make tree
Highlight with metadata input genome for iTOL to visualize where they fall
Also do this for highlighting on the ribosomal tree if a certain single marker falls anywhere

Metabolic Summary heatmap

Create a single function or separate functions to split large dataframe into cycles, so they are colored differently, and then stitch them back together in one single PDF
Make the axis labels and title a bit prettier
Above each heatmap that will be stitched together, put the title of the cycle, and then have a larger title for the merged "grid"

My custom-marker script is currently set to work with the WLJ proteins. It also doesn't order them in function until I get to the R script. Change this to take as input as well a list of the order of any set of markers (so that the user can get the figure in the order wanted), and visualize the heatmap.

Take a list
Make heatmap
R script takes into account numbering of within a group (such as phyla) and also outputs the number next to the phyla label. Make an option to turn that off if comparing across single genomes

Print genome if missing more than # of markers for phylogeny

If given genome missing more than a set amount of markers for phylogeny, report to the user so they know to look closer at the alignment file and decide if whether or not to take that genome out

Demo dataset

Change demo dataset to Microbial Observatory Lake ref MAGs/SAGs set

References among TOL with concatenated ribosomal proteins
Metabolic summaries/pathways of interest breakdowns

Option for corresponding tree from single phylogeny

Add option for RIBO=TRUE if want a corresponding ribosomal tree of only the genomes that have a specific marker in the single-marker-phylogeny workflow.

Gets the genomes to make the tree from by making a list from either the hits list or the .faa of hits with the '>' header, and then makes the corresponding tree

Metabolic Summary > Visualization

Add R script for general metabolic summary of main pathways (C, N, S) and add groupings of genomes for comparisons, maybe cut down on the not so essential markers

Concatenate script

Add step in genome-phylogeny script to concatenate the alignments myself so don't have to depend on the outside perl script

check for markers download

repackaged for installation with pip, have to check user downloaded the markers in the right place

programs installed

have to check that external dependencies are in the path and alert user if they need to install something or change the name of it

R script options

Crashes out because of optional row names argument

Kofam search yields TC bit threshhold error

Hi,
I was able to use summarize-metabolism but when using the search-custom-markers with the kofam database, I am running into some issues and I suspect it's some way the program is calling kofamscan?

Here's my command:
search-custom-markers --input bins_aa --output vitaminb3_bins_out_01062020 --markers_dir profiles/vitaminb3/ --markers_list vitaminb3_list.txt --metadata bin_metadata.csv --kofam kofamscan-1.1.0/exec_annotation --ko_list ko_list

The error I get is: TC bit threshholds unavailable on model K0xxxx (for all the KO's I listed).
I get results but it's totally blank/empty.

I am calling the executable (exec_annotation) and the ko_list should have the threshholds within. I tried running kofamscan on it's own without issue. I downloaded the dependencies required for kofamscan in the metabolishmm conda env (ruby, parallel). Even my config.yml should be correct although I'm not sure that matters since metabolishmm is calling kofamscan.

Let me know if you know if a fix for this! Thank you very much!

tfdA test case

I'm a little too far down the road to fix some of this stuff for my mehg analysis. And that one is a little tricky with all the different datasets I pulled from and trying to make comparative analyses with HGT and whatnot.

A good thing to try would be to wrap up that project and try the presence/absence among the whole tree of screened genomes (TOL) for tfdA since it seems to be a little more widespread and not as rare as the mehg crap.

Steps that would have to be taken into account:

Pulling down refseq genomes
Pulling down large-scale, publicly available genome sets (Anantharaman, Woodcroft, Crits-Cristoph, Tran, Parks) and then dereplicating by some threshold so don't have a bunch of duplicated bins across datasets that are probably very similar in sequence and are just going to add to the mess of the tree
This is the genome database to screen from = have nucleotide and protein files stored somewhere (OSF?)
Then can create all the presence/absence analyses with tfdA as a test case and looking closer into that
Fastree will be implemented in the pipeline as a test to look at what things pop out, but with strong recommendations to run with RaxML on servers with more computing power because fastree be sucky

prodigal quiet

will scream about short contigs and using the -m or -c for metagenomic "incomplete" genomes to have greater than 1000 nts or something of the like, either make STOUT null or put in quiet version

TOL DB

create_phylogeny workflow to include putting user genomes among the TOL DB in KA lab used for Hg paper

installation instructions

change order of creating environment and installing metabolisHMM, also the pip instructions and the R script download/database are a little wonky at the moment

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 2.22. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary metabolishmm -w /tmp/ext metabolishmm==2.22
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting metabolishmm==2.22
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ea0/9df216c0fb79a/metabolisHMM-2.22.tar.gz (14 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-wkurcu8f/metabolishmm/pip-egg-info
         cwd: /tmp/pip-wheel-wkurcu8f/metabolishmm/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py", line 4, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Create a conda release

Hi @elizabethmcd ,

thank you for providing this package. I'm excited to use it.

I believe providing a conda release for it would greatly improve it and attract users. I'm attempting one myself for Bioconda over here and would like to kindly invite you to collaborate, if you are keen :)

I'm not very experienced with conda so your help as the main developer would be invaluable.

Thank you for any assistance you can provide,

single marker corresponding tree making

program says it's done before it actually finishes making the corresponding ribosomal tree, which takes longer but makes it look like it crashed out for some reason

trimming

qual trimming, equal lengths issues for a couple cases

Missing markers for summaries

If a genome has zero hits among a list of markers, it won't put it in the output, have to append these

Reformatting fasta files

Add function to automatically reformat headers of gene names, and add progidal as a dependency if given .fna files

Documentations and figures

Fix documentation and figures for publication specific release fixes and table comparisons

Documentation

Start writing the documentation in the wiki

Major Release Tasks

Make a v1.0 release after:

Fix directory names
Give options for .fna or .faa input
Makes ribosomal tree of single marker hits for comparison
Preliminary heatmap figures (have to figure out if I can do this in python with seaborn or too much of a hassle)
Figure out how to put in the /bin and then call from installation in path?

iTol metadata files for phylogeny workflows

output formatted metadata files for iTOL based on the provided metadata, optional arugments

refseq hits

New script to take a set of bins with a species marker better resolved than ribosomal proteins and will most likely be recovered in genomes well
Get hits from refseq , find those top 5 hits per genome
Make tree of the candidate bins and the hits

Full matrix for metabolic results

Create dataframe/matrix for every genome is a row, with columns as markers w/ counts of hits from parsing HMM outputs

no hmmersearch for summarize-metabolism

Hi! Thanks for the tool. I installed the software using the conda way. For summarize-metabolism workflow, it did not perform hmmsearch (empty files for out folder); while search-custom-markers workflow did that. Could you check if there is any issue with the summarize-metabolism script? Here is my cmd line: summarize-metabolism --input faa --output summaries --metadata ../genomeinfo.csv --aggregate ON. Thanks!

dev branch

- Separate workflow scripts
- Conda installation & packaging of DB markers
- Multithreading?
- Main reformatting, utils, and arguments scripts
- Plotting package fixes and updates to new package versions (seaborn, matplotlib)

Markers w/o -tc option

Markers don't have a default tc option for a threshold, either run separately or identify tc for them?

Documentation

Installation instructions for dependencies
Caveats
Adding your own markers
Functions associated with each marker
Suggested references to cite
Add a license of some sort

genome names cannot contain dashes?

Hello! I ran metabolishmm and received the following error:

$  summarize-metabolism --inp
ut in_genomes --output out_genomes --metadata metabolishmm_metadata.csv --aggregate OFF

metabolisHMM v2.21
Reformatting fasta files...
Screening curated metabolic markers...
Parsing all results...
/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism:162: DeprecationWarning: '
U' mode is deprecated
  with open(result, "rU") as input:
Traceback (most recent call last):
  File "/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism", line 162, in <mo
dule>
    with open(result, "rU") as input:
FileNotFoundError: [Errno 2] No such file or directory: 'out_genomes/out/QinJ_2012__CON/QinJ_20
12__CON-091__bin.20-ccoP_TIGR00782.out'

out_genomes/out/QinJ_2012__CON/QinJ_2012__CON-091__bin.20-ccoP_TIGR00782.out

When I removed the dashes from the genome name, metabolishmm ran fine.

make-heatmap.R script location?

Hi Elizabeth,

Thank you for this awesome tool! I am interested in using the summarize metabolism workflow, but I am unable to locate the "make-heatmap.R" script that is mentioned to be a requirement. Could you please let me know where I could access this script?

Thanks!
Binvir

Pretty-fy

At some point, I will fix the scripts to run with more reproducible functions and not my really ugly for loops. They work fine for now for testing and getting preliminary data.

Presence/absence among a given set of genomes

I currently have where you can make a gene tree based off an HMM, and it gives that tree only if the marker is present in those genomes, which is nice for looking at distribution/evolution of that marker. I also do presence/absence of a suite of metabolic markers. I like Mike Lee's example of showing presence/absence with a highlighted tree of all genomes, so you can broadly see where the given marker is NOT located.

Steps:

Can pass an assembly accession file to ncbi-genome-download. Can also get metadata from here
Calling ORFs/annotations because just pulling down nucleotide genbank files
Search for the marker across proteins
Make ribosomal protein tree of all genomes in the set
Highlight with color the presence/absence, and also give an output of # among that clade (such as phyla)

plotting option

Make plotting options by default ON and has to check that the Rscript is in the path or provided path, can create option to turn off if the user just wants the stats and makes the figure themselves and not deal with the R functions

fixes

aggregating plotting
documentation with kofamkoala markers
version numbers