elizabethmcd / metabolishmm Goto Github PK
View Code? Open in Web Editor NEWTool for constructing phylogenies and summarizing metabolic characteristics based on curated and custom profile HMMs
License: GNU General Public License v3.0
Tool for constructing phylogenies and summarizing metabolic characteristics based on curated and custom profile HMMs
License: GNU General Public License v3.0
fix options for ribosomal markers for bacteria vs archaea
option to point to exec_annotation file for threshold cutoffs
Hi,
I am running the 'summarize-markers' as
summarize-metabolism --input aquifer-genomes/ --output summary --metadata groups.csv
but I am getting the following error:
#############################################
metabolisHMM v1.4.0
The directory of curated metabolic markers could not be found.
Please either download the markers from https://github.com/elizabethmcd/metabolisHMM/releases/download/v2.0/metabolisHMM_v2.0_markers.tgz and decompress the tarball, or move the directory to where you are running the workflow from.
However, the models exist in curated_markers/metabolic_markers/*hmm
Also, where is the make-heatmap.R
?
Add option to provide list for custom ordering of heatmap rows instead of default alphabetical so user can do what they want or put in taxonomical order of the tree
Quick analysis of single marker in all genomes (such as nifA) for looking at evolutionary history of the specific protein and how it compares to the genome phylogeny
set custom directory for metabolic summaries script
Integrating putative genome classifications with the Hug 2016 et al. TOL of genomes, highlighting input genomes
fix within workflow version numbers
My custom-marker script is currently set to work with the WLJ proteins. It also doesn't order them in function until I get to the R script. Change this to take as input as well a list of the order of any set of markers (so that the user can get the figure in the order wanted), and visualize the heatmap.
Take a list
Make heatmap
R script takes into account numbering of within a group (such as phyla) and also outputs the number next to the phyla label. Make an option to turn that off if comparing across single genomes
If given genome missing more than a set amount of markers for phylogeny, report to the user so they know to look closer at the alignment file and decide if whether or not to take that genome out
Change demo dataset to Microbial Observatory Lake ref MAGs/SAGs set
References among TOL with concatenated ribosomal proteins
Metabolic summaries/pathways of interest breakdowns
Add option for RIBO=TRUE if want a corresponding ribosomal tree of only the genomes that have a specific marker in the single-marker-phylogeny
workflow.
Gets the genomes to make the tree from by making a list from either the hits list or the .faa of hits with the '>' header, and then makes the corresponding tree
Add R script for general metabolic summary of main pathways (C, N, S) and add groupings of genomes for comparisons, maybe cut down on the not so essential markers
Add step in genome-phylogeny script to concatenate the alignments myself so don't have to depend on the outside perl script
repackaged for installation with pip, have to check user downloaded the markers in the right place
have to check that external dependencies are in the path and alert user if they need to install something or change the name of it
Crashes out because of optional row names argument
Hi,
I was able to use summarize-metabolism but when using the search-custom-markers with the kofam database, I am running into some issues and I suspect it's some way the program is calling kofamscan?
Here's my command:
search-custom-markers --input bins_aa --output vitaminb3_bins_out_01062020 --markers_dir profiles/vitaminb3/ --markers_list vitaminb3_list.txt --metadata bin_metadata.csv --kofam kofamscan-1.1.0/exec_annotation --ko_list ko_list
The error I get is: TC bit threshholds unavailable on model K0xxxx (for all the KO's I listed).
I get results but it's totally blank/empty.
I am calling the executable (exec_annotation) and the ko_list should have the threshholds within. I tried running kofamscan on it's own without issue. I downloaded the dependencies required for kofamscan in the metabolishmm conda env (ruby, parallel). Even my config.yml should be correct although I'm not sure that matters since metabolishmm is calling kofamscan.
Let me know if you know if a fix for this! Thank you very much!
I'm a little too far down the road to fix some of this stuff for my mehg analysis. And that one is a little tricky with all the different datasets I pulled from and trying to make comparative analyses with HGT and whatnot.
A good thing to try would be to wrap up that project and try the presence/absence among the whole tree of screened genomes (TOL) for tfdA since it seems to be a little more widespread and not as rare as the mehg crap.
Steps that would have to be taken into account:
will scream about short contigs and using the -m or -c for metagenomic "incomplete" genomes to have greater than 1000 nts or something of the like, either make STOUT null or put in quiet version
create_phylogeny workflow to include putting user genomes among the TOL DB in KA lab used for Hg paper
change order of creating environment and installing metabolisHMM, also the pip instructions and the R script download/database are a little wonky at the moment
It appears that the manifest is missing at least one file necessary to build
from the sdist for version 2.22. You're in good company, about 5% of other
projects updated in the last year are also missing files.
+ /tmp/venv/bin/pip3 wheel --no-binary metabolishmm -w /tmp/ext metabolishmm==2.22
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting metabolishmm==2.22
Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ea0/9df216c0fb79a/metabolisHMM-2.22.tar.gz (14 kB)
ERROR: Command errored out with exit status 1:
command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-wkurcu8f/metabolishmm/pip-egg-info
cwd: /tmp/pip-wheel-wkurcu8f/metabolishmm/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py", line 4, in <module>
with open('requirements.txt') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Hi @elizabethmcd ,
thank you for providing this package. I'm excited to use it.
I believe providing a conda release for it would greatly improve it and attract users. I'm attempting one myself for Bioconda over here and would like to kindly invite you to collaborate, if you are keen :)
I'm not very experienced with conda so your help as the main developer would be invaluable.
Thank you for any assistance you can provide,
V
program says it's done before it actually finishes making the corresponding ribosomal tree, which takes longer but makes it look like it crashed out for some reason
qual trimming, equal lengths issues for a couple cases
If a genome has zero hits among a list of markers, it won't put it in the output, have to append these
Add function to automatically reformat headers of gene names, and add progidal as a dependency if given .fna files
Fix documentation and figures for publication specific release fixes and table comparisons
Start writing the documentation in the wiki
Make a v1.0 release after:
output formatted metadata files for iTOL based on the provided metadata, optional arugments
New script to take a set of bins with a species marker better resolved than ribosomal proteins and will most likely be recovered in genomes well
Get hits from refseq , find those top 5 hits per genome
Make tree of the candidate bins and the hits
Create dataframe/matrix for every genome is a row, with columns as markers w/ counts of hits from parsing HMM outputs
Hi! Thanks for the tool. I installed the software using the conda way. For summarize-metabolism
workflow, it did not perform hmmsearch
(empty files for out folder); while search-custom-markers
workflow did that. Could you check if there is any issue with the summarize-metabolism
script? Here is my cmd line: summarize-metabolism --input faa --output summaries --metadata ../genomeinfo.csv --aggregate ON
. Thanks!
Markers don't have a default tc option for a threshold, either run separately or identify tc for them?
Hello! I ran metabolishmm and received the following error:
$ summarize-metabolism --inp
ut in_genomes --output out_genomes --metadata metabolishmm_metadata.csv --aggregate OFF
metabolisHMM v2.21
Reformatting fasta files...
Screening curated metabolic markers...
Parsing all results...
/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism:162: DeprecationWarning: '
U' mode is deprecated
with open(result, "rU") as input:
Traceback (most recent call last):
File "/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism", line 162, in <mo
dule>
with open(result, "rU") as input:
FileNotFoundError: [Errno 2] No such file or directory: 'out_genomes/out/QinJ_2012__CON/QinJ_20
12__CON-091__bin.20-ccoP_TIGR00782.out'
out_genomes/out/QinJ_2012__CON/QinJ_2012__CON-091__bin.20-ccoP_TIGR00782.out
When I removed the dashes from the genome name, metabolishmm ran fine.
Hi Elizabeth,
Thank you for this awesome tool! I am interested in using the summarize metabolism workflow, but I am unable to locate the "make-heatmap.R" script that is mentioned to be a requirement. Could you please let me know where I could access this script?
Thanks!
Binvir
At some point, I will fix the scripts to run with more reproducible functions and not my really ugly for loops. They work fine for now for testing and getting preliminary data.
I currently have where you can make a gene tree based off an HMM, and it gives that tree only if the marker is present in those genomes, which is nice for looking at distribution/evolution of that marker. I also do presence/absence of a suite of metabolic markers. I like Mike Lee's example of showing presence/absence with a highlighted tree of all genomes, so you can broadly see where the given marker is NOT located.
Steps:
Make plotting options by default ON and has to check that the Rscript is in the path or provided path, can create option to turn off if the user just wants the stats and makes the figure themselves and not deal with the R functions
Create own function for concatenating alignments
Set argument for giving a project name to name the results
and out
with the project name so don't have to have multiple head directories of the package
also output heatmap figures in .png in addition to .pdf because some of them have issues
create-genome-phylogeny gives error about optional metadata/ITOL output. change to be like single-marker-phylogeny
Single marker script uses fasttree, change to RaxML, and see if other genome phylogeny script can be fine with RaxML without submitting to the cynet server thing
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.