Git Product home page Git Product logo

metabolishmm's People

Contributors

elizabethmcd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

metabolishmm's Issues

single marker corresponding tree making

program says it's done before it actually finishes making the corresponding ribosomal tree, which takes longer but makes it look like it crashed out for some reason

trimming

qual trimming, equal lengths issues for a couple cases

Metabolic Summary heatmap

  • Create a single function or separate functions to split large dataframe into cycles, so they are colored differently, and then stitch them back together in one single PDF
  • Make the axis labels and title a bit prettier
  • Above each heatmap that will be stitched together, put the title of the cycle, and then have a larger title for the merged "grid"

refseq hits

New script to take a set of bins with a species marker better resolved than ribosomal proteins and will most likely be recovered in genomes well
Get hits from refseq , find those top 5 hits per genome
Make tree of the candidate bins and the hits

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 2.22. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary metabolishmm -w /tmp/ext metabolishmm==2.22
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting metabolishmm==2.22
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ea0/9df216c0fb79a/metabolisHMM-2.22.tar.gz (14 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-wkurcu8f/metabolishmm/pip-egg-info
         cwd: /tmp/pip-wheel-wkurcu8f/metabolishmm/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-wkurcu8f/metabolishmm/setup.py", line 4, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Reformatting fasta files

Add function to automatically reformat headers of gene names, and add progidal as a dependency if given .fna files

heatmap row order

Add option to provide list for custom ordering of heatmap rows instead of default alphabetical so user can do what they want or put in taxonomical order of the tree

Metabolic Summary > Visualization

Add R script for general metabolic summary of main pathways (C, N, S) and add groupings of genomes for comparisons, maybe cut down on the not so essential markers

plotting option

Make plotting options by default ON and has to check that the Rscript is in the path or provided path, can create option to turn off if the user just wants the stats and makes the figure themselves and not deal with the R functions

Markers w/o -tc option

Markers don't have a default tc option for a threshold, either run separately or identify tc for them?

Concatenate script

Add step in genome-phylogeny script to concatenate the alignments myself so don't have to depend on the outside perl script

programs installed

have to check that external dependencies are in the path and alert user if they need to install something or change the name of it

TOL DB

create_phylogeny workflow to include putting user genomes among the TOL DB in KA lab used for Hg paper

Kofam search yields TC bit threshhold error

Hi,
I was able to use summarize-metabolism but when using the search-custom-markers with the kofam database, I am running into some issues and I suspect it's some way the program is calling kofamscan?

Here's my command:
search-custom-markers --input bins_aa --output vitaminb3_bins_out_01062020 --markers_dir profiles/vitaminb3/ --markers_list vitaminb3_list.txt --metadata bin_metadata.csv --kofam kofamscan-1.1.0/exec_annotation --ko_list ko_list

The error I get is: TC bit threshholds unavailable on model K0xxxx (for all the KO's I listed).
I get results but it's totally blank/empty.

I am calling the executable (exec_annotation) and the ko_list should have the threshholds within. I tried running kofamscan on it's own without issue. I downloaded the dependencies required for kofamscan in the metabolishmm conda env (ruby, parallel). Even my config.yml should be correct although I'm not sure that matters since metabolishmm is calling kofamscan.

Let me know if you know if a fix for this! Thank you very much!

RaxML implementation

Single marker script uses fasttree, change to RaxML, and see if other genome phylogeny script can be fine with RaxML without submitting to the cynet server thing

prodigal quiet

will scream about short contigs and using the -m or -c for metagenomic "incomplete" genomes to have greater than 1000 nts or something of the like, either make STOUT null or put in quiet version

Create a conda release

Hi @elizabethmcd ,

thank you for providing this package. I'm excited to use it.

I believe providing a conda release for it would greatly improve it and attract users. I'm attempting one myself for Bioconda over here and would like to kindly invite you to collaborate, if you are keen :)

I'm not very experienced with conda so your help as the main developer would be invaluable.

Thank you for any assistance you can provide,

V

tfdA test case

I'm a little too far down the road to fix some of this stuff for my mehg analysis. And that one is a little tricky with all the different datasets I pulled from and trying to make comparative analyses with HGT and whatnot.

A good thing to try would be to wrap up that project and try the presence/absence among the whole tree of screened genomes (TOL) for tfdA since it seems to be a little more widespread and not as rare as the mehg crap.

Steps that would have to be taken into account:

  1. Pulling down refseq genomes
  2. Pulling down large-scale, publicly available genome sets (Anantharaman, Woodcroft, Crits-Cristoph, Tran, Parks) and then dereplicating by some threshold so don't have a bunch of duplicated bins across datasets that are probably very similar in sequence and are just going to add to the mess of the tree
  3. This is the genome database to screen from = have nucleotide and protein files stored somewhere (OSF?)
  4. Then can create all the presence/absence analyses with tfdA as a test case and looking closer into that
  5. Fastree will be implemented in the pipeline as a test to look at what things pop out, but with strong recommendations to run with RaxML on servers with more computing power because fastree be sucky

installation instructions

change order of creating environment and installing metabolisHMM, also the pip instructions and the R script download/database are a little wonky at the moment

make-heatmap.R script location?

Hi Elizabeth,

Thank you for this awesome tool! I am interested in using the summarize metabolism workflow, but I am unable to locate the "make-heatmap.R" script that is mentioned to be a requirement. Could you please let me know where I could access this script?

Thanks!
Binvir

fixes

  • aggregating plotting
  • documentation with kofamkoala markers
  • version numbers

Presence/absence among a given set of genomes

I currently have where you can make a gene tree based off an HMM, and it gives that tree only if the marker is present in those genomes, which is nice for looking at distribution/evolution of that marker. I also do presence/absence of a suite of metabolic markers. I like Mike Lee's example of showing presence/absence with a highlighted tree of all genomes, so you can broadly see where the given marker is NOT located.

Steps:

  1. Can pass an assembly accession file to ncbi-genome-download. Can also get metadata from here
  2. Calling ORFs/annotations because just pulling down nucleotide genbank files
  3. Search for the marker across proteins
  4. Make ribosomal protein tree of all genomes in the set
  5. Highlight with color the presence/absence, and also give an output of # among that clade (such as phyla)

Demo dataset

Change demo dataset to Microbial Observatory Lake ref MAGs/SAGs set

  • References among TOL with concatenated ribosomal proteins

  • Metabolic summaries/pathways of interest breakdowns

Directory names

Set argument for giving a project name to name the results and out with the project name so don't have to have multiple head directories of the package

Pretty-fy

At some point, I will fix the scripts to run with more reproducible functions and not my really ugly for loops. They work fine for now for testing and getting preliminary data.

The directory of curated metabolic markers could not be found.

Hi,
I am running the 'summarize-markers' as

summarize-metabolism --input aquifer-genomes/ --output summary --metadata groups.csv

but I am getting the following error:

#############################################
metabolisHMM v1.4.0
     The directory of curated metabolic markers could not be found.
     Please either download the markers from https://github.com/elizabethmcd/metabolisHMM/releases/download/v2.0/metabolisHMM_v2.0_markers.tgz and decompress the tarball, or move the directory to where you are running the workflow from.

However, the models exist in curated_markers/metabolic_markers/*hmm
Also, where is the make-heatmap.R?

output format

also output heatmap figures in .png in addition to .pdf because some of them have issues

Documentation

  • Installation instructions for dependencies
  • Caveats
  • Adding your own markers
  • Functions associated with each marker
  • Suggested references to cite
  • Add a license of some sort

dev branch

  • - Separate workflow scripts
  • - Conda installation & packaging of DB markers
  • - Multithreading?
  • - Main reformatting, utils, and arguments scripts
  • - Plotting package fixes and updates to new package versions (seaborn, matplotlib)

iTOL options

create-genome-phylogeny gives error about optional metadata/ITOL output. change to be like single-marker-phylogeny

Additions to create-genome-phylogeny

Integrating putative genome classifications with the Hug 2016 et al. TOL of genomes, highlighting input genomes

  • Argument to download TOL genomes (bacteria, archaea, eukarya, all or some)
  • Point to already downloaded TOL genomes directory or custom references directory
  • Point to input references
  • Make tree
  • Highlight with metadata input genome for iTOL to visualize where they fall
  • Also do this for highlighting on the ribosomal tree if a certain single marker falls anywhere

Single Marker Analysis

Quick analysis of single marker in all genomes (such as nifA) for looking at evolutionary history of the specific protein and how it compares to the genome phylogeny

Any Pathway > Visualization

My custom-marker script is currently set to work with the WLJ proteins. It also doesn't order them in function until I get to the R script. Change this to take as input as well a list of the order of any set of markers (so that the user can get the figure in the order wanted), and visualize the heatmap.

Take a list
Make heatmap
R script takes into account numbering of within a group (such as phyla) and also outputs the number next to the phyla label. Make an option to turn that off if comparing across single genomes

Major Release Tasks

Make a v1.0 release after:

  • Fix directory names
  • Give options for .fna or .faa input
  • Makes ribosomal tree of single marker hits for comparison
  • Preliminary heatmap figures (have to figure out if I can do this in python with seaborn or too much of a hassle)
  • Figure out how to put in the /bin and then call from installation in path?

Option for corresponding tree from single phylogeny

Add option for RIBO=TRUE if want a corresponding ribosomal tree of only the genomes that have a specific marker in the single-marker-phylogeny workflow.

Gets the genomes to make the tree from by making a list from either the hits list or the .faa of hits with the '>' header, and then makes the corresponding tree

genome names cannot contain dashes?

Hello! I ran metabolishmm and received the following error:

$  summarize-metabolism --inp
ut in_genomes --output out_genomes --metadata metabolishmm_metadata.csv --aggregate OFF
metabolisHMM v2.21
Reformatting fasta files...
Screening curated metabolic markers...
Parsing all results...
/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism:162: DeprecationWarning: '
U' mode is deprecated
  with open(result, "rU") as input:
Traceback (most recent call last):
  File "/home/tereiter/miniconda3/envs/metabolishmm/bin/summarize-metabolism", line 162, in <mo
dule>
    with open(result, "rU") as input:
FileNotFoundError: [Errno 2] No such file or directory: 'out_genomes/out/QinJ_2012__CON/QinJ_20
12__CON-091__bin.20-ccoP_TIGR00782.out'

out_genomes/out/QinJ_2012__CON/QinJ_2012__CON-091__bin.20-ccoP_TIGR00782.out

When I removed the dashes from the genome name, metabolishmm ran fine.

no hmmersearch for summarize-metabolism

Hi! Thanks for the tool. I installed the software using the conda way. For summarize-metabolism workflow, it did not perform hmmsearch (empty files for out folder); while search-custom-markers workflow did that. Could you check if there is any issue with the summarize-metabolism script? Here is my cmd line: summarize-metabolism --input faa --output summaries --metadata ../genomeinfo.csv --aggregate ON. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.