Git Product home page Git Product logo

magpurify's People

Contributors

apcamargo avatar snayfach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

magpurify's Issues

Analysis automation

Hello everybody. I have a question about this software. I have 5 directories with at least 40 bins in each. I need to analyze at least 4 modules. Do I have to do this for each bin? The software does not accept to put the directory and do for all the bins at once?

Conspecific module

  • Integrate MAGdecon module that flag contigs matching other species
  • Create default database

"lastal: can't open file: -P" When running "clade-markers"

Hi I started the test workflow with MAGpurify but ran into an issue with the "clade-markers" step. I am using a conda environment with python 2.7 and MAGpurify installed. I successfully ran "phylo-markers" but I get this error when running "clade-markers". It looks like there are two "-P 1" flag and argument pairs confusing the command. There is output in its respective folder, but the clean bin results for display "clade-markers: no output file found". This is also my first issue I've reported, so I appreciate any feedback to help me help you identify the issue. Thank you for your help.

magpurify clade-markers example/test.fna example/output

After the command runs a little bit I get an error message:

Performing pairwise alignment of genes against MetaPhlan2 db of clade-specific genes

Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 MAGpurify-db-v1.0/clade-markers/markers.faa example/output/clade-markers/genes.faa -P 1 > example/output/clade-markers/genes.m8

Error message:
lastal: can't open file: -P

too many bam files

Hi!

Cool software, thanks!

Any suggestions for how to run the coverage module with, for example, over 15,000 bam files? I get the following message

OSError: [Errno 7] Argument list too long: '/bin/sh'

Best,
David

No module named 'utility'

Traceback (most recent call last):
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/run_qc.py", line 39, in
from magpurify import csmg
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/magpurify/csmg.py", line 4, in
import utility
ModuleNotFoundError: No module named 'utility'

Error running phylo-markers

Dear developers:
I have found the following error when running the phylo-markers options. It happens for some MAGs, not for others. Any help will be appreciated.
Best,

magpurify phylo-markers results/DAS/H4_DASTool_bins/maxbin.097.fasta.contigs.fa temp/purify

• Calling genes with Prodigal
 all genes: temp/purify/phylo-markers/genes.[ffn|faa]
• Identifying PhyEco phylogenetic marker genes with HMMER
 hmm results: temp/purify/phylo-markers/phyeco.hmmsearch
 marker genes: temp/purify/phylo-markers/markers
• Performing pairwise BLAST alignment of marker genes against database
 blast results: temp/purify/phylo-markers/alns
• Finding taxonomic outliers
Traceback (most recent call last):
 File "/home/tamames/software/miniconda3/bin/magpurify", line 10, in <module>
   sys.exit(cli())
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/cli.py", line 116, in cli
   args["func"](args)
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 419, in main
   flagged = flag_contigs(args["db"], args["tmp_dir"], args)
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 372, in flag_contigs
   bin.genes[aln["qname"]].annotations.append(annotation)
KeyError: 'megahit_65877_14'

Interpretation of cutoff values

Hi,

Can you clarify what the cutoff values are for gc-content and tetra-freq and how these were established? My guess is that for gc-content the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).

I appreciate that the tetra-freq measure if more abstract, so I'm more interested in how the 0.06 default was established.

Thanks,
Donovan

Adding another genome to known-contam?

Hi,
How can I add another contaminant genome in the database?
The easiest way seems to be to add it in the known-contam folder, making the database file with blast++ and modifying lines 91 and 105 of the file contam.py from for target in ["hg38", "phix"]: to for target in ["hg38", "phix","newOne"]:.
Am I missing something?
Greg

phylo-markers: no output file found...

when I run this command magpurify clean-bin ./metabat_bin_last/${line} ./magpurify_out ./magpurify_out_fna/${line}.fna,i I encountered this problem:

Reading flagged contigs
 phylo-markers: no output file found
 clade-markers: no output file found
 conspecific: no output file found
 tetra-freq: no output file found
 gc-content: no output file found
 coverage: no output file found
 known-contam: no output file found

I don't know how to solve it?

lastal: not found

@Justan6 wrote:

I encountered the error below:
Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 /home/nwezejus/MAGpurify-db-v1.0/clade-markers/markers.faa Magpurify_out/clade-markers/genes.faa -P 1 > Magpurify_out/clade-markers/genes.m8

Error message:
b'/bin/sh: 1: lastal: not found\n'

@Justan6 please make sure that lastal has been installed and can be called from the command line. If not, please use conda to install the software or install from https://gitlab.com/mcfrith/last

Add FPR cutoffs to modules

  • Modules: GC, TNF, depth
  • Expected % of contigs flagged incorrectly
  • How does FPR relate to contig length, mean value for bin, variance

tetra-freq unable to handle "N" nucleotide

The tetra-freq module crashes if there is an N nucleotide in the nucleotide sequence.
An example error message is:

File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main
contig.kmers[kmer_rev] += 1
KeyError: 'NTTC'

N nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.

Deletion of N nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N would just be ignored when constructing tetra-nucleotide frequency profiles.

Issue with building my reference database for conspecific module

Hello,

I have been trying to build my own reference database for the conspecific module using the command lines from the manual but I keep failing.

The error message that I got was:

...tching "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"
for reading.not open "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"

So I tried with the example genome files but it still fails:
Sketching example/ref_genomes/ERS473214_89.fna...
ERROR: could not open example/ref_genomes/ERS473214_89.fna for reading.

I have been using Compute Canada cloud and both mash and MAGpurify are installed and work fine. Could you help fix this problem? Thanks!

Question about --weighted-means and % bin removed

Hi, thanks for building and maintaining this tool!

I have a question about a section of Nayfach et al. 2019 as it relates to running MAGpurify. The paper says:
"In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length."

I was wondering if you could comment on the purpose of the ≤25% length reduction requirement for a MAGpurify module to be run. It does not look to me like the clean-bin module turns off a given module based on length. Is the 25% rule somehow related to the new--weighted-means flag described on the Releases page? In other words, does using --weighted-means attempt to avoid removing large chunks of bins where without --weighted-means those bins only would have been rescued by turning off modules that remove >25% of the bin? Hopefully this makes sense. Thanks!

Best,
Bryan

Suggestions based on trial use

Hi,

Thanks for your work on MAGpurify, I just tried it out on some example data and it seems to work quite well. I do have a few questions if you have time to answer them:

  1. I am not sure I understand how the conspecific module will deal with "novel" sequences, e.g. contigs that are actually derived from a conspecific collection of species but are not present in the type strains available in the accompanying assemblies used to construct the Mash sketch that performs the initial tax assignment. Will these contigs be excluded as contaminants?

  2. How are calls from multiple methods (tetra vs. gc vs. phylogenetic markers vs. conspecific etc) integrated?

  3. Is it possible to modify the conspecific module to automatically decompress reference genomes if stored as individual archives (fasta1.fna.gz, etc) prior to BLAST-ing? Mash can handle compressed input fine, so would be good to save space this way especially when running MAGpurify in a docker container.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.