snayfach / magpurify Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 12.0 27.6 MB

Improvement of metagenome-assembled genomes

License: GNU General Public License v3.0

Python 100.00%

magpurify's People

Contributors

Stargazers

Watchers

Forkers

liaoherui silask tankmermaid sonnenburglab apcamargo lupen14461 jianshu93 xiangyang1984 xiangrong131 wook2014

magpurify's Issues

Analysis automation

Hello everybody. I have a question about this software. I have 5 directories with at least 40 bins in each. I need to analyze at least 4 modules. Do I have to do this for each bin? The software does not accept to put the directory and do for all the bins at once?

SyntaxError: Missing parentheses in call to 'print'. Did you mean print("\n## Computing mean genome-wide GC content")?

in tetra.py and gc.py

Add SNP density module

Flag contigs with outlier SNP density/nucleotide diversity

Conspecific module

Integrate MAGdecon module that flag contigs matching other species
Create default database

"lastal: can't open file: -P" When running "clade-markers"

Hi I started the test workflow with MAGpurify but ran into an issue with the "clade-markers" step. I am using a conda environment with python 2.7 and MAGpurify installed. I successfully ran "phylo-markers" but I get this error when running "clade-markers". It looks like there are two "-P 1" flag and argument pairs confusing the command. There is output in its respective folder, but the clean bin results for display "clade-markers: no output file found". This is also my first issue I've reported, so I appreciate any feedback to help me help you identify the issue. Thank you for your help.

magpurify clade-markers example/test.fna example/output

After the command runs a little bit I get an error message:

Performing pairwise alignment of genes against MetaPhlan2 db of clade-specific genes

Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 MAGpurify-db-v1.0/clade-markers/markers.faa example/output/clade-markers/genes.faa -P 1 > example/output/clade-markers/genes.m8

Error message:
lastal: can't open file: -P

Conda package

Do you think you could you make a conda package?

too many bam files

Hi!

Cool software, thanks!

Any suggestions for how to run the coverage module with, for example, over 15,000 bam files? I get the following message

OSError: [Errno 7] Argument list too long: '/bin/sh'

Best,
David

No module named 'utility'

Traceback (most recent call last):
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/run_qc.py", line 39, in
from magpurify import csmg
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/magpurify/csmg.py", line 4, in
import utility
ModuleNotFoundError: No module named 'utility'

Error running phylo-markers

Dear developers:
I have found the following error when running the phylo-markers options. It happens for some MAGs, not for others. Any help will be appreciated.
Best,

magpurify phylo-markers results/DAS/H4_DASTool_bins/maxbin.097.fasta.contigs.fa temp/purify

• Calling genes with Prodigal
 all genes: temp/purify/phylo-markers/genes.[ffn|faa]
• Identifying PhyEco phylogenetic marker genes with HMMER
 hmm results: temp/purify/phylo-markers/phyeco.hmmsearch
 marker genes: temp/purify/phylo-markers/markers
• Performing pairwise BLAST alignment of marker genes against database
 blast results: temp/purify/phylo-markers/alns
• Finding taxonomic outliers
Traceback (most recent call last):
 File "/home/tamames/software/miniconda3/bin/magpurify", line 10, in <module>
   sys.exit(cli())
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/cli.py", line 116, in cli
   args["func"](args)
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 419, in main
   flagged = flag_contigs(args["db"], args["tmp_dir"], args)
 File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 372, in flag_contigs
   bin.genes[aln["qname"]].annotations.append(annotation)
KeyError: 'megahit_65877_14'

Interpretation of cutoff values

Hi,

Can you clarify what the cutoff values are for gc-content and tetra-freq and how these were established? My guess is that for gc-content the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).

I appreciate that the tetra-freq measure if more abstract, so I'm more interested in how the 0.06 default was established.

Thanks,
Donovan

Adding another genome to known-contam?

Hi,
How can I add another contaminant genome in the database?
The easiest way seems to be to add it in the known-contam folder, making the database file with blast++ and modifying lines 91 and 105 of the file contam.py from for target in ["hg38", "phix"]: to for target in ["hg38", "phix","newOne"]:.
Am I missing something?
Greg

better instructions for downloaded and unpacking database

phylo-markers: no output file found...

when I run this command magpurify clean-bin ./metabat_bin_last/${line} ./magpurify_out ./magpurify_out_fna/${line}.fna,i I encountered this problem:

Reading flagged contigs
 phylo-markers: no output file found
 clade-markers: no output file found
 conspecific: no output file found
 tetra-freq: no output file found
 gc-content: no output file found
 coverage: no output file found
 known-contam: no output file found

I don't know how to solve it?

lastal: not found

@Justan6 wrote:

I encountered the error below:
Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 /home/nwezejus/MAGpurify-db-v1.0/clade-markers/markers.faa Magpurify_out/clade-markers/genes.faa -P 1 > Magpurify_out/clade-markers/genes.m8

Error message:
b'/bin/sh: 1: lastal: not found\n'

@Justan6 please make sure that lastal has been installed and can be called from the command line. If not, please use conda to install the software or install from https://gitlab.com/mcfrith/last

Add FPR cutoffs to modules

Modules: GC, TNF, depth
Expected % of contigs flagged incorrectly
How does FPR relate to contig length, mean value for bin, variance

add flag for max % of bin removed

tetra-freq unable to handle "N" nucleotide

The tetra-freq module crashes if there is an N nucleotide in the nucleotide sequence.
An example error message is:

File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main
contig.kmers[kmer_rev] += 1
KeyError: 'NTTC'

N nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.

Deletion of N nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N would just be ignored when constructing tetra-nucleotide frequency profiles.

Integrate mash with run_qc.py conspecific

Issue with building my reference database for conspecific module

Hello,

I have been trying to build my own reference database for the conspecific module using the command lines from the manual but I keep failing.

The error message that I got was:

...tching "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"
for reading.not open "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"

So I tried with the example genome files but it still fails:
Sketching example/ref_genomes/ERS473214_89.fna...
ERROR: could not open example/ref_genomes/ERS473214_89.fna for reading.

I have been using Compute Canada cloud and both mash and MAGpurify are installed and work fine. Could you help fix this problem? Thanks!

Question about --weighted-means and % bin removed

Hi, thanks for building and maintaining this tool!

I have a question about a section of Nayfach et al. 2019 as it relates to running MAGpurify. The paper says:
"In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length."

I was wondering if you could comment on the purpose of the ≤25% length reduction requirement for a MAGpurify module to be run. It does not look to me like the clean-bin module turns off a given module based on length. Is the 25% rule somehow related to the new--weighted-means flag described on the Releases page? In other words, does using --weighted-means attempt to avoid removing large chunks of bins where without --weighted-means those bins only would have been rescued by turning off modules that remove >25% of the bin? Hopefully this makes sense. Thanks!

Best,
Bryan

add arguments allowing selecting which modules to use for contamintion removal

Suggestions based on trial use

Hi,

Thanks for your work on MAGpurify, I just tried it out on some example data and it seems to work quite well. I do have a few questions if you have time to answer them:

I am not sure I understand how the conspecific module will deal with "novel" sequences, e.g. contigs that are actually derived from a conspecific collection of species but are not present in the type strains available in the accompanying assemblies used to construct the Mash sketch that performs the initial tax assignment. Will these contigs be excluded as contaminants?
How are calls from multiple methods (tetra vs. gc vs. phylogenetic markers vs. conspecific etc) integrated?
Is it possible to modify the conspecific module to automatically decompress reference genomes if stored as individual archives (fasta1.fna.gz, etc) prior to BLAST-ing? Mash can handle compressed input fine, so would be good to save space this way especially when running MAGpurify in a docker container.