snayfach / magpurify Goto Github PK
View Code? Open in Web Editor NEWImprovement of metagenome-assembled genomes
License: GNU General Public License v3.0
Improvement of metagenome-assembled genomes
License: GNU General Public License v3.0
Hello everybody. I have a question about this software. I have 5 directories with at least 40 bins in each. I need to analyze at least 4 modules. Do I have to do this for each bin? The software does not accept to put the directory and do for all the bins at once?
in tetra.py and gc.py
Flag contigs with outlier SNP density/nucleotide diversity
Hi I started the test workflow with MAGpurify but ran into an issue with the "clade-markers" step. I am using a conda environment with python 2.7 and MAGpurify installed. I successfully ran "phylo-markers" but I get this error when running "clade-markers". It looks like there are two "-P 1" flag and argument pairs confusing the command. There is output in its respective folder, but the clean bin results for display "clade-markers: no output file found". This is also my first issue I've reported, so I appreciate any feedback to help me help you identify the issue. Thank you for your help.
magpurify clade-markers example/test.fna example/output
After the command runs a little bit I get an error message:
Performing pairwise alignment of genes against MetaPhlan2 db of clade-specific genes
Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 MAGpurify-db-v1.0/clade-markers/markers.faa example/output/clade-markers/genes.faa -P 1 > example/output/clade-markers/genes.m8
Error message:
lastal: can't open file: -P
Do you think you could you make a conda package?
Hi!
Cool software, thanks!
Any suggestions for how to run the coverage module with, for example, over 15,000 bam files? I get the following message
OSError: [Errno 7] Argument list too long: '/bin/sh'
Best,
David
Traceback (most recent call last):
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/run_qc.py", line 39, in
from magpurify import csmg
File "/home/kiesers/scratch/Test_magpurify/MAGpurify/magpurify/csmg.py", line 4, in
import utility
ModuleNotFoundError: No module named 'utility'
Dear developers:
I have found the following error when running the phylo-markers options. It happens for some MAGs, not for others. Any help will be appreciated.
Best,
magpurify phylo-markers results/DAS/H4_DASTool_bins/maxbin.097.fasta.contigs.fa temp/purify
• Calling genes with Prodigal
all genes: temp/purify/phylo-markers/genes.[ffn|faa]
• Identifying PhyEco phylogenetic marker genes with HMMER
hmm results: temp/purify/phylo-markers/phyeco.hmmsearch
marker genes: temp/purify/phylo-markers/markers
• Performing pairwise BLAST alignment of marker genes against database
blast results: temp/purify/phylo-markers/alns
• Finding taxonomic outliers
Traceback (most recent call last):
File "/home/tamames/software/miniconda3/bin/magpurify", line 10, in <module>
sys.exit(cli())
File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/cli.py", line 116, in cli
args["func"](args)
File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 419, in main
flagged = flag_contigs(args["db"], args["tmp_dir"], args)
File "/home/tamames/software/miniconda3/lib/python3.8/site-packages/magpurify/modules/phylo.py", line 372, in flag_contigs
bin.genes[aln["qname"]].annotations.append(annotation)
KeyError: 'megahit_65877_14'
Hi,
Can you clarify what the cutoff values are for gc-content
and tetra-freq
and how these were established? My guess is that for gc-content
the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).
I appreciate that the tetra-freq
measure if more abstract, so I'm more interested in how the 0.06
default was established.
Thanks,
Donovan
Hi,
How can I add another contaminant genome in the database?
The easiest way seems to be to add it in the known-contam
folder, making the database file with blast++ and modifying lines 91 and 105 of the file contam.py
from for target in ["hg38", "phix"]:
to for target in ["hg38", "phix","newOne"]:
.
Am I missing something?
Greg
when I run this command magpurify clean-bin ./metabat_bin_last/${line} ./magpurify_out ./magpurify_out_fna/${line}.fna
,i I encountered this problem:
Reading flagged contigs
phylo-markers: no output file found
clade-markers: no output file found
conspecific: no output file found
tetra-freq: no output file found
gc-content: no output file found
coverage: no output file found
known-contam: no output file found
I don't know how to solve it?
@Justan6 wrote:
I encountered the error below:
Error encountered executing:
lastal -p BLOSUM62 -P 1 -f blasttab+ -m 10 /home/nwezejus/MAGpurify-db-v1.0/clade-markers/markers.faa Magpurify_out/clade-markers/genes.faa -P 1 > Magpurify_out/clade-markers/genes.m8Error message:
b'/bin/sh: 1: lastal: not found\n'
@Justan6 please make sure that lastal
has been installed and can be called from the command line. If not, please use conda to install the software or install from https://gitlab.com/mcfrith/last
The tetra-freq
module crashes if there is an N
nucleotide in the nucleotide sequence.
An example error message is:
File "/home/ubuntu/.local/lib/python3.6/site-packages/magpurify/modules/tetra.py", line 87, in main
contig.kmers[kmer_rev] += 1
KeyError: 'NTTC'
N
nucleotides are very common in MAGs and draft genome assemblies, so this causes errors frequently, such as when working with the UHGG.
Deletion of N
nucleotides will cause artificial adjacencies that will bias the tetra-nucleotide frequency profile. Random imputation would have similar bias. Ideally, any 4-mer with an N
would just be ignored when constructing tetra-nucleotide frequency profiles.
Hello,
I have been trying to build my own reference database for the conspecific module using the command lines from the manual but I keep failing.
The error message that I got was:
...tching "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"
for reading.not open "/lustre04/scratch/djung09/MAGpurify/magpurify/MAGpurify-db-v1.0/wvu007.fna"
So I tried with the example genome files but it still fails:
Sketching example/ref_genomes/ERS473214_89.fna...
ERROR: could not open example/ref_genomes/ERS473214_89.fna for reading.
I have been using Compute Canada cloud and both mash and MAGpurify are installed and work fine. Could you help fix this problem? Thanks!
Hi, thanks for building and maintaining this tool!
I have a question about a section of Nayfach et al. 2019 as it relates to running MAGpurify. The paper says:
"In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length."
I was wondering if you could comment on the purpose of the ≤25% length reduction requirement for a MAGpurify module to be run. It does not look to me like the clean-bin
module turns off a given module based on length. Is the 25% rule somehow related to the new--weighted-means
flag described on the Releases page? In other words, does using --weighted-means
attempt to avoid removing large chunks of bins where without --weighted-means
those bins only would have been rescued by turning off modules that remove >25% of the bin? Hopefully this makes sense. Thanks!
Best,
Bryan
Hi,
Thanks for your work on MAGpurify, I just tried it out on some example data and it seems to work quite well. I do have a few questions if you have time to answer them:
I am not sure I understand how the conspecific module will deal with "novel" sequences, e.g. contigs that are actually derived from a conspecific collection of species but are not present in the type strains available in the accompanying assemblies used to construct the Mash sketch that performs the initial tax assignment. Will these contigs be excluded as contaminants?
How are calls from multiple methods (tetra vs. gc vs. phylogenetic markers vs. conspecific etc) integrated?
Is it possible to modify the conspecific module to automatically decompress reference genomes if stored as individual archives (fasta1.fna.gz, etc) prior to BLAST-ing? Mash can handle compressed input fine, so would be good to save space this way especially when running MAGpurify in a docker container.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.