csmiller / emirge Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 29.0 92.32 MB

EMIRGE reconstructs full length ribosomal genes from short read sequencing data.

Python 94.61% C 5.39%

emirge's People

Contributors

Stargazers

Watchers

emirge's Issues

Error IndexError: index 1399 out of bounds 0<=index<0

I have come across an error while attempting to run EMIRGE (listed below) does anyone know the possible cause? Thanks!

Traceback (most recent call last):
File "/work/kgwin1/packages/python/bin/emirge.py", line 1616, in
main()
File "/work/kgwin1/packages/python/bin/emirge.py", line 1609, in main
do_iterations(em, max_iter = options.iterations, save_every = options.save_every)
File "/work/kgwin1/packages/python/bin/emirge.py", line 1348, in do_iterations
em.do_iteration(em.current_bam_filename, em.current_reference_fasta_filename)
File "/work/kgwin1/packages/python/bin/emirge.py", line 432, in do_iteration
self.calc_likelihoods()
File "/work/kgwin1/packages/python/bin/emirge.py", line 978, in calc_likelihoods
self.calc_probN() # (handles initial iteration differently within this method)
File "/work/kgwin1/packages/python/bin/emirge.py", line 1139, in calc_probN
bases = numpy.array(self.fastafile.fetch(fastaname), dtype='c')[zero_indices[0]]
IndexError: index 1399 out of bounds 0<=index<0

Trouble resuming from completed iteration

Hi,
I've been having some difficulties for a while resuming Emirge runs from a completed iteration.
All my last attempts resulted in the same errors. See below:

If you use EMIRGE in your work, please cite these manuscripts, as appropriate.

Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF (2011)
EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data.
Genome biology 12: R44. doi:10.1186/gb-2011-12-5-r44.

Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF (2013)
Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments.
PloS one 8: e56018. doi:10.1371/journal.pone.0056018.

imported _emirge C functions from: /home/pierre.pericard/anaconda3/envs/py27/lib/python2.7/site-packages/_emirge.so
Command:
/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py emirge_outdir -1 /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/SRS049896.denovo_duplicates_marked.trimmed.1.fastq -2 /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/SRS049896.denovo_duplicates_marked.trimmed.2.fastq -f /workdir/pierre.pericard/paper/16S_rRNA/ref_db/emirge_default_db/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b /workdir/pierre.pericard/paper/16S_rRNA/ref_db/emirge_default_db/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed -l 101 -i 150 -s 50 -j 1 -p 0.001 -n 100 -r 9 -a 16 --phred33

EMIRGE started at Fri Jan 13 11:31:25 2017
Resuming EMIRGE from iteration 09 at Fri Jan 13 11:31:25 2017 ...
Starting from information in directory:
/workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/emirge/raw_default_db_j1_p0.001/emirge_outdir/iter.09
DONE with resume initialization at Fri Jan 13 11:31:25 2017...
Starting iteration 9 at Fri Jan 13 11:31:25 2017...
Reading bam file /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/emirge/raw_default_db_j1_p0.001/emirge_outdir/iter.08/bowtie.iter.08.PE.bam at Fri Jan 13 11:31:25 2017...
DONE Reading bam file /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/emirge/raw_default_db_j1_p0.001/emirge_outdir/iter.08/bowtie.iter.08.PE.bam at Fri Jan 13 11:32:32 2017 [0:01:07.117363]...
Calculating likelihood (13671, 378106) for iteration 9 at Fri Jan 13 11:32:35 2017...
Calculating Pr(N=n) for iteration 9 at Fri Jan 13 11:32:35 2017...
Loading probN for resume case from /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/emirge/raw_default_db_j1_p0.001/emirge_outdir/iter.09/probN.pkl
DONE calculating Pr(N=n) for iteration 9 at Fri Jan 13 11:32:37 2017 [0:00:02.734468]...
DONE Calculating likelihood for iteration 9 at Fri Jan 13 11:34:30 2017 [0:01:55.514443]...
Calculating posteriors for iteration 9 at Fri Jan 13 11:34:30 2017...
/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py:1180: RuntimeWarning: invalid value encountered in divide
self.posteriors[-1].data = self.posteriors[-1].data / denom[(self.posteriors[-1].col,)] # index out denom with column indices from coo format.
DONE Calculating posteriors for iteration 9 at Fri Jan 13 11:34:34 2017 [3.878 seconds]...
Writing consensus for iteration 9 at Fri Jan 13 11:34:34 2017...
snp_minor_prob_thresh = 0.100
snp_percentage_thresh = 0.001
Traceback (most recent call last):
File "/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py", line 1697, in
main()
File "/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py", line 1688, in main
do_iterations(em, max_iter = options.iterations, save_every = options.save_every)
File "/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py", line 1444, in do_iterations
os.path.join(subdir, "iter.%02d.cons.fasta"%(em.iteration_i)))
File "/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py", line 499, in do_iteration
self.write_consensus(consensus_filename) # culls and splits
File "/home/pierre.pericard/anaconda3/envs/py27/bin/emirge.py", line 590, in write_consensus
if self.min_depth is not None and self.coverage[seq_i] < self.min_depth: # could also do this only after self.iteration_i > 5 or something
IndexError: list index out of range
Command exited with non-zero status 1
Command being timed: "emirge.py emirge_outdir -1 /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/SRS049896.denovo_duplicates_marked.trimmed.1.fastq -2 /workdir/pierre.pericard/paper/16S_rRNA/human_microbiome_project/SRS049896/SRS049896.denovo_duplicates_marked.trimmed.2.fastq -f /workdir/pierre.pericard/paper/16S_rRNA/ref_db/emirge_default_db/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b /workdir/pierre.pericard/paper/16S_rRNA/ref_db/emirge_default_db/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed -l 101 -i 150 -s 50 -j 1 -p 0.001 -n 100 -r 9 -a 16 --phred33"
User time (seconds): 185.75
System time (seconds): 6.40
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:13.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8311924
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 62
Minor (reclaiming a frame) page faults: 5182494
Voluntary context switches: 9793
Involuntary context switches: 765
Swaps: 0
File system inputs: 798784
File system outputs: 140032
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 1

Crash in Emirge amplicon

Rewrites the reads then dies

DONE Rewriting reads with indexes in headers at Thu Apr 18 00:12:57 2013 [0:01:21.308602]...
Number of reads (or read pairs) in input file(s): 3017638
Preallocating reads and quals in memory at Thu Apr 18 00:12:57 2013...
Traceback (most recent call last):
  File "/srv/whitlam/bio/apps/12.04/sw//emirge/0.60/bin/emirge_amplicon.py", line 1543, in <module>
    main()
  File "/srv/whitlam/bio/apps/12.04/sw//emirge/0.60/bin/emirge_amplicon.py", line 1509, in main
    rewrite_reads = not options.no_rewrite_reads)
  File "/srv/whitlam/bio/apps/12.04/sw//emirge/0.60/bin/emirge_amplicon.py", line 221, in __init__
    _emirge.populate_reads_arrays(self)
  File "_emirge_amplicon.pyx", line 525, in _emirge_amplicon.populate_reads_arrays (_emirge_amplicon.c:6577)
IndexError: Out of bounds on buffer access (axis 2)

Use setuptools to simplify installation

Switch to importing setuptools instead of disutils.core and add

install_requires=[
"BioPython",
"Cython",
"pysam",
"scipy"
],

so the setup area. The installation will automatically ensure that these dependencies are installed on the machine as part of the install.

http://stackoverflow.com/questions/9810603/adding-install-requires-to-setup-py-when-making-a-python-package

reduce memory usage via sparse matrices

Many of EMIRGE's largest data structures could be converted to sparse matrices, which should reduce memory usage substantially.

techical difference between emirge and emirge_amplicon

HI,
We have a project that going to reconstruct ribosome gene from meta-rna dataset. I know emirge_amplicon can handle up a few millions of reads while emirge is used for normal case, but there is no quite detailed description in your previous paper. I have some quesions about these two versions:

What has been changed in emirge_amplicon to make it suit for large dataset?
What's the difference of the performance about these two version if they are used for same dataset, except running time?

Thanks :)
Best regards

incompatiblility of emirge_rename_fasta.py with new version of biopython?

Recently received an error when running the emirge_rename_fasta.py script.

$ emirge_rename_fasta.py iter.40 > iter.40.cons.rn.fasta
Traceback (most recent call last):
File "/home/micro/anaconda2/bin/emirge_rename_fasta.py", line 164, in
main()
File "/home/micro/anaconda2/bin/emirge_rename_fasta.py", line 159, in main
rename(wd, options.prob_min, options.record_prefix, options.no_N, options.no_trim_N)
File "/home/micro/anaconda2/bin/emirge_rename_fasta.py", line 123, in rename
for prior, record in sorted(sorted_records, reverse=True):
File "/home/micro/anaconda2/lib/python2.7/site-packages/Bio/SeqRecord.py", line 720, in eq
raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest.

EMIRGE dependencies

Hi,

Is EMIRGE still maintained and is it being tested with newer versions of usearch, bowtie and samtools as well as the required python packages? I'm trying to install it centrally on our cluster but hitting segmentation faults:

/tmp/1453995841.5729613: line 8: 19013 Segmentation fault emirge.py /lustre/scratch108/pathogen/maa/emirge/ -1 18512_8#4_1.fastq.gz -f ../silva/filtered_SILVA_123_SSURef_Nr99_tax_silva.fasta -b ../silva/filtered_SILVA_123_SSURef_Nr99_tax_silva.fasta -l 75 --phred33

Thanks,

Martin

Candidate db file dead link

The link to download the candidate db file doesnt work anymore.
https://googledrive.com/host/0B7hz7JVEE15dbUtkRmxKVlhtd1U/SSURef_111_candidate_db.fasta.gz

Could you update it, please ?

Can EMIRGE work with FASTA files?

I'd like to use pre-trimmed FASTA files with EMIRGE. I believe this is supported in Bowtie. Is it possible to add this feature?

emirge_makedb.py: please add an argument-option for providing pre-downloaded SILVA-db

Apparently, running emirge_makedb.py from behind a proxy-server can be problematic.

This could be worked-around most easily if users would have the option to download the databases from SILVA manually and provide them to emirge_makedb.py per argument.

EMIRGE should work with single-end data

write a new bowtie wrapper and 2. adjust some of the data structures to maintain a zero-length paired read?

Inordinate memory usage

Hey there,

I'm using EMIRGE as a preliminary contamination screen for single-cell genomes. Unfortunately, EMIRGE is exhausting the RAM on our compute server (264Gb RAM, running RHEL). I have the 64-bit version of usearch (v8.1) so it should be able to utilize all available memory.

Here is the output:

[fai_load] build FASTA index.
[fai_load] build FASTA index.
usearch command was:
usearch -usearch_global /home/cmorganlang/Hallam_projects/OMZs/ProcessedData/EMIRGE_16S_prediction/minDepth10_outputs/IX0866_D1CP6ACXX_8_AACCCC/iter.05/iter.05.cons.fasta.tmp.fasta --db /home/cmorganlang/Hallam_projects/OMZs/ProcessedData/EMIRGE_16S_prediction/minDepth10_outputs/IX0866_D1CP6ACXX_8_AACCCC/iter.05/iter.05.cons.fasta.tmp.fasta --id 0.800 -quicksort -query_cov 0.5 -target_cov 0.5 -strand plus --userout /home/cmorganlang/Hallam_projects/OMZs/ProcessedData/EMIRGE_16S_prediction/minDepth10_outputs/IX0866_D1CP6ACXX_8_AACCCC/iter.05/iter.05.cons.fasta.tmp.fasta.us.txt --userfields query+target+id+caln+qlo+qhi+tlo+thi -threads 4 --maxaccepts 8 --maxrejects 256
Traceback (most recent call last):
  File "/usr/bin/emirge.py", line 4, in <module>
    __import__('pkg_resources').run_script('EMIRGE==0.60.3', 'emirge.py')
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 726, in run_script
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1491, in run_script
  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1697, in <module>

  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1688, in main

  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1444, in do_iterations

  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 500, in do_iteration

  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 807, in cluster_sequences

  File "/usr/lib64/python2.7/site-packages/EMIRGE-0.60.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/emirge.py", line 871, in cluster_sequences2

  File "/usr/lib64/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1224, in _execute_child
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

There are ~50 million reads (HiSeq) per SAG. iter.05.cons.fasta contained 2123 sequences.

Let me know if you need more information. Thanks!

EMIRGE should exit more gracefully if 0 sequences are reconstructed

If EMIRGE cannot reconstruct any sequences due to low read coverage, it currently gets to the clustering stage where usearch throws a segfault and emirge exits without an informative error message.

got errors when running emirge_amplicon.py

Hi,
I installed EMIRGE following the instrcution. The orginal 'emirge.py' seems work well, but the new version 'emrige_amplicon.py' that for large dataset got errors:

_$ ./emirge_amplicon.py
Traceback (most recent call last):
File "./emirge_amplicon.py", line 61, in
import _emirge_amplicon as _emirge
File "build/bdist.linux-x86_64/egg/_emirge_amplicon.py", line 7, in
File "build/bdist.linux-x86_64/egg/_emirge_amplicon.py", line 6, in bootstrap
ImportError: /Home/ii/yaxinx/.python-eggs/EMIRGE-0.60.2a6-py2.7-linux-x86_64.egg-tmp/emirge_amplicon.so: undefined symbol: gzread

Anyone know how to fix it? Thanks very much.

IndexError: Out of bounds on buffer access (axis 1)

Hi there,
I'm running Emirge on a dataset that keeps getting to iteration 12 and then failing (see below). I saw someone posted about this error previously, and that it had to do with read length, but the comment below said this should not longer be an issue.

Any ideas?

I've run Emirge successfully on other datasets recently, so I'm not sure why this one is a problem.

Thanks!
-Roxanne-

DONE Reading bam file /n/girguis_lab/Users/rbeinart/Emirge/BC_metatranscriptome_Emirge_output/iter.12/bowtie.iter.12.PE.bam at Tue Aug 23 16:21:28
2016 [1:11:31.149570]...
Calculating likelihood (25714, 8320954) for iteration 13 at Tue Aug 23 16:22:47 2016...
Calculating Pr(N=n) for iteration 13 at Tue Aug 23 16:22:47 2016...
Loading probN for resume case from /n/girguis_lab/Users/rbeinart/Emirge/
BC_metatranscriptome_Emirge_output/iter.13/probN.pkl
DONE calculating Pr(N=n) for iteration 13 at Tue Aug 23 16:22:54 2016 [0:00:06.533856]...
Traceback (most recent call last):
File "/n/girguis_lab/Users/rbeinart/Apps/EMIRGE-master/emirge.py", line 1697, in
main()
File "/n/girguis_lab/Users/rbeinart/Apps/EMIRGE-master/emirge.py", line 1688, in main
do_iterations(em, max_iter = options.iterations, save_every = options.save_every)
File "/n/girguis_lab/Users/rbeinart/Apps/EMIRGE-master/emirge.py", line 1444, in do_iterations
os.path.join(subdir, "iter.%02d.cons.fasta"%(em.iteration_i)))
File "/n/girguis_lab/Users/rbeinart/Apps/EMIRGE-master/emirge.py", line 491, in do_iteration
self.calc_likelihoods()
File "/n/girguis_lab/Users/rbeinart/Apps/EMIRGE-master/emirge.py", line 1141, in calc_likelihoods
lik_data)
File "_emirge.pyx", line 132, in _emirge._calc_likelihood (_emirge.c:2661)
s += ( qual2one_minus_p[qualints[i]] * probN_single[pos + i, j] ) # lookup table
IndexError: Out of bounds on buffer access (axis 1)

emirge_makedb.py add regex to allow for Silva `SSURef_N[Rr]99_tax...`

Hi there,

I'm new to using EMIRGE, and not sure if this is still supported, but running emirge_makedb.py I ran into the issue that it was by default expecting the SILVA file to be ...SSURef_Nr99_tax..., whereas in the current version it is ...SSURef_NR99_tax....

I modified the code to simply search for NR, and it appears to have run as expected, but perhaps it could be good to add a regex in there to search for either NR or Nr.

Cheers,
Mike.

move to usearch v4

EMIRGE currently uses uclust v3, which is outdated.
Need to move to usearch v4. Should just involve changing the command line subprocess call and testing.

Bowtie 1.2 breaks `emirge`

Hello.

I know emirge has only been tested on versions 0.12* or so of bowtie. But, we had no issues running it up to version 1.1.2 of bowtie. A recent upgrade to bowtie 1.2, however, broke emirge.

The command generated at do_initial_mapping does not seem to work. We get an error saying the file does not appear to be a FASTQ file.

Running latest build of emirge (0.61.0), python 2.7.13, and latest versions of numpy, scipy, biopython, and pysam.

multithreading the non-mapping calculations

Right now the only thing using multiple threads/CPUs is the mapping stage.

Several stages could probably be multithreaded:

reading of bam files
calculation of likelihood
calculation of Pr(N)
calculation of posterior
splitting of sequences

EMIRGE exit without an error message after completing iteration 0.

Hello, I am running emirge on a PE dataset with about 3 million paired end reads. The command I used is:
time emirge.py ../011-Emirge -1 BEI_16SCap_1P.fastq.gz -f ~/Database/EMIRGE/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b ~/Database/EMIRGE/SILVA_128_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed -l 151 -2 BEI_16SCap_2P.fastq -i 375 -s 60 -a 2 --phred33 2>&1 | tee emirge_bei_20170223_2.log

I am running it in a ubuntu virtual machine with 8GB and 4 cores.

The program exit without an error message about 25 minutes after initiating the iteration 1, still at the step [fai_load] build FASTA index. The complete output is attached. The iter.1 folder is empty so I cannot resume the iteration. I used to get emirge working on the clusters my university provided last May.

Thank you very much!
emirge_bei_20170223_2.txt

EMIRGE should work with SOLiD / colorspace reads

Requires adjustment of internal mapping (easy) and understanding any differences with reported quality values (maybe harder).

EMIRGE compatible with bowtie v.1+ ?

Your dependencies state bowtie v. 0.12.7 or v.0.12.8 as dependencies. However bowtie currently is already at version 1.2.0.
Is there a reason for emirge to require/prefer these old versions of bowtie? Or would the new versions of bowtie work just as well with emirge?

EMIRGE Amplicon: n_alignments not intialized

Upon running emirge_amplicon.py assembly with the following command:

emirge_amplicon.py assembly --phred33 --iterations 120 --fasta_db /srv/databases/emirge/SSURef_111_candidate_db.fasta --bowtie_db /srv/databases/emirge/SSU_111_candidate_db --mapping TL04arch.mapped.sorted.bam --max_read_length 126 --insert_mean 100 --insert_stddev 10 -1 ...TL02arch-Overland-DNA2.forward.decontam.adapt_trim.qual_trim.fastq.gz -2 ...TL02arch-Overland-DNA2.reverse.decontam.adapt_trim.qual_trim.fastq

the program crashes with:

EMIRGE started at Fri Feb 19 17:02:01 2016
Rewriting reads with indices in headers at Fri Feb 19 17:02:01 2016...
DONE Rewriting reads with indexes in headers at Fri Feb 19 17:03:03 2016 [0:01:02.031644]...
Number of reads (or read pairs) in input file(s): 5806520
Preallocating reads and quals in memory at Fri Feb 19 17:03:03 2016...
DONE Preallocating reads and quals in memory at Fri Feb 19 17:03:47 2016 [0:00:44.743259]...
Beginning initialization at Fri Feb 19 17:03:47 2016...
Reading bam file ...TL04arch.mapped.sorted.bam at Fri Feb 19 17:03:47 2016...
Traceback (most recent call last):
File "/bin/emirge_amplicon.py", line 1576, in
main()
File "/bin/emirge_amplicon.py", line 1565, in main
em.initialize_EM(options.mapping, options.fasta_db, randomize_priors = options.randomize_init_priors)
File "/bin/emirge_amplicon.py", line 384, in initialize_EM
self.read_bam(bam_filename, reference_fasta_filename)
File "/bin/emirge_amplicon.py", line 334, in read_bam
_emirge.process_bamfile(self, BOWTIE_ASCII_OFFSET)
File "_emirge_amplicon.pyx", line 553, in _emirge_amplicon.process_bamfile (_emirge_amplicon.c:7241)
AttributeError: 'EM' object has no attribute 'n_alignments'

I've been investigating the error and this is what seems to be happening:

emirge_amplicon.py

line 1576 calls main()
line 1533 initializes the em class, there is not a self.n_alignments variable initialized with the class
line 1565 calls em.initialize_EM
line 384 in initialize_EM(args) calls self.read_bam(bam_filename, reference_fasta_filename)
line 334 in read_bam calls _emirge.process_bamfile(self, BOWTIE_ASCII_OFFSET)

The prior command leads to a function in _emirge_amplicon.pyx called process_bamfile

line 553 in the aforementioned function calls bamfile_data = np.empty((em.n_alignments, 6), dtype=np.uint32)

Note: self is passed to process_bamfile which is now called em

This is where the error is thrown.

n_alignments appears to be generated by get_n_alignments_from_bowtie on line 1086 which is called on line 1058 by do_mapping_bowtie which is called by do_mapping which is called by do_iteration which is called by do_iterations which is called on line 1568 near the end of main().

In summary, EM does not initialize any n_alignments variable but immediately tries to process the BAM file which attempts to use the n_alignments variable as part of a numpy array. Since n_alignments doesn't exist, the numpy array cannot initialize and the program crashes, i.e., the function ultimately requiring n_alignments is called before the function generating n_alignments. It seems that if no BAM file is provided to EMIRGE then it will derive n_alignments from bowtie's stderr but if you provide a BAM file, there's no way to provide this info and the program crashes.

Unexpected crash at 17th iteration

Hi,

I encountered an unexpected crash in the 17th iteration of an Emirge run.
This occurred only with one dataset and I reproduced the bug on 2 different computers.
I have no message in STDOUT telling me where is coming the pb from...
I joined the run log.

Can you help me please ?

Thanks in advance

emirge.out.txt

IndexError: Out of bounds on buffer access (axis 0)

python /home/arkg/EMIRGE/emirge.py emirge-output/ -1 ../../JdF_1362A_J2.573-2/reads/SSU_reads/JdF1362AcombinedSSU.fastq -f SILVA_132_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b SILVA_132_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -l 150 -a 16 -m SSUreadsA.bam.sorted

If you use EMIRGE in your work, please cite these manuscripts, as appropriate.

Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF (2011)
EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data.
Genome biology 12: R44. doi:10.1186/gb-2011-12-5-r44.

Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF (2013)
Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments.
PloS one 8: e56018. doi:10.1371/journal.pone.0056018.

imported _emirge C functions from: /home/arkg/.cache/Python-Eggs/EMIRGE-0.61.0-py2.7-linux-x86_64.egg-tmp/_emirge.so
Command:
/home/arkg/EMIRGE/emirge.py emirge-output/ -1 ../../JdF_1362A_J2.573-2/reads/SSU_reads/JdF1362AcombinedSSU.fastq -f SILVA_132_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -b SILVA_132_SSURef_Nr99_tax_silva_trunc.ge1200bp.le2000bp.0.97.fixed.fasta -l 150 -a 16 -m SSUreadsA.bam.sorted

EMIRGE started at Sat Jan 27 17:29:17 2018
Beginning initialization at Sat Jan 27 17:29:17 2018...
Reading bam file /mnt/maximus/data1/chan/arkadiy/Juan_de_Fuca/JdF_1362AB/emirge_SSU/SSUreadsA.bam.sorted at Sat Jan 27 17:29:17 2018...
DONE Reading bam file /mnt/maximus/data1/chan/arkadiy/Juan_de_Fuca/JdF_1362AB/emirge_SSU/SSUreadsA.bam.sorted at Sat Jan 27 17:29:22 2018 [0:00:04.452040]...
DONE with initialization at Sat Jan 27 17:29:22 2018...
Starting iteration 0 at Sat Jan 27 17:29:22 2018...
Reading bam file /mnt/maximus/data1/chan/arkadiy/Juan_de_Fuca/JdF_1362AB/emirge_SSU/SSUreadsA.bam.sorted at Sat Jan 27 17:29:22 2018...
DONE Reading bam file /mnt/maximus/data1/chan/arkadiy/Juan_de_Fuca/JdF_1362AB/emirge_SSU/SSUreadsA.bam.sorted at Sat Jan 27 17:29:25 2018 [0:00:03.364967]...
Calculating likelihood (1041, 42776) for iteration 0 at Sat Jan 27 17:29:25 2018...
	Calculating Pr(N=n) for iteration 0 at Sat Jan 27 17:29:25 2018...
	DONE calculating Pr(N=n) for iteration 0 at Sat Jan 27 17:29:26 2018 [0:00:01.241517]...
Traceback (most recent call last):
  File "/home/arkg/EMIRGE/emirge.py", line 1697, in <module>
    main()
  File "/home/arkg/EMIRGE/emirge.py", line 1688, in main
    do_iterations(em, max_iter = options.iterations, save_every = options.save_every)
  File "/home/arkg/EMIRGE/emirge.py", line 1439, in do_iterations
    em.do_iteration(em.current_bam_filename, em.current_reference_fasta_filename)
  File "/home/arkg/EMIRGE/emirge.py", line 491, in do_iteration
    self.calc_likelihoods()
  File "/home/arkg/EMIRGE/emirge.py", line 1141, in calc_likelihoods
    lik_data)
  File "_emirge.pyx", line 130, in _emirge._calc_likelihood
    if numeric_bases[i] == j:   # this is called base, set 1-P
IndexError: Out of bounds on buffer access (axis 0)

I am running on SSU reads pulled out from the raw quality reads using sortmerna. These include read pairs combined and uncombined with flash

Is this a live project?

I heard there might be an EMIRGE "version 2" coming?

EMIRGE presets and environmental variable defaults

EMIRGE is not always easy to use. Two suggestions to improve this:

Have several "presets" options that set the individual options for common use cases.
Have EMIRGE look for a handful of $EMIRGE_* in the environment, so that, for example, you don't have to specify -f and -b for the SSU candidate database on the command line every time.

EMIRGE should be indel-aware, and work with 454 reads.

Right now, EMIRGE uses bowtie for its read mapping, which does not handle indels. Consequently, EMIRGE is not coded to handle indels either. To be able to get more accurate reconstructions, as well as use homopolymer-rich 454/Roche sequencing data, we need to incorporate indels into the mapping and statistical model.

Reporting base probabilities in reconstructed sequences

Erin Nuccio pointed out that it would be nice to be able to recover the per-base probabilities calculated by EMIRGE for each reconstructed sequence. These are stored in the probN numpy array, if anyone wants to take a crack. Erin suggested converting to a fastq format, where quality scores represent the prob of the reported base. I also have code for plotting all four base probabilities at each position I should clean up and post as a stand-alone script.

EMIRGE method choice

Hello,

I would like to try EMIRGE on metatranscriptome reads. Indeed I have already sorted the putative SSU reads out with cmsearch. Would you suggest using the normal EMIRGE version or the amplicon variant?

Thanks,
Domenico

Non-paired end data

Hi,

I used emirge_amplicon.py v0.60 with single reads (non paired-end):
emirge_amplicon.py emirge_dir -1 reads.fastq --phred33 --fasta_db SSU_candidate_db.fna --bowtie_db SSU_candidate_db_btindex --max_read_length 302 --processors 10

And got the error:
emirge_amplicon.py: error: --insert_mean is required, but is not specified (try --help)

In the end, I had to add both a bogus insert_means and insert_stddev to make EMIRGE happy:
emirge_amplicon.py emirge_dir -1 reads.fastq --phred33 --fasta_db SSU_candidate_db.fna --bowtie_db SSU_candidate_db_btindex --max_read_length 302 --processors 10 --insert_mean 100 --insert_stddev 1

When not using paired-end data (the -2) option, it would be nice of EMIRGE did not request insert parameters.

Other than that, it seems like EMIRGE worked fine with my non-paired-end data. Thanks,

Florent

Segmentation fault error

Time loading forward index: 00:00:00
I am attempting to run EMIRGE and receive the following error (listed below). I am operating USEARCH v 5.2.32 (64-bit to increase the RAM). I am running EMIRGE in an empty directory. Does anyone know what might be happening? Thanks.

Time loading mirror index: 00:00:00
[samopen] SAM header is present: 21 sequences.
Seeded quality full-index search: 00:12:47

reads processed: 10010002

reads with at least one reported alignment: 9956 (0.10%)

reads that failed to align: 10000046 (99.90%)

Reported 9956 alignments to 1 output stream(s)
Time searching: 00:12:47
Overall time: 00:12:47
Beginning initialization at Wed Apr 10 10:31:21 2013...
Reading bam file /home/localuser/Desktop/Emirge_runs/emirge_output/initial_mapping/initial_bowtie_mapping.PE.bam at Wed Apr 10 10:31:21 2013...
Segmentation fault