qiime2 / q2-vsearch Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 19.0 938 KB

vsearch plugin for QIIME 2

License: BSD 3-Clause "New" or "Revised" License

Python 97.94% Makefile 0.12% TeX 0.41% HTML 1.53%

hacktoberfest

q2-vsearch's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-vsearch's People

Contributors

Stargazers

Watchers

Forkers

johnchase ebolyen jairideout gregcaporaso jakereps colinbrislawn madeleineernst chriskeefe turanoo nearinj angrybee oddant1 andrewsanchez devonorourke jcmcnch keegan-evans lizgehret colinvwood hagenjp

q2-vsearch's Issues

add cluster-sequences-denovo

expose --minuniquesize for dereplicating

Improvement Description
A forum member wanted to use --minuniquesize during dereplication. Can we expose this setting so users can omit low abundance clusters?

Comments
This issues applies directly to qiime vsearch dereplicate-sequences, but may apply to other vsearch functions as well.

References
wanted to use --minuniquesize during dereplication

add method for chimera checking

Add Citations

Should use the new citation API in qiime2/qiime2#387

vsearch cluster-features-de-novo fatal error

Hi there, and thanks for your tremendous effort in maintaining q2!

I am running vsearch cluster-features-de-novo with the hope to re-cluster my AVS table inferred from DADA2. The following command:
qiime vsearch cluster-features-de-novo --i-sequences rep-seqs.qza --i-table feature-table.qza --p-perc-identity .99 --p-threads 6 --output-dir OUT --verbose
spits out this traceback:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --cluster_size /tmp/tmpgm9lrl9r --id 0.99 --centroids /tmp/q2-DNAFASTAFormat-zdyznqtj --uc /tmp/tmpz0fek351 --qmask none --xsize --threads 6

vsearch v2.7.0_linux_x86_64, 31.1GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/tmpgm9lrl9r 4%  

Fatal error: Invalid (zero) abundance annotation in FASTA file header
Traceback (most recent call last):
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/commands.py", line 327, in __call__
    results = action(**arguments)
  File "</root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/decorator.py:decorator-gen-121>", line 2, in cluster_features_de_novo
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
    output_types, provenance)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 193, in cluster_features_de_novo
    run_command(cmd)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 33, in run_command
    subprocess.run(cmd, check=True)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['vsearch', '--cluster_size', '/tmp/tmpgm9lrl9r', '--id', '0.99', '--centroids', '/tmp/q2-DNAFASTAFormat-zdyznqtj', '--uc', '/tmp/tmpz0fek351', '--qmask', 'none', '--xsize', '--threads', '6']' returned non-zero exit status 1.

Plugin error from vsearch:

  Command '['vsearch', '--cluster_size', '/tmp/tmpgm9lrl9r', '--id', '0.99', '--centroids', '/tmp/q2-DNAFASTAFormat-zdyznqtj', '--uc', '/tmp/tmpz0fek351', '--qmask', 'none', '--xsize', '--threads', '6']' returned non-zero exit status 1.

Could you kindly suggest a solution to this issue?

With thanks, Stan

derep - expose minseqlength parameter

Improvement Description
Expose the -minseqlength vsearch parameter in the dereplicate-sequences method.

References

forum

Pass `--threads` to `join-pairs`

Came up on the forum.

Looks like --threads can be used with fastq_mergepairs so we just need to expose and wire up that option.

Missing a length filter cross-reference in `cluster-features-*` actions

Bug Description
vsearch apparently applies a minimum length filter of 32 nts to input sequences - our cluster-features-* actions appear to assume that no reads are going to be filtered by vsearch, so there is no cross-referencing or post-vsearch filtering applied.

Steps to reproduce the behavior

Please see reference 1, below.

Expected behavior
I see at least two ways to solve, detailed in questions 1 and 2, below.

Questions

Should we solve by applying post-vsearch filtering? If so, how should we report the filtered sequences back to the user? Is this a new output, or is it lumped in with one of the existing outputs?
Should (can?) we solve this by removing the min-length filter on vsearch?

References

https://forum.qiime2.org/t/error-when-renning-cluster-features-de-novo/14878

feature ids in "chimeras" FeatureData[Sequence] artifacts are incorrect

Improvement Description
We are fixing these ids internal to QIIME 2 - once this issue is fixed in vsearch, we should remove our fix as it will be taking an unnecessary pass through the file at that point.

References
This is related to torognes/vsearch#272.

add cluster-features-open-reference

include dependencies in setup.py

merge fastq_stats_* visualizers into a single action

Improvement Description
Now that we have a brand-new unified view type for SE and PE fastq formats, fastq_stats_single and fastq_stats_paired could be merged into a single action. See:
qiime2/q2-types#245
and here for an open PR where this unified view type is utilized:
qiime2/q2-quality-control#52

Current Behavior
fastq_stats_single and fastq_stats_paired exist as separate visualizers. I am excited for these new visualizers! We have an opportunity to merge these into one before they area released to make use more straightforward.

Proposed Behavior
Merge these visualizers into one fastq_stats visualizer. This can probably be done by:

changing the view type here to CasavaOneEightSingleLanePerSampleDirFmt (that format has a new manifest property so afaik this line should still work but needs testing)
instead of this if statement you could then just check if the manifest has a reverse column.
in plugin_setup.py you would register a union type as the input type. E.g., instead of specifying SampleData[PairedEndSequencesWithQuality] as seen here you would do SampleData[SequencesWithQuality | PairedEndSequencesWithQuality]

Then just cleaning up the docs, tests etc to use the unified action. (I may be missing a few other steps, but the above should be the gist of the main changes I think)

add vsearch database search method

Improvement Description
This is useful for running a BLAST-like search locally. For example, I recently wanted to get a report of the best matches in my reference database for some query sequences of interest. The vsearch command looks like the following:

vsearch --db 99-otus/dna-sequences.fasta \
  --usearch_global queries.fasta \
  --alnout out.aln \
  --blast6out out.bl6 \
  --id 0.0 \
  --maxaccepts 10 \
  --qmask none

Proposed Behavior
I would see this generating a qza containing the bl6 output, which could then be viewed with qiime metadata tabulate.

ENH: output closed-ref OTU picking stats

Improvement Description
Simply: % reads that match ref, but maybe other information could be relevant. (does vsearch output anything by default that we could just propagate here?)

This could be as simple as a report written to stdout, or could attempt to mirror what dada2/deblur do. (I'd lean toward the former, given the relative simplicity of OTU picking)

References
forum xref

add vsearch all-pairwise-alignments pipeline

Improvement Description
Add vsearch all-pairwise-alignments pipeline

Current Behavior
The vsearch command I ran was:

vsearch --allpairs_global queries.fasta --acceptall --id 0.0 --blast6out out.bl6 --alnout out.aln --qmask none

Proposed Behavior
Compute all pairwise alignments for a set of query sequences

Comments

I would see this generating a qzv showing the pairwise alignments and maybe a heatmap-like summary of the pairwise alignment scores, and a qza containing the bl6 output that could be viewed with qiime metadata tabulate.

ENH: add support for the different versions of `uchime_denovo` within vsearch

Somewhat an extension of #75, but there have been improvements to the uchime_denovo algorithm of vsearch. Specifically:

uchime2_denovo
uchime3_denovo

As outlined in the vsearch manual, these additional options are ideal for working with denoised amplicons. Also note that the minh option is ignored for these two options.

add plugin description

add install instructions to readme

Should be:

conda install -c bioconda vsearch
pip install https://github.com/gregcaporaso/q2-vsearch/archive/master.zip

add support for ``--derep_prefix`` when we support vsearch 2.1.1

A vsearch bug was fixed related to this functionality, so we should be able to add this as an option to dereplicate_sequences, except that we're currently installing vsearch 2.0.3. Version 2.1.1 contains the fix.

only feature ids present in the table should be clustered

I helped a collaborator run cluster-features after importing sequences and a biom table that were processed with a non-QIIME 2 pipeline, and we had the surprising result that the resulting FeatureData[Sequence] had more records in it than there were features in the FeatureTable. It turned out that the input FeatureData[Sequence] had sequences for features that had been filtered out of the input FeatureTable. We should probably raise an error if the set of feature ids isn't the same between the two inputs.

This is a bit of an edge case, so isn't really high priority (this will probably primarily be applied to FeatureTable and FeatureData[Sequence] artifacts that are generated through a QIIME 2 pipeline).

Expose `FeatureMap` data when clustering

Improvement Description
A user on the QIIME 2 forum was requesting detail on what features map to specific subjects from closed reference clustering. The cluster-features-closed-reference action does not presently expose the UC mapping output that would provide this detail. @gregcaporaso suggested use of FeatureMap as a means to represent these data.

Current Behavior
The mapping detail is not provided.

Proposed Behavior
Modify the relevant actions, including cluster-features-closed-reference, to allow saving the FeatureMap.

References

@gregcaporaso suggested migrating FeatureMap to q2-types, see qiime2/q2-types#298
https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291/35

Please support vsearch 2.13 or newer (Debian Packaging)

Hello,

I am currently packaging q2-vsearch for the Debian-Med packaging team1. I have realised that q2-vsearch2 is very strict with the version of vsearch that is required (2.7.0). This became apparent as the tests failed with the following recurring messages (exemplar):

self = <q2_vsearch.tests.test_join_pairs.MergePairsTests testMethod=test_join_pairs_alt_qminout>

    def test_join_pairs_alt_qminout(self):
        with redirected_stdio(stderr=os.devnull):
>           cmd, obs = _join_pairs_w_command_output(
                self.input_seqs, qminout=-1)

q2_vsearch/tests/test_join_pairs.py:276: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
q2_vsearch/_join_pairs.py:146: in _join_pairs_w_command_output
    run_command(cmd)
q2_vsearch/_cluster_features.py:33: in run_command
    subprocess.run(cmd, check=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = False, timeout = None, check = True
popenargs = (['vsearch', '--fastq_mergepairs', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_v....dev0/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz', '--fastqout', ...],)
kwargs = {}, process = <subprocess.Popen object at 0x7ffb3af2a9a0>
stdout = None, stderr = None, retcode = 1

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.
    
        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.
    
        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.
    
        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.
    
        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.
    
        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.
    
        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE
    
        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE
    
        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command '['vsearch', '--fastq_mergepairs', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R1_001.fastq.gz', '--reverse', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz', '--fastqout', '/tmp/q2-SingleLanePerSampleSingleEndFastqDirFmt-uv9jfa6y/BAQ2687.1_0_L001_R1_001.fastq', '--fastq_ascii', '33', '--fastq_minlen', '1', '--fastq_minovlen', '10', '--fastq_maxdiffs', '10', '--fastq_qmin', '0', '--fastq_qminout', '-1', '--fastq_qmax', '41', '--fastq_qmaxout', '41', '--minseqlength', '1', '--fasta_width', '0', '--threads', '1']' returned non-zero exit status 1.

/usr/lib/python3.8/subprocess.py:512: CalledProcessError
----------------------------- Captured stdout call -----------------------------
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --fastq_mergepairs /<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R1_001.fastq.gz --reverse /<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz --fastqout /tmp/q2-SingleLanePerSampleSingleEndFastqDirFmt-uv9jfa6y/BAQ2687.1_0_L001_R1_001.fastq --fastq_ascii 33 --fastq_minlen 1 --fastq_minovlen 10 --fastq_maxdiffs 10 --fastq_qmin 0 --fastq_qminout -1 --fastq_qmax 41 --fastq_qmaxout 41 --minseqlength 1 --fasta_width 0 --threads 1

----------------------------- Captured stderr call -----------------------------
Fatal error: Invalid options to command fastq_mergepairs
Invalid option(s): --fastq_qminout --minseqlength
The valid options for the fastq_mergepairs command are: --bzip2_decompress --eeout --eetabbedout --fasta_width --fastaout --fastaout_notmerged_fwd --fastaout_notmerged_rev --fastq_allowmergestagger --fastq_ascii --fastq_eeout --fastq_maxdiffpct --fastq_maxdiffs --fastq_maxee --fastq_maxlen --fastq_maxmergelen --fastq_maxns --fastq_minlen --fastq_minmergelen --fastq_minovlen --fastq_nostagger --fastq_qmax --fastq_qmaxout --fastq_qmin --fastq_truncqual --fastqout --fastqout_notmerged_fwd --fastqout_notmerged_rev --gzip_decompress --label_suffix --log --no_progress --quiet --relabel --relabel_keep --relabel_md5 --relabel_self --relabel_sha1 --reverse --sizein --sizeout --threads --xee --xsize

Most notably, Invalid option(s): --fastq_qminout --minseqlength. It has become apparent that these options are not available in the latest version of vsearch (they were removed).

It would be much appreciated if this could be fixed.

Missing deps in conda recipe

numpy
pandas
pyyaml

add cluster-sequences-closed-reference

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs

Improvement Description
Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in dereplicate-sequences.

Current Behavior
Seq hashes are used by default.

Proposed Behavior
Expose a --p-hashed-feature-ids parameter to choose how sequence IDs get handled.

References

forum xref

performance issues with `cluster-features-open-reference`

Bug Description
Runtimes seem to be very slow relative to closed and de-novo on the same datasets.

This is probably the intermediate steps in the pipeline. One good indicator is that the bulk of open does not appear to run in parallel with multiple threads are requested (suggesting that it is hanging on those steps).

References
the intermediate steps

expand tests of `.uc` format parser

Improvement Description
This should be done when we write the de novo and closed-reference OTU pickers, and can be ported from QIIME 1.

add cluster-features-closed-reference

Set –fasta_width 0 on relevant actions

Comments
This will ensure that sequences aren't split to a new line at 80 characters.

References
This recently came up on the forum.

cluster-features id parameter should specify range

BUG: derep-seqs produces mismatched feature IDs

Bug Description
vsearch dereplicate-sequences adds Sample ID information to the Feature IDs of the FeatureData[Sequence] output, but not the Feature IDs of the FeatureTable[Frequency] output. This causes problems downstream when attempting to utilize Actions that require both as inputs --- the Feature IDs no longer match.

Steps to reproduce the behavior

Run dereplicate-sequences on any input
Export the FeatureData[Sequence]
Note the modified Feature IDs: e.g. >9142dd139a96f63aba52d4e88bdcb803981e3467 L2S204_3773

Expected behavior
Feature IDs should be consistent in both outputs. Probably no need to include the Sample ID info that is being tacked on.

References

Original forum post

add unit tests

Rename `join-pairs` to `merge-pairs`

Suggestion: Rename the q2-vsearch join-pairs action to merge_pairs, and update the associated descriptions to state something like: "Merge paired-end sequence reads..."). Merging reads is not the same as joining reads.

Reason: Calling this QIIME 2 action join-pairs may be confusing for savvy vsearch users, as there is in fact a --fastq_join command within vsearch. This command will stitch each read together end-to-end with a padding string, usually Ns. As we're actually calling the vsearch --fastq_mergepairs, we should update the name of the action.

add cluster-sequences-open-reference

Add support for paired-end joining

This should just be wiring up the appropriate incantation of vsearch.

add taxonomy classification method

It could be useful to add a taxonomy classifier to the vsearch plugin, mirroring the functionality of the qiime1 uclust classifier.

An example command that I had used previously for this, showing (at least the basic) parameters needed:
vsearch --usearch_global <input fasta> --db <ref seqs> --uc <output file> --id <% similarity> --strand both --maxaccepts <int> --maxrejects <int> --uc_allhits --output_no_hits

This outputs .uc files, so a conversion method akin to taxter's uc_consensus_assignments may be needed.

qiime2 / q2-vsearch Goto Github PK

q2-vsearch's Introduction

qiime2 (the QIIME 2 framework)

Installation

Users

Developers

Citing QIIME 2

q2-vsearch's People

Contributors

Stargazers

Watchers

Forkers

q2-vsearch's Issues

Recommend Projects

Recommend Topics

Recommend Org