Git Product home page Git Product logo

q2-vsearch's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-vsearch's People

Contributors

angrybee avatar chriskeefe avatar colinbrislawn avatar colinvwood avatar david-rod avatar ebolyen avatar gregcaporaso avatar hagenjp avatar jairideout avatar lizgehret avatar oddant1 avatar q2d2 avatar thermokarst avatar turanoo avatar vaamb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-vsearch's Issues

vsearch cluster-features-de-novo fatal error

Hi there, and thanks for your tremendous effort in maintaining q2!

I am running vsearch cluster-features-de-novo with the hope to re-cluster my AVS table inferred from DADA2. The following command:
qiime vsearch cluster-features-de-novo --i-sequences rep-seqs.qza --i-table feature-table.qza --p-perc-identity .99 --p-threads 6 --output-dir OUT --verbose
spits out this traceback:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --cluster_size /tmp/tmpgm9lrl9r --id 0.99 --centroids /tmp/q2-DNAFASTAFormat-zdyznqtj --uc /tmp/tmpz0fek351 --qmask none --xsize --threads 6

vsearch v2.7.0_linux_x86_64, 31.1GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file /tmp/tmpgm9lrl9r 4%  

Fatal error: Invalid (zero) abundance annotation in FASTA file header
Traceback (most recent call last):
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2cli/commands.py", line 327, in __call__
    results = action(**arguments)
  File "</root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/decorator.py:decorator-gen-121>", line 2, in cluster_features_de_novo
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 240, in bound_callable
    output_types, provenance)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/qiime2/sdk/action.py", line 383, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 193, in cluster_features_de_novo
    run_command(cmd)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 33, in run_command
    subprocess.run(cmd, check=True)
  File "/root/miniconda2/envs/qiime2-2019.7/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['vsearch', '--cluster_size', '/tmp/tmpgm9lrl9r', '--id', '0.99', '--centroids', '/tmp/q2-DNAFASTAFormat-zdyznqtj', '--uc', '/tmp/tmpz0fek351', '--qmask', 'none', '--xsize', '--threads', '6']' returned non-zero exit status 1.

Plugin error from vsearch:

  Command '['vsearch', '--cluster_size', '/tmp/tmpgm9lrl9r', '--id', '0.99', '--centroids', '/tmp/q2-DNAFASTAFormat-zdyznqtj', '--uc', '/tmp/tmpz0fek351', '--qmask', 'none', '--xsize', '--threads', '6']' returned non-zero exit status 1.

Could you kindly suggest a solution to this issue?

With thanks, Stan

Missing a length filter cross-reference in `cluster-features-*` actions

Bug Description
vsearch apparently applies a minimum length filter of 32 nts to input sequences - our cluster-features-* actions appear to assume that no reads are going to be filtered by vsearch, so there is no cross-referencing or post-vsearch filtering applied.

Steps to reproduce the behavior

  1. Please see reference 1, below.

Expected behavior
I see at least two ways to solve, detailed in questions 1 and 2, below.

Questions

  1. Should we solve by applying post-vsearch filtering? If so, how should we report the filtered sequences back to the user? Is this a new output, or is it lumped in with one of the existing outputs?
  2. Should (can?) we solve this by removing the min-length filter on vsearch?

References

  1. https://forum.qiime2.org/t/error-when-renning-cluster-features-de-novo/14878

merge fastq_stats_* visualizers into a single action

Improvement Description
Now that we have a brand-new unified view type for SE and PE fastq formats, fastq_stats_single and fastq_stats_paired could be merged into a single action. See:
qiime2/q2-types#245
and here for an open PR where this unified view type is utilized:
qiime2/q2-quality-control#52

Current Behavior
fastq_stats_single and fastq_stats_paired exist as separate visualizers. I am excited for these new visualizers! We have an opportunity to merge these into one before they area released to make use more straightforward.

Proposed Behavior
Merge these visualizers into one fastq_stats visualizer. This can probably be done by:

  1. changing the view type here to CasavaOneEightSingleLanePerSampleDirFmt (that format has a new manifest property so afaik this line should still work but needs testing)
  2. instead of this if statement you could then just check if the manifest has a reverse column.
  3. in plugin_setup.py you would register a union type as the input type. E.g., instead of specifying SampleData[PairedEndSequencesWithQuality] as seen here you would do SampleData[SequencesWithQuality | PairedEndSequencesWithQuality]

Then just cleaning up the docs, tests etc to use the unified action. (I may be missing a few other steps, but the above should be the gist of the main changes I think)

add vsearch database search method

Improvement Description
This is useful for running a BLAST-like search locally. For example, I recently wanted to get a report of the best matches in my reference database for some query sequences of interest. The vsearch command looks like the following:

vsearch --db 99-otus/dna-sequences.fasta \
  --usearch_global queries.fasta \
  --alnout out.aln \
  --blast6out out.bl6 \
  --id 0.0 \
  --maxaccepts 10 \
  --qmask none

Proposed Behavior
I would see this generating a qza containing the bl6 output, which could then be viewed with qiime metadata tabulate.

ENH: output closed-ref OTU picking stats

Improvement Description
Simply: % reads that match ref, but maybe other information could be relevant. (does vsearch output anything by default that we could just propagate here?)

This could be as simple as a report written to stdout, or could attempt to mirror what dada2/deblur do. (I'd lean toward the former, given the relative simplicity of OTU picking)

References
forum xref

add vsearch all-pairwise-alignments pipeline

Improvement Description
Add vsearch all-pairwise-alignments pipeline

Current Behavior
The vsearch command I ran was:

vsearch --allpairs_global queries.fasta --acceptall --id 0.0 --blast6out out.bl6 --alnout out.aln --qmask none

Proposed Behavior
Compute all pairwise alignments for a set of query sequences

Comments

  1. I would see this generating a qzv showing the pairwise alignments and maybe a heatmap-like summary of the pairwise alignment scores, and a qza containing the bl6 output that could be viewed with qiime metadata tabulate.

only feature ids present in the table should be clustered

I helped a collaborator run cluster-features after importing sequences and a biom table that were processed with a non-QIIME 2 pipeline, and we had the surprising result that the resulting FeatureData[Sequence] had more records in it than there were features in the FeatureTable. It turned out that the input FeatureData[Sequence] had sequences for features that had been filtered out of the input FeatureTable. We should probably raise an error if the set of feature ids isn't the same between the two inputs.

This is a bit of an edge case, so isn't really high priority (this will probably primarily be applied to FeatureTable and FeatureData[Sequence] artifacts that are generated through a QIIME 2 pipeline).

Expose `FeatureMap` data when clustering

Improvement Description
A user on the QIIME 2 forum was requesting detail on what features map to specific subjects from closed reference clustering. The cluster-features-closed-reference action does not presently expose the UC mapping output that would provide this detail. @gregcaporaso suggested use of FeatureMap as a means to represent these data.

Current Behavior
The mapping detail is not provided.

Proposed Behavior
Modify the relevant actions, including cluster-features-closed-reference, to allow saving the FeatureMap.

References

  1. @gregcaporaso suggested migrating FeatureMap to q2-types, see qiime2/q2-types#298
  2. https://forum.qiime2.org/t/introducing-greengenes2-2022-10/25291/35

Please support vsearch 2.13 or newer (Debian Packaging)

Hello,

I am currently packaging q2-vsearch for the Debian-Med packaging team1. I have realised that q2-vsearch2 is very strict with the version of vsearch that is required (2.7.0). This became apparent as the tests failed with the following recurring messages (exemplar):

self = <q2_vsearch.tests.test_join_pairs.MergePairsTests testMethod=test_join_pairs_alt_qminout>

    def test_join_pairs_alt_qminout(self):
        with redirected_stdio(stderr=os.devnull):
>           cmd, obs = _join_pairs_w_command_output(
                self.input_seqs, qminout=-1)

q2_vsearch/tests/test_join_pairs.py:276: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
q2_vsearch/_join_pairs.py:146: in _join_pairs_w_command_output
    run_command(cmd)
q2_vsearch/_cluster_features.py:33: in run_command
    subprocess.run(cmd, check=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = False, timeout = None, check = True
popenargs = (['vsearch', '--fastq_mergepairs', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_v....dev0/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz', '--fastqout', ...],)
kwargs = {}, process = <subprocess.Popen object at 0x7ffb3af2a9a0>
stdout = None, stderr = None, retcode = 1

    def run(*popenargs,
            input=None, capture_output=False, timeout=None, check=False, **kwargs):
        """Run command with arguments and return a CompletedProcess instance.
    
        The returned instance will have attributes args, returncode, stdout and
        stderr. By default, stdout and stderr are not captured, and those attributes
        will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.
    
        If check is True and the exit code was non-zero, it raises a
        CalledProcessError. The CalledProcessError object will have the return code
        in the returncode attribute, and output & stderr attributes if those streams
        were captured.
    
        If timeout is given, and the process takes too long, a TimeoutExpired
        exception will be raised.
    
        There is an optional argument "input", allowing you to
        pass bytes or a string to the subprocess's stdin.  If you use this argument
        you may not also use the Popen constructor's "stdin" argument, as
        it will be used internally.
    
        By default, all communication is in bytes, and therefore any "input" should
        be bytes, and the stdout and stderr will be bytes. If in text mode, any
        "input" should be a string, and stdout and stderr will be strings decoded
        according to locale encoding, or by "encoding" if set. Text mode is
        triggered by setting any of text, encoding, errors or universal_newlines.
    
        The other arguments are the same as for the Popen constructor.
        """
        if input is not None:
            if kwargs.get('stdin') is not None:
                raise ValueError('stdin and input arguments may not both be used.')
            kwargs['stdin'] = PIPE
    
        if capture_output:
            if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
                raise ValueError('stdout and stderr arguments may not be used '
                                 'with capture_output.')
            kwargs['stdout'] = PIPE
            kwargs['stderr'] = PIPE
    
        with Popen(*popenargs, **kwargs) as process:
            try:
                stdout, stderr = process.communicate(input, timeout=timeout)
            except TimeoutExpired as exc:
                process.kill()
                if _mswindows:
                    # Windows accumulates the output in a single blocking
                    # read() call run on child threads, with the timeout
                    # being done in a join() on those threads.  communicate()
                    # _after_ kill() is required to collect that and add it
                    # to the exception.
                    exc.stdout, exc.stderr = process.communicate()
                else:
                    # POSIX _communicate already populated the output so
                    # far into the TimeoutExpired exception.
                    process.wait()
                raise
            except:  # Including KeyboardInterrupt, communicate handled that.
                process.kill()
                # We don't call process.wait() as .__exit__ does that for us.
                raise
            retcode = process.poll()
            if check and retcode:
>               raise CalledProcessError(retcode, process.args,
                                         output=stdout, stderr=stderr)
E               subprocess.CalledProcessError: Command '['vsearch', '--fastq_mergepairs', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R1_001.fastq.gz', '--reverse', '/<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz', '--fastqout', '/tmp/q2-SingleLanePerSampleSingleEndFastqDirFmt-uv9jfa6y/BAQ2687.1_0_L001_R1_001.fastq', '--fastq_ascii', '33', '--fastq_minlen', '1', '--fastq_minovlen', '10', '--fastq_maxdiffs', '10', '--fastq_qmin', '0', '--fastq_qminout', '-1', '--fastq_qmax', '41', '--fastq_qmaxout', '41', '--minseqlength', '1', '--fasta_width', '0', '--threads', '1']' returned non-zero exit status 1.

/usr/lib/python3.8/subprocess.py:512: CalledProcessError
----------------------------- Captured stdout call -----------------------------
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --fastq_mergepairs /<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R1_001.fastq.gz --reverse /<<PKGBUILDDIR>>/.pybuild/cpython3_3.8/build/q2_vsearch/tests/data/demux-1/BAQ2687.1_3_L001_R2_001.fastq.gz --fastqout /tmp/q2-SingleLanePerSampleSingleEndFastqDirFmt-uv9jfa6y/BAQ2687.1_0_L001_R1_001.fastq --fastq_ascii 33 --fastq_minlen 1 --fastq_minovlen 10 --fastq_maxdiffs 10 --fastq_qmin 0 --fastq_qminout -1 --fastq_qmax 41 --fastq_qmaxout 41 --minseqlength 1 --fasta_width 0 --threads 1

----------------------------- Captured stderr call -----------------------------
Fatal error: Invalid options to command fastq_mergepairs
Invalid option(s): --fastq_qminout --minseqlength
The valid options for the fastq_mergepairs command are: --bzip2_decompress --eeout --eetabbedout --fasta_width --fastaout --fastaout_notmerged_fwd --fastaout_notmerged_rev --fastq_allowmergestagger --fastq_ascii --fastq_eeout --fastq_maxdiffpct --fastq_maxdiffs --fastq_maxee --fastq_maxlen --fastq_maxmergelen --fastq_maxns --fastq_minlen --fastq_minmergelen --fastq_minovlen --fastq_nostagger --fastq_qmax --fastq_qmaxout --fastq_qmin --fastq_truncqual --fastqout --fastqout_notmerged_fwd --fastqout_notmerged_rev --gzip_decompress --label_suffix --log --no_progress --quiet --relabel --relabel_keep --relabel_md5 --relabel_self --relabel_sha1 --reverse --sizein --sizeout --threads --xee --xsize

Most notably, Invalid option(s): --fastq_qminout --minseqlength. It has become apparent that these options are not available in the latest version of vsearch (they were removed).

It would be much appreciated if this could be fixed.

ENH: `dereplicate-sequences` expose parameter to disable sequence hash IDs

Improvement Description
Similar to q2-dada2 and q2-deblur, there should be an option to use the unhashed sequences as their own IDs instead of using a hash ID in dereplicate-sequences.

Current Behavior
Seq hashes are used by default.

Proposed Behavior
Expose a --p-hashed-feature-ids parameter to choose how sequence IDs get handled.

References

  1. forum xref

BUG: derep-seqs produces mismatched feature IDs

Bug Description
vsearch dereplicate-sequences adds Sample ID information to the Feature IDs of the FeatureData[Sequence] output, but not the Feature IDs of the FeatureTable[Frequency] output. This causes problems downstream when attempting to utilize Actions that require both as inputs --- the Feature IDs no longer match.

Steps to reproduce the behavior

  1. Run dereplicate-sequences on any input
  2. Export the FeatureData[Sequence]
  3. Note the modified Feature IDs: e.g. >9142dd139a96f63aba52d4e88bdcb803981e3467 L2S204_3773

Expected behavior
Feature IDs should be consistent in both outputs. Probably no need to include the Sample ID info that is being tacked on.

References

  1. Original forum post

Rename `join-pairs` to `merge-pairs`

Suggestion: Rename the q2-vsearch join-pairs action to merge_pairs, and update the associated descriptions to state something like: "Merge paired-end sequence reads..."). Merging reads is not the same as joining reads.

Reason: Calling this QIIME 2 action join-pairs may be confusing for savvy vsearch users, as there is in fact a --fastq_join command within vsearch. This command will stitch each read together end-to-end with a padding string, usually Ns. As we're actually calling the vsearch --fastq_mergepairs, we should update the name of the action.

add taxonomy classification method

It could be useful to add a taxonomy classifier to the vsearch plugin, mirroring the functionality of the qiime1 uclust classifier.

An example command that I had used previously for this, showing (at least the basic) parameters needed:
vsearch --usearch_global <input fasta> --db <ref seqs> --uc <output file> --id <% similarity> --strand both --maxaccepts <int> --maxrejects <int> --uc_allhits --output_no_hits

This outputs .uc files, so a conversion method akin to taxter's uc_consensus_assignments may be needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.