Git Product home page Git Product logo

q2-feature-classifier's Introduction

q2-feature-classifier

This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.

q2-feature-classifier's People

Contributors

adamovanja avatar benkaehler avatar cherman2 avatar chriskeefe avatar colinvwood avatar david-rod avatar ebolyen avatar epruesse avatar gregcaporaso avatar hagenjp avatar jairideout avatar jakereps avatar lizgehret avatar nbokulich avatar oddant1 avatar q2d2 avatar sirdavidludwig avatar thermokarst avatar turanoo avatar vaamb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-feature-classifier's Issues

Using the gg-13-8-99-515-806-nb-classifier.qza classfier results in a warning

Using the gg-13-8-99-515-806-nb-classifier.qza classifier results in an warning that the classifier was created with an older version of scikit-learn than what is currently on my system. This is a small issue, though I figured it was worth noting.

taxa_assignments = feature_classifier.methods.classify(reads=qiime_seqs, classifier=taxa_classifier, n_jobs=14)

 UserWarning: Trying to unpickle estimator FeatureHasher from version pre-0.18 when using version 0.18. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

Warn users about scikit-learn version mismatch at runtime

For trained DBs, it would be really nice to warn users that they are using a version of scikit-learn that is incompatible with the version used to create the DB. We could do this by updating the DirFmt to encode scikit-learn version, then validate version at runtime.

naive bayes classifiers don't work if reads are reverse complement relative to the reference database

I ran into some very strange results when analyzing 454 data that I imported at the FeatureTable[Frequency] and FeatureData[Sequence] stage. These are skin samples from the Fierer et al "Keyboard Study".

My initial taxonomy barplots looked like this, which are clearly not the right composition:

screenshot 2017-03-29 14 39 16

After spending some time figuring out what was going on, and noticing that the blast+ assigner in the current development version was giving correct results, I realized that in the FeatureData[Sequence] data that I was importing, the sequences are actually the reverse complement relative to what's in Greengenes. I reverse complemented all of my sequences before importing, and the resulting taxa compositions look correct:

screenshot 2017-03-29 14 39 23

blast+ is handling this because it searches for both forward/forward and forward/reverse matches, and I think RDP classifier handles it by training on the forward and reverse complement kmers. @nbokulich, @BenKaehler, does that sound right to you?

Could we update the naive bayes classifiers, and any others that would have this same issue, to work both for reads that are forward and reverse complement relative to the training data?

change project name to q2-feature-classifier?

@BenKaehler, I'd suggest changing the name of this to q2-feature-classifier. What do you think? We're also working on a package that will do supervised classification of samples, so I think it'll be really clear to users if we have q2-feature-classifier and q2-sample-classifier.

revisit interaction between batch size and chunk size for joblib Parallel

So I'm happy with setting batch_size=1 for joblib.Parallel, because we're already chunking the processing by chunk_size in accordance with what the user's system can handle memory-wise, and batch_size=1 just means that each worker gets a chunk at a time, which is what I would expect as a user.

No way to figure out read orientation used after executing an action

Current Behavior
From the help text in classify-sklearn:

  --p-read-orientation [reverse-complement|same]
                                  [optional]
                                  Direction of reads with respect
                                  to reference sequences. same will cause
                                  reads to be classified unchanged; reverse-
                                  complement will cause reads to be reversed
                                  and complemented prior to classification.
                                  Default is to autodetect based on the
                                  confidence estimates for the first 100
                                  reads.

When this parameter is not specified, there doesn't seem to be a way to figure out what orientation was used. For example, when I look at the provenance information of this artifact, this parameter is listed as "null":

screen shot 2017-09-05 at 5 20 12 pm

unassigned sequences

It would be clearer for the users if, when the classifier failed to classify at the confidence threshold, it outputted a null classification, rather than omitting the sequence from the results. So that's what we'll do.

Thanks @nbokulich for suggesting this.

Should q2-feature-classifier support paired-end reads?

Hi @gregcaporaso,

Currently, q2-feature-classifier supports paired-end reads by providing an alternative workflow. That is, you must extract paired-end reads from the reference data set, train a classifier, then classify paired-end reads using the classifier. You can't classify single-end reads with a paired-end read classifier and vice versa.

@nbokulich tells me that paired-end reads will usually be converted into single-end reads before they get to q2-feature-classifier.

So, should we remove support for paired-end reads from q2-feature-classifier?

Pros: reduce user confusion, reduce source complexity, reduce work.
Cons: if the paired-end reads don't overlap, you have to concatenate or discard reads.

We would expect that the current approach is better than concatenating or discarding reads, but we don't actually know.

Thanks!

Tidy up a little bit of logic

The logic on line 118 of _cutter.py could be tidied up a bit. Some changes that occurred during the last PR allow an if statement to be removed.

ENH: use metadata to input class priors

Current Behavior
It is currently impossible to input class priors on the command line as command line commands cannot exceed a fixed, platform dependent limit.

Proposed Behavior
A solution is to use metadata (http://qiime.org/documentation/file_formats.html#metadata-mapping-files) to optionally input class priors. We can't use Artifacts because Artifacts can't be optional.

Reference

  1. A workaround is here: https://github.com/BenKaehler/q2-extra-classifier (see the qiime_kludge command line utility).
  2. http://qiime.org/documentation/file_formats.html#metadata-mapping-files

make repository public?

I think this is ready to be public. You could add a note to the README indicating that it's early alpha/testing status, and that it shouldn't be used for production analysis. Making it public will help with letting some of the other QIIME 2 devs test it out, so you'll get a lot of useful feedback and probably some pull requests that way.

add confidence for all levels

Improvement Description
We may be interested in looking at multiple confidence levels quickly, this would allow the filtering to the confidence threshold on the fly rather than having to assign taxonomy for each threshold of interest

Comments
It could be useful to calculate the confidence for each taxonomy level.

confidence calculations could be improved

Current Behavior
Currently we're calculating confidence in the same way that RDP does. For a sequence of length n and k-mers of length k, draw n/k bootstraps with replacement. Repeat 100 times and see how many times we get the same prediction as our original sequence at each level.

It works ok for naive bayes, but for the ensemble methods for which I've done limited testing it does not do very well. Perhaps this is because each bootstrap doesn't resemble the real data much.

Unfortunately it's hard to do much better, not least because the literature doesn't seem to address our situation much: small sample size with strong dependence and categorical samples. Also, bootstraps are slow.

Proposed Behavior
We could perhaps use a jackknife approach, but I can't find anything useful in the literature.

An alternative is probability calibration. If it worked, it would be much faster than bootstraps. It would need to be thoroughly tested and compared with the RDP approach. Changes on the classifier fitting side should be minimal, but the classification side would need a small rewrite.

References

  1. same way that RDP does
  2. hard to do much better
  3. probability calibration

fit-classifier-naive-bayes raises a ValueError for valid input

The problem:

$ qiime feature-classifier fit-classifier-naive-bayes --p-feat-ext--n-features 1024 --p-feat-ext--ngram-range '[4,16]' --i-reference-reads ref_dbs/99_gg_seq.qza --i-reference-taxonomy ref_dbs/99_gg_tax.qza --o-classifier never_gonna_get_that_far.qza
Plugin error from feature-classifier:

  indices and data should have the same size

Re-run with --verbose to see debug info.

This is a manifestation of a bug in scikit-learn: scikit-learn/scikit-learn#8941.

It could be fixed by kludging the feature classifier to chunk the input to HashingVectorizer. It could be fixed by fixing scikit-learn. Not sure what the best approach is yet.

Should classify-sklearn error when "dummy" jobs are created?

When running classify-sklearn, it's up to the user to do the load-balancing per job such that the parallelism works as best as possible. As such when running N (N > 1) parallel jobs, there might be situations where the number of reads to be processed is smaller than the number of reads that each job can process, and as such all reads are processed by one job, still all other N-1 jobs are created (note that each job will require some space to allocate the pre-trained classifier, which can be a lot for big databases like Silva). This gives the untrained eye the impression that parallelism is or isn't working.

An easy solution would to check before the classification starts, and produce an error with a proposed solution. An alternative would be to not create the N-1 jobs that will do nothing (though this would probably be prone to misinterpretations).

Make API consistent for classifiers?

Improvement Description
I wonder if would be possible to make the api consistent across the different classifiers? The two that I am most familiar with are the naive bayes classifier and the blast classifier. I apologize If there is already a way to do this that I have missed.

Use case

We often perform analyses across many runs so we have created a pipeline for most processes, including assigning taxonomy. Being able to swap out classifiers would be great for testing, and if new classifiers were added they could be added with almost no overhead.

Example of different APIs:

#bayes classifier
bayes_classifier = '/home/john/classifiers/bayes_classifier.qza'
classifier = qiime2.Artifact.load(bayes_classifier)
classifier_function = feature_classifier.methods.classify
taxa_assignments = classifier_function(reads=seqs, classifier=classifier)

#blast classifier
rep_tax = qiime2.Artifact.load('/home/john/classifiers/gg_tax_99.qza')
rep_seqs_99_amplicon = qiime2.Artifact.load('/home/john/classifiers/gg_seqs_99_515f_806r.qza')
taxa_assignments = feature_classifier.methods.classify_consensus_blast(query=seqs, 
                                                                       reference_reads=rep_seqs_99_amplicon, 
                                                                       reference_taxonomy=rep_tax)

I really like the API in the first example as it makes it very easy to try out a lot of different classifiers quickly. Additionally we will share classifiers between users. The first API allows a classifier to be shared without having to specify multiple parameters and reference files.

Possible solution:

blast_classifer = feature_classifier.methods.create_blast_classifier(reference_reads=rep_seqs_99_amplicon,
                                                                     reference_taxonomy=rep_tax)

blast_classifier.save('/home/john/classifiers/blast_classifier.qza')

blast_classifier_fp = '/home/john/classifiers/blast_classifier.qza'
blast_classifier = qiime2.Artifact.load(blast_classifier_fp)
classifier_function = feature_classifier.methods.classify
taxa_assignments = classifier_function(reads=seqs, classifier=blast_classifier)

Explanation

I realize that in the example of the bayes classifier the classifier needs to be "trained" - not something that is necessary with the BLAST classifier, however a classifier could still be created and then used in the classfier_function above.

Hopefully this makes sense, and I am fully understanding the requirements of the different classifications methods. I am curious to know if this would be useful for others.

update unit tests

Unit tests are required to cover the cases where multioutput=True when fitting the classifier or where n_jobs>1 or confidence>=0 for the classifier.

Missing parameter descriptions in classify-sklearn

There's no documentation for the following parameters:

  --i-reads PATH                  Artifact: FeatureData[Sequence]  [required]
  --i-classifier PATH             Artifact: TaxonomicClassifier  [required]
  --p-chunk-size INTEGER          [default: 262144]
  --p-n-jobs INTEGER              [default: 1]
  --p-pre-dispatch TEXT           [default: 2*n_jobs]

Write unit tests

Comments
Unit tests are required for

  • classifier.pipeline_from_spec
  • classifier.spec_from_pipeline
  • custom.LowMemoryMultinomialNB
  • custom._MultioutputClassifier

extract-reads should be split into two methods

extract-reads currently takes a method parameter with the possible values being position and match. When the value is position, the input type needs to be FeatureData[AlignedSequence], and when the value is match, the input type needs to be FeatureData[Sequence]. This would make sense as two methods, one for each value of method/input semantic type, and will save users from incorrectly using this method. (@BenKaehler and I both fell into the trap of using the incorrect input type, so it's very likely that others will as well.)

review docstrings

Bug Description
For example, extract_reads incorrectly mentions readlength and classify does not have a docstring.

extract_reads should fail gracefully for empty return reads

extract_reads throws an unhelpful ValueError if none of the reference sequences match the primers in "match" mode or an unhelpful IndexError if none of the reference sequences match the primers in "position" mode.

One way for this situation to arise is if there is an error in the primers.

We need to check for this contingency and tell the user what went wrong.

vsearch classifier method: odd parameter name

The vsearch method has a parameter with an odd name. In the CLI it is --p-id-, and in the Python code I'm guessing it is called id_. @nbokulich does it make sense to rename this parameter to match blast, i.e. --p-perc-identity?

Example:

screen shot 2017-03-31 at 12 27 07 pm

OverflowError when trying to train a classifier

Ran into this today. @ebolyen suggests that using dill rather than pickle might allow us to get around this.

$ qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gg-13-8-99-sequences.qza --i-reference-taxonomy gg-13-8-99-taxonomy.qza --o-classifier gg-13-8-99-classifier.qza
Traceback (most recent call last):
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/bin/qiime", line 11, in <module>
    load_entry_point('q2cli==0.0.3', 'console_scripts', 'qiime')()
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2cli/commands.py", line 154, in __call__
    results = self.action(**arguments)
  File "<decorator-gen-133>", line 2, in fit_classifier_naive_bayes
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/callable.py", line 238, in callable_wrapper
    output_types, provenance)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/callable.py", line 362, in _callable_executor_
    semantic_type, output_view, view_type, provenance)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/sdk/result.py", line 213, in _from_view
    result = transformation(view)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/transform.py", line 58, in transformation
    new_view = transformer(view)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2_feature_classifier/_taxonomic_classifier.py", line 88, in _2
    bytes_out = pickle.dumps(data['pipeline'])
OverflowError: cannot serialize a bytes object larger than 4 GiB

add default confidence value > 0.0 for classify-sklearn

The current default (-1.0) doesn't compute confidence scores. This can give very misleading results in some cases. For example, I accidentally used a 515F/806R classifier on data that was sequenced with 27F/338R with the default of no confidence estimation. I got species-level assignments, but they were nowhere near the right taxa. Since there were no confidence estimates, there was no indication that anything was wrong with the results. When I re-ran the data with higher confidence values (0.8 and 0.5) I only had assignments of bacteria (in a few bases, a phylum-level assignment was also made with confidence=0.5). Indicating that they were unassigned would have been ideal, but in this case I could clearly see that there was a problem. I'm labeling this as a bug because there is a very real potential for users to get a silent failure like the one I'm describing here.

@BenKaehler and @nbokulich are going to make a decision on a default value to start with based on their on-going parameter sweep/optimization.

Also, when using a value other than -1.0 for confidence, the confidence scores will be included in the FeatureData[Taxonomy]'s data. It would be really nice if qiime taxa tabulate could present these values as well (it could just present all of the information in the data/taxonomy.tsv file). This is secondary to the main point of this issue though, which is just starting to use a confidence value > 0.0.

test data files should be removed from the repo

You should just point to or access the files from q2-types, as those will always be guaranteed to be up-to-date with format changes, etc. The ones in this repo are going to be outdated quickly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.