qiime2 / q2-feature-classifier Goto Github PK

View Code? Open in Web Editor NEW

18.0 11.0 38.0 9.45 MB

QIIME 2 plugin supporting taxonomic classification

License: BSD 3-Clause "New" or "Revised" License

Python 99.10% Makefile 0.12% TeX 0.78%

hacktoberfest

q2-feature-classifier's Introduction

q2-feature-classifier

This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.

q2-feature-classifier's People

Contributors

Stargazers

Watchers

q2-feature-classifier's Issues

create conda package

follow the wiki then seek help on slack

Add an option for hierarchical classification

Take "sklearn" out of the pipeline specifications

Using the gg-13-8-99-515-806-nb-classifier.qza classfier results in a warning

Using the gg-13-8-99-515-806-nb-classifier.qza classifier results in an warning that the classifier was created with an older version of scikit-learn than what is currently on my system. This is a small issue, though I figured it was worth noting.

taxa_assignments = feature_classifier.methods.classify(reads=qiime_seqs, classifier=taxa_classifier, n_jobs=14)

 UserWarning: Trying to unpickle estimator FeatureHasher from version pre-0.18 when using version 0.18. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

Warn users about scikit-learn version mismatch at runtime

For trained DBs, it would be really nice to warn users that they are using a version of scikit-learn that is incompatible with the version used to create the DB. We could do this by updating the DirFmt to encode scikit-learn version, then validate version at runtime.

Why is classify-consensus-vsearch so slow?

classify-consensus-vsearch was 50 times slower than classify-consensus-blast in the run time analysis for the paper. The users have noticed.

We should double check that there isn't anything strange going on.

should the different classifiers be converted to different methods?

that would give control over classifier-specific problems at the cli level. For example, we could have methods classify-naive-bayes, classify-svc, and classify-perfect.

Make sure pipeline isn't being sent over the wire more than necessary in the classifier

naive bayes classifiers don't work if reads are reverse complement relative to the reference database

I ran into some very strange results when analyzing 454 data that I imported at the FeatureTable[Frequency] and FeatureData[Sequence] stage. These are skin samples from the Fierer et al "Keyboard Study".

My initial taxonomy barplots looked like this, which are clearly not the right composition:

After spending some time figuring out what was going on, and noticing that the blast+ assigner in the current development version was giving correct results, I realized that in the FeatureData[Sequence] data that I was importing, the sequences are actually the reverse complement relative to what's in Greengenes. I reverse complemented all of my sequences before importing, and the resulting taxa compositions look correct:

blast+ is handling this because it searches for both forward/forward and forward/reverse matches, and I think RDP classifier handles it by training on the forward and reverse complement kmers. @nbokulich, @BenKaehler, does that sound right to you?

Could we update the naive bayes classifiers, and any others that would have this same issue, to work both for reads that are forward and reverse complement relative to the training data?

Update imports for TestPluginBase class to use qiime.plugin.testing

TestPluginBase has been moved to qiime.plugin.testing in qiime2/qiime2/pull/152.

I will open a PR to change this soon.

change project name to q2-feature-classifier?

@BenKaehler, I'd suggest changing the name of this to q2-feature-classifier. What do you think? We're also working on a package that will do supervised classification of samples, so I think it'll be really clear to users if we have q2-feature-classifier and q2-sample-classifier.

revisit interaction between batch size and chunk size for joblib Parallel

So I'm happy with setting batch_size=1 for joblib.Parallel, because we're already chunking the processing by chunk_size in accordance with what the user's system can handle memory-wise, and batch_size=1 just means that each worker gets a chunk at a time, which is what I would expect as a user.

No way to figure out read orientation used after executing an action

Current Behavior
From the help text in classify-sklearn:

  --p-read-orientation [reverse-complement|same]
                                  [optional]
                                  Direction of reads with respect
                                  to reference sequences. same will cause
                                  reads to be classified unchanged; reverse-
                                  complement will cause reads to be reversed
                                  and complemented prior to classification.
                                  Default is to autodetect based on the
                                  confidence estimates for the first 100
                                  reads.

When this parameter is not specified, there doesn't seem to be a way to figure out what orientation was used. For example, when I look at the provenance information of this artifact, this parameter is listed as "null":

fit-classifier* methods should warn that classifiers are specific to scikit-learn version

This message should say something like:

The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version . It can not be used with other versions of scikit-learn (while the classifier may complete successfully, the results will be unreliable).

unassigned sequences

It would be clearer for the users if, when the classifier failed to classify at the confidence threshold, it outputted a null classification, rather than omitting the sequence from the results. So that's what we'll do.

Thanks @nbokulich for suggesting this.

Should q2-feature-classifier support paired-end reads?

Hi @gregcaporaso,

Currently, q2-feature-classifier supports paired-end reads by providing an alternative workflow. That is, you must extract paired-end reads from the reference data set, train a classifier, then classify paired-end reads using the classifier. You can't classify single-end reads with a paired-end read classifier and vice versa.

@nbokulich tells me that paired-end reads will usually be converted into single-end reads before they get to q2-feature-classifier.

So, should we remove support for paired-end reads from q2-feature-classifier?

Pros: reduce user confusion, reduce source complexity, reduce work.
Cons: if the paired-end reads don't overlap, you have to concatenate or discard reads.

We would expect that the current approach is better than concatenating or discarding reads, but we don't actually know.

Thanks!

Add 5' trimming to extract-reads

It is sometimes practice to trim reads at both ends. We should support that in extract-reads.

Thanks @nbokulich for bringing this to my attention.

Tidy up a little bit of logic

The logic on line 118 of _cutter.py could be tidied up a bit. Some changes that occurred during the last PR allow an if statement to be removed.

`fit-classifier-naive-bayes` default chunk size to 20000

Discussed this with @BenKaehler on Slack - this will speed up the classifier and avoid memory errors.

ENH: use metadata to input class priors

Current Behavior
It is currently impossible to input class priors on the command line as command line commands cannot exceed a fixed, platform dependent limit.

Proposed Behavior
A solution is to use metadata (http://qiime.org/documentation/file_formats.html#metadata-mapping-files) to optionally input class priors. We can't use Artifacts because Artifacts can't be optional.

Reference

A workaround is here: https://github.com/BenKaehler/q2-extra-classifier (see the qiime_kludge command line utility).
http://qiime.org/documentation/file_formats.html#metadata-mapping-files

make repository public?

I think this is ready to be public. You could add a note to the README indicating that it's early alpha/testing status, and that it shouldn't be used for production analysis. Making it public will help with letting some of the other QIIME 2 devs test it out, so you'll get a lot of useful feedback and probably some pull requests that way.

add confidence for all levels

Improvement Description
We may be interested in looking at multiple confidence levels quickly, this would allow the filtering to the confidence threshold on the fly rather than having to assign taxonomy for each threshold of interest

Comments
It could be useful to calculate the confidence for each taxonomy level.

Detach `num_threads` from blast parameter list

Blast doesn't allow threading when a subject is provided (which is required as an input artifact currently), according to the warning given during the run.

confidence calculations could be improved

Current Behavior
Currently we're calculating confidence in the same way that RDP does. For a sequence of length n and k-mers of length k, draw n/k bootstraps with replacement. Repeat 100 times and see how many times we get the same prediction as our original sequence at each level.

It works ok for naive bayes, but for the ensemble methods for which I've done limited testing it does not do very well. Perhaps this is because each bootstrap doesn't resemble the real data much.

Unfortunately it's hard to do much better, not least because the literature doesn't seem to address our situation much: small sample size with strong dependence and categorical samples. Also, bootstraps are slow.

Proposed Behavior
We could perhaps use a jackknife approach, but I can't find anything useful in the literature.

An alternative is probability calibration. If it worked, it would be much faster than bootstraps. It would need to be thoroughly tested and compared with the RDP approach. Changes on the classifier fitting side should be minimal, but the classification side would need a small rewrite.

References

fit-classifier-naive-bayes raises a ValueError for valid input

The problem:

$ qiime feature-classifier fit-classifier-naive-bayes --p-feat-ext--n-features 1024 --p-feat-ext--ngram-range '[4,16]' --i-reference-reads ref_dbs/99_gg_seq.qza --i-reference-taxonomy ref_dbs/99_gg_tax.qza --o-classifier never_gonna_get_that_far.qza
Plugin error from feature-classifier:

  indices and data should have the same size

Re-run with --verbose to see debug info.

This is a manifestation of a bug in scikit-learn: scikit-learn/scikit-learn#8941.

It could be fixed by kludging the feature classifier to chunk the input to HashingVectorizer. It could be fixed by fixing scikit-learn. Not sure what the best approach is yet.

possible to remove depth parameter?

usually these just make as specific of assignments as they can.

Should classify-sklearn error when "dummy" jobs are created?

When running classify-sklearn, it's up to the user to do the load-balancing per job such that the parallelism works as best as possible. As such when running N (N > 1) parallel jobs, there might be situations where the number of reads to be processed is smaller than the number of reads that each job can process, and as such all reads are processed by one job, still all other N-1 jobs are created (note that each job will require some space to allocate the pre-trained classifier, which can be a lot for big databases like Silva). This gives the untrained eye the impression that parallelism is or isn't working.

An easy solution would to check before the classification starts, and produce an error with a proposed solution. An alternative would be to not create the N-1 jobs that will do nothing (though this would probably be prone to misinterpretations).

Make API consistent for classifiers?

Improvement Description
I wonder if would be possible to make the api consistent across the different classifiers? The two that I am most familiar with are the naive bayes classifier and the blast classifier. I apologize If there is already a way to do this that I have missed.

Use case

We often perform analyses across many runs so we have created a pipeline for most processes, including assigning taxonomy. Being able to swap out classifiers would be great for testing, and if new classifiers were added they could be added with almost no overhead.

Example of different APIs:

#bayes classifier
bayes_classifier = '/home/john/classifiers/bayes_classifier.qza'
classifier = qiime2.Artifact.load(bayes_classifier)
classifier_function = feature_classifier.methods.classify
taxa_assignments = classifier_function(reads=seqs, classifier=classifier)

#blast classifier
rep_tax = qiime2.Artifact.load('/home/john/classifiers/gg_tax_99.qza')
rep_seqs_99_amplicon = qiime2.Artifact.load('/home/john/classifiers/gg_seqs_99_515f_806r.qza')
taxa_assignments = feature_classifier.methods.classify_consensus_blast(query=seqs, 
                                                                       reference_reads=rep_seqs_99_amplicon, 
                                                                       reference_taxonomy=rep_tax)

I really like the API in the first example as it makes it very easy to try out a lot of different classifiers quickly. Additionally we will share classifiers between users. The first API allows a classifier to be shared without having to specify multiple parameters and reference files.

Possible solution:

blast_classifer = feature_classifier.methods.create_blast_classifier(reference_reads=rep_seqs_99_amplicon,
                                                                     reference_taxonomy=rep_tax)

blast_classifier.save('/home/john/classifiers/blast_classifier.qza')

blast_classifier_fp = '/home/john/classifiers/blast_classifier.qza'
blast_classifier = qiime2.Artifact.load(blast_classifier_fp)
classifier_function = feature_classifier.methods.classify
taxa_assignments = classifier_function(reads=seqs, classifier=blast_classifier)

Explanation

I realize that in the example of the bayes classifier the classifier needs to be "trained" - not something that is necessary with the BLAST classifier, however a classifier could still be created and then used in the classfier_function above.

Hopefully this makes sense, and I am fully understanding the requirements of the different classifications methods. I am curious to know if this would be useful for others.

update unit tests

Unit tests are required to cover the cases where multioutput=True when fitting the classifier or where n_jobs>1 or confidence>=0 for the classifier.

data for comparing performance of classification approaches

Compare blast, uclust and rdp (see qiime 1).

Mock Data

Missing parameter descriptions in classify-sklearn

There's no documentation for the following parameters:

  --i-reads PATH                  Artifact: FeatureData[Sequence]  [required]
  --i-classifier PATH             Artifact: TaxonomicClassifier  [required]
  --p-chunk-size INTEGER          [default: 262144]
  --p-n-jobs INTEGER              [default: 1]
  --p-pre-dispatch TEXT           [default: 2*n_jobs]

Compare RDP and the naive bayes classifier for the Iceland tutorial data

Observation: the vectorised counts are more presence/absence than frequency spectrum

... so need to try BernoulliNB

Write unit tests

Comments
Unit tests are required for

classifier.pipeline_from_spec
classifier.spec_from_pipeline
custom.LowMemoryMultinomialNB
custom._MultioutputClassifier

Reduce memory requirements for classifier

... by removing pandas.Series dependency and maybe writing a custom output type.

classifier parameter artifact

ben to start by figuring out what the parameters look like

update method names, now that we have more classifiers

On the dev call we discussed names like:

blast-consensus-classify
vsearch-consensus-classify
skl-classify
skl-fit-classifier

Explore other feature_extraction methods

extract-reads should be split into two methods

extract-reads currently takes a method parameter with the possible values being position and match. When the value is position, the input type needs to be FeatureData[AlignedSequence], and when the value is match, the input type needs to be FeatureData[Sequence]. This would make sense as two methods, one for each value of method/input semantic type, and will save users from incorrectly using this method. (@BenKaehler and I both fell into the trap of using the incorrect input type, so it's very likely that others will as well.)

review docstrings

Bug Description
For example, extract_reads incorrectly mentions readlength and classify does not have a docstring.

extract_reads should fail gracefully for empty return reads

extract_reads throws an unhelpful ValueError if none of the reference sequences match the primers in "match" mode or an unhelpful IndexError if none of the reference sequences match the primers in "position" mode.

One way for this situation to arise is if there is an error in the primers.

We need to check for this contingency and tell the user what went wrong.

testing and linting

Add unit tests, travisci testing and I also recommend adding linting with flake8. See .travis.yml from QIIME 2 for an example.

vsearch classifier method: odd parameter name

The vsearch method has a parameter with an odd name. In the CLI it is --p-id-, and in the Python code I'm guessing it is called id_. @nbokulich does it make sense to rename this parameter to match blast, i.e. --p-perc-identity?

Example:

add amplicon extraction functionality

OverflowError when trying to train a classifier

Ran into this today. @ebolyen suggests that using dill rather than pickle might allow us to get around this.

$ qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gg-13-8-99-sequences.qza --i-reference-taxonomy gg-13-8-99-taxonomy.qza --o-classifier gg-13-8-99-classifier.qza
Traceback (most recent call last):
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/bin/qiime", line 11, in <module>
    load_entry_point('q2cli==0.0.3', 'console_scripts', 'qiime')()
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2cli/commands.py", line 154, in __call__
    results = self.action(**arguments)
  File "<decorator-gen-133>", line 2, in fit_classifier_naive_bayes
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/callable.py", line 238, in callable_wrapper
    output_types, provenance)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/callable.py", line 362, in _callable_executor_
    semantic_type, output_view, view_type, provenance)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/sdk/result.py", line 213, in _from_view
    result = transformation(view)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime/core/transform.py", line 58, in transformation
    new_view = transformer(view)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2_feature_classifier/_taxonomic_classifier.py", line 88, in _2
    bytes_out = pickle.dumps(data['pipeline'])
OverflowError: cannot serialize a bytes object larger than 4 GiB

Abstract as many classifier parameters as possible using scikit-learn-style classifiers

The multioutput parameter could be removed by creating a custom scikit-learn classifier that wraps other classifiers and pre- and post-processes the labels.

This approach may simplify embedding MultiLabelClf in the future.

add default confidence value > 0.0 for classify-sklearn

The current default (-1.0) doesn't compute confidence scores. This can give very misleading results in some cases. For example, I accidentally used a 515F/806R classifier on data that was sequenced with 27F/338R with the default of no confidence estimation. I got species-level assignments, but they were nowhere near the right taxa. Since there were no confidence estimates, there was no indication that anything was wrong with the results. When I re-ran the data with higher confidence values (0.8 and 0.5) I only had assignments of bacteria (in a few bases, a phylum-level assignment was also made with confidence=0.5). Indicating that they were unassigned would have been ideal, but in this case I could clearly see that there was a problem. I'm labeling this as a bug because there is a very real potential for users to get a silent failure like the one I'm describing here.

@BenKaehler and @nbokulich are going to make a decision on a default value to start with based on their on-going parameter sweep/optimization.

Also, when using a value other than -1.0 for confidence, the confidence scores will be included in the FeatureData[Taxonomy]'s data. It would be really nice if qiime taxa tabulate could present these values as well (it could just present all of the information in the data/taxonomy.tsv file). This is secondary to the main point of this issue though, which is just starting to use a confidence value > 0.0.