This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.
qiime2 / q2-types Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.
Current Behavior
Currently methods like feature-table filter-samples
can filter out all values in a table, resulting in a successfully created (albeit empty) artifact. This then causes problems in other methods, where they expect to have some data in the table. As well, without this centralized check, the burden of checking for data in these tables falls on the plugin developer, which creates some extra work.
References
This is a more specific case of qiime2/qiime2#294.
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
This would be useful to use principal coordinates (or other ordination results) as input metadata. I am working on some methods that could employ this, e.g., to test whether samples change over PC1 before/after treatment.
The transformation of OrdinationFormat --> pd.DataFrame can be achieved with something like this (in a jupyter notebook, at least. I suppose the first line might be unnecessary in a transformer):
beta_div = beta_div.view(skbio.OrdinationResults)
beta_div = beta_div.samples.loc[:, 0:2]
beta_div.columns = ['unweighted-unifrac-pc1', 'unweighted-unifrac-pc2', 'unweighted-unifrac-pc3']
and then I assume the beta_div
DataFrame can be converted to metadata with
qiime2.Metadata(beta_div)
I would find this extremely useful — any interest?
Right now Phylogeny implies that it will only be allowed to handle phylogenetic trees. But there are many tree like structures that could be made - for instance hierarchical clusterings.
Could we create a super type, for example Hierarchy
that could encompass both Phylogenies and Clusterings?
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
I've included an example code block showing the error at the bottom of this issue. Pulling the _16
transformer into a local function and adding a skbio.DNA
wrapper around the sequence string allowed it to partially work as expected (sans header ids).
In [11]: def _16(data: pd.Series) -> DNAFASTAFormat:
...: ff = DNAFASTAFormat()
...: with ff.open() as f:
...: for sequence in data:
...: skbio.io.write(skbio.DNA(sequence), format='fasta', into=f)
...: return ff
...:
In [12]: f = _16(features.loc[data.columns, 'DenoisedSequenceVariant'])
In [13]: !head {f.path}
>
GCGAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
The funny(/ironic?) part is that it's the only transformer in the sub-package without any tests.
In [6]: qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])
---------------------------------------------------------------------------
UnrecognizedFormatError Traceback (most recent call last)
<ipython-input-6-f8fdc74db9db> in <module>()
----> 1 qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in import_data(cls, type, view, view_type)
190
191 provenance_capture = archive.ImportProvenanceCapture(format_, md5sums)
--> 192 return cls._from_view(type_, view, view_type, provenance_capture)
193
194 @classmethod
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
215 transformation = from_type.make_transformation(to_type,
216 recorder=recorder)
--> 217 result = transformation(view)
218
219 artifact = cls.__new__(cls)
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in transformation(view)
57 self.validate(view)
58
---> 59 new_view = transformer(view)
60
61 new_view = other.coerce_view(new_view)
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in wrapped(view)
205 def wrapped(view):
206 new_view = self._view_type()
--> 207 file_view = transformer(view)
208 if transformer is not identity_transformer:
209 self.set_user_owned(file_view, False)
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/q2_types/feature_data/_transformer.py in _16(data)
339 with ff.open() as f:
340 for sequence in data:
--> 341 skbio.io.write(sequence, format='fasta', into=f)
342 return ff
343
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
1164 @wraps(IORegistry.write)
1165 def write(obj, format, into, **kwargs):
-> 1166 return io_registry.write(obj, format, into, **kwargs)
1167
1168
~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
615 raise UnrecognizedFormatError(
616 "Cannot write %r into %r, no %s writer found." %
--> 617 (format, into, obj.__class__.__name__))
618
619 writer(obj, into, **kwargs)
UnrecognizedFormatError: Cannot write 'fasta' into <_io.TextIOWrapper name='/tmp/q2-DNAFASTAFormat-ohosxny8' mode='r+' encoding='utf8'>, no str writer found.
In [7]:
I can work on this if necessary!
TestPluginBase
has been moved to qiime.plugin.testing
in qiime2/qiime2/pull/152.
Bug Description
When feature_classifier.extract_reads encounters a sequence with a lowercase letter in it, it throws the error below.
Screenshots
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-674d2044b0eb> in <module>()
7 seqs_in = qiime.Artifact.import_data("FeatureData[Sequence]", seqs)
8 reads = feature_classifier.methods.extract_reads(seqs_in, read_length,
----> 9 fwd_primer, rev_primer)
10 reads.save(reads_out)
<decorator-gen-204> in extract_reads(sequences, read_length, f_primer, r_primer, method, direction, n_sample)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in callable_wrapper(*args, **kwargs)
225
226 outputs = self._callable_executor_(self._callable, view_args,
--> 227 output_types, provenance)
228 # `outputs` matches a Python function's return: either a single
229 # value is returned, or it is a tuple of return values. Treat both
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in _callable_executor_(self, callable, view_args, output_types, provenance)
350 (view_type.__name__, type(output_view).__name__))
351 artifact = qiime.sdk.Artifact._from_view(
--> 352 semantic_type, output_view, view_type, provenance.fork())
353 output_artifacts.append(artifact)
354
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
214 transformation = from_type.make_transformation(to_type,
215 recorder=recorder)
--> 216 result = transformation(view)
217
218 artifact = cls.__new__(cls)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in transformation(view)
57 self.validate(view)
58
---> 59 new_view = transformer(view)
60
61 new_view = other.coerce_view(new_view)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in wrapped(view)
188 def wrapped(view):
189 new_view = self._view_type()
--> 190 file_view = transformer(view)
191 if transformer is not identity_transformer:
192 self.set_user_owned(file_view, False)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_types-0.0.6-py3.5.egg/q2_types/feature_data/_transformer.py in _10(data)
89 def _10(data: DNAIterator) -> DNAFASTAFormat:
90 ff = DNAFASTAFormat()
---> 91 skbio.io.write(data.generator, format='fasta', into=str(ff))
92 return ff
93
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
1164 @wraps(IORegistry.write)
1165 def write(obj, format, into, **kwargs):
-> 1166 return io_registry.write(obj, format, into, **kwargs)
1167
1168
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
617 (format, into, obj.__class__.__name__))
618
--> 619 writer(obj, into, **kwargs)
620 return into
621
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_writer(obj, file, encoding, newline, **kwargs)
1080 with open_files(files, mode='w', **io_kwargs) as fhs:
1081 kwargs.update(zip(file_keys, fhs[:-1]))
-> 1082 writer_function(obj, fhs[-1], **kwargs)
1083
1084 self._add_writer(cls, wrapped_writer, monkey_patch, override)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _generator_to_fasta(obj, fh, qual, id_whitespace_replacement, description_newline_replacement, max_width, lowercase)
772 obj, id_whitespace_replacement, description_newline_replacement,
773 qual is not None, lowercase)
--> 774 for header, seq_str, qual_scores in formatted_records:
775 if max_width is not None:
776 seq_str = chunk_str(seq_str, max_width, '\n')
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/_base.py in _format_fasta_like_records(generator, id_whitespace_replacement, description_newline_replacement, require_qual, lowercase)
144 "sequence IDs, nor to replace newlines in sequence descriptions.")
145
--> 146 for idx, seq in enumerate(generator):
147
148 if len(seq) < 1:
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_cutter.py in read_seqs()
129
130 def read_seqs():
--> 131 for single_sequence_tuple in result:
132 yield single_sequence_tuple[0]
133 return DNAIterator(read_seqs())
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_gregex.py in extract_reads_by_position(aln, readlength, f_primer, r_primer, endedness, sample)
56 query_cache = []
57 i = 0
---> 58 for query in aln:
59 query_cache.append(query)
60 gaps = query.gaps()
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in <genexpr>(.0)
504 # GeneratorType
505 try:
--> 506 return (x for x in itertools.chain([next(gen)], gen))
507 except StopIteration:
508 # If the error was a StopIteration, then we want to return an
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in _read_gen(self, file, fmt, into, verify, kwargs)
529 reader, kwargs = self._init_reader(file, fmt, into, verify, kwargs,
530 io_kwargs)
--> 531 yield from reader(file, **kwargs)
532
533 def _find_io_kwargs(self, kwargs):
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_reader(file, encoding, newline, **kwargs)
1006 with open_files(files, mode='r', **io_kwargs) as fhs:
1007 kwargs.update(zip(file_keys, fhs[:-1]))
-> 1008 yield from reader_function(fhs[-1], **kwargs)
1009
1010 self._add_reader(cls, wrapped_reader, monkey_patch, override)
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _fasta_to_generator(fh, qual, constructor, **kwargs)
675 FASTAFormatError):
676 yield constructor(seq, metadata={'id': id_, 'description': desc},
--> 677 **kwargs)
678 else:
679 fasta_gen = _parse_fasta_raw(fh, _parse_sequence_data,
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in __init__(self, sequence, metadata, positional_metadata, lowercase, validate)
334
335 if validate:
--> 336 self._validate()
337
338 def _validate(self):
/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in _validate(self)
358 [str(b.tostring().decode("ascii")) for b in bad] if
359 len(bad) > 1 else bad[0],
--> 360 list(self.alphabet)))
361
362 @stable(as_of='0.4.0')
ValueError: Invalid character in sequence: b't'.
Valid characters: ['G', 'C', '.', 'Y', 'W', 'B', 'R', 'V', 'N', 'K', 'D', 'S', '-', 'A', 'H', 'T', 'M']
Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.
Comments
According to @BenKaehler : It looks like it’s coming from skbio when we attempt to write out lowercase sequences, which is called indirectly from q2_types.
Hence I am posting this issue here.
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
Improvement Description
There are two styles of TSV that would be useful, and two orientations.
Styles:
Orientations:
Transformers should convert the BIOMV210Format into the 4 combinations above. I don't have good names for these, but some examples might be:
MatrixTSVBySampleFormat
MatrixTSVByFeatureFormat
RecordTSVBySampleFormat
RecordTSVByFeatureFormat
In the future we could do smarter things with TSVs and schemas, but for now, the above would help a lot of people with a pretty mundane conversion.
References
This came up on the forum (specific details in that post) and corresponds to the QIIME 1 script extract_reads_from_interleaved_fastq.py
.
Came up on the forum a few times (e.g. here, here, and here). Users need to be able to import multiplexed sequence data that contains barcodes in the sequences (we currently support data that has the barcodes extracted in a separate file, i.e. the "EMP protocol multiplexed data"). For now, a workaround is to use QIIME 1's extract_barcodes.py
to extract the barcodes into their own file.
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
When reading TaxonomyFormat
with any of the transformers, the first line is assumed to be a (non-comment) header, followed by the taxonomy mapping lines. The sniffer is very lenient and only cares that the file is two-column TSV.
Not all taxonomy files include a header (for example, Greengenes). When a transformer is invoked to read the file, the first line is interpreted as a header, causing the feature ID to be set as Index.name
and the taxonomy string to be set as Series.name
.
For example, suppose we have the following taxonomy.tsv
file (I used the first few lines from the Greengenes taxonomy map):
228054 k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
228057 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Pelagibacteraceae; g__; s__
73627 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Mycobacteriaceae; g__Mycobacterium; s__
378462 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Staphylococcaceae; g__Staphylococcus; s__
89370 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__Anoxybacillus; s__kestanbolensis
Reading this file into a pd.Series
(all transformers are affected, it's not limited to the pd.Series
transformer):
In [1]: from qiime2.plugin.util import transform
In [2]: from q2_types.feature_data import TaxonomyFormat
In [3]: import pandas as pd
In [4]: taxonomy_series = transform('taxonomy.tsv', from_type=TaxonomyFormat, to_type=pd.Series)
In [5]: taxonomy_series
Out[5]:
228054
228057 k__Bacteria; p__Proteobacteria; c__Alphaproteo...
73627 k__Bacteria; p__Actinobacteria; c__Actinobacte...
378462 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
89370 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
Name: k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__, dtype: object
In [6]: # :(
In [7]:
The Series has its Index.name
set to "228054" and its Series.name
to "k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__".
During classification, if a query sequence is assigned to the first reference sequence taxonomy (i.e. the one that's misinterpreted as a header), I think the code will error with an IndexError
coming from pandas. This happened with @nbokulich's vsearch
classifier (not yet in master), and adding a header line to the Greengenes file appears to fix the issue (he's not receiving the error anymore at least).
I don't think we've seen this error with the existing classifiers because we've never had a query sequence assigned to the first reference sequence (e.g. just by chance). I suspect (and hope) that the code would fail in a similar way with an IndexError
, but haven't confirmed.
I propose that we require TaxonomyFormat
(both when sniffing and reading) to have the following header (we've been using this header in the unit tests, maybe elsewhere):
Feature ID<tab>Taxon
... optionally followed by other columns that are ignored. We could finesse the column names a little -- something like feature_id
and taxon
would be easier to access from pandas objects. This is a minor detail we can work out later.
If we go with a stricter format, such as what I'm proposing, then importing files without the appropriate header (e.g. Greengenes and other reference databases) will raise an error and the header will have to be added to the file in order to import. This is annoying, but I really think we should stop supporting tabular files without headers, especially because we want the .qza file formats to be as self-documenting as possible.
Thoughts? cc: @gregcaporaso, @nbokulich, @BenKaehler, @ebolyen, @thermokarst, @jakereps
Thanks @nbokulich for finding and reporting this bug!
This came up on the forum here. If indices are not strings, users will get a traceback with a cryptic error message. We should improve this error message.
SingleLanePerSampleSingleEndFastqDirFmt
and SingleLanePerSamplePairedEndFastqDirFmt
-> PerSampleDNAIterators
transformers don't take the MANIFEST
comments into account and crash when attempting to view the artifact as an iterator.
In [1]: import qiime2
In [2]: from q2_types.per_sample_sequences import PerSampleDNAIterators
In [3]: a = qiime2.Artifact.load('20170626_1/demux.qza')
In [4]: a.view(PerSampleDNAIterators)
...
~/Developer/mc3/envs/biota/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py in _1(dirfmt)
44 next(fh)
45 for line in fh:
---> 46 sample_id, filename, _ = line.split(',')
47 filepath = str(dirfmt.path / filename)
48 result[sample_id] = skbio.io.read(filepath, format='fastq',
ValueError: not enough values to unpack (expected 3, got 1)
In preparation for supporting BIOM V2 files.
The current tests/data/phylogeny-rooted.qza
and tests/data/phylogeny-unrooted.qza
files are really big and don't work with the wiki tutorial. We should replace them with these files:
https://dl.dropboxusercontent.com/u/2868868/phylogeny-rooted.qza
https://dl.dropboxusercontent.com/u/2868868/phylogeny-unrooted.qza
It would be nice to re-write the history to remove the current files that are in there, since they are much larger than anything else.
Thanks @jairideout for catching this issue!
This will make the resulting artifacts system agnostic. Related to qiime2/qiime2#95
Improvement Description
It'd be useful to support FeatureData[Sequences]
, i.e. analogous to QIIME 1's "OTU Map". This type/format describes the sequences in each feature (e.g. sequences that clustered into an OTU).
Comments
We had planned to add this type but deferred until we could come up with a reasonable file format (the QIIME 1 OTU Map format is un-parsable in Python when the lines are too long).
References
This type was requested on the QIIME 2 forum here.
Mostly for example purposes in this repository, but it'll be useful since this will have a release today.
Came up on the forum: duplicate sample IDs should be disallowed with types SampleData[SequencesWithQuality]
and SampleData[PairedEndSequencesWithQuality]
(the fix would be implemented on those types' transformers).
Iterator's must return themselves when __iter__
is called. DNAIterator
is non-conforming. An Iterator
is probably overkill as well, an Iterable
would be perfectly fine.
A user on the forum was able to import fastq files using one of the manifest formats. Downstream in the analysis it appears that the sequences don't have quality scores associated with them, e.g.:
Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
results = action(**arguments)
File "<decorator-gen-207>", line 2, in summarize
File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
output_types, provenance)
File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 392, in _callable_executor_
ret_val = callable(output_dir=temp_dir, **view_args)
File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py", line 114, in summarize
for seq in _read_fastq_seqs(file):
File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_demux.py", line 36, in _read_fastq_seqs
qual.strip())
AttributeError: 'NoneType' object has no attribute 'strip'
@thermokarst I think the issue I was having with the HeaderlessTSVTaxonomyFormat was possibly related to the wrong base class being used? I'm not sure but it is obviously not like the rest.
https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L69
I noticed this because I was getting this error and I didn't know why it was trying to do a transformation to a HeaderlessTSVTaxonomyFormat when I am positive I already gave it a HeaderlessTSVTaxonomyFormat.
Well, it's also not defined yet, so I'll just make my own for now. Thanks for all your help!
Comments
This is because importing data that share the same artifact_format
as the semantic type enjoy transformerless-imports. While this is generally a convenience for developers, it makes things a pain when we need to apply additional transforms (see: feature table munging in this plugin, which strips out biom metadata, among other things). Not sure if this needs to be solved more generally in the framework, or if we can just define a BIOMV210Format
-> BIOMV210Format
transformer here.
References
This issue came up on the forum.
Bug Description
The current directory formats assume the phred offset is 33, but this is not always the case. Labeling as a bug since it is possible to create a phred offset 33 directory format with phred offset 64 data, and no error is displayed until quality scores are requested in a transformer (e.g. loading into skbio Sequence objects).
References
Original issue reported here.
Attempting to import a directory of paired-end, per-sample fastq files using the CasavaOneEightSingleLanePerSampleDirFmt
format should raise an error if there are any unpaired fwd/rev reads files. It is currently possible to create a .qza
with unpaired files, which can cause issues in downstream methods/visualizers (e.g. see this forum post about qiime demux summarize
).
We need transformers from pd.DataFrame
to biom.Table
and BIOMV210Format
.
Drop duplicate registration of FeatureTable
/ BIOMV100DirFmt
once qiime2/qiime2#134 is taken care of.
This is also related to #43 - when we restructure the existing tests any failures related to the lack of registration should be rendered irrelevant.
This format doesn't have a lane identifier so we would need another format to support this.
It would be much easier to use this format than to create a fastq-manifest with potentially hundreds of lines.
this will simplify importing data from QIIME 1.
Related to qiime2/qiime2#271
Related to qiime2/qiime2#269
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
Comments
this will simplify importing data from QIIME 1.
It seems to permit both .fastq
and fastq.gz
files as "input" for the manifest format.
It doesn't look like FastqGzFormat
used in SingleLanePerSampleSingleEndFastqDirFmt
(or it's paired variant) verify this fact either. It should reject files that aren't gzipped in it's sniff
method.
It would be nice to be able to gzip in the transformers from the .*FastqManifest.*
formats if possible.
References
This recently came up on the forum.
Currently the two formats that we support, CasavaOneEightSingleLanePerSampleDirFmt
and SingleLanePerSampleSingleEndFastqDirFmt
, require files to be named in the Casava convention (i.e., matching the regular expression r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz'
). We should add another format that uses a MANIFEST file to relax this restriction on the filenames so that, for example, the filenames could just be sample-id.fastq.gz
.
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
See distance_matrix
subpackage tests (added in #40) for a guide on how to write and structure these tests.
I do not believe that we are successfully validating csv
data from MANIFEST files.
We have some code which validates the csv from a MANIFEST. Suppose I put that code into a function def _validate_manifest_csv(manifest)
.
def _validate_manifest_csv(manifest):
try:
manifest = pd.read_csv(manifest_fh, comment='#', header=0,
skip_blank_lines=True, dtype=object)
except pd.io.common.CParserError as e:
raise ValueError('All records in manifest must contain '
'exactly three comma-separated fields, but it '
'appears that at least one record contains more. '
'Original error message:\n %s' % str(e))
Then if I put the following test into test_transformer.py
:
def test_validate_manifest_csv(self):
manifest = io.StringIO(
'sample-id,filename,direction\n'
'banana,/hello/world,forward,hotdog\n' # < -- important, notice the hotdog
'banana,/hello/world,forward\n'
'banana,/hello/world,reverse\n'
'banana,/hello/world,reverse\n')
with self.assertRaisesRegex(ValueError, 'at least one record contains more.'):
_validate_manifest_csv(manifest)
... the test fails. But I think it seems like it should succeed (i.e, the error should occur), otherwise, in what scenario are we expecting that error to happen?
In fact, I don't think Pandas has any issue with jagged data, such as in the above hotdog example
>>> import io
>>> manifest = io.StringIO(
... 'sample-id,filename,direction\n'
... 'banana,/hello/world,forward,hotdog\n'
... 'banana,/hello/world,forward\n'
... 'banana,/hello/world,reverse\n'
... 'banana,/hello/world,reverse\n')
>>> manifest = pd.read_csv(manifest, comment='#', header=0, skip_blank_lines=True, dtype=object)
>>> print(manifest)
sample-id filename direction
banana /hello/world forward hotdog
banana /hello/world forward NaN
banana /hello/world reverse NaN
banana /hello/world reverse NaN
Am I missing something, or is the current behavior incorrect?
This is only referenced in q2-types
and docs
, so we should drop this. We initially thought we needed this for q2-feature-classifier
, but ended up replacing it with using various FeatureData
types instead.
Proposed Behavior
Maybe in a subpackage (common_filefmts):
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.