fungs / taxator-tk Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 7.0 8.34 MB

A set of programs for the taxonomic analysis of genetic sequences

CMake 0.40% C++ 91.46% Shell 0.25% Python 0.21% C 3.51% Perl 3.17% JavaScript 1.00%

bioinformatics classification taxator-tk taxonomic

taxator-tk's People

Contributors

Stargazers

Watchers

Forkers

brettin hzi-bifo eik-dahms bgistone gaberoo tankmermaid zjyzjjzmt

taxator-tk's Issues

Introduce C++ namespace

There should be a common taxator-tk namespace.

taxator: exception handling for unsorted input

Throw exceptions if input files are not sorted correctly.

taxator: incorrect exception message is given when running multi-threaded

The error message is:

taxator: /usr/include/boost/thread/pthread/condition_variable.hpp:125: boost::condition_variable_any::~condition_variable_any(): Assertion `!pthread_mutex_destroy(&internal_mutex)' failed.
Terminated

taxator: make class stopwatch thread-safe

There are certain counters which measure the run-time of steps in taxator-tk. They must be made thread-safe for reliable measurements when running multi-threaded.

Remove class PredictionRecordBinning

This is an unnecessary class, the copy constructor must be fixed.

taxknife: throw specific error when taxid is not found in taxonomy

The exception is "An unrecoverable error occurred.".

taxator: segment assignment too specific

When there is perfect reference and the RPA uses local alignment scores without re-alignment for speed improvement, the assignment seems to be at the lower node level, not at the upper node level. This is true for sequence S|S1|C39252 in CAMI test data-set contigs_30.4.

Time for LAST index build

I first would like to thank you for the effort to provide the community a reliable composition based metagenomic taxonomic classifier. On samples with many unknown organisms the similarity based approaches fail to classify up to 90% of the contigs.

I setup your programs and download the most resent (and biggest) refpack. I start the LAST index building 3 days ago and it is still stuck on building it. So my question is:

How long is the normal run time for building the LAST index?

The machine specs are Ubuntu 128 Gb RAM, 24 threads 4 Tb disk

taxknife: add functionality to show the tree distance from root

An option should be added which converts the input taxon ID into an integer which is the distance from the root.

taxknife: "--set-invalid-annotate" omits empty trailing fields

When mapping and using the replace option, the columns after the replace column are not displayed.

Improve biobox parser

read header exactly
test for correct file format

taxator: read FASTA as stream

We currently use SeqAn loading function, maybe this is better in SeqAn 2.0. When using the in-memory sequence store, the data should be read as a stream to support reading the output of decompression programs or on-the-fly conversion from FASTQ etc.

Calculate unique taxonomy version

Derive a tree hash which can be used to identify a taxonomy solely based on its structure (tree + node identifiers).

taxator: ordered GFF3 output

Currently, the output of taxator is typically sorted to group the lines with identical query identifier. This is a requirement for the current binner. However, on the long term, there should be an output buffer (e.g. a priority list) which put the multi-threading output in order.

Provide an algorithmic description

An algorithmic description likely in pseudocode should be provided in the documentation.

Update to SeqAn 2

binner: do not require sorted GFF3 input

binner needs to parse the full input anyway which is not large and runs in seconds to minutes. Therefore, we can drop the sorted input requirement in the old and new binning post-processing programs. This will remove the sort call from the workflow scripts.

taxator: add advanced option to ignore reference sequences with invalid taxon mapping

The new default after proper implementation of exceptions #4 will be to exit if an invalid mapping for a requested reference sequence is found. The possibility to continue ignoring these data should be given via an advanced (hidden) option.

taxator: exception if sequence not found

When looking up a sequence identifier and if the corresponding entry is not found, an exception should be thrown and a meaningful error message be given.

taxator: decrease memory usage when sequences are not split

When running taxator on large samples with segmentation disabled, then memory fills up quite easily due to the in-memory holding of sequences. We should see whether we can generally decrease the memory footprint maybe by re-loading sequence every time they are needed or just in some cases. It could be made a command line option.

Reduce binary size and create shared library

Put important functions into a libtaxator library and link to this library from the other tools. This allows external programs to use the functions in taxator-tk more easily.

binner: remove empty log file

Currently, when a logfile is set, there is nothing logged. Either remove the option or start to output logging information.

taxator: consider strains in classification by default

I the current implementation, the taxonomy is simplified by deleting all nodes which are not annotated with a major rank such as species, genus, family, order, class, phylum or superkingdom. Leaf nodes aka strains/subspecies etc. are not consistently assigned to a rank. By running taxator with appropriate flags, the full taxonomy can be utilized. However, to extend the standard mode to major ranks plus the lowest subspecies level, we simply need to mark the leaf nodes in addition to the major ranks. This should be tested and become the default mode.

taxator: add option to perform re-alignment with perfect reference

The local aligner does sometimes not report the best hit. Currently, local alignment scores are taken as pairwise scores with perfect reference which can lead to imprecise assignments (really minor differences). We should let the user decide whether he/she wants to re-calculate the pairwise scores in these cases in order to account for a fuzzy homology search.

Consider overlapping segments of same reference?

Should overlapping segments of the same reference be handled differently in the algorithm?

Generalize taxonomy class

The taxonomy class is currently split into two, one storage and one interface class. We need to clearly structure a single class that derives from both and which can be specialized for different taxonomies.

Implement with #41

Improve option handling

go over options and make their descriptions clearer, remove unnecessary or move to advanced-options group
set options to required if necessary and move po::notify (vm) after --help parsing;
implement a two-level option handling system for taxknife
common options for all programs via shared code

taxator: resolve nearest neighbor improvement

When reference segments have the same score, the best hit should be determined based on the score, the number of matches and the original local alignment score, in this particular order.

alignment format error

taxator throws error if last (optional) column of alignment file is missing and there is no tab after the last column.

alignments-filter: add hidden command line parameters

Similar to taxator etc.

taxator: cache all alignment scores

A hashmap or similar should be used to cache all alignments in the RPA algorithm. This is currently limited to those alignments which involve the query sequence.

taxator/binner: throw exception when taxid is not found in taxonomy

The default should be to exit with a meaningful message and to force the user to provide correct input.
There should be an advanced command line parameter which would allow the user to ignore data with an invalid taxid mapping.

ID for GFF3 segments?

Should we give segments an identifier in order to track it?

taxator: --quiet flag and/or --verbose

Deactive messages via stderr or use a verbosity switch with moderate default.

binner: write output in bioboxes format

Should be easy to do, just add the corresponding header.

taxator: test and improve exception error messages

Common errors like

missing sequence ID in mapping file
missing taxon ID in taxonomy
missing files in refpack
missing input files

should be tested in single and threaded mode. Error codes should be minimalistic and understandable.

Port to C++11

Port code to C++11 with better support for multithreading, generic programming, uniform initialization, and better overall performance. This will make it also possible to use more specific data types, replace some boost code and to simplify some of the template-based code.

Add coding style document

A text file or markdown document which defines the coding guidelines.

Add citation note in program and README

A line which points to the original taxator-tk publication should be given to facilitate citation.

Update licensing and author headers

The header should reflect the current list of authors and years.

Rework README

Make README a markdown document and check all command if they are still correct for the recent version

Improve sequence store class

Currently, the class needs to maintain a redundant list of identifiers to be thread-safe. This should be fixed. Maybe, this is settled already in the SeqAn base class.

Support compressed taxonomy dump files

names.dmp and nodes.dmp are relatively large text files which compress very well. Using e.g. gzip compression will thus speed up reading those files from slow locations. We should implement a simple auto-detection of gzipped files and load them instead, if present.

taxator: port RPA to protein sequences

Basically, the whole algorthm is independent of sequence type if a proper pairwise alignment algorithm and score is used. Therefore, it should be possible to run the same algorithm with protein sequences. I expect it to be somewhat slower than for nucleotide sequences because we cannot use the ultra-fast edit-score alignment algorithm in SeqAn.

taxknife: path and taxid-path should not end with semicolon

Currenly, the path looks like A;B;C;

However, C as the leaf node should not be followed by the last path field separator.

Shrink logfiles

We need to reduce the size of the logs of taxator and probably also rework those of the binner.

choose a syntax
cleanup code

Read newick taxonomy

Quickly construct a taxonomy from a newick string/file instead of the NCBI raw dump files. This allows us to use different taxonomies more easily.

taxator: all LCA variant crash with segfault

This is due to the introduction of the rtax field in version 1.4 which is not set in the corresponding assignment algorithms.

GFF3 format: replace seqlen variable by pseudo feature

Instead of redundantly specifying the sequence length in each line, it should be placed in a pseudo feature similar to the NCBI genome annotation. This will shrink the overall size but make parsing slightly more complex and impose an order on the GFF3 lines.

Regexp for sequence identifiers

There are many tools which give space-separated information in the sequence identifier (assembly programs), others which parse the identifiers and report only a substring (NCBI Blast) and programs which don't support spaces at all (LAST and MAF format), we should consider using a regexp to internally handle these cases without having to modify the input FASTA files.