fungs / taxator-tk Goto Github PK
View Code? Open in Web Editor NEWA set of programs for the taxonomic analysis of genetic sequences
A set of programs for the taxonomic analysis of genetic sequences
There should be a common taxator-tk namespace.
Throw exceptions if input files are not sorted correctly.
The error message is:
taxator: /usr/include/boost/thread/pthread/condition_variable.hpp:125: boost::condition_variable_any::~condition_variable_any(): Assertion `!pthread_mutex_destroy(&internal_mutex)' failed.
Terminated
There are certain counters which measure the run-time of steps in taxator-tk. They must be made thread-safe for reliable measurements when running multi-threaded.
This is an unnecessary class, the copy constructor must be fixed.
The exception is "An unrecoverable error occurred.".
When there is perfect reference and the RPA uses local alignment scores without re-alignment for speed improvement, the assignment seems to be at the lower node level, not at the upper node level. This is true for sequence S|S1|C39252 in CAMI test data-set contigs_30.4.
I first would like to thank you for the effort to provide the community a reliable composition based metagenomic taxonomic classifier. On samples with many unknown organisms the similarity based approaches fail to classify up to 90% of the contigs.
I setup your programs and download the most resent (and biggest) refpack. I start the LAST index building 3 days ago and it is still stuck on building it. So my question is:
The machine specs are Ubuntu 128 Gb RAM, 24 threads 4 Tb disk
An option should be added which converts the input taxon ID into an integer which is the distance from the root.
When mapping and using the replace option, the columns after the replace column are not displayed.
We currently use SeqAn loading function, maybe this is better in SeqAn 2.0. When using the in-memory sequence store, the data should be read as a stream to support reading the output of decompression programs or on-the-fly conversion from FASTQ etc.
Derive a tree hash which can be used to identify a taxonomy solely based on its structure (tree + node identifiers).
Currently, the output of taxator is typically sorted to group the lines with identical query identifier. This is a requirement for the current binner. However, on the long term, there should be an output buffer (e.g. a priority list) which put the multi-threading output in order.
An algorithmic description likely in pseudocode should be provided in the documentation.
binner needs to parse the full input anyway which is not large and runs in seconds to minutes. Therefore, we can drop the sorted input requirement in the old and new binning post-processing programs. This will remove the sort call from the workflow scripts.
The new default after proper implementation of exceptions #4 will be to exit if an invalid mapping for a requested reference sequence is found. The possibility to continue ignoring these data should be given via an advanced (hidden) option.
When looking up a sequence identifier and if the corresponding entry is not found, an exception should be thrown and a meaningful error message be given.
When running taxator on large samples with segmentation disabled, then memory fills up quite easily due to the in-memory holding of sequences. We should see whether we can generally decrease the memory footprint maybe by re-loading sequence every time they are needed or just in some cases. It could be made a command line option.
Put important functions into a libtaxator library and link to this library from the other tools. This allows external programs to use the functions in taxator-tk more easily.
Currently, when a logfile is set, there is nothing logged. Either remove the option or start to output logging information.
I the current implementation, the taxonomy is simplified by deleting all nodes which are not annotated with a major rank such as species, genus, family, order, class, phylum or superkingdom. Leaf nodes aka strains/subspecies etc. are not consistently assigned to a rank. By running taxator with appropriate flags, the full taxonomy can be utilized. However, to extend the standard mode to major ranks plus the lowest subspecies level, we simply need to mark the leaf nodes in addition to the major ranks. This should be tested and become the default mode.
The local aligner does sometimes not report the best hit. Currently, local alignment scores are taken as pairwise scores with perfect reference which can lead to imprecise assignments (really minor differences). We should let the user decide whether he/she wants to re-calculate the pairwise scores in these cases in order to account for a fuzzy homology search.
Should overlapping segments of the same reference be handled differently in the algorithm?
The taxonomy class is currently split into two, one storage and one interface class. We need to clearly structure a single class that derives from both and which can be specialized for different taxonomies.
Implement with #41
When reference segments have the same score, the best hit should be determined based on the score, the number of matches and the original local alignment score, in this particular order.
taxator throws error if last (optional) column of alignment file is missing and there is no tab after the last column.
Similar to taxator etc.
A hashmap or similar should be used to cache all alignments in the RPA algorithm. This is currently limited to those alignments which involve the query sequence.
The default should be to exit with a meaningful message and to force the user to provide correct input.
There should be an advanced command line parameter which would allow the user to ignore data with an invalid taxid mapping.
Should we give segments an identifier in order to track it?
Deactive messages via stderr or use a verbosity switch with moderate default.
Should be easy to do, just add the corresponding header.
Common errors like
should be tested in single and threaded mode. Error codes should be minimalistic and understandable.
Port code to C++11 with better support for multithreading, generic programming, uniform initialization, and better overall performance. This will make it also possible to use more specific data types, replace some boost code and to simplify some of the template-based code.
A text file or markdown document which defines the coding guidelines.
A line which points to the original taxator-tk publication should be given to facilitate citation.
The header should reflect the current list of authors and years.
Make README a markdown document and check all command if they are still correct for the recent version
Currently, the class needs to maintain a redundant list of identifiers to be thread-safe. This should be fixed. Maybe, this is settled already in the SeqAn base class.
names.dmp and nodes.dmp are relatively large text files which compress very well. Using e.g. gzip compression will thus speed up reading those files from slow locations. We should implement a simple auto-detection of gzipped files and load them instead, if present.
Basically, the whole algorthm is independent of sequence type if a proper pairwise alignment algorithm and score is used. Therefore, it should be possible to run the same algorithm with protein sequences. I expect it to be somewhat slower than for nucleotide sequences because we cannot use the ultra-fast edit-score alignment algorithm in SeqAn.
Currenly, the path looks like A;B;C;
However, C as the leaf node should not be followed by the last path field separator.
We need to reduce the size of the logs of taxator and probably also rework those of the binner.
Quickly construct a taxonomy from a newick string/file instead of the NCBI raw dump files. This allows us to use different taxonomies more easily.
This is due to the introduction of the rtax field in version 1.4 which is not set in the corresponding assignment algorithms.
Instead of redundantly specifying the sequence length in each line, it should be placed in a pseudo feature similar to the NCBI genome annotation. This will shrink the overall size but make parsing slightly more complex and impose an order on the GFF3 lines.
There are many tools which give space-separated information in the sequence identifier (assembly programs), others which parse the identifiers and report only a substring (NCBI Blast) and programs which don't support spaces at all (LAST and MAF format), we should consider using a regexp to internally handle these cases without having to modify the input FASTA files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.