Git Product home page Git Product logo

taxator-tk's People

Contributors

eik-dahms avatar fungs avatar gaberoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

taxator-tk's Issues

taxator: make class stopwatch thread-safe

There are certain counters which measure the run-time of steps in taxator-tk. They must be made thread-safe for reliable measurements when running multi-threaded.

taxator: segment assignment too specific

When there is perfect reference and the RPA uses local alignment scores without re-alignment for speed improvement, the assignment seems to be at the lower node level, not at the upper node level. This is true for sequence S|S1|C39252 in CAMI test data-set contigs_30.4.

Time for LAST index build

I first would like to thank you for the effort to provide the community a reliable composition based metagenomic taxonomic classifier. On samples with many unknown organisms the similarity based approaches fail to classify up to 90% of the contigs.

I setup your programs and download the most resent (and biggest) refpack. I start the LAST index building 3 days ago and it is still stuck on building it. So my question is:

How long is the normal run time for building the LAST index?

The machine specs are Ubuntu 128 Gb RAM, 24 threads 4 Tb disk

taxator: read FASTA as stream

We currently use SeqAn loading function, maybe this is better in SeqAn 2.0. When using the in-memory sequence store, the data should be read as a stream to support reading the output of decompression programs or on-the-fly conversion from FASTQ etc.

taxator: ordered GFF3 output

Currently, the output of taxator is typically sorted to group the lines with identical query identifier. This is a requirement for the current binner. However, on the long term, there should be an output buffer (e.g. a priority list) which put the multi-threading output in order.

binner: do not require sorted GFF3 input

binner needs to parse the full input anyway which is not large and runs in seconds to minutes. Therefore, we can drop the sorted input requirement in the old and new binning post-processing programs. This will remove the sort call from the workflow scripts.

taxator: decrease memory usage when sequences are not split

When running taxator on large samples with segmentation disabled, then memory fills up quite easily due to the in-memory holding of sequences. We should see whether we can generally decrease the memory footprint maybe by re-loading sequence every time they are needed or just in some cases. It could be made a command line option.

Reduce binary size and create shared library

Put important functions into a libtaxator library and link to this library from the other tools. This allows external programs to use the functions in taxator-tk more easily.

binner: remove empty log file

Currently, when a logfile is set, there is nothing logged. Either remove the option or start to output logging information.

taxator: consider strains in classification by default

I the current implementation, the taxonomy is simplified by deleting all nodes which are not annotated with a major rank such as species, genus, family, order, class, phylum or superkingdom. Leaf nodes aka strains/subspecies etc. are not consistently assigned to a rank. By running taxator with appropriate flags, the full taxonomy can be utilized. However, to extend the standard mode to major ranks plus the lowest subspecies level, we simply need to mark the leaf nodes in addition to the major ranks. This should be tested and become the default mode.

taxator: add option to perform re-alignment with perfect reference

The local aligner does sometimes not report the best hit. Currently, local alignment scores are taken as pairwise scores with perfect reference which can lead to imprecise assignments (really minor differences). We should let the user decide whether he/she wants to re-calculate the pairwise scores in these cases in order to account for a fuzzy homology search.

Generalize taxonomy class

The taxonomy class is currently split into two, one storage and one interface class. We need to clearly structure a single class that derives from both and which can be specialized for different taxonomies.

Implement with #41

Improve option handling

  • go over options and make their descriptions clearer, remove unnecessary or move to advanced-options group
  • set options to required if necessary and move po::notify (vm) after --help parsing;
  • implement a two-level option handling system for taxknife
  • common options for all programs via shared code

taxator: resolve nearest neighbor improvement

When reference segments have the same score, the best hit should be determined based on the score, the number of matches and the original local alignment score, in this particular order.

alignment format error

taxator throws error if last (optional) column of alignment file is missing and there is no tab after the last column.

taxator: cache all alignment scores

A hashmap or similar should be used to cache all alignments in the RPA algorithm. This is currently limited to those alignments which involve the query sequence.

taxator: test and improve exception error messages

Common errors like

  • missing sequence ID in mapping file
  • missing taxon ID in taxonomy
  • missing files in refpack
  • missing input files

should be tested in single and threaded mode. Error codes should be minimalistic and understandable.

Port to C++11

Port code to C++11 with better support for multithreading, generic programming, uniform initialization, and better overall performance. This will make it also possible to use more specific data types, replace some boost code and to simplify some of the template-based code.

Rework README

Make README a markdown document and check all command if they are still correct for the recent version

Improve sequence store class

Currently, the class needs to maintain a redundant list of identifiers to be thread-safe. This should be fixed. Maybe, this is settled already in the SeqAn base class.

Support compressed taxonomy dump files

names.dmp and nodes.dmp are relatively large text files which compress very well. Using e.g. gzip compression will thus speed up reading those files from slow locations. We should implement a simple auto-detection of gzipped files and load them instead, if present.

taxator: port RPA to protein sequences

Basically, the whole algorthm is independent of sequence type if a proper pairwise alignment algorithm and score is used. Therefore, it should be possible to run the same algorithm with protein sequences. I expect it to be somewhat slower than for nucleotide sequences because we cannot use the ultra-fast edit-score alignment algorithm in SeqAn.

Shrink logfiles

We need to reduce the size of the logs of taxator and probably also rework those of the binner.

  • choose a syntax
  • cleanup code

Read newick taxonomy

Quickly construct a taxonomy from a newick string/file instead of the NCBI raw dump files. This allows us to use different taxonomies more easily.

GFF3 format: replace seqlen variable by pseudo feature

Instead of redundantly specifying the sequence length in each line, it should be placed in a pseudo feature similar to the NCBI genome annotation. This will shrink the overall size but make parsing slightly more complex and impose an order on the GFF3 lines.

Regexp for sequence identifiers

There are many tools which give space-separated information in the sequence identifier (assembly programs), others which parse the identifiers and report only a substring (NCBI Blast) and programs which don't support spaces at all (LAST and MAF format), we should consider using a regexp to internally handle these cases without having to modify the input FASTA files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.