Git Product home page Git Product logo

viralclust's Issues

NCBI Accession IDs enhancement

If several NCBI IDs are found in the header, store all and check later, whether one of them exists in the ncbi_metainfo.pkl file.

better restrictions for yml files

currently, the conda environments in viralclust are throwing errors, due to some numpy/scipy update. This should be resolved with better version management in the yml files

RAM vs Diskspace trade off

Using kmer frequencies is really demanding for the memory in the HDBSCAN module.
I am thinking of re-implementing this via a (maybe pickled) data dump on the harddrive, that is read into the memory on demand, line-by-line. More I/O, less RAM...

invalid compressed data in ncbi update process

It seems like wget sometimes fails to download some tar-balls correctly from the FTP server.
Re-downloading those might fix the issue, however, so far I am only able to detect invalid tar.gz in the gunzip step.

NCBI collection year format

At the moment, the database has many different formats and resolutions of the collection date.
Sorting via YYYY/MM/DD and maybe removing the day-field should be an improvement in terms of overview.

sort_sequences: different default behavior

If no ORF is found at all, the "sorted" sequence file contains empty fasta records.
This might be changed to report the "original" sequence instead of an empty sequence.

NCBI annotation via Regex

sigh
apparently, the taxonomic lineage of viruses isn't consistent among viruses. sometimes, the species is the last entry, sometimes it is second to last and everything else gets shifted as well.

a regex that looks for the family/genus suffix (-viridae and -virus) should help to avoid mis-annotations in my results

Inconsistent use of the label revComp/revcomp

The process reverseComp is labeled with revcomp, but the config files (conda and local) use the label revComp (notice the upper case C). This leads to the process not using the python3 conda environment as intended.

HDBSCAN error for large dataset

Got this error message while running ViralClust with SARS-CoV-2 alpha genomes (152,307 non-redundant seqs).

Traceback (most recent call last): File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 663, in <module> perform_clustering() File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 604, in perform_clustering virusClusterer.determine_profile(multiPool) File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 267, in determine_profile allProfiles = p.map(self.profile, self.d_sequences.items()) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks put(task) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

parameter/profiles for global and local cluster structures

extend the NF pipeline with two parameter config profiles (--local and --global ?); this should set the parameters of all clustering tools such that the corresponding structures are found, i.e. resolution of cluster structures are modified here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.