Git Product home page Git Product logo

phist's People

Contributors

agudys avatar aziele avatar sebastiandeorowicz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phist's Issues

Problems handling errors in reference genomes

I'm trying to run PHIST on a single viral contig vs a large database of gzipped bacterial genomes. The program finishes without errors, but stops after the first ~715 reference genomes and reports these in the *_common_kmers.csv output file. Additionally, the number of reference genomes processed by PHIST is non-deterministic. Sometimes it stops after 715, 720, or 730 genomes processed.

Update: I unzipped the reference genomes. Now the program is printing a warning for certain genomes and stalling in what appears to be an infinite loop. After CTRL+C, here's the error message:

File "PHIST/phist.py", line 107, in
subprocess.run(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 425, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 855, in communicate
self.wait()
File "/usr/lib64/python3.6/subprocess.py", line 1477, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Turns out these genomes were empty files. After uncompressing and deleting these, the program finished without errors.

It would be great if the behavior of the program was improved when encountering these corrupted reference genomes. Either a warning that skips over them, or an error with informative error message

a question of k-mer based prediction

Dear agudys:
Thanks for your great work about this tool. When I am doing the host prediciton using by PHIST, there raises a question: I use Virsorter2 to find some putative viral sequence in MAGs, should I cut the viral region from the MAG then do the prediction? Because as I know some binning tools is also based on the k-mer frequencies, is that the same principle of PHIST? If so, it sounds that there is no-need for the prediction for the viral contigs in MAGs, but their are still remaining of Contamination in MAGs. Hope for your suggestions, thanks a lot!

Common K-Mers

Is there a minimum number of common k-mers to have a high quality host prediction or is 1 accurate?

File load failed

I'm getting a "File load failed" warning during the "Processing queries..." step that causes the program to hang. The program rapidly gets through the first 3500 reference genomes, but then gets stuck at this point. I'm not sure which input genome is causing the issues (there are >200,000 of them).

When I kill the program, here's the following output:

File "PHIST/phist.py", line 107, in
subprocess.run(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 425, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 855, in communicate
self.wait()
File "/usr/lib64/python3.6/subprocess.py", line 1477, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Strange behavior with changing -k flag

I know the README recommends a kmer size of between 25-30 bp, but I was hoping to use a longer size (to avoid kmer matches for CRISPR spacers).

I tried kmers of 25, 30, 50 and 100 bp. The behavior of 25 and 30 bp was expected -- the top hits were the same, but fewer kmer matcher when k=30. For k=50, the top hit changed and it contained 2x as many kmer matches. For k=100, there are multiple hits reported for each query genome, each with many kmer hits.

I'd suggest having the program raise an error if the user supplies a kmer value above 30 and also including this information in the help text. Ideally, it would be great to be able to use longer kmer lengths.

p-value cutoff

Thanks for the great tool. It runs really fast. Is there a p-value cutoff that you suggest to use?

Input folder also reads existing subfolders

Hi @agudys,

Thanks again for fixing the earlier issues with the output directory. There still exists a potential issue, where all the files and folders in the input folder are being read into the input vir.list. .

For example, my input folder has the following, where the first is a folder containing tmp bins, and the actual file to run would be the dereplicated_bins.fna

all_bins  (folder)
dereplicated_bins.fna

It would help to provide a sanity check of sorts, where the vir.list only reads those with a particular fasta extension such as .fa, .fna, .fasta, .FA, or .FASTA

Thank you!

Installation error: g++: Error:unrecognized command line option ‘-fno-plt’

Hi,
I get the following error when I run makefile

make g++ -std=c++11 -O3 utils/phist.cpp -o utils/phist g++ -std=c++11 -o utils/matcher -I./kmer-db/libs -O3 utils/matcher.cpp utils/input_file.cpp -lz make -C kmer-db make[1]: Entering directory '/home/soft/PHIST/kmer-db' g++ -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/miniconda3/include -Wall -O3 -m64 -std=c++11 -fopenmp -pthread -mavx -I src/kmc_api -I "" -c src/analyzer.cpp -o src/analyzer.o g++: Error:unrecognized command line option ‘-fno-plt’ make[1]: *** [makefile:84: src/analyzer.o] Error 1 make[1]: Leaving directory '/home/soft/PHIST/kmer-db' make: *** [makefile:12: subsystem] Error 2

No `vir.list` file found

Hi,

I'm trying to run PHIST in a snakemake workflow. However, i'm getting the below error:

Traceback (most recent call last):
  File "/mnt/irisgpfs/projects/nomis/assemblies/viromes/workflow/rules/../../deps/phist/phist.py", line 78, in <module>
    oh = open(lst_path, 'w')
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/irisgpfs/projects/nomis/assemblies/virome_results/phist/vir.list'

My run command looks like this:

python3 /mnt/irisgpfs/projects/nomis/assemblies/viromes/workflow/rules/../../deps/phist/phist.py -t 24 $(dirname /work/projects/nomis/assemblies/virome_results/vrhyme/dereplicated_bins.fna) /scratch/users/sbusi/collected_bins_20220720 /work/projects/nomis/assemblies/virome_results/phist/common_kmers.csv /work/projects/nomis/assemblies/virome_results/phist/predictions.csv

is this an issue with creating a tmp folder?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.