refresh-bio / phist Goto Github PK

View Code? Open in Web Editor NEW

27.0 4.0 2.0 12.47 MB

Phage-Host Interaction Search Tool

License: GNU General Public License v3.0

Python 15.66% C++ 80.79% Makefile 3.56%

bacterial-genomes bioinformatics genomics host-prediction k-mers phages

phist's People

Contributors

Stargazers

Watchers

Forkers

linxingchen rodrigobivarazevedo

phist's Issues

Problems handling errors in reference genomes

I'm trying to run PHIST on a single viral contig vs a large database of gzipped bacterial genomes. The program finishes without errors, but stops after the first ~715 reference genomes and reports these in the *_common_kmers.csv output file. Additionally, the number of reference genomes processed by PHIST is non-deterministic. Sometimes it stops after 715, 720, or 730 genomes processed.

Update: I unzipped the reference genomes. Now the program is printing a warning for certain genomes and stalling in what appears to be an infinite loop. After CTRL+C, here's the error message:

File "PHIST/phist.py", line 107, in
subprocess.run(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 425, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 855, in communicate
self.wait()
File "/usr/lib64/python3.6/subprocess.py", line 1477, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Turns out these genomes were empty files. After uncompressing and deleting these, the program finished without errors.

It would be great if the behavior of the program was improved when encountering these corrupted reference genomes. Either a warning that skips over them, or an error with informative error message

Threshold of p-value

What p-value indicates a reliable host prediction?

a question of k-mer based prediction

Dear agudys:
Thanks for your great work about this tool. When I am doing the host prediciton using by PHIST, there raises a question: I use Virsorter2 to find some putative viral sequence in MAGs, should I cut the viral region from the MAG then do the prediction? Because as I know some binning tools is also based on the k-mer frequencies, is that the same principle of PHIST? If so, it sounds that there is no-need for the prediction for the viral contigs in MAGs, but their are still remaining of Contamination in MAGs. Hope for your suggestions, thanks a lot!

Common K-Mers

Is there a minimum number of common k-mers to have a high quality host prediction or is 1 accurate?

File load failed

I'm getting a "File load failed" warning during the "Processing queries..." step that causes the program to hang. The program rapidly gets through the first 3500 reference genomes, but then gets stuck at this point. I'm not sure which input genome is causing the issues (there are >200,000 of them).

When I kill the program, here's the following output:

File "PHIST/phist.py", line 107, in
subprocess.run(cmd)
File "/usr/lib64/python3.6/subprocess.py", line 425, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib64/python3.6/subprocess.py", line 855, in communicate
self.wait()
File "/usr/lib64/python3.6/subprocess.py", line 1477, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib64/python3.6/subprocess.py", line 1424, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Strange behavior with changing -k flag

I know the README recommends a kmer size of between 25-30 bp, but I was hoping to use a longer size (to avoid kmer matches for CRISPR spacers).

I tried kmers of 25, 30, 50 and 100 bp. The behavior of 25 and 30 bp was expected -- the top hits were the same, but fewer kmer matcher when k=30. For k=50, the top hit changed and it contained 2x as many kmer matches. For k=100, there are multiple hits reported for each query genome, each with many kmer hits.

I'd suggest having the program raise an error if the user supplies a kmer value above 30 and also including this information in the help text. Ideally, it would be great to be able to use longer kmer lengths.

p-value cutoff

Thanks for the great tool. It runs really fast. Is there a p-value cutoff that you suggest to use?

Input folder also reads existing subfolders

Hi @agudys,

Thanks again for fixing the earlier issues with the output directory. There still exists a potential issue, where all the files and folders in the input folder are being read into the input vir.list. .

For example, my input folder has the following, where the first is a folder containing tmp bins, and the actual file to run would be the dereplicated_bins.fna

all_bins  (folder)
dereplicated_bins.fna

It would help to provide a sanity check of sorts, where the vir.list only reads those with a particular fasta extension such as .fa, .fna, .fasta, .FA, or .FASTA

Thank you!

Installation error: g++: Error：unrecognized command line option ‘-fno-plt’

Hi,
I get the following error when I run makefile

make g++ -std=c++11 -O3 utils/phist.cpp -o utils/phist g++ -std=c++11 -o utils/matcher -I./kmer-db/libs -O3 utils/matcher.cpp utils/input_file.cpp -lz make -C kmer-db make[1]: Entering directory '/home/soft/PHIST/kmer-db' g++ -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/miniconda3/include -Wall -O3 -m64 -std=c++11 -fopenmp -pthread -mavx -I src/kmc_api -I "" -c src/analyzer.cpp -o src/analyzer.o g++: Error：unrecognized command line option ‘-fno-plt’ make[1]: *** [makefile:84: src/analyzer.o] Error 1 make[1]: Leaving directory '/home/soft/PHIST/kmer-db' make: *** [makefile:12: subsystem] Error 2

No `vir.list` file found

Hi,

I'm trying to run PHIST in a snakemake workflow. However, i'm getting the below error:

Traceback (most recent call last):
  File "/mnt/irisgpfs/projects/nomis/assemblies/viromes/workflow/rules/../../deps/phist/phist.py", line 78, in <module>
    oh = open(lst_path, 'w')
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/irisgpfs/projects/nomis/assemblies/virome_results/phist/vir.list'

My run command looks like this:

python3 /mnt/irisgpfs/projects/nomis/assemblies/viromes/workflow/rules/../../deps/phist/phist.py -t 24 $(dirname /work/projects/nomis/assemblies/virome_results/vrhyme/dereplicated_bins.fna) /scratch/users/sbusi/collected_bins_20220720 /work/projects/nomis/assemblies/virome_results/phist/common_kmers.csv /work/projects/nomis/assemblies/virome_results/phist/predictions.csv

is this an issue with creating a tmp folder?

refresh-bio / phist Goto Github PK

phist's People

Contributors

Stargazers

Watchers

Forkers

phist's Issues

Problems handling errors in reference genomes

Threshold of p-value

a question of k-mer based prediction

Common K-Mers

File load failed

Strange behavior with changing -k flag

p-value cutoff

Input folder also reads existing subfolders

Installation error: g++: Error：unrecognized command line option ‘-fno-plt’

No `vir.list` file found

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent