gbouras13 / pharokka Goto Github PK

View Code? Open in Web Editor NEW

141.0 9.0 14.0 52.84 MB

fast phage annotation program

License: MIT License

Python 74.94% Shell 3.47% Jupyter Notebook 21.59%

pharokka's Introduction

pharokka

Extra special thanks to Ghais Houtak for making Pharokka's logo.

Fast Phage Annotation Tool

pharokka is a rapid standardised annotation tool for bacteriophage genomes and metagenomes.

If you are looking for rapid standardised annotation of bacterial genomes, please use Bakta. Prokka, which inspired the creation and naming of pharokka, is another good option, but Bakta is Prokka's worthy successor.

phold

If you like pharokka, you will probably love phold. phold uses structural homology to improve phage annotation. Benchmarking is ongoing but phold strongly outperforms pharokka in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

pharokka still has features phold lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it is recommended to run phold after running pharokka.

phold takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with phold.

Google Colab Notebooks

If you don't want to install pharokka or phold locally, you can run pharokka and phold (and phynteny), or only pharokka, without any code using the Google Colab notebook.

phynteny uses a long-short term memory model trained on phage synteny (the conserved gene order across phages) to assign hypothetical phage proteins to a PHROG category - it might help you add extra PHROG category annotations to hypothetical genes remaining after you run phold.
Note: Phynteny will work only if your phage has fewer than 120 predicted proteins
You can still use this notebook to run pharokka and/or phold if your phage(s) are too big - just don't run the Phynteny step!

pharokka
- Fast Phage Annotation Tool
phold
Google Colab Notebooks
Table of Contents
Quick Start
Documentation
Paper
Pharokka with Galaxy Europe Webserver
Brief Overview
- Pharokka v 1.7.0 Update
- Pharokka v 1.6.0 Update (11 January 2024)
- Pharokka v 1.5.0 Update (20 September 2023)
- Pharokka v 1.4.0 Update (27 August 2023)
- Pharokka v 1.3.0 Update
Installation
- Conda Installation
- Pip
- Source
Database Installation
Beginner Conda Installation
Usage
Version Log
System
Time
Benchmarking v1.5.0
Benchmarking v1.4.0
Original Benchmarking (v1.1.0)
Bugs and Suggestions
Citation

Quick Start

The easiest way to install pharokka is via conda:

conda install -c bioconda pharokka

Followed by database download and installation:

install_databases.py -o <path/to/databse_dir>

And finally annotation:

pharokka.py -i <phage fasta file> -o <output directory> -d <path/to/database_dir> -t <threads>

As of pharokka v1.4.0, if you want extremely rapid PHROG annotations, use --fast:

pharokka.py -i <phage fasta file> -o <output directory> -d <path/to/database_dir> -t <threads> --fast

Documentation

Check out the full documentation at https://pharokka.readthedocs.io.

Paper

pharokka has been published in Bioinformatics:

George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, https://doi.org/10.1093/bioinformatics/btac776.

If you use pharokka, please see the full Citation section for a list of all programs pharokka uses, in order to fully recognise the creators of these tools for their work.

Pharokka with Galaxy Europe Webserver

Thanks to some amazing assistance from Paul Zierep, you can run pharokka using the Galaxy Europe webserver. There is no plotting functionality at the moment.

So if you can't get pharokka to install on your machine for whatever reason or want a GUI to annotate your phage(s), please give it a go there.

Brief Overview

pharokka uses PHANOTATE, the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. Prodigal implemented with pyrodigal and Prodigal-gv implemented with pyrodigal-gv are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the PHROGs, CARD and VFDB databases using MMseqs2. As of v1.4.0, pharokka will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using PyHMMER. Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like Roary. pharokka also generates a cds_functions.tsv file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full usage and check out the full documentation for more details.

Pharokka v 1.7.0 Update

You can run pharokka_multiplotter.py to plot as many phage(s) as you want.

It requires the pharokka output Genbank file (here, pharokka.gbk). It will save plots for each contig in the output directory (here pharokka_plots_output_directory).

e.g.

pharokka_multiplotter.py -g pharokka.gbk  -o pharokka_plots_output_directory

Pharokka v 1.6.0 Update (11 January 2024)

Fixes a variety of bugs (#300 pharokka_proteins.py crashing if it found VFDB hits, #303 errors in the .tbl format, #316 errors with types and where custom HMM dbs had identical scored hits, #317 types and #320 deprecated GC function)
Adds --mash_distance and --minced_args as parameters (#299 thanks @iferres).

Pharokka v 1.5.0 Update (20 September 2023)

Adds support for pyrodigal-gv implementing prodigal-gv as a gene predictor for alternate genetic codes (pyrodigal-gv and prodigal-gv). This can be specified with -g prodigal-gv and is recommended for metagenomic input datasets. Thanks to @althonos and @apcamargo for making this possible, and to @asierFernandezP for raising this as an issue in the first place here.
-g prodigal and -g prodigal-gv should be much faster thanks to multithread support added by the inimitable @althonos.
Adds checks to determine if your input FASTA has duplicated contig headers. Thanks @thauptfeld for raising this.
Genbank format output will be designated with PHG not VRL.
The _length_gc_cds_density.tsv and _cds_final_merged_output.tsv files now contain the translation table/genetic code for each contig.
--skip_mash flag added to skip finding the closest match for each contig in INPHARED using mash.
--skip_extra_annotations flag added to skip running tRNA-scanSE, MinCED and Aragorn in case you only want CDS predictions and functional annotations.

Pharokka v 1.4.0 Update (27 August 2023)

pharokka v1.4.0 is a large update implementing:

More sensitive search for PHROGs using Hidden Markov Models (HMMs) using the amazing PyHMMER.
By default, pharokka will now run searches using both MMseqs2 (PHROGs, CARD and VFDB) and HMMs (PHROGs). MMseqs2 was kept for PHROGs as it provides more information than the HMM results (e.g. sequence alignment identities & top hit PHROG protein) if it finds a hit.
--fast or --hmm_only which only runs PyHMMER on PHROGs. It will not run MMseqs2 at all on PHROGs, CARD or VFDB. For phage isolates, this will be much faster than v1.3.2, but you will not get CARD or VFDB annotations. For metagenomes, this will be (much) slower though!
Other changes in the codebase should make pharokka v1.4.0 run somewhat faster than v1.3.2, even if PyHMMER is not used and --mmseqs2_only is specified.
Updated databases as of 23 August 2023. You will need to download the new pharokka v1.4.0 databases. The VFDB database is now clustered at 50% sequence identity (which speeds up runtime).
--mmseqs2_only which will essentially run pharokka v1.3.2 and is default in meta mode -m or --meta.
pharokka_proteins.py, which takes an input file of amino acid proteins in FASTA format and runs MMseqs2 (PHROGs, CARD, VFDB) and PyHMMER (PHROGs). See the proteins documentation for more details.
--custom_hmm, which allows for custom HMM profile databases to be used with pharokka.
create_custom_hmm.py which facilitates the creation of a HMM profile database from multiple sequence alignments. See the documentation for more details about how to create a compatible HMM profile database.
--dnaapler, which automatically detects and reorients your phage to start with the large terminase subunit. For more information, see dnaapler.
--genbank, which allows for genbank format input with -i. This will take all (custom) CDS calls in genbank file and PHANOTATE/pyrodigal will not be run. So if you have done manual gene curation, this option is recommended.
Fixes to -c, which should now work with -g prodigal (thanks Alistair Legione for the fixes).

Pharokka v 1.3.0 Update

pharokka v1.3.0 implements pharokka_plotter.py, which creates a simple circular genome plot using the amazing pyCirclize package with output in PNG format. All CDS are coloured according to their PHROG functional group.

It is reasonably customisable and is designed for single input phage contigs. If an input FASTA with multiple contigs is entered, it will only plot the first contig.

It requires the input FASTA, pharokka output directory, and the -p or --prefix value used with pharokka if specified.

You can run pharokka_plotter.py in the following form

pharokka_plotter.py -i input.fasta -n pharokka_plot -o pharokka_output_directory

This will create pharokka_plot.png as an output file plot of your phage.

An example plot is included below made with the following command (assuming Pharokka has been run with SAOMS1_pharokka_output_directory as the output directory).

pharokka_plotter.py -i test_data/SAOMS1.fasta -n SAOMS1_plot -o SAOMS1_pharokka_output_directory --interval 8000 --annotations 0.5 --plot_title '${Staphylococcus}$ Phage SAOMS1'

SAOMS1 phage (GenBank: MW460250.1) was isolated and sequenced by: Yerushalmy, O., Alkalay-Oren, S., Coppenhagen-Glazer, S. and Hazan, R. from the Institute of Dental Sciences and School of Dental Medicine, Hebrew University, Israel.

Please see plotting for details on all plotting parameter options.

Installation

Conda Installation

The easiest way to install pharokka is via conda. For inexperienced command line users, this method is highly recommended.

conda install -c bioconda pharokka

This will install all the dependencies along with pharokka. The dependencies are listed in environment.yml.

If conda is taking a long time to solve the environment, try using mamba:

conda install mamba
mamba install -c bioconda pharokka

Pip

As of v1.4.0, you can also install the python components of pharokka with pip.

pip install pharokka

You will still need to install the non-python dependencies manually.

Source

Alternatively, the development version of pharokka (which may include new, untested features) can be installed manually via github.

git clone https://github.com/gbouras13/pharokka.git
cd pharokka
pip install -e .
pharokka.py --help

The dependencies found in environment.yml will then need to be installed manually.

For example using conda to install the required dependencies:

conda env create -f environment.yml
conda activate pharokka_env
# assuming you are in the pharokka directory 
# installs pharokka from source
pip install -e .
pharokka.py --help

Database Installation

To install the pharokka database to the default directory:

install_databases.py -d

If you would like to specify a different database directory (recommended), that can be achieved as follows:

install_databases.py -o <path/to/databse_dir>

If this does not work, you an alternatively download the databases from Zenodo at https://zenodo.org/record/8276347/files/pharokka_v1.4.0_databases.tar.gz and untar the directory in a location of your choice.

If you prefer to use the command line:

wget "https://zenodo.org/record/8276347/files/pharokka_v1.4.0_databases.tar.gz"
tar -xzf pharokka_v1.4.0_databases.tar.gz

which will create a directory called "pharokka_v1.4.0_databases" containing the databases.

Beginner Conda Installation

If you are new to using the command-line, please install conda using the following instructions.

Install Anaconda. I would recommend miniconda.
Assuming you are using a Linux x86_64 machine (for other architectures, please replace the URL with the appropriate one on the miniconda website).

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

For Mac (Intel, will also work with M1):

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

Install miniconda and follow the prompts.

sh Miniconda3-latest-Linux-x86_64.sh

After installation is complete, you should add the following channels to your conda configuration:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

After this, conda should be installed (you may need to restart your terminal). It is recommended that mamba is also installed, as it will solve the enviroment quicker than conda:

conda install mamba

Finally, I would recommend installing pharokka into a fresh environment. For example to create an environment called pharokkaENV with pharokka installed:

mamba create -n pharokkaENV pharokka
conda activate pharokkaENV
install_databases.py -h
pharokka.py -h

Usage

Once the databases have finished downloading, to run pharokka:

pharokka.py -i <fasta file> -o <output directory> -t <threads>

To specify a different database directory (recommended):

pharokka.py -i <fasta file> -o <output directory> -d <path/to/database_dir> -t <threads> -p <prefix>

For a full explanation of all arguments, please see usage.

pharokka defaults to 1 thread.

usage: pharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s]
                   [-c CODING_TABLE] [-e EVALUE] [--fast] [--mmseqs2_only] [--meta_hmm] [--dnaapler] [--custom_hmm CUSTOM_HMM] [--genbank]
                   [--terminase] [--terminase_strand TERMINASE_STRAND] [--terminase_start TERMINASE_START] [--skip_extra_annotations]
                   [--skip_mash] [--minced_args MINCED_ARGS] [--mash_distance MASH_DISTANCE] [-V] [--citation]

pharokka: fast phage annotation program

options:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        Input genome file in fasta format.
  -o OUTDIR, --outdir OUTDIR
                        Directory to write the output to.
  -d DATABASE, --database DATABASE
                        Database directory. If the databases have been installed in the default directory, this is not required. Otherwise specify the path.
  -t THREADS, --threads THREADS
                        Number of threads. Defaults to 1.
  -f, --force           Overwrites the output directory.
  -p PREFIX, --prefix PREFIX
                        Prefix for output files. This is not required.
  -l LOCUSTAG, --locustag LOCUSTAG
                        User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.
  -g GENE_PREDICTOR, --gene_predictor GENE_PREDICTOR
                        User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". 
                        Defaults to phanotate (not required unless prodigal is desired).
  -m, --meta            meta mode for metavirome input samples
  -s, --split           split mode for metavirome samples. -m must also be specified. 
                        Will output separate split FASTA, gff and genbank files for each input contig.
  -c CODING_TABLE, --coding_table CODING_TABLE
                        translation table for prodigal. Defaults to 11.
  -e EVALUE, --evalue EVALUE
                        E-value threshold for MMseqs2 database PHROGs, VFDB and CARD and PyHMMER PHROGs database search. Defaults to 1E-05.
  --fast, --hmm_only    Runs PyHMMER (HMMs) with PHROGs only, not MMseqs2 with PHROGs, CARD or VFDB. 
                        Designed for phage isolates, will not likely be faster for large metagenomes.
  --mmseqs2_only        Runs MMseqs2 with PHROGs, CARD and VFDB only (same as Pharokka v1.3.2 and prior). Default in meta mode.
  --meta_hmm            Overrides --mmseqs2_only in meta mode. Will run both MMseqs2 and PyHMMER.
  --dnaapler            Runs dnaapler to automatically re-orient all contigs to begin with terminase large subunit if found. 
                        Recommended over using '--terminase'.
  --custom_hmm CUSTOM_HMM
                        Run pharokka with a custom HMM profile database suffixed .h3m. 
                        Please use create this with the create_custom_hmm.py script.
  --genbank             Flag denoting that -i/--input is a genbank file instead of the usual FASTA file. 
                         The CDS calls in this file will be preserved and re-annotated.
  --terminase           Runs terminase large subunit re-orientation mode. 
                        Single genome input only and requires --terminase_strand and --terminase_start to be specified.
  --terminase_strand TERMINASE_STRAND
                        Strand of terminase large subunit. Must be "pos" or "neg".
  --terminase_start TERMINASE_START
                        Start coordinate of the terminase large subunit.
  --skip_extra_annotations
                        Skips tRNAscan-se, MINced and Aragorn.
  --skip_mash           Skips running mash to find the closest match for each contig in INPHARED.
  --minced_args MINCED_ARGS
                        extra commands to pass to MINced (please omit the leading hyphen for the first argument). You will need to use quotation marks e.g. --minced_args "minNR 2 -minRL 21"
  --mash_distance MASH_DISTANCE
                        mash distance for the search against INPHARED. Defaults to 0.2.
  -V, --version         Print pharokka Version
  --citation            Print pharokka Citation

Version Log

A brief description of what is new in each update of pharokka can be found in the HISTORY.md file.

System

pharokka has been tested on Linux and MacOS (M1 and Intel).

Time

On a standard 16GB RAM laptop specifying 8 threads, pharokka should take between 3-10 minutes to run for a single phage, depending on the genome size.

In --fast mode, it should take 45-75 seconds.

Benchmarking v1.5.0

pharokka v1.5.0 was run on the 673 crAss phage dataset to showcase the improved CDS prediction of -g prodigal-gv for metagenomic datasets where some phages likely have alternative genetic codes (i.e. not 11).

All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 8 threads (-t 8). pyrodigal-gv v0.1.0 and pyrodigal v3.0.0 were used respectively.

673 crAss-like genomes	`pharokka` v1.5.0 `-g prodigal-gv`	`pharokka` v1.5.0 `-g prodigal`
Total CDS	81730	91999
Annotated Function CDS	20344	17458
Unknown Function CDS	61386	74541
Contigs with genetic code 15	229	NA
Contigs with genetic code 4	38	NA
Contigs with genetic code 11	406	673

Fewer (larger) CDS were predicted more accurately, leading to an increase in the number of coding sequences with annotated functions. Approximately 40% of contigs in this dataset were predicted to use non-standard genetic codes according to pyrodigal-gv.

Benchmarking v1.4.0

pharokka v1.4.0 has also been run on phage SAOMS1 and also the same 673 crAss phage dataset to showcase:

The improved sensitivity of gene annotation with PyHMMER and a demonstration of how --fast is slower for metagenomes.
- If you can deal with the compute cost (especially for large metagenomes), I highly recommend --fast or --meta_hmm for metagenomes given how much more sensitive HMM search is.
The large speed-up over v1.3.2 with --fast for phage isolates - with the proviso that no virulence factors or AMR genes will be detected.
The slight speed-up over v1.3.2 with --mmseqs2_only.

All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 16 threads (-t 16).

SAOMS1 was run with Phanotate

Phage SAOMS1	`pharokka` v1.4.0 `--fast`	`pharokka` v1.4.0	`pharokka` v1.3.2
Time (min)	0.70	3.73	5.08
CDS	246	246	246
Annotated Function CDS	93	93	92
Unknown Function CDS	153	153	154

The 673 crAss-like genomes were run with -m (defaults to --mmseqs2_only in v 1.4.0) and with -g prodigal (pyrodigal v2.1.0).

673 crAss-like genomes	`pharokka` v1.4.0 `--fast`	`pharokka` v1.4.0 `--mmseqs2_only`	`pharokka` v1.3.2
Time (min)	35.62	11.05	13.27
CDS	91999	91999	91999
Annotated Function CDS	16713	9150	9150
Unknown Function CDS	75286	82849	82849

Original Benchmarking (v1.1.0)

pharokka (v1.1.0) has been benchmarked on an Intel Xeon CPU E5-4610 v2 @ 2.30 specifying 16 threads. Below is benchamarking comparing pharokka run with PHANOTATE and Prodigal against Prokka v1.14.6 run with PHROGs HMM profiles, as modified by Andrew Millard (https://millardlab.org/2021/11/21/phage-annotation-with-phrogs/).

Benchmarking was conducted on Enterbacteria Phage Lambda (Genbank accession J02459) Staphylococcus Phage SAOMS1 (Genbank Accession MW460250) and 673 crAss-like phage genomes in one multiFASTA input taken from Yutin, N., Benler, S., Shmakov, S.A. et al. Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features. Nat Commun 12, 1044 (2021) https://doi.org/10.1038/s41467-021-21350-w.

For the crAss-like phage genomes, pharokka meta mode -m was enabled.

Phage Lambda	`pharokka` PHANOTATE	`pharokka` Prodigal	Prokka with PHROGs
Time (min)	4.19	3.88	0.27
CDS	88	61	62
Coding Density (%)	94.55	83.69	84.96
Annotated Function CDS	43	37	45
Unknown Function CDS	45	24	17

Phage SAOMS1	`pharokka` PHANOTATE	`pharokka` Prodigal	Prokka with PHROGs
Time (min)	4.26	3.89	0.93
CDS	246	212	212
Coding Density (%)	92.27	89.69	89.31
Annotated Function CDS	92	93	92
Unknown Function CDS	154	119	120

673 crAss-like genomes from Yutin et al., 2021	`pharokka` PHANOTATE Meta Mode	`pharokka` Prodigal Meta Mode	Prokka with PHROGs
Time (min)	106.55	11.88	252.33
Time Gene Prediction (min)	96.21	3.4	5.12
Time tRNA Prediction (min)	1.25	1.08	0.3
Time Database Searches (min)	6.75	5.58	238.77
CDS	138628	90497	89802
Contig Min Coding Density (%)	66.01	46.18	46.13
Contig Max Coding Density (%)	98.86	97.85	97.07
Annotated Function CDS	9341	9228	14461
Unknown Function CDS	129287	81269	75341

pharokka scales well for large metavirome datasets due to the speed of MMseqs2. In fact, as the size of the input file increases, the extra time taken is required for running gene prediction (particularly PHANOTATE) and tRNA-scan SE2 - the time taken to conduct MMseqs2 searches remain small due to its many vs many approach.

If you require fast annotations of extremely large datasets (i.e. thousands of input contigs), running pharokka with Prodigal (-g prodigal) is recommended.

Bugs and Suggestions

If you come across bugs with pharokka, or would like to make any suggestions to improve the program, please open an issue or email [email protected].

Citation

If you use pharokka, I would recommend a citation in your manuscript along the lines of:

All phages were annotated with Pharokka v ___ (Bouras, et al. 2023). Specifically, coding sequences (CDS) were predicted with PHANOTATE (McNair, et al. 2019), tRNAs were predicted with tRNAscan-SE 2.0 (Chan, et al. 2021), tmRNAs were predicted with Aragorn (Laslett, et al. 2004) and CRISPRs were preducted with CRT (Bland, et al. 2007). Functional annotation was generated by matching each CDS to the PHROGs (Terzian, et al. 2021), VFDB (Chen, et al. 2005) and CARD (Alcock, et al. 2020) databases using MMseqs2 (Steinegger, et al. 2017) and PyHMMER (Larralde, et al. 2023). Contigs were matched to their closest hit in the INPHARED database (Cook, et al. 2021) using mash (Ondov, et al. 2016). Plots were created with pyCirclize (Shimoyama 2022).

With the following full citations for the constituent tools below where relevant:

Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007.
McNair K., Zhou C., Dinsdale E.A., Souza B., Edwards R.A. (2019) "PHANOTATE: a novel approach to gene identification in phage genomes", Bioinformatics, https://doi.org/10.1093/bioinformatics/btz26.
Chan, P.P., Lin, B.Y., Mak, A.J. and Lowe, T.M. (2021) "tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes", Nucleic Acids Res., https://doi.org/10.1093/nar/gkab688.
Steinegger M. and Soeding J. (2017), "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets", Nature Biotechnology https://doi.org/10.1038/nbt.3988.
Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x.
Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., "PHROG : families of prokaryotic virus proteins clustered using remote homology", NAR Genomics and Bioinformatics, (2021), https://doi.org/10.1093/nargab/lqab067.
Bland C., Ramsey L., Sabree F., Lowe M., Brown K., Kyrpides N.C., Hugenholtz P. , "CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats", BMC Bioinformatics, (2007), https://doi.org/10.1186/1471-2105-8-209.
Laslett D., Canback B., "ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.", Nucleic Acids Research (2004) https://doi.org/10.1093/nar/gkh152.
Chen L., Yang J., Yao Z., Sun L., Shen Y., Jin Q., "VFDB: a reference database for bacterial virulence factors", Nucleic Acids Research (2005) https://doi.org/10.1093/nar/gki008.
Alcock et al, "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic Acids Research (2020) https://doi.org/10.1093/nar/gkz935.
Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296. doi:10.21105/joss.04296.
Larralde M., Zeller G., (2023). PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214, https://doi.org/10.1093/bioinformatics/btad214.
Larralde M. and Camargo A., (2023) Pyrodigal-gv: A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code. https://github.com/althonos/pyrodigal-gv.
Shimoyama, Y. (2022). pyCirclize: Circular visualization in Python [Computer software]. https://github.com/moshi4/pyCirclize.

pharokka's People

Contributors

Stargazers

Watchers

Forkers

rujinlong yasas1994 dreycey israfilhzau enformatik dyxstat ivanv87 lothindir alegione buihoangphuc412 iferres valentynbez mdmobarokhossain eunseo-olo

pharokka's Issues

ModuleNotFoundError: No module named 'Bio'

phrokka version: 0.1.11
Python version: 3.10.6
Operating System: macOS - Monterrey 12.6

Description

Hi George,

I tried the new version (v0.1.11) installed via mamba (mamba install pharokka), but once the downloaded is completed and run the help menu, an error was printed: ModuleNotFoundError: No module named 'Bio'

What I Did

pharokka.py -h

The entire error output:

Traceback (most recent call last):
  File "/Applications/miniconda3/envs/pharokka/bin/pharokka.py", line 3, in <module>
    import input_commands
  File "/Applications/miniconda3/envs/pharokka/bin/input_commands.py", line 5, in <module>
    from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'

Conda environment list

Here is my conda envirnment list:

# packages in environment at /Applications/miniconda3/envs/pharokka:
#
# Name                    Version                   Build  Channel
aragorn                   1.2.41               ha5712d3_0    bioconda
backports                 1.0                        py_2    conda-forge
backports.tempfile        1.0                        py_0    conda-forge
backports.weakref         1.0.post1       pyhd8ed1ab_1003    conda-forge
bcbio-gff                 0.6.9              pyh5e36f6f_0    bioconda
biopython                 1.79            py310h1961e1f_2    conda-forge
brotli                    1.0.9                h5eb16cf_7    conda-forge
brotli-bin                1.0.9                h5eb16cf_7    conda-forge
bx-python                 0.9.0           py310hd9b96a7_0    bioconda
bzip2                     1.0.8                h0d85af4_4    conda-forge
ca-certificates           2022.6.15.2          h033912b_0    conda-forge
certifi                   2022.6.15.2        pyhd8ed1ab_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
emboss                    6.6.0                h6debe1e_0    bioconda
expat                     2.4.8                h96cf925_0    conda-forge
fastpath                  1.9             py310he24745e_1    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.0               h5bb23bf_1    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.37.0          py310h90acd4f_0    conda-forge
freetype                  2.12.1               h3f81eb7_0    conda-forge
gawk                      5.1.0                h8a989fb_0    conda-forge
gettext                   0.19.8.1          hd1a6beb_1008    conda-forge
giflib                    5.2.1                hbcb3906_2    conda-forge
hhsuite                   3.3.0           py310pl5321h6c969c3_5    bioconda
icu                       70.1                 h96cf925_0    conda-forge
infernal                  1.1.4           pl5321ha5712d3_1    bioconda
jpeg                      9e                   hac89ed1_2    conda-forge
kiwisolver                1.4.4           py310habb735a_0    conda-forge
lcms2                     2.12                 h577c468_0    conda-forge
lerc                      4.0.0                hb486fe8_0    conda-forge
libblas                   3.9.0           16_osx64_openblas    conda-forge
libbrotlicommon           1.0.9                h5eb16cf_7    conda-forge
libbrotlidec              1.0.9                h5eb16cf_7    conda-forge
libbrotlienc              1.0.9                h5eb16cf_7    conda-forge
libcblas                  3.9.0           16_osx64_openblas    conda-forge
libcxx                    14.0.6               hccf4f1f_0    conda-forge
libdeflate                1.14                 hb7f2c08_0    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgd                     2.3.3                h1e214de_3    conda-forge
libgfortran               5.0.0           10_4_0_h97931a8_25    conda-forge
libgfortran5              11.3.0              h082f757_25    conda-forge
libiconv                  1.16                 haf1e3a3_0    conda-forge
libidn2                   2.3.3                hac89ed1_0    conda-forge
liblapack                 3.9.0           16_osx64_openblas    conda-forge
libopenblas               0.3.21          openmp_h429af6e_3    conda-forge
libpng                    1.6.37               h5481273_4    conda-forge
libsqlite                 3.39.3               ha978bb4_0    conda-forge
libtiff                   4.4.0                hdb44e8a_4    conda-forge
libunistring              0.9.10               h0d85af4_0    conda-forge
libwebp                   1.2.4                hfa4350a_0    conda-forge
libwebp-base              1.2.4                h775f41a_0    conda-forge
libxcb                    1.13              h0d85af4_1004    conda-forge
libzlib                   1.2.12               hfd90126_3    conda-forge
llvm-openmp               14.0.4               ha654fa7_0    conda-forge
lzo                       2.10              haf1e3a3_1000    conda-forge
matplotlib-base           3.5.3           py310h1bfeb8c_2    conda-forge
minced                    0.4.2                hdfd78af_1    bioconda
mmseqs2                   13.45111        pl5321hdb1ff06_2    bioconda
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
ncurses                   6.3                  h96cf925_1    conda-forge
numpy                     1.23.3          py310h1b7c290_0    conda-forge
openjdk                   17.0.3               hbc0c0cd_2    conda-forge
openjpeg                  2.5.0                h5d0d7b0_1    conda-forge
openssl                   3.0.5                hfd90126_2    conda-forge
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    1.4.4           py310hecf8f37_0    conda-forge
patsy                     0.5.2              pyhd8ed1ab_0    conda-forge
perl                      5.32.1          2_h0d85af4_perl5    conda-forge
phanotate                 1.5.0                hb6a186f_2    bioconda
pharokka                  0.1.11               hdfd78af_0    bioconda
pillow                    9.2.0           py310h54af1cc_2    conda-forge
pip                       22.2.2             pyhd8ed1ab_0    conda-forge
prodigal                  2.6.3                ha5712d3_4    bioconda
pthread-stubs             0.4               hc929b4f_1001    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
python                    3.10.6          hc14f532_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-lzo                1.14            py310h484c9e0_1    conda-forge
python_abi                3.10                    2_cp310    conda-forge
pytz                      2022.2.1           pyhd8ed1ab_0    conda-forge
readline                  8.1.2                h3899abd_0    conda-forge
scipy                     1.9.1           py310h240c617_0    conda-forge
seaborn                   0.12.0               hd8ed1ab_0    conda-forge
seaborn-base              0.12.0             pyhd8ed1ab_0    conda-forge
setuptools                65.3.0             pyhd8ed1ab_1    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
statsmodels               0.13.2          py310h1bbcd0e_0    conda-forge
textwrap3                 0.9.2                      py_0    conda-forge
tk                        8.6.12               h5dbffcc_0    conda-forge
trnascan-se               2.0.9           pl5321ha5712d3_3    bioconda
typing_extensions         4.3.0              pyha770c72_0    conda-forge
tzdata                    2022c                h191b570_0    conda-forge
unicodedata2              14.0.0          py310h1961e1f_1    conda-forge
wget                      1.20.3               hd3787cc_1    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h35c211d_0    conda-forge
xorg-libxdmcp             1.1.3                h35c211d_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge
zlib                      1.2.12               hfd90126_3    conda-forge
zstd                      1.5.2                hfa58983_4    conda-forge

Which is weird because Bioconda is in there. By the way the emboss library is still being downloaded from conda.

Best.

If the contig header is an integer - leads to 0's in pharokka_cds_functions.tsv

v 1.1.0

issue with pharokka_cds_functions.tsv if the contig header is an integer

Running crashed due to ""IndexError: list index out of range"

phrokka version: v1.2.0
Python version: Python 3.10.8
Operating System: Linux

Description

I was trying to run pharokka, but it crashed due to "IndexError: list index out of range". What should I do to remedy this?

What I Did

Command: pharokka.py -i 234389_PhageGq1.fna -o phage_gq1 -t 4

Traceback (most recent call last):
File "/home/ubuntu/lisa-armyVol3/miniconda3/envs/phage/bin/pharokka.py", line 87, in
input_commands.check_dependencies(logger)
File "/home/ubuntu/lisa-armyVol3/miniconda3/envs/phage/bin/input_commands.py", line 313, in check_dependencies
mash_version = version_line[0].split(' ')[2]
IndexError: list index out of range

Fixed GBK translation when non-11 coding tables used

phrokka version: 1.3.2
Python version: 3.11 (I think)
Operating System: ubuntu

Description

Generating annotations for translation table 4 would initially output a correct temporary prodigal file, but the final gbk would contain incorrect translations

What I Did

Hi there,
I noted that you had non-standard coding tables in as experimental, I was keen to get it to work so forked your repro and I think worked through the code that was resulting in standard bacterial (11) table gbks being output even when a different coding table was selected.

This was mostly around the biopython code for 'translate' that was missing the 'table=' option, so even when a different coding table was selected and annotated with prodigal, the gbk translation from the fasta file would treat it as table 11 every time.

Happy to submit a pull request but thought I best put it here first.

phrokka

phrokka version: 0.1.0
Python version: 3.9
Operating System: MacOS

Description

Could you please explain the headings in final_merged_output.tsv in more detail

specify in documentation that --terminase_start requires strand-specific coordinate

pharokka version: 1.3.2
Python version: 3.9.16
Operating System: Rocky Linux 9.0 (Blue Onyx)

Thanks for making and maintaining pharokka. It's a great tool.

After some troubleshooting, I realized that --terminase_start requires the end (right) coordinate from the gff when --terminase_strand neg is specified. My example:

gff line describing terminase large subunit

ON631220.1    PHANOTATE    CDS    80073    82226    -1.721062366005452e+16    -    ...

incorrect use of `--terminase`

pharokka.py \
  -i <fasta file> \
  -o <output folder> \
  -d <path/to/database_dir> \
  -t <threads> \
  --terminase \
  --terminase_start 80073 \
  --terminase_strand neg

correct use of `--terminase`

pharokka.py \
  -i <fasta file> \
  -o <output folder> \
  -d <path/to/database_dir> \
  -t <threads> \
  --terminase \
  --terminase_start 82226 \
  --terminase_strand neg

It would be nice if you could mention explicitly in the documentation that --terminase expects the true strand-specific start position of the terminase large subunit and not the coordinate specified in the start field of the un-reoriented gff file.

Error in tRNAscan-SE step

phrokka version: 1.3.0
Python version: 3.10.10
Operating System: MacOS Catalina version 10.15.7

Description

Trying to run pharokka on a fasta file of ~2000 virus contigs, but getting an error in the tRNAscan-SE step (I believe).

What I Did

Command I ran: pharokka.py -i /fastapath/fasta.fasta -o /outputdirpath -t 8 -f -g prodigal -m [with my file and folder paths]

Output: Running tRNAscan-SE. Applying meta mode.
Traceback (most recent call last):
  File "/pathtoenv/bin/pharokka.py", line 152, in <module>
    processes.concat_trnascan_meta(out_dir, num_fastas)
  File "/pathtoenv/bin/processes.py", line 195, in concat_trnascan_meta
    with open(fname) as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/outputdirpath/input_split_tmp/trnascan_tmp1.gff'

Additional info

It has created some files in the output directory, but does not seem to have finished. Log file ends at "Starting tRNA-scanSE. Applying meta mode."

Add Module to Check Databases Installed upon running Pharokka

Sequences (contigs) ids should be equal in gff's table and fasta section

phrokka version: 1.3.2
Python version: 3.9.16
Operating System: Linux

Description

I annotated a bunch of viral genomes with pharokka and it looks that the sequences ids in the table and in the fasta header of the gff file are not the same. For instance:

##gff-version 3
##sequence-region AP017925.1 1 276958
AP017925.1      PHANOTATE       CDS     30      452     -116.87450862992809     -       0       ID=AP017925_CDS_0001;phrog=1198;top_hit=MG720308_p31;locus_tag=AP017925_CDS_0001;function=other;product=MutT/NUDIX hydrolase
AP017925.1      PHANOTATE       CDS     501     2687    -6900141919402.969      -       0       ID=AP017925_CDS_0002;phrog=2927;top_hit=NC_031039_p151;locus_tag=AP017925_CDS_0002;function=DNA, RNA and nucleotide metabolism;product=DNA polymerase
...
##FASTA
>AP017925.1 Ralstonia phage RP31 DNA, complete genome
ACGAGAGAGGAGGCGAATGCCTCCTCTCTCTATGCCGCTATGGTAATGCGGCTGGGTACA
AAACCCTTTTCCACCAGAGATTTCAACGGCGGAAAGAGATTCTCAGGCAACTTATCCCAT
...

In this case, AP017925.1 (first column in the gff table) is not equal to AP017925.1 Ralstonia phage RP31 DNA, complete genome (header in the fasta section of the gff file), which may cause 3rd party software to not being able to correctly read it. For comparison, the same genome annotated with prokka outputs:

##gff-version 3
##sequence-region AP017925.1 1 276958
AP017925.1      Prodigal:002006 CDS     30      452     .       -       0       ID=AP017925_00001;inference=ab initio prediction:Prodigal:002006;locus_tag=AP017925_00001;product=hypothetical protein
AP017925.1      Prodigal:002006 CDS     501     2687    .       -       0       ID=AP017925_00002;inference=ab initio prediction:Prodigal:002006;locus_tag=AP017925_00002;product=hypothetical protein
...
##FASTA
>AP017925.1
ACGAGAGAGGAGGCGAATGCCTCCTCTCTCTATGCCGCTATGGTAATGCGGCTGGGTACA
AAACCCTTTTCCACCAGAGATTTCAACGGCGGAAAGAGATTCTCAGGCAACTTATCCCAT
...

In this case, identifiers match so it's easy to parse.

PS. Thanks for this cool software!

some gff genes are blank

phrokka version: 0.1.1
Python version: 3.9.3
Operating System: Linux

Description

gff file had no output for one gene

What I Did

pharokka.py -I NC_047732.1.fasta -o NC_047732.1 -t 16 -p NC_047732.1 -f

need to look into why, and if it is because similarity is 0, need to fill the gff entry with NA.

Phrokka gff contig name

phrokka version: 0.1.0
Python version: 3.9
Operating System: MacOS

Description

Small thing, but I noticed that in the .gff output that the names for the CDS and the tRNA hits are different - seems trimmed.

Add error checking to threads - ensure an integer is passed

phrokka version: v 1.2.1
Python version: 3.10
Operating System: Mac OS

Description

Pharokka crashes is a non-integer is passed as -t.

Add error correction in the next update.

pharokka's default mash version error

phrokka version:1.2.1
Python version:3.10.8
Operating System:Ubuntu 18.04.6 LTS

I had tried to run pharokka after installing according to README.md

Command:
./bin/pharokka.py -i /path/contigs.fasta -o /path/pharokka_result -d /path/pharokka_v1.2.0_databases -t 92 -f

Error:

Starting pharokka v1.2.1
Checking database installation.
All databases have been successfully checked.
Checking dependencies.
Phanotate version found is v1.5.0.
Phanotate version is ok.
MMseqs2 version found is v13.45111.
MMseqs2 version is ok.
tRNAscan-SE version found is v2.0.11.
tRNAscan-SE version is ok.
MinCED version found is v0.4.2.
MinCED version is ok.
ARAGORN version found is v1.2.41.
ARAGORN version is ok.
Traceback (most recent call last):
File "/media/chrf/Home03/Trial_Preonath/pharokka/bin/pharokka.py", line 87, in
input_commands.check_dependencies(logger)
File "/media/chrf/Home03/Trial_Preonath/pharokka/bin/input_commands.py", line 313, in check_dependencies
mash_version = version_line[0].split(' ')[2]
IndexError: list index out of range

Solution:
Then I had checked the mash version and Installed the specifie mash version
Command:
conda install -c bioconda mash=2.2

Now Solved my issue

Add locus tag option

Add the ability for user to specify locus tag

https://twitter.com/PaulinaDeptula/status/1544237125461229568

Adding custom databases

Hey, I'm trying to add a custom database to pharokka, but simply adding an mmseqs2 database to the database folder doesn't work (looks like there's specific code to call individual databases sadly).

I was wondering if there was an easy way to add custom databases to pharokka like in prokka, or if such a functionality is planned.

cheers

Phanotate does not work properly

phrokka version: 1.2.1
Python version: 3.7
Operating System: linux

Description

Dear Sir,

When I run a genome (GCA_000154865.1.fasta, 4M), it stays at the Run Phanotate step even though I have waited for more than 10 hours and cannot come up with results.
Interestingly when I run another Klebsiella pneumonia genome, it normally runs (38 minutes), I am confused as to what causes this.

The command I used is pharokka.py -i GCA_000154865.1.fasta -o test2 -t 60 -d ~/database/pharokka/pharokka_v1.2.0_databases -p tes2t -l test -f

Could you tell me why this happened?

No trna scans in pharokka_cds_functions.tsv but tRNAs found in the off

Pharokka crashes if no mmseqs2 matches found

phrokka version: v 0.1.7
Python version: 3.9
Operating System: Linux

Description

Upon checking the output file, there was nothing in mmseqs_results.tsv. Need to add an error catch for this.

Activating conda environment: .snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08
Traceback (most recent call last):
File
".snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/bin/pharokka.py", line 80, in
phan_mmseq_merge_df = post_processing.process_results(DBDIR, out_dir, prefix, gene_predictor)
File
".snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/bin/post_processing.py", line 41, in process_results
merged_df[['phrog','top_hit']] = merged_df['phrog'].str.split(' ## ',expand=True)
File
"snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in setitem
self._setitem_array(key, value)
File
"snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
check_key_length(self.columns, key, value)
File
".snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

Add prodigal as a gene prediction option

install_databases.py downloads unneeded hidden files

pharokka version: 1.3.2
Python version: 3.10.12 (from conda)
Operating System: Oracle Linux 8.8

Description

I want to download the databases, which worked, but I got some (Ignoring) warnings and extra filesystem related files, which could be avoided in the next release.

What I Did

$ install_databases.py -o scratch/mirror/pharokka/1.3.2
PHROGs Databases are missing.
VFDB Databases are missing.
CARD Databases are missing.
PHROGs Annotation File is missing.
INPHARED Mash Annotation File is missing.
INPHARED Mash Sketch File is missing.
Some Databases are missing.
Downloading Pharokka Database
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  242M  100  242M    0     0  2236k      0  0:01:50  0:01:50 --:--:-- 2240k
tar: Ignoring unknown extended header keyword 'SCHILY.fflags'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.FinderInfo'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.macl'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'

$ ls /scratch/mirror/pharokka/1.3.2/
.DS_Store                           CARD_h                              phrogs_profile_db.index             vfdb
._.DS_Store                         CARD_h.dbtype                       phrogs_profile_db_consensus         vfdb.dbtype
._5Jan2023_data.tsv                 CARD_h.index                        phrogs_profile_db_consensus.dbtype  vfdb.index
._5Jan2023_genomes.fa.msh           VFDB_setB_pro.fas                   phrogs_profile_db_consensus.index   vfdb.lookup
5Jan2023_data.tsv                   aro_index.tsv                       phrogs_profile_db_h                 vfdb.source
5Jan2023_genomes.fa.msh             phrog_annot_v4.tsv                  phrogs_profile_db_h.index           vfdb_h
CARD                                phrogs_db                           phrogs_profile_db_seq               vfdb_h.dbtype
CARD.dbtype                         phrogs_db.dbtype                    phrogs_profile_db_seq.dbtype        vfdb_h.index
CARD.index                          phrogs_db.index                     phrogs_profile_db_seq.index
CARD.lookup                         phrogs_profile_db                   phrogs_profile_db_seq_h
CARD.source                         phrogs_profile_db.dbtype            phrogs_profile_db_seq_h.index

I suggest not downloading .DS_Store files and hidden backup files (._*).

Thanks for your time!

PS: the issue templates says 'phrokka' instead of 'pharokka' in the first line.

Long fasta headers raise errors

phrokka version: 1.0.1
Python version: Python 3.9.13 (build deps with conda)
Operating System: Debian GNU/Linux 9 (stretch)

Dear developers, first of all, thank you very much for making this tool, it will be very useful for annotating my phages !

I was trying to run phrokka with one of my assemblies and the program threw an error.

Running hmmsuite
data file: teste_ph-f/hhsuite_target_dir/hhsuite_tsv_file.ffdata
index file: teste_ph-f/hhsuite_target_dir/hhsuite_tsv_file.ffindex
fasta file: teste_ph-f/phanotate_aas.fasta
- 21:00:16.274 INFO: Searching 38880 column state sequences.

- 21:00:16.357 INFO: Thread 0   S34#1#2210b5d2-d153-4961-bdbc-a4

- 21:00:16.374 ERROR: Error in /opt/conda/conda-bld/hhsuite_1645696999782/work/src/hhfunc.cpp:83: ReadQueryFile:

- 21:00:16.374 ERROR:   unrecognized input file format in 'S34#1#2210b5d2-d153-4961-bdbc-a4'

- 21:00:16.374 ERROR:   line = 0f0688b033delimiter0

Processing mmseqs output
Processing hhsuite output
Traceback (most recent call last):
  File "/home/hugoa/repos/phrokka/bin/phrokka.py", line 36, in <module>
    phan_mmseq_merge_df = post_processing.process_results(DBDIR, out_dir)
  File "/home/hugoa/repos/phrokka/bin/modules/post_processing.py", line 75, in process_results
    tophits_hmm__df[['spl','ind']] = tophits_hmm__df['gene_hmm'].str.split('delimiter',expand=True)
  File "/home/hugoa/miniconda3/envs/ph_dev/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in __setitem__
    self._setitem_array(key, value)
  File "/home/hugoa/miniconda3/envs/ph_dev/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/hugoa/miniconda3/envs/ph_dev/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

After some investigation, I found that maybe the problem is with the file "test/hhsuite_target_dir/hhsuite_tsv_file.ffdata" and the program used to build it.
It looks like when a header is long like mine:

S34#1#2210b5d2-d153-4961-bdbc-a40f0688b033

The pipeline breaks on creating the hh index and reading the resulting table in pandas.
I found the header size to be an issue because I tried editing my fasta header, removing special characters like '#_-', but the error still happened.
The only thing that worked was shortening the string to:

S3412210b5d2d

After editing the header the program runs until the end, but it would be nice not to have to edit the header.

Best regards !

pharokka.tbl files are empty for conda install v1.0.0

phrokka version: v1.0.0
Python version: v3.9, conda install
Operating System:

Description

I ran pharokka on phage genomes, the input fasta file has a complete phage genome. The command run,
pharokka.py -i {input} -o {params.o} -d {params.db} -t {threads} -f

What I Did

The output has not errors, and the gbk and gff files are populated, but the pharokka.tbl is empty.

add circular plot for single phage contigs

Add sub-module for phage circular plot - e.g. using circlise

https://anaconda.org/conda-forge/r-circlize

An error after proccesing after processing hhsuite

phrokka version: 0.1.4
Python version: Python 3.9.13
Operating System: Ubuntu 18.04.5 LTS

Description

I tried pharokka but I encountered a bug after processing hhsuite.

The log is as follows.

Starting pharokka.
FASTA checked
Running Phanotate.
Running tRNAscan-SE.
Running mmseqs2.
Running hhsuite.
Processing mmseqs2 output.
Processing hhsuite output.
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/pharokka/bin/pharokka.py", line 68, in <module>
    post_processing.create_txt(phan_mmseq_merge_df, length_df,out_dir, prefix)
  File "/home/ubuntu/miniconda3/envs/pharokka/bin/modules/post_processing.py", line 148, in create_txt
    phanotate_mmseqs_df_cont[['attributes2','function']] = phanotate_mmseqs_df_cont['attributes2'].str.split(';function=',expand=True)
  File "/home/ubuntu/miniconda3/envs/pharokka/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in __setitem__
    self._setitem_array(key, value)
  File "/home/ubuntu/miniconda3/envs/pharokka/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/ubuntu/miniconda3/envs/pharokka/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

What I Did

I used viral contigs using by VirSorter2 from metagenome sequences.

$ nohup pharokka.py -i ./checkv-results/vs2-results-pass2/final-viral-combined.fa -d /mnt/pharokka-db/ -o pharokka-results -t 16 &

Header of fasta is as follows.

>k141_91663||full_1
>k141_591542||full_1
......

add --prefix

Add --prefix flag to allow users to name the output files

MMseqs2 v 14-7e284 Error - Pharokka Hanging at database steps.

If you came across an error with pharokka hanging at the database steps, it is likely you have an old version of pharokka with MMseqs2 version v14-7e284.

Full explanation

The MMseqs2 team have changed the internal MMseqs2 profile format in the new MMseqs2 version v14-7e284.
This means that the pharokka database will not work with MMseqs2 v14-7e284.
As a result, pharokka needs to be run with MMseqs2 v13.4511.
As of v1.1.0, the pharokka bioconda dependencies are fixed to ensure that MMseqs2 is installed with version v13.4511.
However, if you run into issues with pharokka (particularly pre v1.1.0 versions), please check that MMseqs2 v13.4511 is installed (it will be clear in the pharokka***.log output file).
I would recommend a fresh pharokka install conda install -c bioconda pharokka, or trying conda install -c bioconda pharokka mmseqs2==13.4511.
If you are installing pharokka from the git repository, the environment.yml file has been changed to fix MMseqs2 v13.4511, so proceed as below.

Bug in tmRNA output processing

phrokka version: v0.1.7
Python version: 3.9
Operating System: Linux

Description

Activating conda environment: snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08
Traceback (most recent call last):
File "/.snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/bin/pharokka.py", line 85, in
post_processing.create_gff(phan_mmseq_merge_df, length_df, args.infile, out_dir, prefix, locustag, tmrna_flag)
File "/.snakemake/conda/f3ab90247c158ab63eeae3366d6f1d08/bin/post_processing.py", line 236, in create_gff
tmrna_df.start = tmrna_df.start.astype(int)
File "lib/python3.9/site-packages/pandas/core/dtypes/cast.py", line 1154, in astype_nansafe
return lib.astype_intsafe(arr, dtype)
File "pandas/_libs/lib.pyx", line 668, in pandas._libs.lib.astype_intsafe
ValueError: invalid literal for int() with base 10: 'c26805'

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

--label_hypotheticals

phrokka version: 1.3.2
Python version: 3.9.16
Operating System:

Description

I'm trying to plot a phage genome using this script: $ pharokka_plotter.py -i input.fasta -n pharokka_plot -o pharokka_output_directory --label_hypotheticals

Is there a way to label only a specefic hypothetical protein based on its ID in gff/gbk file,
"--label_hypotheticals" flag labels all hypothetical proteins

Conversion to string failed - and a couple of additional observation.

phrokka version: 0.1.1
Python version: 3.9.13
Operating System: Ubuntu 18.04 LTS Server

Description

I launched pharokka on my viral contigs to annotate them.

What I Did

$ pharokka.py -i "$infile" -d /data/db/PHROGS -t 60

FASTA checked
Beginning Phanotate
Traceback (most recent call last):
  File "/home/bioinfo/anaconda3/envs/pharokka/bin/pharokka.py", line 16, in <module>
    processes.translate_fastas(out_dir)
  File "/home/bioinfo/anaconda3/envs/pharokka/bin/modules/processes.py", line 58, in translate_fastas
    dna_description =   phan_df['start'].iloc[i].astype(str) + "_" + phan_df['stop'].iloc[i].astype(str)
AttributeError: 'str' object has no attribute 'astype'

That's it. .astype() is an attribute of pandas objects, such as dataframes. I suggest using the str() function instead

str(phan_df['start'].iloc[i]) + "_" + str(phan_df['stop'].iloc[i])

That would solve the issue.

I've also noticed that the values in the columns start and stop are already strings, but I suspect this is actually undesired and has to do with the fact that there are comment rows included in the database. Here's a snapshot of my cleaned_phanotate.tsv file:

start   stop    frame   contig  score   gene
648     85      -       627_AS_032deLWdelim     -5950.328915480545228412878818  627_AS_032deLWdelim0 648_85
827     645     -       627_AS_032deLWdelim     -19.03537314250255055372969411  627_AS_032deLWdelim1 827_645
1027    824     -       627_AS_032deLWdelim     -7.209186680523061904788001127  627_AS_032deLWdelim2 1027_824
1242    1027    -       627_AS_032deLWdelim     -51.69015098948056216884224348  627_AS_032deLWdelim3 1242_1027
1418    1224    -       627_AS_032deLWdelim     -15.61481905040385983902763601  627_AS_032deLWdelim4 1418_1224
2251    1409    -       627_AS_032deLWdelim     -138334.0158863974592453390774  627_AS_032deLWdelim5 2251_1409
3306    2257    -       627_AS_032deLWdelim     -16974294.17134049948965038800  627_AS_032deLWdelim6 3306_2257
4066    3335    -       627_AS_032deLWdelim     -41290.30362059099239253731847  627_AS_032deLWdelim7 4066_3335
4171    4079    -       627_AS_032deLWdelim     -0.05403914249865158487576726233        627_AS_032deLWdelim8 4171_4079
4365    4180    -       627_AS_032deLWdelim     -11.59954147505824609448388817  627_AS_032deLWdelim9 4365_4180
5054    4368    -       627_AS_032deLWdelim     -50964.95419646835233135824328  627_AS_032deLWdelim10 5054_4368
4988    5545    +       627_AS_032deLWdelim     -207.1401271894045440866113863  627_AS_032deLWdelim11 4988_5545
5549    5659    +       627_AS_032deLWdelim     -0.1655476160158931084311159951 627_AS_032deLWdelim12 5549_5659
5831    6166    +       627_AS_032deLWdelim     -96.74353881729221545778586926  627_AS_032deLWdelim13 5831_6166
6166    7065    +       627_AS_032deLWdelim     -1311690.688900922166701121237  627_AS_032deLWdelim14 6166_7065
7109    7327    +       627_AS_032deLWdelim     -154.4648528527137585470452864  627_AS_032deLWdelim15 7109_7327
7351    7455    +       627_AS_032deLWdelim     -0.06459941106751764040614335289        627_AS_032deLWdelim16 7351_7455
7452    7628    +       627_AS_032deLWdelim     -1.751788612773718521814506920  627_AS_032deLWdelim17 7452_7628
7920    7795    -       627_AS_032deLWdelim     -1.899751346464667351966481006  627_AS_032deLWdelim18 7920_7795
#id:    1200_AS_032deLWdelim       
#START  STOP    FRAME   CONTIG  SCORE   CONTIG20 #START_STOP
92      3       -       1200_AS_032deLWdelim    -0.3947089479339262925811772674 1200_AS_032deLWdelim21 92_3
397     104     -       1200_AS_032deLWdelim    -203.0864784150881507217786563  1200_AS_032deLWdelim22 397_104

The lines #id: 1200_AS_032deLWdelim and #START STOP FRAME CONTIG SCORE CONTIG20 #START_STOP are maybe out of place.

Also, are you sure the delimiter (file bin/modules/processes.py, function add_delim_trim_fasta, line 27) is actually the string "delim"?
It looks a bit weird to me, but maybe it's a design feature.

Cheers! Cannot wait to look at its results.

phrokka thread usage

phrokka version: 0.1.0
Python version: 3.9
Operating System: Linux-64

Description

phrokka silently uses as many cpus as is available on the operating system. Due to mmseqs and hhsuite.

What I Did

No crash, but HPC admins were not happy.

Issue with PHAROKKA 1.1.0 or 1.2.0 intsallation

phrokka version: 1.1.0
Python version: 3.9
Operating System: Unbuntu 20.04.2

Description

I am trying to install PHAROKKA from Github using conda or mamba. I tried on 3 different machines (1 MAC OS X and 2 PC running Ubuntu). And I always failed. I tried PHAROKKA 1.1.0 as well as 1.2.0. Databases do install properly but when I type the Help command pharroka.py -h I always get the following messages. Can you help me ?

Thanks
Nico

What I Did

Traceback (most recent call last):
  File "/home/julia/miniconda3/envs/pharokkaENV/bin/pharokka.py", line 5, in <module>
    import processes
  File "/home/julia/miniconda3/envs/pharokkaENV/bin/processes.py", line 8, in <module>
    from BCBio import GFF
  File "/home/julia/miniconda3/envs/pharokkaENV/lib/python3.9/site-packages/BCBio/GFF/__init__.py", line 3, in <module>
    from BCBio.GFF.GFFParser import GFFParser, DiscoGFFParser, GFFExaminer, parse, parse_simple
  File "/home/julia/miniconda3/envs/pharokkaENV/lib/python3.9/site-packages/BCBio/GFF/GFFParser.py", line 34, in <module>
    from Bio.Seq import UnknownSeq
ImportError: cannot import name 'UnknownSeq' from 'Bio.Seq' (/home/julia/miniconda3/envs/pharokkaENV/lib/python3.9/site-packages/Bio/Seq.py)

[hhsuite] cluster consensus error

phrokka version: 0.1.5 (conda)
Python version: Python 3.9.13
Operating System: Ubuntu 14.04.6 LTS (GNU/Linux 4.4.0-148-generic x86_64)

Description

Describe what you were trying to get done.

Annotating a phage genome from a single contig.

Tell us what happened, what went wrong, and what you expected to happen.

hhsuite runs into an error:

2022-07-19 12:26:19,687 - INFO - Starting hhsuite
2022-07-19 12:26:19,690 - INFO - data file: pharokka_annot/hhsuite_target_dir/hhsuite_tsv_file.ffdata
2022-07-19 12:26:19,690 - INFO - index file: pharokka_annot/hhsuite_target_dir/hhsuite_tsv_file.ffindex
2022-07-19 12:26:19,690 - INFO - fasta file: pharokka_annot/phanotate_aas_tmp.fasta
2022-07-19 12:26:19,713 - INFO - - 12:26:19.713 INFO: Searching 38880 column state sequences.
2022-07-19 12:26:19,713 - INFO -
2022-07-19 12:26:20,318 - INFO - - 12:26:20.318 INFO: Thread 0  cluster_001_consensusdelim0
2022-07-19 12:26:20,318 - INFO -
2022-07-19 12:26:20,907 - INFO - - 12:26:20.905 INFO: cluster_001_consensusdelim0 is in A2M, A3M or FASTA format
2022-07-19 12:26:20,907 - INFO -
2022-07-19 12:26:20,940 - INFO - - 12:26:20.939 ERROR: In /opt/conda/conda-bld/hhsuite_1645696999782/work/src/hhalignment.cpp:511: Read:
2022-07-19 12:26:20,940 - INFO -
2022-07-19 12:26:20,940 - INFO - - 12:26:20.940 ERROR:  MSA file cluster_001_consensusdelim0 contains no master sequence!
2022-07-19 12:26:20,940 - INFO -

Seems to be related to issues in #140 and #101. I had the same error happen with <0.1.5 versions. Is this maybe fixed in 0.1.6 from a few days ago?

What I Did

$ pharokka.py -i 8_medaka.fasta -o pharokka_annot -d /media/5c679734-9376-4617-815c-d4bd4177b8b2/leon/projects/06/soft/pharokkaDB -t 32 -p T3
Starting pharokka.
FASTA checked
Phanotate will be used for gene prediction
Running Phanotate.
Running tRNAscan-SE.
Running mmseqs2.
Running hhsuite.
Processing mmseqs2 output.
Processing hhsuite output.
Traceback (most recent call last):
  File "/home/leon/miniconda3/envs/pharokka_env/bin/pharokka.py", line 77, in <module>
    phan_mmseq_merge_df = post_processing.process_results(DBDIR, out_dir, prefix, gene_predictor)
  File "/home/leon/miniconda3/envs/pharokka_env/bin/modules/post_processing.py", line 80, in process_results
    tophits_hmm__df[['spl','ind']] = tophits_hmm__df['gene_hmm'].str.split('delim',expand=True)
  File "/home/leon/miniconda3/envs/pharokka_env/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in __setitem__
    self._setitem_array(key, value)
  File "/home/leon/miniconda3/envs/pharokka_env/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/leon/miniconda3/envs/pharokka_env/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

Thanks!

Add pharokka as a toot to galaxy

I would like to add pharokka as a toot to galaxy. In order to write the wrapper, I would need a small DB that can be added for the test case. Could you provide a minimal example of the DB, that can be used for this purpose. The DB would need to allow for the execution of the tool, the output does not need to be meaningful. The real DB can be added to galaxy instance later.

empty VFDB and CARD counts in functions.tsv despite hits to both in top_hits_[].tsv

phrokka version: v1.1.0
Python version: 3.8.16
Operating System: Ubuntu

Description

pharokka_cds_functions.tsv lists every single contig as having 0 VF or CARD AMR genes despite there being many hits in both the top_hits_vfdb.tsv and top_hits_card.tsv. For example when I grep "NODE_190_length_23790_cov_6.801601" on pharokka_cds_final_merged_output.tsv I find a VF hit to

VFG000477(gb|NP_461845) (rpoS) RNA polymerase sigma factor RpoS [RpoS (VF0112) - Regulation (VFC0301)] [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2]

However when i grep NODE_190_length_23790_cov_6.801601 to pharokka_cds_functions.tsv I get:

CDS 31 NODE_190_length_23790_cov_6.801601
connector 0 NODE_190_length_23790_cov_6.801601
DNA, RNA and nucleotide metabolism 0 NODE_190_length_23790_cov_6.801601
head and packaging 0 NODE_190_length_23790_cov_6.801601
integration and excision 0 NODE_190_length_23790_cov_6.801601
lysis 0 NODE_190_length_23790_cov_6.801601
moron, auxiliary metabolic gene and host takeover 1 NODE_190_length_23790_cov_6.801601
other 1 NODE_190_length_23790_cov_6.801601
tail 0 NODE_190_length_23790_cov_6.801601
transcription regulation 1 NODE_190_length_23790_cov_6.801601
unknown function 28 NODE_190_length_23790_cov_6.801601
tRNAs 0 NODE_190_length_23790_cov_6.801601
CRISPRs 0 NODE_190_length_23790_cov_6.801601
tmRNAs 0 NODE_190_length_23790_cov_6.801601
VFDB_Virulence_Factors 0 NODE_190_length_23790_cov_6.801601
CARD_AMR_Genes 0 NODE_190_length_23790_cov_6.801601

pharokka.py -i impact1_vOTU.fna -o impact1_vOTU_pharokka -d /home/ubuntu/sdb/stan/db/pharokka -t 16 -m

Pharokka Database Install Failed.

I was installing pharokka but for some reason I can not install the databases with this command: install_databases.py -d
So I manually downloaded the databases from the following link: https://zenodo.org/record/7563578/files/pharokka_v1.2.0_database.tar.gz(pharokka) But I still can not make it work. Is there any way that I can complete database installation and start using pharokka?
Since I am new to the bioinformatics I am not exactly sure what to do after downloading databases manually.
Your help is much appreciated.

Replace Prodigal with Pyrodigal

Link to the equivalent in bakta here. Prodigal has an unpatched error with the reverse strand. To do

Plotter inclusive of AMR genes and VFs?

Thank you for this amazing tool!

Is it possible to annotate/color genes related to AMR or VFs within the pharokka plotter?

Other features to add

IS Finder (ISE scan).
AMR and particularly virulence gene detection.

Problems using the database.

phrokka version:v1.2.0
Python version:3.10.8
Operating System:Ubuntu 22.04.1 LTS

Description

I have problems using the database.

What I Did

install_databases.py -o ~/.pharokka_db

PHROGs Databases are missing.
VFDB Databases are missing.
CARD Databases are missing.
PHROGs Annotation File is missing.
INPHARED Mash Annotation File is missing. 
INPHARED Mash Sketch File is missing. 
Some Databases are missing.
Downloading Pharokka Database
Error: Pharokka Database Install Failed. 
 Please try again or use the manual option detailed at https://github.com/gbouras13/pharokka.git 
 downloading from https://zenodo.org/record/7563578/files/pharokka_v1.2.0_database.tar.gz

Installed manually, but when I run:

harokka.py -i NC_004617.fasta -o NC_004617 -d ~/.pharokka_db/pharokka_v1.2.0_database -t 4 -f

Starting pharokka v1.2.0
Checking database installation.
PHROGs Databases are missing.
VFDB Databases are missing.
CARD Databases are missing.
PHROGs Annotation File is missing.
INPHARED Mash Annotation File is missing. 
INPHARED Mash Sketch File is missing. 

The database directory was unsuccessfully checked. Please run install_databases.py

trnascan-SE pseudogene

phrokka version: 0.1.0
Python version: 3.9
Operating System: Linux-64

Description

Phrokka crashed when trying to process the output of trnascan-SE. Seems like this occurs when trnascan-SE finds a pseudogene.

What I Did

 phrokka.py -i /hpcfs/users/a1667917/Phage_Mapping/All_Phages/NC_047752.1.fasta -o /hpcfs/users/a1667917/Phage_Mapping/Pipeline_Out/PHROKKA/NC_047752.1 -d Databases -f


8355.err:Traceback (most recent call last):
8355.err-  File "/hpcfs/users/a1667917/Phage_Mapping/Phage_Repeat_Mapper/.snakemake/conda/79505b378a5aeb4c419dee41dbc60887/bin/phrokka.py", line 39, in <module>
8355.err-    post_processing.create_tbl(phan_mmseq_merge_df, length_df, out_dir)
8355.err-  File "/hpcfs/users/a1667917/Phage_Mapping/Phage_Repeat_Mapper/.snakemake/conda/79505b378a5aeb4c419dee41dbc60887/bin/modules/post_processing.py", line 250, in create_tbl
8355.err-    trna_df[['attributes','isotypes']] = trna_df['attributes'].str.split(';isotype=',expand=True)
8355.err-  File "/hpcfs/users/a1667917/Phage_Mapping/Phage_Repeat_Mapper/.snakemake/conda/79505b378a5aeb4c419dee41dbc60887/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in __setitem__
8355.err-    self._setitem_array(key, value)
8355.err-  File "/hpcfs/users/a1667917/Phage_Mapping/Phage_Repeat_Mapper/.snakemake/conda/79505b378a5aeb4c419dee41dbc60887/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
8355.err-    check_key_length(self.columns, key, value)
8355.err-  File "/hpcfs/users/a1667917/Phage_Mapping/Phage_Repeat_Mapper/.snakemake/conda/79505b378a5aeb4c419dee41dbc60887/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
8355.err-    raise ValueError("Columns must be same length as key")
8355.err-ValueError: Columns must be same length as key

curl and tar are not in the conda recipe

phrokka version:1.2.1
Python version:3.8
Operating System:docker, biocontainer

Description

I found the probably reason for #241 and #236

curl and tar are not in the conda recipe, those commands are used by
install_databases.py, which fails if they are not installed, which is e.g. the case if you use pharokka in
a container (biocontainer) and galaxy

I made a PR to bioconda (bioconda/bioconda-recipes#40481)

MMSeqs2 error on PHROG database

phrokka version: 1.0.1
Python version: 3.10.6
Operating System: Ubuntu

Description

Greetings,
I'm trying to annotate some small (2.5-8 Kb) phage genomes using Pharokka; everything goes well until the program begins annotating using the PHROG database, which throws the following error:
Score of forward/backward SW differ: 751 748. Q: 0. T: 9857. Start: Q: 6, T:0. End: Q: 192, T:327. Cannot oper index file pharokka/mmseqs/results_mmseqs.index

What I Did

pharokka.py -i final_sanger_genomes.fasta -o pharokka -d /home/szn/pharokkadb/ -t 6 -f

What did I get wrong?

Phrokka Error in post_processing

phrokka version: 0.1.0
Python version: 3.9
Operating System: MacOS

Description

Seems like a pandas error in the str.split function

What I Did

Traceback (most recent call last):
  File "/Users/roshan/opt/anaconda3/envs/phrokka/bin/phrokka.py", line 31, in <module>
    post_processing.create_txt(phan_mmseq_merge_df, length_df,out_dir)
  File "/Users/roshan/opt/anaconda3/envs/phrokka/bin/modules/post_processing.py", line 149, in create_txt
    phanotate_mmseqs_df_cont[['attributes2','function']] = phanotate_mmseqs_df_cont['attributes2'].str.split(';function=',expand=True)
  File "/Users/roshan/opt/anaconda3/envs/phrokka/lib/python3.9/site-packages/pandas/core/frame.py", line 3643, in __setitem__
    self._setitem_array(key, value)
  File "/Users/roshan/opt/anaconda3/envs/phrokka/lib/python3.9/site-packages/pandas/core/frame.py", line 3685, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/Users/roshan/opt/anaconda3/envs/phrokka/lib/python3.9/site-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

Add validation if out directory exists

phrokka version: v1.3.2

The error is pretty good but add explicit check

Traceback (most recent call last):
  File "~pharokka.py", line 57, in <module>
    out_dir = input_commands.instantiate_dirs(args.outdir, args.meta, args.force)
  File "~/pharokka/bin/input_commands.py", line 53, in instantiate_dirs
    os.mkdir(output_dir)
FileExistsError: [Errno 17] File exists: 'NC_007458.fasta'

pharokka_plotter.py v1.3.1 bug

phrokka version: 1.3.1
Python version: 3.10
Operating System: Linux

Description

Bug in pharokka_plotter.py regarding tmrnas in v 1.3.1

Traceback (most recent call last):
File "/Users/a1667917/Documents/pharokka/./bin/pharokka_plotter.py", line 134, in
plot.create_plot( gff_file, gbk_file, args.interval, args.annotations, args.title_size, args.plot_title, args.truncate, plot_file, args.dpi, args.label_size, args.label_hypotheticals)
File "/Users/a1667917/Documents/pharokka/bin/plot.py", line 434, in create_plot
pos_list_trmna = [pos_list_trmna[i] for i in filtered_indices_tmrna]
File "/Users/a1667917/Documents/pharokka/bin/plot.py", line 434, in
pos_list_trmna = [pos_list_trmna[i] for i in filtered_indices_tmrna]
NameError: free variable 'pos_list_trmna' referenced before assignment in enclosing scope

documentation of use

phrokka version:
Python version:
Operating System:

Description

A more comprehensive documentation for the use of Pharokka would be helpful for people that are less experienced with bash coding. This could be done in in the documentation or as a vignette with a clear reference to it on the git. The documentation should entail instructions on the following procedures to aid beginners:

Download anaconda and install anaconda
Create env for Pharokka
Install Pharokka and its databases on the environment
When using Pharokka, activate the environment and specify the input file and output folder
Have a tested example phage on git that can be downloaded to illustrate the use and output of Pharokka as a vignette

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Quick question

pharokka version:
Python version:
Operating System:

Description

I was just wondering if the meta -m flag assemble a phage genome or should I assemble the genome first? Does meta -m flag only combine contigs from the consensus sequence?

Phrokka Databases Don't Download

phrokka version: 0.1.0
Python version: 3.9
Operating System: MacOS

Description

Tried to download the databases using install_databases.py. It failed with the below error:

What I Did

Phrokka_Dir
Getting PHROGs MMSeqs DB
dyld[25797]: missing symbol called
tar: Error opening archive: Failed to open 'Phrokka_Dir/phrogs_mmseqs_db.tar.gz'
Getting PHROGs Annotation Table
dyld[25977]: missing symbol called
Getting PHROGs HHmer DB
dyld[25979]: missing symbol called
tar: Error opening archive: Failed to open 'Phrokka_Dir/phrogs_hhsuite_db.tar.gz'

No *cds_functions.tsv in the output file

phrokka version: v1.2.1
Python version: 2.10.8
Operating System: Linux

Description

I want to use
'./pharokka.py -i /work/hudi/lina/vir/checkv_outputs/LINA00089K_/vir_high_quality.fna -o /work/hudi/lina/vir/pharokka/separa_4/ -t 8 -f'
to get the *cds_functions.tsv.

Actually, it already generate a lot of output files. But it didn't contain this file.
'CARD cleaned_phanotate.tsv phanotate_aas_tmp.fasta pharokka_04102023_151937.log pharokka_minced_spacers.txt trnascan_out.gff vfdb_tmp_dir
CARD_results.tsv mmseqs phanotate_out_tmp.fasta pharokka_aragorn.txt tmp_dir vfdb
CARD_tmp_dir mmseqs_results.tsv phanotate_out.txt pharokka_minced.gff top_hits_mmseqs.tsv vfdb_results.tsv'

It also didn't have error message in the log file

I don't know why.

gbouras13 / pharokka Goto Github PK

pharokka's Introduction

pharokka

Fast Phage Annotation Tool

phold

Google Colab Notebooks

Table of Contents

Quick Start

Documentation

Paper

Pharokka with Galaxy Europe Webserver

Brief Overview

Pharokka v 1.7.0 Update

Pharokka v 1.6.0 Update (11 January 2024)

Pharokka v 1.5.0 Update (20 September 2023)

Pharokka v 1.4.0 Update (27 August 2023)

Pharokka v 1.3.0 Update

Installation

Conda Installation

Pip

Source

Database Installation

Beginner Conda Installation

Usage

Version Log

System

Time

Benchmarking v1.5.0

Benchmarking v1.4.0

Original Benchmarking (v1.1.0)

Bugs and Suggestions

Citation

pharokka's People

Contributors

Stargazers

Watchers

Forkers

pharokka's Issues

Description

What I Did

Conda environment list

Description

What I Did

Description

What I Did

Description

gff line describing terminase large subunit

incorrect use of --terminase

correct use of --terminase

Description

What I Did

Additional info

Description

Description

What I Did

need to look into why, and if it is because similarity is 0, need to fill the gff entry with NA.

Description

Description

Now Solved my issue

Description

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

What I Did

incorrect use of `--terminase`

correct use of `--terminase`