vishnuraghuram94 / agrvate Goto Github PK

View Code? Open in Web Editor NEW

9.0 1.0 2.0 60.61 MB

Rapid identification of Staphylococcus aureus agr locus type and agr operon variants.

License: MIT License

Shell 1.27% HTML 98.73%

agr-operon agr-group mummer aureus usearch agr staphylococcus-aureus staphylococcus frameshifts alleles

agrvate's Introduction

AgrVATE

Agr Variant Assessment & Typing Engine

AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.

WORKFLOW:

AgrVATE accepts a S. aureus genome assembly as input and performs a kmer search using an Agr-group specific kmer database to assign the Agr-group. The agr operon is then extracted using in-silico PCR and variants are called using an Agr-group specific reference operon.

Citation

Please cite the following paper if you use AgrVATE in your research. Thank you!

Raghuram V, Alexander AM, Loo HQ, Petit RA 3rd, Goldberg JB, Read TD. Species-Wide Phylogenomics of the Staphylococcus aureus Agr Operon Revealed Convergent Evolution of Frameshift Mutations. Microbiol Spectr. 2022 Jan 19;10(1):e0133421. doi: 10.1128/spectrum.01334-21. Epub ahead of print. PMID: 35044202; PMCID: PMC8768832.

INSTALLATION:

Please see the PREREQUISITES section for all the tools required to run AgrVATE. For ease of use, I recommended you install AgrVATE using Conda.

conda create -n agrvate -c bioconda agrvate
conda activate agrvate

This will install all necessary dependencies EXCEPT Usearch. Due to Usearch's license, it cannot be provided with the conda installation. Please download and extract usearch11.0.667 (osx32 or linux32) from here and add it to your PATH

For example (Use the version appropriate for your operating system):

curl "https://www.drive5.com/downloads/usearch11.0.667_i86linux32.gz" --output usearch11.0.667_i86linux32.gz #Downloads usearch binary

gunzip usearch11.0.667_i86linux32.gz #Decompresses usearch binary

chmod 755 usearch11.0.667_i86linux32 #Changes permissions to executable

cp ./usearch11.0.667_i86linux32 $(dirname "$(which agrvate)") #Copies usearch binary to the same directory as agrvate

NOTE: Currently, only the 32-bit version of usearch is free to use. This version is not supported by WSL or MacOS (post-Catalina). Therefore, it is recommended to use AgrVATE on Linux machines or older versions MacOS. If you are unable to run usearch, use the -m option to run MUMmer instead (IN BETA). However, please note that if there are large insertions/deletions in the agr-operon, MUMmer can split the alignment into 2 and the resulting extracted agr-operon will not be intact, in which case frameshift detection using snippy may miss these indels.

PREREQUISITES:

Usearch 32 bit linux
Robert C. Edgar, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, Volume 26, Issue 19, 1 October 2010, Pages 2460–2461, https://doi.org/10.1093/bioinformatics/btq461
NCBI blast+
Camacho, C., Coulouris, G., Avagyan, V. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). https://doi.org/10.1186/1471-2105-10-421
Snippy
Seemann T (2015). Snippy: fast bacterial variant calling from NGS reads. https://github.com/tseemann/snippy
MUMmer
S. Kurtz. et al (2004). Versatile and open software for comparing large genomes. Genome Biology, R12. https://doi.org/10.1186/gb-2004-5-2-r12
HMMER
S.R. Eddy. Biological sequence analysis using profile hidden Markov models. http://hmmer.org/
SeqKit
Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962

Databases folder for agr group typing and variant calling

DREME
DREME is not required for AgrVATE but it was used to build the kmer database for Agr-group typing (gp1234_motifs_all.fasta)
Timothy L. Bailey, DREME: motif discovery in transcription factor ChIP-seq data, Bioinformatics, Volume 27, Issue 12, 15 June 2011, Pages 1653–1659, https://doi.org/10.1093/bioinformatics/btr261

 agrvate_databases/
 	├── agrD_hmm.hmm
 	├── agrD_hmm.hmm.h3f
 	├── agrD_hmm.hmm.h3i
 	├── agrD_hmm.hmm.h3m
 	├── agrD_hmm.hmm.h3p
 	├── agr_operon_primers.fa
 	├── gp1234_motifs_all.fasta
 	└── references
 		├── gp1-operon_ref.gbk
 		├── gp2-operon_ref.gbk
 		├── gp3-operon_ref.gbk
 		└── gp4-operon_ref.gbk
 		└── mummer_ref_operon.fna

USAGE:

agrvate -i filename.fasta [options]

FLAGS:
- -i Input S. aureus genome in FASTA format [alternate: --input]
- -t Does agr typing only (skips agr operon extraction and frameshift detection) [alternate: --typing-only]
- -m Uses MUMmer dnadiff instead of usearch [alternate: --mummer]
- -f Force overwrite existing results directory [alternate: --force]
- -d Path to agrvate_databases (Not required if installed using Conda) [alternate: --databases]
- -h Print this help message and exit [alternate: --help]
- -v Print version and exit [alternate: --version]

AgrVATE supports a single FASTA file as input, but the file can be a multi-fasta file. To run multiple S. aureus genomes, it is recommended to keep them as separate files in a common directory.
For example:

ls fasta_files/* | xargs -I {} agrvate -i {} [options]

OUTPUTS:

RESULTS:

A new directory with suffix -results will be created where all the following files can be found

NOTE: There are 15 possible kmers for each agr group per genome. The analyses will continue even if only one kmer matches a given agr-group but it should be noted that < 5 kmers matching leads to a low confidence agr-group call. Col 3 in fasta-summary.tab shows the number of kmers matched

fasta-summary.tab:

  col 1: Filename
  col 2: Agr group (gp1/gp2/gp3/gp4). 'u' means unknown. If multiple agr groups were found (col 5 = m), the displayed agr group is the majority/highest confidence. 
  col 3: Match score for agr group (maximum 15; 0 means untypeable; < 5 means low confidence)
  col 4: Canonical or non-canonical agrD ( 1 means canonical; 0 means non-canonical; u means unknown)
  col 5: If multiple agr groups were found, likely due to multiple S. aureus isolates in sequence ( s means single, m means multiple, u means unknown )
  col 6: Number of frameshifts found in CDS of extracted agr operon ( Column is 'u' if agr operon was not extracted )

If multiple assemblies are run, use this command from parent directory to output a consolidated summary table for all samples

  awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-results/*-summary.tab > filename.tab

fasta-agr_gp.tab:

  col 1: Assembly Contig ID
  col 2: ID of matched agr group kmer
  col 3: evalue
  col 4: Percentage identity of match
  col 5: Start position of kmer alignment on input sequence
  col 6: End position of kmer alignment on input sequence

fasta-agr_operon_frameshifts.tab:
Frameshift mutations in CDS of extracted agr operon detected by Snippy. An agr-group specific reference sequence is used to call variants.
```
  col 1: Filename
  col 2: Position on agr operon compared to reference
  col 3: Type of frameshift
  col 4: Effect of mutation
  col 5: Gene
```
fasta-blastn-log.txt:
Standard output of ncbi blastn
fasta-agr_operon.fna:
Agr operon extracted from in-silico PCR using USEARCH -SEARCH_PCR in fasta format
fasta-hmm.tab:
Tabular output of nhmmer This file is present only if the agr group is untypeable.
fasta-hmm-log.txt:
Standard output of nhmmer This file is present only if the agr group is untypeable.
fasta-pcr-log.tab:
Standard output of USEARCH -SEARCH_PCR
fasta-snippy_log.txt:
Standard output of Snippy
fasta-snippy/
All output files of Snippy
fasta-mummer_log.txt:
Standard output of MUMmer dnadiff
fasta-mummer/
All output files of MUMmer dnadiff

TROUBLESHOOTING

An error report summary file with suffix -error-report.tab will be created in the working directory.

The error report file does not contain any results. It merely shows which steps of the process pipeline ran (pass) and which steps did not (fail).

pass Does not necessarily mean a result was obtained, it only means the step completed successfully.
fail Does not necessarily mean there was an error, it only means that step was not performed. However, possible causes of error for each column are mentioned below.

The columns are ordered by how the processes are carried out. i.e col 1 is the first step and col 7 is the last. If one column shows fail it means the programme exited at that step and therefore the remaining columns will also show fail .

error-report.tab:

  col 1: Input name - the argument supplied to the -i flag
  col 2: Input check - If fail, the input did not pass the valid fasta file criteria
  col 3: Databases check - If fail, the databases folder or the path to the databases was not valid. 
  col 4: Outdir check - If fail, the results directory already exists and couldn't be overwritten. Use flag -f or --force. 
  col 5: Agr typing - If fail, the Agr typing kmer search could not be performed. Check if blastn is installed correctly. 
  col 6: Operon check - If fail, in-silico PCR was not performed by usearch or agr operon search was not performed by mummer. Check if usearch/mummer is installed correctly. 
  col 7: Snippy check - If fail, agr operon frameshift detection was not performed. Check if snippy is installed correctly.

If multiple assemblies are run, use this command from parent directory to output a consolidated report table for all samples

  awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-error-report.tab > filename.tab

Author

Vishnu Raghuram

agrvate's People

Contributors

Stargazers

Watchers

Forkers

rpetit3

agrvate's Issues

option to force overwrite?

I'm all for not overwriting existing results! Good job!

I think it would be useful to add an option for allow users to forcibly overwrite existing outputs. Here's an example where it would be useful:

ls -lh
total 2.9M
drwxr-xr-x 2 rpetit users 4.0K Jan  6 16:35 empty_folder
-rw-r--r-- 1 rpetit users 2.9M Jan  6 16:31 ERX204841.fna


agrvate ERX204841.fna empty_folder/
cat: ERX204841-results/ERX204841-agr_gp.tab: No such file or directory
cat: ERX204841-results/ERX204841-agr_gp.tab: No such file or directory
Unable to agr type

Error: File existence/permissions problem in trying to open query file empty_folder//agrD_hmm.hmm.
HMM file empty_folder//agrD_hmm.hmm not found (nor an .h3m binary of it)

grep: ERX204841-results/ERX204841-hmm.tab: No such file or directory
Unable to find agrD
usearch11.0.667_i86linux32 not in path, cannot perform frameshift detection
 please download usearch11.0.667_i86linux32 from https://www.drive5.com/usearch/download.html
 
# Run failed but results folder was created, which is fine
ls -lh
total 2.9M
drwxr-xr-x 2 rpetit users 4.0K Jan  6 16:35 empty_folder
-rw-r--r-- 1 rpetit users 2.9M Jan  6 16:31 ERX204841.fna
drwxr-xr-x 2 rpetit users 4.0K Jan  6 16:42 ERX204841-results

# fixed database path
agrvate ERX204841.fna
Results directory already exists, cannot overwrite

# manually remove folder
rm -rf ERX204841-results/

agrvate ERX204841.fna
agr typing successful, gp1
usearch found
agr operon extraction successful
Snippy successful
No frameshifts found

Alternative

agrvate ERX204841.fna -f 
found ERX204841-results/, but '-f' given will delete results folder
agr typing successful, gp1
usearch found
agr operon extraction successful
Snippy successful
No frameshifts found

invalid inputs return exit code 0, and print to STDOUT

Very minor, but errors caught all return exit code 0 (no error). The error messages are also printed to STDOUT.

agrvate g
Invalid input

AgrVATE: Agr Variant Assessment & Typing Engine

VERSION: agrvate v1.0

USAGE:   agrvate <fasta file>
         <fasta_file> <path/to/agrvate_databases> #Not required if installed using Conda

FLAGS:
  -h     Print this help message
  -v     Print version

SOURCE:  https://github.com/VishnuRaghuram94/AgrVATE

echo $?
0

Lines affected:
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L34
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L41-L72
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L84-L88
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L91-L95
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L191-L200
https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L206-L225

For the exits that you think should be an error, you can change them to exit 1

For messages you think should go to STDERR, I think you can do something like echo "my error message" 1>&2"

update citation to publication instead of preprint

Hi, thanks for developing this tool. I just noticed that the peer reviewed publication is out and thought you might want to link to that on the main /README.md instead of the pre-print.

https://journals.asm.org/doi/10.1128/spectrum.01334-21

question about agr type argument in biorxiv paper

Is it true that agr types of every clonal complex should be the same? I have seen this trend at the mlst level but not clonal complex level. For example, ST188 is CC1 but has gp1 (not gp3 shown in your paper).

Many thanks,

why choose --minqual 1 --mincov 2?

Thank you for this great tool to detecting the mutations in agr operon!

I noticed that the values for --minqual and --mincov are 1 and 2, respectively. So why you choose these values?

Best regards,
Tonny_z

check existence of individual database files

Currently only the existence of the database directory is checked. It would be useful if each database file used is checked before any processing occurs.

Steps to repeat:

mkdir empty_folder
agrvate ERX204841.fna empty_folder/
cat: ERX204841-results/ERX204841-agr_gp.tab: No such file or directory
cat: ERX204841-results/ERX204841-agr_gp.tab: No such file or directory
Unable to agr type

Error: File existence/permissions problem in trying to open query file empty_folder//agrD_hmm.hmm.
HMM file empty_folder//agrD_hmm.hmm not found (nor an .h3m binary of it)

grep: ERX204841-results/ERX204841-hmm.tab: No such file or directory
Unable to find agrD
usearch11.0.667_i86linux32 not in path, cannot perform frameshift detection
 please download usearch11.0.667_i86linux32 from https://www.drive5.com/usearch/download.html

check for usearch before processing

The check for usearch happens after processing would have occurred

https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L188-L200

I think it would be useful to check this at the start before processing.

You could then tell give the users the commands to download like you have in the README. Instead of the cp usearch /usr/bin you could use the same path that agrvate is in. Somthing like:

...
script_dir=$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )
...

echo "usearch11.0.667_i86linux32 not in path, cannot perform frameshift detection\n please download usearch11.0.667_i86linux32 from https://www.drive5.com/usearch/download.html"
echo "Example commands:"  1>&2
echo "wget usearch" 1>&2
echo "gunzip usearch" 1>&2
echo " chmod 755 usearch" 1>&2
echo cp usearch ${script_dir}/usearch11.0.667_i86linux32" 1>&2

help and version parameters don't work with a fasta file

Steps to repeat:

agrvate ERX204841.fna -h
-h does not exist

agrvate ERX204841.fna -v
-v does not exist

Expected behavior:

agrvate ERX204841.fna -h

AgrVATE: Agr Variant Assessment & Typing Engine

VERSION: agrvate v1.0

USAGE:   agrvate <fasta file>
         <fasta_file> <path/to/agrvate_databases> #Not required if installed using Conda

FLAGS:
  -h     Print this help message
  -v     Print version

SOURCE:  https://github.com/VishnuRaghuram94/AgrVATE

license needed

Yo! Before putting on Bioconda you will need to pick a license.

snippy not running

Column 7 of output returns "fail". All other outputs are returning "pass". I am running on mummer as I cannot download USearch on my current MacOS software.
Snippy is installed correctly as version 3.1
Please could you advise?

novel agr types?

Hello,

If the output of AgrVATE indicates that there are missense mutations in AgrD, does that indicate that my isolate has a novel Agr type? My understanding is that there should be no mutations in the AgrD

Many thanks for your wonderful tool,

thoughts on adding column headers to tabbed outputs?

What do you think about adding the column names to the tabbed outputs? I think it would be useful to users.

Something like:

cat ERX204841-results/ERX204841-summary.tab
ERX204841       gp1     13      1       s       0

cat ERX204841-results/ERX204841-summary.tab
filename	agr_group	match_score	canonical_agr	groups_found	frameshifts
ERX204841       gp1     13      1       s       0

input for DREME

Hello,

Is it possible to adapt your tool for other staph species? If so, what would be the appropriate input to DREME?

Best wishes,

dots in sample name cause name to get truncated to first dot

I input S.201202.00885.fna and the results were written to S-results

Processing S.201202.00885.fna ...
  /local/home/rpetit/miniconda3/envs/staph-typer2/bin/agrvate_databases/ is valid
  agr typing successful, gp3
  Mummer successful
  Unable to find agr operon, check S-results/S-mummer-log.txt

feature request - allow for user to define output prefix and/or output directory

It would be nice for the user to be able to define at the minimum the output filename prefix with an option like --outprefix <string> and even better, define the output directory with an option like --outdir <directory>. It would allow for more flexibility and predictablity of output filenames, which is useful for incorporation of agrvate into workflows.

AFAIK agrvate currently uses the filename prefix to name the output directory and resulting file names. Looks to me like it is cutting on the period, but perhaps I'm misunderstanding the code here: https://github.com/VishnuRaghuram94/AgrVATE/blob/main/agrvate#L138

$ agrvate -i asdfasdf12345.fasta -m
Processing asdfasdf12345.fasta ...
/usr/local/bin/agrvate_databases/ is valid
agr typing successful, gp1
Mummer successful
Extracting agr operon from mummer output
Mummer alignment is contiguous
agr operon extraction successful
Snippy successful
No frameshifts found

$ tree -L 1 asdfasdf12345-results/
asdfasdf12345-results/
├── asdfasdf12345-agr_gp.tab
├── asdfasdf12345-agr_operon.fna
├── asdfasdf12345-agr_operon_frameshifts.tab
├── asdfasdf12345-blastn_log.txt
├── asdfasdf12345-mummer
├── asdfasdf12345-mummer-log.txt
├── asdfasdf12345-snippy
├── asdfasdf12345-snippy-log.txt
└── asdfasdf12345-summary.tab

The result of frame shift is not consistent between output of snippy and blast result.

Thank you for this great tool to detect the mutations in agr operon!

Recently, I used this great tool for some MRSA genomes, and I found that for some strains, the position of frame shift was not consistent between the output from snippy (snps.tab) and the blast result (between the reference agr sequence and agr operon extracted by agrvate).

For instance, the position of frame shift in agrC of strain A from snippy was c.487delT (p.Tyr163fs); however, the blast result showed that the position of frame shift in agrC was c.481delT. I guess this might be caused by the repeated T base?

Please could you explain this phenomenon?