aleimba / bac-genomics-scripts Goto Github PK

Collection of scripts for bacterial genomics

License: GNU General Public License v3.0

Perl 97.14% Shell 2.86%

perl bioinformatics genomics science microbial-genomics microbiology ngs sequencing annotation mlst

bac-genomics-scripts's Introduction

bac-genomics-scripts

A collection of scripts intended for bacterial genomics (some might also be useful for eukaryotes) from high-throughput sequencing (aka next-generation sequencing).

Summary
Introduction
Installation recommendations
Dependencies
UNIX loops
Windows - UNIX linebreak problems
Citation
License
Author - contact

Summary

Basic stats for bases and reads in FASTQ files: calc_fastq-stats
Concatenate multi-sequence files (RichSeq EMBL or GENBANK format, or FASTA format) to a single artificial file: cat_seq
COG (cluster of orthologous groups) classification of proteins: cdd2cog
Extraction of protein/nucleotide sequences from CDSs: cds_extractor
MLST (multilocus sequence typing) assignment and allele extraction for Escherichia coli (Achtman scheme): ecoli_mlst
Create a feature table for all annotated primary features in RichSeq (EMBL or GENBANK format) files: genomes_feature_table
Deprecated! Batch downloading of sequences from NCBI's FTP server: ncbi_ftp_download
Order sequence entries in FASTA/FASTQ files according to an ID list: order_fastx
Create an ortholog/paralog annotation comparison matrix from Proteinortho5 output: po2anno
Calculate stats and plot venn diagrams for genome groups according to orthologs/paralogs from Proteinortho5 output, i.e. overall presence/absence statistics for groups of genomes and not simply single genomes: po2group_stats
Strain panel query protein search with BLASTP plus concise hit summary, optional alignment, and presence/absence matrix. Also included, scripts to transpose the matrix and calculate overall presence/absence statistics for groups of columns in the matrix: prot_finder
Rename FASTA ID lines and optionally numerate them: rename_fasta_id
Reverse complement (multi-)sequence files (RichSeq EMBL or GENBANK format, or FASTA format): revcom_seq
Regions of difference (ROD) detection in genomes with BLASTN: rod_finder
NGS paired-end library insert size estimation from BAM/SAM: sam_insert-size
Randomly subsample FASTA, FASTQ, or TEXT files with reservoir sampling: sample_fastx-txt
Convert a sequence file to another format with BioPerl: seq_format-converter
Manual curation of annotation in NCBI's TBL format (e.g. from Prokka automatic annotation) in a spreadsheet software: tbl2tab
Truncate sequence files (RichSeq EMBL or GENBANK format, or FASTA format) according to given coordinates: trunc_seq
And an assortment of smaller scripts for tasks like (not yet uploaded to GitHub): alignment format converters, dnadiff, GC% calculation etc.

Introduction

All the scripts here are written in Perl (some include bash shell wrappers).

Each script is hosted in its own folder, so that a separate README.md can be included for more information. However, all of the Perl scripts include additionally a usage/help text or a comprehensive POD (Plain Old Documentation) by calling the script either without arguments/options or option -h|-help.

The scripts are only tested under UNIX, some won't run in a Windows environment (because of included UNIX commands). If you are on Windows an alternative might be Cygwin.

Installation recommendations

To download the repository, use either the 'Download ZIP' link after clicking the green 'Clone or download' button at the top or clone the repository with git:

git clone https://github.com/aleimba/bac-genomics-scripts.git

If there is an update to this GitHub repository (see above commits and releases), you can refresh your local repository by using the following command inside the local folder:

git pull

To install the scripts, copy them e.g. to a home /bin folder in your PATH and make them executable

$ find . \( -name '*.pl' -o -name '*.sh' -o -name '*.fas' -o -name '*.txt' \) -exec cp {} ~/bin \;
$ chmod u+x ~/bin/*.pl

the scripts can then be run everywhere on your system. Of course you can just call them directly by prefexing perl to the command or a './' for bash wrappers:

$ perl /path/to/script/script.pl <options>

$ ./script.sh <arguments>

Single scripts can be downloaded as well. For this purpose click on the folder you're interested in and then on the link of the script. There click on the Raw button and save this page to a file (without Raw you'll get an unusable html file). This is also true for other files (e.g. PDFs etc.).

Dependencies

All scripts are tested with Perl v5.22.1.

Most of the Perl scripts include modules from BioPerl as stated in their respective README.md or POD, which as a consequence has to be installed on your system. For BioPerl installation instructions see the website (Installation).

Some scripts need additional Perl modules, which will be stated in the associated README.md or POD. If they're not installed yet on your system get them from CPAN (installation instructions can be found on the website, see e.g. Getting Started...Installing Perl Modules or FAQ).

Furthermore, some scripts call upon statistical computing language R and dependent packages for plotting purposes (again see the respective README.md or POD).

UNIX loops

A very handy tip, if you want to run a script on all files in the current working directory you can use a loop in UNIX, e.g.:

$ for file in *.fasta; do perl script.pl "$file"; done

Windows - UNIX linebreak problems

At last, some of the scripts don't like Windows formatted line breaks, you might consider running these input files through a nifty UNIX utility called dos2unix:

$ dos2unix input

Citation

For now cite the latest major release (tag: bovine_ecoli_mastitis) hosted on Zenodo:

Leimbach A. 2016. bac-genomics-scripts: Bovine E. coli mastitis comparative genomics edition. Zenodo. http://dx.doi.org/10.5281/zenodo.215824.

Also, all scripts have a version number (see option -v), which might be included in a materials and methods section.

License

All scripts are licensed under GPLv3 which is contained in the file LICENSE.

Author - contact

For help, suggestions, bugs etc. use the GitHub issues or write an email to aleimba [at] gmx [dot] de.

Andreas Leimbach (Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)

bac-genomics-scripts's People

Contributors

Stargazers

Watchers

bac-genomics-scripts's Issues

Can't convert CDDs IDs to COGs

Following all the commands in #1 , I get rpsblast claiming that the arguments in the command "rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6" are wrong, and it works after I adjust the command to

rpsblast -i GCF_000005845.2_ASM584v2_protein.faa -d Cog -o rps-blast.out -m 8

which gets the job done, however the conversion to COGs with

perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

gets the result

Overall assignment statistics:
~ Total query proteins categorized into COGs: 4134
~ Total COGs used for the query proteins [of 4873 overall]: 1
~ Total number of assigned functional categories: 0
~ Total functional categories used for the query proteins [of 25 overall]: 0

which brings no information, since nothing was identified. It also outputs a lot of "Use of uninitialized value" which probably means the CDD's IDs are not being recognized. The rest of the commands were used as suggested, so what is happening here? Also, this is using the 2003 COGs right?

How to run RPS-BLAST+ and `cdd2cog`

A master student from Brazil contacted me via email with questions how to run RPS-BLAST+ and cdd2cog.pl correctly. I'm copying the correspondence in here in case it is useful for someone else:

Hi, I am a master student of genetics at the Universidade Federal de Minas Gerais, Brasil. I was reading the cdd2goc description at
https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog#rps-blast

In the line referring to the use of RPS-Blast :

rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

I am confuse about the cog, which I highlighted. Is this another database we must download? If so, where could I find it, and to perform a search for protein sequences of a draft bacterial genome I assembled, how database should I get? Thank you in advance.

CDS-extractor.pl should return error message (and exit code 1) when no CDS could be extracted

When genbanks are used as input that still have windows/dos line-endings, cds-extractor.pl just quits without an error message, giving the impression that it successfully extracted all CDS.

Maybe it could either be adjusted to tolerate windows line endings or to always double-check the number of extracted CDS when finished and generally raise a warning if it is zero.

Cannot retrieve path to RPS database

Hi, I think this has a simple answer, I get this error when running the rpsblast command:

rpsblast -query ROD1.txt -db /Users/Laura/Desktop/rspblast/Cog_LE/ -out rps.blast.out -evalue 1e-5 -outfmt 6
Cannot retrieve path to RPS database

Thanks!

small code change for use of the cdd2cog.pl script with COG20

Hi!

not really an issue, but putting this here in case the original author doesn't have time to modify the cdd2cog.pl script for use with the updated COG20 database. I guess this would be more properly done via pull request, but I figure more people might see an opened issue. I also realise this is a "band-aid" style fix, but might be useful to someone nonetheless.

if you're interested in using the cdd2cog.pl script with the COG20 database (which can be found here https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/) only a small change is required. The body of the script will work perfectly fine, but the information parsed from the fun.txt and whog files now needs to be parsed from files with a slightly different name and format.

fun.txt is now replaced by fun-20.tab
whog can be replaced by cog-20.def.tab
both of these files can be downloaded from the link above

to retrieve the relevant info from these files, the subroutines all the way at the end of cdd2cog.pl need to be modified. For clarity and ease of copy/pasting, I have pasted the entire subroutine here. The orignal code is still in place, but commented out. The 4 added lines are present under # code to parse fun-20.tab file and # code to parse cog-20.def.tab.

after the modifications are made, the script can be run using:
cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun-20.tab -w cog-20.def.tab

Hope this is helpful!

###############
# Subroutines #
###############

### Subroutine to parse the 'cddid.tbl', 'fun' and 'whog' file contents and store in hash structures
sub parse_cdd_cog {

    ### 'cddid.tbl'
    open (my $cddid_fh, "<", "$CDDid_File");
    print "\nParsing CDDs '$CDDid_File' file ...\n"; # status message
    while (<$cddid_fh>) {
        chomp;
        my @line = split(/\t/, $_); # split line at the tabs
        if ($line[1] =~ /^COG\d{4}$/) { # search for COG CD accessions in cddid
            $CDDid{$line[0]} = $line[1]; # hash to store info; $line[0] = PSSM-Id
        }
    }
    close $cddid_fh;

    ### 'fun.txt'
    open (my $fun_fh, "<", "$Fun_File");
    print "Parsing COGs '$Fun_File' file ...\n"; # status message
    while (<$fun_fh>) {
        chomp;
	
	# code to parse fun-20.tab file
	my @line = split(/\t/, $_); # split line at the tabs
	$Fun{$line[0]} = {'desc' => $line[2], 'count' => 0}; # anonymous hash in hash
        # $line[0] = single-letter functional category, $line[2] = description of functional category
        # count used to find functional categories not present in the query proteins for final overall assignment statistics	

	# code and comments to parse original fun.txt file
	#$_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
        #if (/^\[(\w)\]\s*(.+)$/) {
        #    $Fun{$1} = {'desc' => $2, 'count' => 0}; # anonymous hash in hash
        #    # $1 = single-letter functional category, $2 = description of functional category
        #    # count used to find functional categories not present in the query proteins for final overall assignment statistics

        #}
    }
    close $fun_fh;

    ### 'whog'
    open (my $whog_fh, "<", "$Whog_File");
    print "Parsing COGs '$Whog_File' file ...\n"; # status message
    while (<$whog_fh>) {
        chomp;
	# code to parse cog-20.def.tab
	my @line = split(/\t/, $_); # split line at the tabs
	$Whog{$line[0]} = {'function' => $line[1], 'desc' => $line[2]}; # anonymous hash in hash
	# $line[1] = single-letter functional categories, maximal four per COG in COG20 (COG5032 no longer exists)
	# $line[0] = COG#, $line[2] = COG protein description

	#code and comments to parse the original whog file
        #$_ =~ s/^\s*|\s+$//g; # get rid of all leading and trailing whitespaces
        #if (/^\[(\w+)\]\s*(COG\d{4})\s+(.+)$/) {
        #    $Whog{$2} = {'function' => $1, 'desc' => $3}; # anonymous hash in hash
        #    # $1 = single-letter functional categories, maximal five per COG (only COG5032 with five)
	#    # $2 = COG#, $3 = COG protein description
        #}
    }
    close $whog_fh;

    return 1;
}

Error while executing po2group_stats.pl

Hi,

I get this error while executing po2group_stats.pl:

Bareword found where operator expected at ./po2group_stats.pl line 669, near "s/.\w+$//r"
syntax error at ./po2group_stats.pl line 669, near "s/.\w+$//r"
BEGIN not safe after errors--compilation aborted at ./po2group_stats.pl line 913

Any advice please?

Thanks

cdd2cog and RPSBLAST 2.2.31+ report parse

Hi,

Thanks for the tool, really useful!.

There seems to be a change on the output of RPS-Blast, an example line:

fig|992418.4.peg.8018 gnl|CDD|223110 26.42 159 100 6 20 165 11 165 4e-13 65.6 45

The subject id format changed, and the current regex on cdd2cog doesn't work. Changing the line 302 to my $pssm_id = $1 if $line[1] =~ /^gnl\|CDD\|(\d+)/; # get PSSM-Id from the subject hit fix the issue.

Regards

Error when running cdd2cog

Hi, I am trying to run cdd2cog and I am getting the Error: Excessively long <> operator at ./cdd2cog.pl line 21

I am running: $ perl ./cdd2cog.pl -r GRW1_COGs_rpsblast.out -c cddid.tbl -f fun.txt -w whog -a

I got the rpsblast output by running: rpsblast -i ../proteins/GRW1_prots.fas -d Cog_LE/Cog -o GRW1_COGs_rpsblast.out -e 0.01 -m 8

I notice that the outfmt parameter written in the README file for running RPS-BLAST says to use "-outfmt 6 or 7", but when I tried running this option it did not want to take it.

Here are the first few lines of the GRW1_COGs_rpsblast.out:
gene_3|GeneMark.hmm|389_aa|+|1295|2464 >k123_143 gnl|CDD|226020 17.25 313 213 12 105 389 65 359 2e-05 42.9
gene_6|GeneMark.hmm|322_aa|-|993|1961 >k123_238 gnl|CDD|224392 21.43 140 92 3 70 209 16 137 5e-08 49.9
gene_7|GeneMark.hmm|235_aa|-|1975|2682 >k123_238 gnl|CDD|224113 21.79 179 126 2 20 198 76 240 2e-19 81.8

update compatibility to COG2014?

Hi Andreas,

Love the cdd2cog.pl script and description, thanks for making this available. Was wondering if you planned to update this to the newer release of COG ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2014/data/ with the fun and whog updated to fun2003-2014.tab and cognames2003-2014.tab. I tried running cdd2cog.pl with these newer classification files and it doesn't seem to parse the COG categories and so the func_stats.txt file does not get populated.

No output argument?

Greetings

Does cdd2cog have an output argument for deciding where to export results, or do they have to go the the ./results directory always?

Error with rpsblast+

Dear Aleimba,

I tried running the following command with the respective error message.

rpsblast+ -query protein.faa -db ../../COG/cog_LE_db/ -evalue 0.00001 -outfmt "6 qseqid sseqid evalue pident score qstart qend sstart send length slen" -out cog_blast_output.out

BLAST Database error: No alias or index file found for protein database [../../COG/cog_LE_db/] in search path [/home/rajal/Downloads/prokka/prokka_annotation:/var/lib/blastdb::]

I am using rpsblast+ version [Reverse Position Specific BLAST 2.2.28+]. What am I missing here?

Thank you.

running rpsblast with gi numbers as queries

Hi, I'm trying to do an rpsblast using ncbi-blast-2.5.0+. I have a file containing gi numbers as my query and my command line is below.

rpsblast -query t -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
Oddly the rpbslast keeps giving me this error:

Warning: [rpsblast] Error initializing remote BLAST database data loader: Protein BLAST database 'Cog/Cog nr' does not exist in the NCBI servers
My query file (t) looks like this. I've tried to remove search using just the numbers (292833481) but it still doesn't solve the problem.
gi|292833481
gi|383341230
gi|289693981

However, when I try to the same search but using a fasta file as query, it runs fine and gives the results I need.
rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
Is there something that I did wrong here? What is the correct way to format query a list of gi's for blast? Thanks!