HapMap files are a standard output format for SNPs generated by the
program . While the
TASSEL GUI has many functions for filtering and analyzing the hapmap
data, these scripts are meant to add some supplemental functions, and
in some cases perform the same function but on a larger scale in the
command line environment.
This script will calculate the number of SNPs found in a sliding window of a given size. These results can be used to calculate the average SNP density per window or for plotting variation in SNP density along the chromosome.
SNP_density.pl <input.hmp.txt> <output.txt> <window_size> <overlap>
input.hmp.txt = A SNP file in tab-delimited format (like the hapmap format used by TASSEL)
output.txt = The name of the output file to create
window_size = The size of the window (in base pairs)
overlap = The amount of overlap between windows (in base pairs)
The command line below will calculate SNP density on the input file "SAP_v3_noMissing.hmp.txt," and return an output file called "out.txt". The SNPs will be counted in 1Mb sliding windows, with 100Kb overlap between windows.
The output file format has 4 columns:
- Chromosome
- Window Start Pos.
- Window End Pos.
- Number of SNPs
$ ./SNP_density.pl SAP_allChr_v3.noMissing.hmp.txt out.txt 1000000 100000
Below is an example plot of SNP density along all 10 chromosomes in Sorghum bicolor, based on GBS data from the Sorghum Association Panel.
To generate a plot like this, you can use the R code below:
data = read.table("out.txt", header=FALSE)
my.chr = unique(data$V1)
chr.lengths = sapply(my.chr, function(x[-(length(x))], data) max(data[which(data$V1==x),3]), data=data)
for (i in 1:length(chr.lengths)) {
chr.addon=append(chr.addon, (chr.addon[i]+chr.lengths[i]))
}
data$V5 = data$V2 + chr.addon[data$V1]
chr.col=rainbow(length(my.chr))
plot.col=chr.col[data$V1]
plot(data$V5, (data$V4/(data$V3-data$V2)), xlab="Position (bp)",
ylab="SNPs per bp", col=plot.col, type="h")
This script calculates the distance between every pair of adjacent SNPs in the hapmap file. The output is useful for determining the average spacing of SNPs in a data set, or the likelihood of having a marker next to a gene in linkage (for GWAS).
SNP_distance.pl <input.hmp.txt> <output.txt>
input.hmp.txt = An input file in hapmap format
output.txt = An output file with a single column of distances
The command line below will calculate all of the distances between adjacent SNPs in the same SAP hapmap file used in the previous example. The output will be saved in the file "dist.txt":
$ ./SNP_distance.pl SAP_allChr_v3.noMissing.hmp.txt dist.txt
Quick awk
command line to get the average distance:
awk '{total += $1 } END { print total/NR }' dist.txt
Simultaneously filters sites based on the fraction of missing data at each site and a minor allele frequency threshold.
filter_sites_hmp.pl <input.hmp> <output.hmp> <min_percent_ind> <MAF>
input.hmp = The input file in hapmap format
output.hmp = The name of the output file, which will also be hapmap format
min_percent_ind = The minimum fraction of individuals needed to be non-missing to keep a given site (b/t 0 and 1)
MAF = The minimum minor allele frequency to retain a site (b/t 0 and 1)
The below example will remove sites with more than 20% of individuals missing, or with a MAF less than 10%:
$ ./filter_sites_hmp.pl SAP_allChr_v3.noMissing.hmp.txt out.hmp.txt 0.8 0.1
Calculate the fraction of heterozygous sites for every individual in a hapmap file. Useful for checking RIL or other crossing populations, or checking for possible cross contamination.
het_hmp.pl <input.hmp> <output.txt>
input.hmp = The input file in hapmap format
output.txt = A tab-delimited file with 2 columns:
Column 1 contains the individual sample names
Column 2 has the fraction of heterozygous sites for that individual
Below is an example using the same SAP hapmap file as input, and outputting to a file called "het.txt"
$ ./het_hmp.pl SAP_allChr_v3.noMissing.hmp.txt het.txt
Get a subset of a hapmap file corresponding to a desired list of individuals (this script will still return all sites for only those individuals). This script is useful if you have a large number of individuals you want to extract, making the point-and-click filtering in the TASSEL GUI impractical.
hmpSubset.pl <input.hmp.txt> <output.hmp.txt> <list.txt>
input.hmp.txt = An input file in the TASSEL hapmap format
output.hmp.txt = The name of a hapmap format output file to create
list.txt = A text file containing a list of individuals to extract, in a single column
The below code will extract a list of the 3 individuals in "list.txt" from the SAP hapmap file, and output a new hapmap file to "subset.hmp":
$ cat list.txt
pi17548
pi24969
pi276837
$ ./hmpSubset.pl SAP_allChr_v3.noMissing.hmp.txt subset.txt list.txt
Calculates the fraction of missing sites for each individual sample in a hapmap file. This can be helpful for determining cut-offs for filtering, and for comparing results pre- and post- imputation (or between SNPs called with different parameters in TASSEL).
missing_hmp.pl <input.hmp> <output.txt>
input.hmp = The input file in hapmap format
output.txt = A tab-delimited file with 2 columns:
Column1: Individual/Sample Name
Column2: The fraction of missing sites for that individual
In the example below, the script will calculate the fraction missing for every individual in the SAP hapmap file, and print the output to "missing.txt."
$ ./missing_hmp.pl SAP_allChr_v3.noMissing.hmp.txt missing.txt