Scripts for Performing various tasks with HapMap Files

HapMap files are a standard output format for SNPs generated by the program . While the TASSEL GUI has many functions for filtering and analyzing the hapmap data, these scripts are meant to add some supplemental functions, and in some cases perform the same function but on a larger scale in the command line environment.

SNP Density

This script will calculate the number of SNPs found in a sliding window of a given size. These results can be used to calculate the average SNP density per window or for plotting variation in SNP density along the chromosome.

SNP_density.pl <input.hmp.txt> <output.txt> <window_size> <overlap>
		input.hmp.txt = A SNP file in tab-delimited format (like the hapmap format used by TASSEL)
		output.txt = The name of the output file to create
		window_size = The size of the window (in base pairs)
		overlap = The amount of overlap between windows (in base pairs)

Example

The command line below will calculate SNP density on the input file "SAP_v3_noMissing.hmp.txt," and return an output file called "out.txt". The SNPs will be counted in 1Mb sliding windows, with 100Kb overlap between windows.

The output file format has 4 columns:

Chromosome
Window Start Pos.
Window End Pos.
Number of SNPs

$ ./SNP_density.pl SAP_allChr_v3.noMissing.hmp.txt out.txt 1000000 100000

Plotting Results

Below is an example plot of SNP density along all 10 chromosomes in Sorghum bicolor, based on GBS data from the Sorghum Association Panel.

To generate a plot like this, you can use the R code below:

data = read.table("out.txt", header=FALSE)
my.chr = unique(data$V1)
chr.lengths = sapply(my.chr, function(x[-(length(x))], data) max(data[which(data$V1==x),3]), data=data)
for (i in 1:length(chr.lengths)) {
	chr.addon=append(chr.addon, (chr.addon[i]+chr.lengths[i]))
}
data$V5 = data$V2 + chr.addon[data$V1]
chr.col=rainbow(length(my.chr))
plot.col=chr.col[data$V1]
plot(data$V5, (data$V4/(data$V3-data$V2)), xlab="Position (bp)",
ylab="SNPs per bp", col=plot.col, type="h")

SNP Distance

This script calculates the distance between every pair of adjacent SNPs in the hapmap file. The output is useful for determining the average spacing of SNPs in a data set, or the likelihood of having a marker next to a gene in linkage (for GWAS).

SNP_distance.pl <input.hmp.txt> <output.txt>
	input.hmp.txt = An input file in hapmap format
	output.txt = An output file with a single column of distances

Example

The command line below will calculate all of the distances between adjacent SNPs in the same SAP hapmap file used in the previous example. The output will be saved in the file "dist.txt":

$ ./SNP_distance.pl SAP_allChr_v3.noMissing.hmp.txt dist.txt

Quick awk command line to get the average distance:

awk '{total += $1 } END { print total/NR }' dist.txt

Filter Sites

Simultaneously filters sites based on the fraction of missing data at each site and a minor allele frequency threshold.

filter_sites_hmp.pl <input.hmp> <output.hmp> <min_percent_ind> <MAF>
	input.hmp = The input file in hapmap format
	output.hmp = The name of the output file, which will also be hapmap format
	min_percent_ind = The minimum fraction of individuals needed to be non-missing to keep a given site (b/t 0 and 1)
	MAF = The minimum minor allele frequency to retain a site (b/t 0 and 1)

Example

The below example will remove sites with more than 20% of individuals missing, or with a MAF less than 10%:

$ ./filter_sites_hmp.pl SAP_allChr_v3.noMissing.hmp.txt out.hmp.txt 0.8 0.1

Heterozygosity for Each Individual

Calculate the fraction of heterozygous sites for every individual in a hapmap file. Useful for checking RIL or other crossing populations, or checking for possible cross contamination.

het_hmp.pl <input.hmp> <output.txt>
	input.hmp = The input file in hapmap format
	output.txt = A tab-delimited file with 2 columns:
		Column 1 contains the individual sample names
		Column 2 has the fraction of heterozygous sites for that individual

Example

Below is an example using the same SAP hapmap file as input, and outputting to a file called "het.txt"

$ ./het_hmp.pl SAP_allChr_v3.noMissing.hmp.txt het.txt

HapMap Subset

Get a subset of a hapmap file corresponding to a desired list of individuals (this script will still return all sites for only those individuals). This script is useful if you have a large number of individuals you want to extract, making the point-and-click filtering in the TASSEL GUI impractical.

hmpSubset.pl <input.hmp.txt> <output.hmp.txt> <list.txt>
	input.hmp.txt = An input file in the TASSEL hapmap format
	output.hmp.txt = The name of a hapmap format output file to create
	list.txt = A text file containing a list of individuals to extract, in a single column

Example

The below code will extract a list of the 3 individuals in "list.txt" from the SAP hapmap file, and output a new hapmap file to "subset.hmp":

$ cat list.txt 
pi17548
pi24969
pi276837

$ ./hmpSubset.pl SAP_allChr_v3.noMissing.hmp.txt subset.txt list.txt

Missing Data per Individual

Calculates the fraction of missing sites for each individual sample in a hapmap file. This can be helpful for determining cut-offs for filtering, and for comparing results pre- and post- imputation (or between SNPs called with different parameters in TASSEL).

missing_hmp.pl <input.hmp> <output.txt>
	input.hmp = The input file in hapmap format
	output.txt = A tab-delimited file with 2 columns:
		Column1: Individual/Sample Name
		Column2: The fraction of missing sites for that individual

In the example below, the script will calculate the fraction missing for every individual in the SAP hapmap file, and print the output to "missing.txt."

Example

$ ./missing_hmp.pl SAP_allChr_v3.noMissing.hmp.txt missing.txt

eacooper400 / hapmap_helpers Goto Github PK

hapmap_helpers's Introduction

Scripts for Performing various tasks with HapMap Files

SNP Density

Example

Plotting Results

SNP Distance

Example

Filter Sites

Example

Heterozygosity for Each Individual

Example

HapMap Subset

Example

Missing Data per Individual

Example

hapmap_helpers's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org