Git Product home page Git Product logo

hapmap_helpers's Introduction

Scripts for Performing various tasks with HapMap Files

HapMap files are a standard output format for SNPs generated by the program TASSEL. While the TASSEL GUI has many functions for filtering and analyzing the hapmap data, these scripts are meant to add some supplemental functions, and in some cases perform the same function but on a larger scale in the command line environment.

SNP Density

This script will calculate the number of SNPs found in a sliding window of a given size. These results can be used to calculate the average SNP density per window or for plotting variation in SNP density along the chromosome.

SNP_density.pl <input.hmp.txt> <output.txt> <window_size> <overlap>
		input.hmp.txt = A SNP file in tab-delimited format (like the hapmap format used by TASSEL)
		output.txt = The name of the output file to create
		window_size = The size of the window (in base pairs)
		overlap = The amount of overlap between windows (in base pairs)
Example

The command line below will calculate SNP density on the input file "SAP_v3_noMissing.hmp.txt," and return an output file called "out.txt". The SNPs will be counted in 1Mb sliding windows, with 100Kb overlap between windows.

The output file format has 4 columns:

  1. Chromosome
  2. Window Start Pos.
  3. Window End Pos.
  4. Number of SNPs
$ ./SNP_density.pl SAP_allChr_v3.noMissing.hmp.txt out.txt 1000000 100000
Plotting Results

Below is an example plot of SNP density along all 10 chromosomes in Sorghum bicolor, based on GBS data from the Sorghum Association Panel.

To generate a plot like this, you can use the R code below:

data = read.table("out.txt", header=FALSE)
my.chr = unique(data$V1)
chr.lengths = sapply(my.chr, function(x[-(length(x))], data) max(data[which(data$V1==x),3]), data=data)
for (i in 1:length(chr.lengths)) {
	chr.addon=append(chr.addon, (chr.addon[i]+chr.lengths[i]))
}
data$V5 = data$V2 + chr.addon[data$V1]
chr.col=rainbow(length(my.chr))
plot.col=chr.col[data$V1]
plot(data$V5, (data$V4/(data$V3-data$V2)), xlab="Position (bp)",
ylab="SNPs per bp", col=plot.col, type="h")

SNP Distance

This script calculates the distance between every pair of adjacent SNPs in the hapmap file. The output is useful for determining the average spacing of SNPs in a data set, or the likelihood of having a marker next to a gene in linkage (for GWAS).

SNP_distance.pl <input.hmp.txt> <output.txt>
	input.hmp.txt = An input file in hapmap format
	output.txt = An output file with a single column of distances
Example

The command line below will calculate all of the distances between adjacent SNPs in the same SAP hapmap file used in the previous example. The output will be saved in the file "dist.txt":

$ ./SNP_distance.pl SAP_allChr_v3.noMissing.hmp.txt dist.txt

Quick awk command line to get the average distance:

awk '{total += $1 } END { print total/NR }' dist.txt

Filter Sites

Simultaneously filters sites based on the fraction of missing data at each site and a minor allele frequency threshold.

filter_sites_hmp.pl <input.hmp> <output.hmp> <min_percent_ind> <MAF>
	input.hmp = The input file in hapmap format
	output.hmp = The name of the output file, which will also be hapmap format
	min_percent_ind = The minimum fraction of individuals needed to be non-missing to keep a given site (b/t 0 and 1)
	MAF = The minimum minor allele frequency to retain a site (b/t 0 and 1)
Example

The below example will remove sites with more than 20% of individuals missing, or with a MAF less than 10%:

$ ./filter_sites_hmp.pl SAP_allChr_v3.noMissing.hmp.txt out.hmp.txt 0.8 0.1

Heterozygosity for Each Individual

Calculate the fraction of heterozygous sites for every individual in a hapmap file. Useful for checking RIL or other crossing populations, or checking for possible cross contamination.

het_hmp.pl <input.hmp> <output.txt>
	input.hmp = The input file in hapmap format
	output.txt = A tab-delimited file with 2 columns:
		Column 1 contains the individual sample names
		Column 2 has the fraction of heterozygous sites for that individual
Example

Below is an example using the same SAP hapmap file as input, and outputting to a file called "het.txt"

$ ./het_hmp.pl SAP_allChr_v3.noMissing.hmp.txt het.txt

HapMap Subset

Get a subset of a hapmap file corresponding to a desired list of individuals (this script will still return all sites for only those individuals). This script is useful if you have a large number of individuals you want to extract, making the point-and-click filtering in the TASSEL GUI impractical.

hmpSubset.pl <input.hmp.txt> <output.hmp.txt> <list.txt>
	input.hmp.txt = An input file in the TASSEL hapmap format
	output.hmp.txt = The name of a hapmap format output file to create
	list.txt = A text file containing a list of individuals to extract, in a single column
Example

The below code will extract a list of the 3 individuals in "list.txt" from the SAP hapmap file, and output a new hapmap file to "subset.hmp":

$ cat list.txt 
pi17548
pi24969
pi276837

$ ./hmpSubset.pl SAP_allChr_v3.noMissing.hmp.txt subset.txt list.txt

Missing Data per Individual

Calculates the fraction of missing sites for each individual sample in a hapmap file. This can be helpful for determining cut-offs for filtering, and for comparing results pre- and post- imputation (or between SNPs called with different parameters in TASSEL).

missing_hmp.pl <input.hmp> <output.txt>
	input.hmp = The input file in hapmap format
	output.txt = A tab-delimited file with 2 columns:
		Column1: Individual/Sample Name
		Column2: The fraction of missing sites for that individual

In the example below, the script will calculate the fraction missing for every individual in the SAP hapmap file, and print the output to "missing.txt."

Example
$ ./missing_hmp.pl SAP_allChr_v3.noMissing.hmp.txt missing.txt

hapmap_helpers's People

Contributors

eacooper400 avatar

Stargazers

Ryan Disney avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.