Git Product home page Git Product logo

gwass's Introduction

gwass

in progress

A python package for analyzing Genome-Wide Association Summary Statistics (gwass). Currently, the package has two primary functions:

  • Creating a table of SNPs with meta-data about derived and ancestral alleles / frequencies
  • Cleaning up GWAS summary statistics and orienting the sign of effects to derived alleles

Currently the package has a specific focus such that it only works with the 1000 Genomes Project data but in the future this will be generalized.

Pre-requitesites

  • python3
  • pysam
  • numpy
  • pandas

Installation

I recommend creating and activating a new conda python environment:

conda create -n <env_name> python
source activate <env_name>

Clone the repository and install into your python env

git clone https://github.com/jhmarcus/gwass
cd gwass/
make install

Make sure unit tests are working:

make test

Demo

Download 1KG sites vcf and reference genome

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gunzip human_g1k_v37.fasta.gz

Create sites vcf with only bi-allelic SNPs

Assumes bcftools is installed (TODO: add bcftools command)

bcftools

Download Height GWAS summary statistics

wget http://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

Create SNPs

Once the package has been installed it should add command-line executables to your path. One of these scripts will create a SNPs table given the 1KG sites vcf:

create-snps --sites <path>/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.biallelic.vcf.gz \
            --ref <path>/human_g1k_v37.fasta | \
            gzip -c > 1kg_phase3_snps.tsv.gz

note that this will take some time and is a pretty large file. The script will create the file 1kg_phase3_snps.tsv.gz with columns:

  • chrom: chromsome number
  • pos: reference genome position
  • snp: rsID
  • effect_allele: allele which effect sign is to be oriented to
  • other_allele: other allele which effect sign is not oriented to
  • ref_allele: the allele in the reference genome
  • alt_allele: the alternate allele to the reference genome
  • derived_allele: the derived allele determined by multi-way alignment
  • ancestral_allele the ancestral allele determined by multi-way alignment
  • minor_allele: the globally minor allele determined by 1KG Phase3
  • major_allele: the globally major allele determined by 1KG Phase3
  • allele_type: the type of the effect allele / other allele which is either derived_ancestral or minor_major
  • ref_base_l2: the reference base two positions to the left of the SNP
  • ref_base_l1: the reference base one positions to the left of the SNP
  • ref_base_r1: the reference base one positions to the right of the SNP
  • ref_base_r2: the reference base two positions to the right of the SNP
  • f_sas: effect allele frequency in 1KG Phase3 in South Asians
  • f_afr: effect allele frequency in 1KG Phase3 in Africans
  • f_eas: effect allele frequency in 1KG Phase3 in East Asians
  • f_eur: effect allele frequency in 1KG Phase3 in Europeans
  • f_amr: effect allele frequency in 1KG Phase3 in Americans

Clean summary statistics

note that this will take a bit of memory if loading the whole SNPs file

clean-summary-statistics --summary_statistics <path>/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz \
                         --snps <path>/1kg_phase3_snps.tsv.gz
                         --out giant_height_summary_statistics

This will generate a file called giant_height_summary_statistics.tsv.gz with the same columns as the SNP table and two additional columns:

  • beta_hat: the estimated effect size via OLS
  • se: the standard error of the estimated effect size

gwass's People

Contributors

jhmarcus avatar

Watchers

James Cloos avatar Peter Carbonetto avatar  avatar

gwass's Issues

(mosty aesthetic) requests

This is very useful. Here are some things that would be nice

  1. Allow for unused columns in the csv file (this is more a pipeline issue but I'll put it here so everything is together)
  2. In the output, the "referenece_allele" column should be renamed to "effect_allele". The alt_allele column should be renamed to other_allele
  3. In the reference csv, the "regresion_type" column should be called something like "log_transforme_beta" and then should be yes if the raw data report OR and no if it is a linear regression or the raw data report log(OR).

add ability for space delim

Right now it is hardcoded for the raw sum stat files to be '.tsv' so we have to preprocess files that aren't. It would be simple and nice to support other common file formats. We ran into issues b/c some files had mixed sep but adding some other standards would be useful to avoid manual preprocessing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.