in progress
A python package for analyzing Genome-Wide Association Summary Statistics (gwass). Currently, the package has two primary functions:
- Creating a table of SNPs with meta-data about derived and ancestral alleles / frequencies
- Cleaning up GWAS summary statistics and orienting the sign of effects to derived alleles
Currently the package has a specific focus such that it only works with the 1000 Genomes Project data but in the future this will be generalized.
python3
pysam
numpy
pandas
I recommend creating and activating a new conda python environment:
conda create -n <env_name> python
source activate <env_name>
Clone the repository and install into your python env
git clone https://github.com/jhmarcus/gwass
cd gwass/
make install
Make sure unit tests are working:
make test
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gunzip human_g1k_v37.fasta.gz
Assumes bcftools is installed (TODO: add bcftools command)
bcftools
wget http://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz
Once the package has been installed it should add command-line executables to your path. One of these scripts will create a SNPs table given the 1KG sites vcf:
create-snps --sites <path>/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.biallelic.vcf.gz \
--ref <path>/human_g1k_v37.fasta | \
gzip -c > 1kg_phase3_snps.tsv.gz
note that this will take some time and is a pretty large file. The script will create the file 1kg_phase3_snps.tsv.gz
with columns:
chrom
: chromsome numberpos
: reference genome positionsnp
: rsIDeffect_allele
: allele which effect sign is to be oriented toother_allele
: other allele which effect sign is not oriented toref_allele
: the allele in the reference genomealt_allele
: the alternate allele to the reference genomederived_allele
: the derived allele determined by multi-way alignmentancestral_allele
the ancestral allele determined by multi-way alignmentminor_allele
: the globally minor allele determined by 1KG Phase3major_allele
: the globally major allele determined by 1KG Phase3allele_type
: the type of the effect allele / other allele which is either derived_ancestral or minor_majorref_base_l2
: the reference base two positions to the left of the SNPref_base_l1
: the reference base one positions to the left of the SNPref_base_r1
: the reference base one positions to the right of the SNPref_base_r2
: the reference base two positions to the right of the SNPf_sas
: effect allele frequency in 1KG Phase3 in South Asiansf_afr
: effect allele frequency in 1KG Phase3 in Africansf_eas
: effect allele frequency in 1KG Phase3 in East Asiansf_eur
: effect allele frequency in 1KG Phase3 in Europeansf_amr
: effect allele frequency in 1KG Phase3 in Americans
note that this will take a bit of memory if loading the whole SNPs file
clean-summary-statistics --summary_statistics <path>/GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz \
--snps <path>/1kg_phase3_snps.tsv.gz
--out giant_height_summary_statistics
This will generate a file called giant_height_summary_statistics.tsv.gz
with the same columns as the SNP table and two additional columns:
beta_hat
: the estimated effect size via OLSse
: the standard error of the estimated effect size