Git Product home page Git Product logo

gwas's Introduction

From SNPs to phenotype

A pipeline to relate Single Nucleotide Polymorphisms (SNPs) to a continuous phenotype using either a Random Forest approach from the MUVR package or a canonical GWAS pipeline from RAINBOWR.

Both methods will perform a Genome Wide Analysis (GWAS) using genetic variant (Variant Call Format) and phenotype information.

PLINK

mydata is the prefix of the .ped and .map files.

plink --file mydata --pheno pheno.txt --pheno-name Phenotype --perm --threads 20 --assoc 

1. Inputs and outputs

Both MUVR Random Forest and RAINBOWR methods use the same input files (VCF and phenotype). They also have similar outputs (table of significant SNPs) with some disctint graphs and tables.

Inputs

VCF file

A Variant Call Format (VCF) file that contains the nucleotidic variations is needed. The VCF file has to be of version 4.0 and higher. It will be converted to a genotype matrix using vcfR where genotypes are represented by:

  • "0" for genome positions homozygous for the reference allele.
  • "2" for genome positions homozygous for the alternative allele.
  • "1" for genome positions heterozygous (one reference allele, one alternative allele).

Some useful tools to work with VCF files:

VCFtools can be used to select split the VCF file into chromosomes: vcftools --gzvcf <input_file.vcf.gz> --chr 1

Phenotype file

A tabulated separated file containing two columns:

  1. The identifier for each individual. Column name should be id.
  2. A phenotype column that contains the phenotypic values. Column name should be phenotype

Common outputs

Table of most important SNPs: a table of the most important SNPs related to the phenotype along with their p-values.

Distinct outputs

Random Forest

  • Plot of the model Q2: a plot of the model Q2 metric compared to a distribution of random Q2 values obtained using N permutations (e.g. N = 100).

GWAS RAINBOWR

  • Manhattan plot: a plot of the SNP chromosome position (x-axis) against the $$-log_{10}$$ of the computed p-value of each SNP.

2. Installation

Install RStudio

If not already done, install R (version >= 3.5) and RStudio for your Operating System.

Clone the repository

In the Shell, type git clone https://github.com/SilkeAllmannLab/gwas.git

๐ŸŽ‰ ๐ŸŽŠ That's it! ๐ŸŽ‰ ๐ŸŽŠ

3. Test

Example datasets

All input data are to be found in data/.

VCF dataset

The VCF data from a study from Arouisse et al. 2019.
The VCF file itself can be downloaded directly from FigShare (file is called "Arabidopsis_2029_Maf001_Filter80").

Number of SNPs per chromosome:

chrom # of SNPs
chr01 734,401
chr02 513,750
chr03 616,986
chr04 496,272
chr05 661,335

Several files were then generated from this initial big VCF file:

  • Chromosome VCF files: for instance, for chromosome 1,chr01.Arabidopsis_2029_Maf001_Filter80.vcf.gz
  • Randomly subsampled chromosome VCF files: 10% or more of the initial SNPs were subsampled and extracted on a per-chromosome basis. For instance, for chromosome 1,chr01.subsampled_10_percent.Arabidopsis_2029_Maf001_Filter80.vcf.gz
  • Test file: a small subset available for testing purposes. It can be found at data/Arabidopsis_2029_Maf001_Filter80.1000lines.vcf.

Examples of code used:

To subsample 10% of the original VCF file restricted to chromosome 4 (from ~490,000 variants to 49,000):

gzip -c -d VCF_chr04.recode.vcf.gz |tail -n +16|shuf -n 49000 > chr04.49k_SNPs.no_header.vcf \
  && cat VCF.header.txt chr04.49k_SNPs.no_header.vcf > chr04.49k_SNPs.vcf \
  && gzip chr04.49k_SNPs.vcf \
  && rm  chr04.49k_SNPs.no_header.vcf 

To create one VCF file for each Arabidopsis chromosome:

for i in {1..5}; 
  do echo "working on chromosome $i";
  vcftools  --gzvcf Arabidopsis_2029_Maf001_Filter80.vcf.gz  --chr $i  --recode --recode-INFO-all --out  VCF_chr0$i; 
done

Phenotypes

A root_data_fid_and_names.tsv file contains the genotype line identifier and the phenotypic values. Arabidopsis ecotypes were treated with 2-E-hexanal and their main root length measured. A response ratio was then calculated for each of the ecotypes by comparing the 2-E-hexanal treatment with a mock (methanol).

Test runs

Random Forest

  • Open a new Shell window.
  • Execute the random forest script: Rscript random_forest/gwas.R --help to see the full list of arguments.

RAINBOWR GWAS

  • Open RStudio
  • Run the rainbowr/rainbowr.R script. Make sure you specify a valid VCF file path and phenotype file.

4. References

โœ๏ธ Authors

  • Marc Galland (@mgalland)
  • Martha van Os (@MvanOs)
  • Machiel Clige (@MCligge)

vcfR

The R package vcfR is heavily described here.

RAINBOWR

The RAINBOWR package for R is described here.

5. Troubleshooting

RAINBOWR package

Sometimes, you might have issues with the installation of the Rccp package which is a dependency of RAINBOWR.

In the Shell, try: apt-get install r-base-dev

6. Useful links

There are 4 important parameters to set when building an RF model:

  • nTree: Number of trees in the forest

  • mTry: Number of variants evaluated at each node of a tree

  • maxD: Maximum depth of a tree to grow

  • minNS: Minimum number of samples in a node to be processed

gwas's People

Contributors

mgalland avatar mcligge avatar mvanos avatar

Stargazers

 avatar qianche avatar zy avatar  avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar

Forkers

mvanos

gwas's Issues

Upload a script to convert VCF into HapMap

The main input of the pipeline will be in the HapMap format. Therefore it is useful to be able to convert the Variant Calling Format (VCF) that is often available into HapMap.
It will serve as input to the pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.