Git Product home page Git Product logo

gap-qc's Introduction

Quality Control for GSA-genotyped data on the UMCG HPC

Created by: Raul Aguirre-Gamboa, Esteban Lopera-Maya.
Contact information: [email protected].

Pipeline structure

Important note: this is a self-contained and specific pipeline to work on the UMCG HPC. This means, that at the time it is not fully automated to work in completeness in every kind of data and every kind of server, and many steps might be specifically designed for the original data alone. You are welcome, however, to take any individual step and/or code snippet to addapt to your own conditions.

This pipeline is designed to work in three main steps as indicated by the numeration in front of the name of the main scripts. Numbered scripts should be called by the user independently one after the other. All the scripts with the prefix "sub_" in the front of the name are automatically called by the numbered scripts.
A feedback loop should be done manually by the user, after processing also manually the output of step 2 (script 2.) to remove familial errors (as these cannot be removed automatically), the data resulting from this should go through step 2 a second time. Special adittional "sub_" scripts contain also the suffix "second_it". These should replace their counterparts in the second iteration of step 2.

content

2.QC_autosomes_launch.sh: Main script. It launches all quality control steps and uses most of the required and other reference files

  • Transform oxford files into plink format
  • Loads tools and reference files for all steps
  • Corrects possible sample ID differences
  • Performs call rate filtering and removes/renames duplicated SNPs
  • Performs MAF and HWE filtering
  • Performs heterozygozity filtering (requires change in second iteration)
  • Relatedness and identity by descent and duplicated samples identification (requires change in second iteration). Results are recommended to be analysed manually
  • Performs basic quality control for chromosome X (call rate and duplicate SNPs filter only)
  • Impute sex from genotype and sex check
  • PCA analysis merges the data with 1000Genomes and GoNL and projects PCs based on these. It flags but does not remove non-european samples
  • Plots output from all steps and also performs external concordance analyses. (requires external reference files)
  • Performs also internal concordance anlyses (requires internal reference files)

3.Founders_and_chrX.sh: This R script makes some additional steps

  • Take as input QCed chromosomes as well as the chomosome X (with samples fully corrected by relatendess)
  • Uses corrected familial information to create founders-only dataset
  • Calculates founder stats
  • Calculates MAF and HWE filters for chormosome X with founder female-only
  1. Pre-imputation steps: located in the folder . This is the process to prepare the data for inputation with the HRC reference panel, following the steps indicated by the Sanger imputation server (https://imputation.sanger.ac.uk/).
  • Remove insertions and deletions (script with number 1)
  • Remove duplicated snps (same position SNPS) (script with number 2)
  • transform to VCF and format for imputation, fix reference and alternative alleles in each SNP (use fixref plugin form BCFtools) ***Important note: post-imputation quality control steps are not included here. This included only internal concordance evaluation with internal biobanks data. A posterior evaluation of the resulting allelic frequencies with HRC revealed ~5K SNPs with wrong reference allele asignation. These need to be removed form the current data imputed and can be found in the gearshift cluster at: "/groups/umcg-lifelines/prm03/releases/gsa_genotypes/v1/UGLI_imputed_outliers.list"
  1. How to use (loosely): fill in the variables for the main script, change the SBATCH parameters according to your data size (i.e 10GB and 30min are more than enough for 750k SNPs in 3000 samples) and then run the script. Original values for the variables cna be found in the original_values.txt file

Acknowledgements: This pipeline was elaborated with the financial aid from UMCG-HAP2017 as part of the project, as well as the Conacyt Fellowship, which supported RAG, and Colciencias Fellowship, which supported ELM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.