imputation-ukb-ref-panel's Introduction

UK Biobank imputation pipelines

About

Genotype imputation is a computational technique for estimating missing genotypes in SNP array data, using a reference panel of haplotypes. This approach extends to low-coverage whole genome sequencing data, aiding in filling missing genotypes or enhancing uncertain genotype calls from sequencing reads.

For both SNP array and low-coverage whole genome sequencing data, we've created two distinct pipelines using the UK Biobank reference panel (>200,000 samples; 700M variants) for genotype imputation. To ensure cost-effective implementation, we leverage efficient state-of-the-art tools, including IMPUTE5 (Rubinacci et al., 2020) for SNP array imputation and GLIMPSE2 (Rubinacci et al., 2023) for low-coverage WGS imputation.

Our pipelines can take input from a multi-sample VCF/BCF file with SNP array genotypes or a set of low-coverage BAM/CRAM files. Using the UK Biobank reference panel, the pipeline executes imputation through applets and dx command jobs, tailor-made for the UKB RAP. At the end of each imputation pipeline, a single multi-sample BCF file is generated per chromosome, encompassing genotype posteriors, dosages, and phased best-guess genotypes. Further outputs like haploid dosages can be acquired by specifying appropriate options in the imputation software.

Website and tutorials

Tutorials on how to use the pipelines can be found at:

https://srubinacci.gitbook.io/uk-biobank-imputation-pipelines/

Citation

If you use the pipelines in your research work, please cite the following papers:

Reference panel

Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. 2023. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet 55, 1243–1249 (2023).

Low-coverage WGS imputation

Rubinacci et al., Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nat Genet 55, 1088–1090 (2023).

SNP array imputation

Rubinacci et al., Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet. 16, e1009049 (2020).

About the project

The UK Biobank imputation pipelines are developed by Simone Rubinacci & Olivier Delaneau.

License

The UK Biobank imputation pipelines are distributed with an MIT license.

imputation-ukb-ref-panel's People

Contributors

Stargazers

Watchers

imputation-ukb-ref-panel's Issues

How to interpret quality score outputs from low-pass WGS imputation on UKB

Thank you so much for this incredible imputation tool! After following the steps for the pipeline for low-pass WGS on UK Biobank data, I am wondering how to interpret the quality score returned in the bcf file. The info scores all appear to be 0.99 or 1 ; is this an expected result and how are these info scores calculated?

AC/AN INFO fields in VCF are inconsistent with GT field, update the values in the VCF in chunk 000 and 022

Hi Simone,
I'm trying to set up your uk-biobank-imputation-pipelines/low-coverage-pipeline on Dnanexus in the US. I’m pulling the genetic map from here: https://github.com/odelaneau/GLIMPSE/tree/master/maps/genetic_maps.b38 since I do not have access to resources on RAP and the phased vcfs from here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased and then I remove multiallelic records. Everything for step 2 ( in https://srubinacci.gitbook.io/uk-biobank-imputation-pipelines/low-coverage-pipeline/running-the-low-coverage-pipeline 'First-time usage - Binary reference panel representation') went smoothly for autosomes but for chr X, 2 of the chunks convert_ref_chrX_000 and convert_ref_chrX_022, are failing with the exact error in this discussion odelaneau/GLIMPSE#189. Maybe it is related to the chrX imputation and splitting chromosomes into PAR and notPAR regions and if so I can’t seem to locate the bam file with the non-PAR chr X reads downsampled to 1x. The link in the https://odelaneau.github.io/GLIMPSE/glimpse1/tutorial_chrX.html that says “The data and the scripts for this tutorial can be downloaded HERE” is broken and the git page tutorial doesn’t seem to have that bam. I also downloaded Glimpse 2 and 1 and the tutorials there are the same as on the git page. Any ideas where I could locate tutorial_chrX?
Also, how would I call the merged vcf generated in 3.3 with glimpse_split_reference? In the tutorial link the merged vcf from step 3.3 is called with --input parameter for glimps_phase (probably --input-gl in version 2) and ref from step 2 as reference. The split_reference doesn’t have a –input option. Just call it as reference?
Also, thinking that the pipeline must have been tested successfully for chr x with the resources made available on RAP- I wonder if we could get the splitVCFs hosted there. That might be the quickest resolution.

Recommend Projects

srubinacci / imputation-ukb-ref-panel Goto Github PK

imputation-ukb-ref-panel's Introduction

UK Biobank imputation pipelines

About

Website and tutorials

Citation

About the project

License

imputation-ukb-ref-panel's People

Contributors

Stargazers

Watchers

imputation-ukb-ref-panel's Issues

How to interpret quality score outputs from low-pass WGS imputation on UKB

AC/AN INFO fields in VCF are inconsistent with GT field, update the values in the VCF in chunk 000 and 022

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent