lh3 / bgt Goto Github PK

View Code? Open in Web Editor NEW

96.0 16.0 10.0 310 KB

Flexible genotype query among 30,000+ samples whole-genome

License: MIT License

C 70.41% TeX 21.95% Makefile 0.54% Go 3.63% Perl 0.27% Shell 0.33% JavaScript 2.36% TSQL 0.51%

bioinformatics genomics

bgt's Introduction

Getting Started

Connect to a public BGT server

curl -s 'http://bgtdemo.herokuapp.com/'
curl -s 'http://bgtdemo.herokuapp.com/?a=(impact=="HIGH")&s=(population=="FIN")&f=(AC>0)'
curl -s 'http://bgtdemo.herokuapp.com/?t=CHROM,POS,END,REF,ALT,AC/AN&f=(AC>1)&r=20'

For the last query, the last line is "*", indicating the result is incomplete. Note that this web app is using Heroku's free tier. It is restricted to one CPU only and put to sleep when the app is idle. There is an overhead of wakeup. Heroku also forces free apps to sleep for "6 hours in a 24 hour period". I don't know how exactly this works.

Run BGT locally

# Installation
git clone https://github.com/lh3/bgt.git
cd bgt; make
# Download demo BCF (1st 1Mbp of chr11 from 1000g), and convert to BGT
wget -O- http://bit.ly/BGTdemo | tar xf -
./bgt import 1kg11-1M.bgt 1kg11-1M.raw.bcf
gzip -dc 1kg11-1M.raw.samples.gz > 1kg11-1M.bgt.spl  # sample meta data
# Get all sample genotypes
./bgt view -C 1kg11-1M.bgt | less -S
# Get genotypes of HG00171 and HG00173 in region 11:100,000-200,000
./bgt view -s,HG00171,HG00173 -f'AC>0' -r 11:100000-200000 1kg11-1M.bgt
# Get alleles high-frequency in CEU but absent from YRI
./bgt view -s'population=="CEU"' -s'population=="YRI"' -f'AC1/AN1>=0.1&&AC2==0' -G 1kg11-1M.bgt
# Select high-impact sites (var annotation provided with -d)
./bgt view -d anno11-1M.fmf.gz -a'impact=="HIGH"' -CG 1kg11-1M.bgt

Set up your web server

# Compile the server; Go compiler required
make bgt-server
GOMAXPROCS=4 ./bgt-server -d anno11-1M.fmf.gz 1kg11-1M.bgt 2> server.log &
curl -s '0.0.0.0:8000' | less -S  # help
curl -s '0.0.0.0:8000/?a=(impact=="HIGH")&s=(population=="FIN")&f=(AC>0)'

Getting Started
Users' Guide
Further Notes
- Other genotype formats
- Performance evaluation

Users' Guide

BGT is a compact file format for efficiently storing and querying whole-genome genotypes of tens to hundreds of thousands of samples. It can be considered as an alternative to genotype-only BCFv2. BGT is more compact in size, more efficient to process, and more flexible on query.

BGT comes with a command line tool and a web application which largely mirrors the command line uses. The tool supports expressive and powerful query syntax. The "Getting Started" section shows a few examples.

1. Data model overview

BGT models a genotype data set as a matrix of genotypes with rows indexed by site and columns by sample. Each BGT database keeps a genetype matrix and a sample annotation file. Site annotations are kept in a separate file which is intended to be shared across multiple BGT databases. This model is different from VCF in that VCF 1) keeps sample information in the header and 2) stores site annotations in INFO along with genotypes which are not meant to be shared across VCFs.

2. Import

A BGT database always has a genotype matrix and sample names, which are acquired from VCF/BCF. Site annotations and sample phenotypes are optional but are recommended. Flexible meta data query is a distinguishing feature of BGT.

2.1 Import genotypes

# Import BCFv2
bgt import prefix.bgt in.bcf
# Import VCF with "##contig" header lines
bgt import -S prefix.bgt in.vcf.gz
# Import VCF without "##contig" header lines
bgt import -St ref.fa.fai prefix.bgt in.vcf.gz

During import, BGT separates multiple alleles on one VCF line. It discards all INFO fields and FORMAT fields except GT. See section 2.3 about how to use variant annotations with BGT.

2.2 Import sample phenotypes

After importing VCF/BCF, BGT generates prefix.bgt.spl text file, which for now only has one column of sample names. You can add pheotype data to this file in a format like (fields TAB-delimited):

sample1   gender:Z:M    height:f:1.73     region:Z:WestEurasia     foo:i:10
sample2   gender:Z:F    height:f:1.64     region:Z:WestEurasia     bar:i:20

where each meta annotation takes a format key:type:value with type being Z for a string, f for a real number and i for an integer. We call this format Flat Metadata Format or FMF in brief. You can get samples matching certain conditions with:

bgt fmf prefix.bgt.spl 'height>1.7&&region=="WestEurasia"'
bgt fmf prefix.bgt.spl 'mass/height**2>25&&region=="WestEurasia"'

You can most common arithmetic and logical operators in the condition.

2.3 Import site annotations

Site annotations are also kept in a FMF file like:

11:209621:1:T  effect:Z:missense_variant   gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:347
11:209706:1:T  effect:Z:synonymous_variant gene:Z:RIC8A  CCDS:Z:CCDS7690.1  CDSpos:i:432

We provide a script misc/vep2fmf.pl to convert the VEP output with the --pick option to FMF.

Note that due to an implementation limitation, we recommend to use a subset of "important" variants with BGT, for example:

gzip -dc vep-all.fmf.gz | grep -v "effect:Z:MODIFIER" | gzip > vep-important.fmf.gz

Using the full set of variants is fine, but is much slower with the current implementation.

3. Query

A BGT query is composed of output and conditions. The output is VCF by default or can be a TAB-delimited table if requsted. Conditions include genotype-independent site selection with option -r and -a (e.g. variants in a region), genotype-independent sample selection with option -s (e.g. a list of samples), and genotype-dependent site selection with option -f (e.g. allele frequency among selected samples above a threshold). BGT has limited support of genotype-dependent sample selection (e.g. samples having an allele).

BGT has an important concept "sample group". On the command line, each option -s creates a sample group. The #-th option -s populates a pair of AC# and AN# aggregate variables. These variables can be used in output or genotype-dependent site selection.

3.1 Genotype-independent site selection

# Select by a region
bgt view -r 11:100,000-200,000 1kg11-1M.bgt > out.vcf
# Select by regions in a BED (BGT will read through the entire BGT)
bgt view -B regions.bed 1kg11-1M.bgt > out.vcf
# Select a list of alleles (if on same chr, use random access)
bgt view -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G 1kg11-1M.bgt
# Select by annotations (-d specifies the site annotation database)
bgt view -d anno11-1M.fmf.gz -a'impact=="HIGH"' -CG 1kg11-1M.bgt

It should be noted that in the last command line, BGT will read through the entire annotation file to find the list of matching alleles. It may take several minutes if the site annotation files contains 100 million lines. That is why we recommend to use a subset of important alleles (section 2.3).

3.2 Genotype-independent sample selection

# Select a list of samples
bgt view -s,HG00171,HG00173 1kg11-1M.bgt
# Select by phenotypes (see also section 2.2)
bgt view -s'population=="CEU"' 1kg11-1M.bgt
# Create sample groups (there will be AC1/AN1 and AC2/AN2 in VCF INFO)
bgt view -s'population=="CEU"' -s'population=="YRI"' -G 1kg11-1M.bgt

3.3 Genotype-dependent site selection

# Select by allele frequency
bgt view -f'AN>0&&AC/AN>.05' 1kg11-1M.bgt
# Select by group frequnecy
bgt view -s'population=="CEU"' -s'population=="YRI"' -f'AC1>10&&AC2==0' -G 1kg11-1M.bgt

Of course, we can mix all the three types of conditions in one command line:

bgt view -G -s'population=="CEU"' -s'population=="YRI"' -f'AC1/AN1>.1&&AC2==0' \
         -r 11:100,000-500,000 -d anno11-1M.fmf.gz -a'CDSpos>0' 1kg11-1M.bgt

3.4 Tabular output

# Output position, sequence and allele counts
bgt view -t CHROM,POS,REF,ALT,AC1,AC2 -s'population=="CEU"' -s'population=="YRI"' 1kg11-1M.bgt

3.5 Miscellaneous output

# Get samples having a set of alleles (option -S)
bgt view -S -a,11:151344:1:G,11:110992:AACTT:A,11:160513::G -s'population=="CEU"' 1kg11-1M.bgt
# Count haplotypes
bgt view -Hd anno11-1M.fmf.gz -a'gene=="SIRT3"' -f 'AC/AN>.01' 1kg11-1M.bgt
# Count haplotypes in multiple populations
bgt view -Hd anno11-1M.fmf.gz -a'gene=="SIRT3"' -f 'AC/AN>.01' \
         -s'region=="Africa"' -s'region=="EastAsia"' 1kg11-1M.bgt

4. BGT server

In addition to a command line tool, we also provide a prototype web application for genotype query. The query syntax is similar to bgt view as is shown in "Getting Started", but with some notable differences:

The server uses .and. for the logical AND operator && (as & is a special character to HTML).
The server can't load a list of samples from a local file (for security).
The server doesn't support BCF output for now (can be implemented on request).
The server doesn't output genotypes by default (option g required for server).
The server loads site annotations into RAM (for real-time response but requiring more memory).
By default (tunable), the server processes up to 10 million genotypes and then truncates the result.
The server may forbid the output of genotypes of some samples (see below).

4.1 Privacy

The BGT server implements a simple mechanism to keep the privacy of samples or a subset of samples. It is controlled by a single parameter: minimal sample group size or MGS. The server refuses to create a sample group if the size of the group is smaller than the MGS of one of the samples in the group. In particular, if MGS is above one, the server doesn't report sample name or sample genotypes. Each sample may have a different MGS as is marked by the _mgs integer tag in prefix.bgt.spl. For samples without this tag, a default MGS is applied.

Further Notes

Other genotype formats

BGT vs PBWT. BGT uses the same data structure as PBWT and is inspired by PBWT. PBWT supports advanced query such as haplotype matching, phasing and imputation, while BGT puts more emphasis on fast random access and data retrieval.
BGT vs BCF2. BCF is more versatile. It is able to keep per-genotype meta information (e.g. per-genotype read depth and genotype likelihood). BGT is generally more efficient and times smaller. It scales better to many samples. BGT also supports more flexible queries, although technically, nothing prevents us from implementing similar functionalities on top of BCF.
BGT vs GQT. GQT should be much faster on traversing sites across whole chromosomes without considering LD. It is however inefficient to retrieve data in small regions or to get haplotype information due to its design. For this reason, GQT is regarded as a complement to BCF or BGT, not a replacement. On file size, GQT is usually larger than genotype-only BCF and is thus larger than BGT.

Performance evaluation

The test is run on the first release of Haplotype Reference Consortium (HRC) data. There are ~39 million phased SNPs in 32,488 samples. We have generated the BGT for the entire dataset, but We are only running tools in region chr11:10,000,001-20,000,000. The following table shows the time and command line. Note that the table omits option -r 11:10,000,001-20,000,000 which has been applied to all command lines below.

Time	Command line
11s	bgt view -G HRC-r1.bgt
13s	bcftools view -Gu HRC-r1.bcf
30s	bgt view -GC HRC-r1.bgt
4s	bgt view -GC -s'source=="1000G"'
19s	bcftools view -Gu -S 1000G.txt HRC-r1.bcf
8s	bgt view -G -s 'source=="UK10K"' -s 'source=="1000G"&&population!="GBK"'

On file sizes, the BGT database for HRC-r1 is 7.4GB (1GB=1024*1024*1024 bytes). In comparison, BCFv2 for the same data takes 65GB, GQT 93GB and PBWT 4.4GB. BGT and PBWT, which are based on the same data structure, are much more compact. BGT is larger than PBWT primarily because BGT keeps an extra bit per haplotype to distinguish reference and multi allele, and stores markers to enable fast random access.

bgt's People

Contributors

Stargazers

Watchers

Forkers

joshuashen zengfengbo yhgenomics mcshane druvus cyang-2014 asure89 scchess weihongx eaglewillard

bgt's Issues

[W::vcf_parse1] can't find '10' in the sequence dictionary

HI!
when i try to run compress order:
./bgt import -S -o archive input.vcf
Error occured

Compatibility with GLNexus joint called VCFs

It seems that BGT does not work with joint called variants by DeepVariant/GLNexus process, when run import, it gives this error:

[W::vcf_parse1] FORMAT 'RNC' is not defined in the header

I understand that this is probably caused by the new "RNC" field by GLNexus, but is there any possible walk-around solutions?

Tests

Do you have test scripts for bgt?

Handling large annotation sets

Could you provide efficient query across the annotations using a FM-index over the concatenated annotation strings from the VCF file? A second compressed bitvector could encode variant annotation starts in this record (basically storing a variant to annotation mapping).

Then you could subset to a given set of records with a particular annotation by finding the ranks of the occurrences of a given pattern in the auxiliary bitvector.

I guess this wouldn't help much when you have to compare floats in the annotations and the annotation is included in all records. Then you end up needing to compare lots of values to execute the query. There might also be a way around this though.

merging bgt files split by chromosome into one

Is it possible to merge bgt files after bgt indexing or should that be done before?

Server addresses in the example

In the BGT example, on my server, 0.0.0.0 does not work:

curl -s '0.0.0.0:8000' | less -S  # help
curl -s '0.0.0.0:8000/?a=(impact=="HIGH")&s=(population=="FIN")&f=(AC>0)'

I switched it to 127.0.0.1 and then the example worked.

Unable to find insertions

I am trying to find insertions using the -a,#variant# framework but I can't get it to return anything. Deletions and SNPs work fine with this format, but insertions always return null. I've tried many variations of the formatting. I even tried to find insertions using the variant IDs generated by the vep2sql.js (using the -f flag) script, but even these return null. Searching for these same variants using the -r region flag works perfectly, so I know they exist in the database. Any help? I don't think its an error with my formatting and it seems limited to insertions.

For example:

This is a variant identified by VEP running the 1kg11-1M.raw.bcf with the following command:
"variant_effect_predictor.pl -i 1kg11-1M.raw.vcf -o 1kg11-1M.raw.vcf.vep --refseq --dir /home/ec2-user/.vep --offline --pick --cache --everything --force_overwrite"
The output was processed through vep2sql.js:
"k8-linux bgt/misc/vep2sql.js -f 1kg11-1M.raw.vcf.vep > 1kg11-1M.raw.vcf.vep.fmf"

Running "bgt view -t CHROM,POS,REF,ALT -a,11:607967:0:CTCCGA 1kg11-1M.bgt" (the allele found in the 1kg11-1M.raw.vcf.vep.fmf file) returns nothing.

However, running "bgt view -t CHROM,POS,REF,ALT -a,11:608013:1:A 1kg11-1M.bgt" (the next allele on the list) does work.

Finding the insertion via searching the region does function and finds the variant.

bgt view -t CHROM,POS,REF,ALT -r 11:607960-607970 1kg11-1M.bgt

BGT missing SNPs

Running BCFTOOLS and BGT for the same region, BCFTOOLS showed two SNPs but BGT (bgt built from the same vcd.gz) returned none. Not sure how widespread this problem would be. Could someone take a look at this?

Examples like following:

bcftools view -r 05:67948215-67948219 xxx.vcf.gz | grep -v "#" | wc -l
2
bgt view -r 05:67948215-67948219 -s sample.list xxx.bgt | grep -v "#" | wc -l
0

sample.list includes all sample names from the VCF header.

annotation of output

Dear Li,

Since we have variant annotation (-d variantannot.fmf.gz option) and also the -a for query where we can select variants based on these annotations, it would be nice to have option(s) to include these annotations in the output (INFO field for VCF and also in the TAB format).

For example if we annotate variants by gnomAD frequency, we can query for it by the given rules, however when we have more complex queries we only know that the output comply to these variant selection criteria, however we don't know the exact values from the output. As the FMF syntax already have most the information (the type, the name, and also we know we have 1 per variant entry, we only lack a sensible Description however the name are usually self descriptive so it could be used. Guess it would be an overkill to have a separate table for descriptions for each FMF ID) to auto generate INPUT tags it would be nice if there would be an option to
#1) include variant annotations that were used in the -a query for filtering
#2) option to include all variant annotations (regardless of query)
#3) option to include a preset of selected annotations (regardless of query)

Since we have TAB format, it would be also nice to be able to reference these fields there too ie the -t option would work the fixed POS,CHROM, etc, plus these variant annotation fields (so we could list IMPACT, GENE, what else) also in the TAB format.

Zoltan

FMF query

Just opening the bgt-server without queries we get INFO on config parameters and the queryable sample/variant annotations from the FMF files: XXX.bgt.spl for samples, and from the annot.XXX.fmf.gz that was provided by the -d option).

However there is no function to see what are the range in INT, or FLOAT fields or what are the queryable values in case of STRING fields.

Technically we usually know it or could document it based on our databses, however It would be nice to have some function to query these on the fly using the bgt fmf command (and/or with some new flags in the bgt-server). This would allow that users who have no access to the server could explore their options.

Regards,
Zoltan

getting private variants from an individual

I'm trying to get variants seen in only one sample X in the cohort.

I'm trying:

bgt view -s,X -f 'AC1>0&&AC<=2' -G cohort.bgt

It's not working. How can I achieve this?

query based on file contents

Would be good to be able to reference files in queries. e.g. in bcftools, one can do something like

ID=@/path/to/file .. select lines with ID present in the file
ID!=@/path/to/file .. skip lines with ID present in the file

compilation issue

I guess I'm a little too quick off the mark, but is something missing from the Makefile?

make: *** No rule to make target bgtview.o', needed by bgt'. Stop.

Also, extracting k8 as in the instructions overwrites the bgt README.md with the k8 README.md

BGT issue with Multi Allelic Variant Sites

Hello.
As a test I ran BGT on chrX of 1KGP3 (available from the FTP link below) ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1c.20130502.genotypes.vcf.gz

The commands I used were (I converted the vcf.gz file above to BCF with bcftools to save space) :

bgt import chrX.bcf out/chrX
bgt view -b out/chrX.bgt > out/chrX.bcf

When I compared the output file with the input file with bcftools view, the first variant is strangely split.

Expected (only first sample GT shown because of size...) :

X       60020   .       T       TA,TAAC 100     PASS    AC=10,92;AF=0.00199681,0.0183706;AN=5008;NS=2504;DP=11848;AMR_AF=0.0029,0.0086;AFR_AF=0.0008,0.0635;EUR_AF=0,0.002;SAS_AF=0.0031,0;EAS_AF=0.004,0;VT=INDEL;MULTI_ALLELIC        GT      0|0 ...

And I got two variants :

X       60020   .       T       TA,<M>   0     .      .        GT      0/0 ...
X       60020   .       T       TAAC,<M> 0     .      .        GT      0/0 ...

I understand the the multi-allelic site has been split however what is strange is that in both lines I get genotype values between 0, 1, and 2.

So how to interpret when for example 2/0 occurs in the first line or in the second line ?

Thanks.
Best
Rick

Import VCF into existing bgt file

Is there any way to import a VCF containing additional samples into an existing bgt file?

I've tried bgt import -S prefix.bgt 1.vcf followed by bgt import -S prefix.bgt 2.vcf, but this only seems to overwrite the existing file.

freebayes vcfs (1.0-r282 and 1.0-r265)

I cannot get freebayes vcf to import on 1.0-r282 at all

I get bgt: atomic.c:111: bcf_atomize: Assertion `i < b->n_info' failed when using bcfs that generated by freebayes --report-monomorphic and a seg fault on default vcfs of reporting only polymorphic (non-reference) sites.

In the release version of 1.0-r265, it works but not for monomorphic

The default freebayes of only polymorphic sites works in this version, but once again I get bgt: atomic.c:111: bcf_atomize: Assertion `i < b->n_info' using --report-monomorphic.

Let me know if you want backtraces, variable calls, or example files.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.