bystrogenomics / bystro Goto Github PK

Bystro genetic analysis (annotation, filtering, statistics)

License: Apache License 2.0

Perl 43.34% Shell 1.75% Dockerfile 0.23% Rust 2.64% Python 48.52% Cython 0.07% Makefile 0.06% Go 3.40%

genomics genomics-search bioinformatics bioinformatics-pipeline bioinformatics-analysis bioinformatics-databases bioinformatics-algorithms bioinformatics-scripts

bystro's Introduction

Bystro

TLDR; 1,000x+ faster than VEP, more complete annotation + online search (https://bystro.io) for datasets of up to 47TB (compressed) online, or petabytes offline.

Bystro Publication

For datasets and scripts used, please visit github.com/bystro-paper

If using Bystro, please cite Kotlar et al, Genome Biology, 2018

Web Tutorial

Start here: TUTORIAL.md

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

The easiest way is to run from Docker: docker pull akotlar/bystro:latest && docker run bystro:latest bystro-annotate.pl

Please read: INSTALL.md for instructions on how to download and use Bystro hg19/hg38/etc databases.

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

VCF format: Bystro-Vcf
SNP format: Bystro-SNP
Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

The config file describes the state of both the database and the annotation. It's required for annotating or building

It has several keys:

tracks: The highest level organization for database values. Tracks have a name property, which must be unique, and a type, which must be one of:

sparse: A bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
- This is used for dbSNP, and Clinvar records, but many files can be fit this format.
- Mapping fields can be managed by the fieldMap key
score: A wigFix file.
- Used for phastCons, phyloP
cadd:
- A CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
- CADD format: http://cadd.gs.washington.edu

gene: A UCSC gene track table (ex: knownGene, refGene, sgdGene) stored as a tab separated output, with column names as columns. Conversion from SQL to the expected tab-delimited format is controlled by bin/bystro-utils.pl, which will automatically fetch the requested sql, and generate the tab-delimited output.

For instance: For a config file that has the following track

chromosomes:
  - chr1
tracks:
  tracks:
  - name: refSeq
    type: gene
    utils:
    - args:
        connection:
          database: hg19
        sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
          ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
          (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
          e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
          (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
          kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
          GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
          x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
          GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
          x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
          refGene r WHERE chrom=%chromosomes%;

Running bin/bystro-utils.pl --config <path/to/this/config> will result in the following config:

chromosomes:
  - chr1
tracks:
  tracks:
  - name: refSeq
    type: gene
    local_files:
      - hg19.kgXref.chr1.gz
      name: refSeq
      type: gene
      utils:
      - args:
          connection:
            database: hg19
          sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
            ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
            (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
            e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
            (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
            kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
            GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
            x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
            GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
            x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
            refGene r WHERE chrom=%chromosomes%;
        completed: <date fetched>
        name: fetch

hg19.kgXref.chr1.gz will contain:

bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	score	name2	cdsStartStat	cdsEndStat	exonFrames	kgID	description	ensemblID	tRnaName	spID	spDisplayID	protAcc	mRNA	rfamAcc

0	NM_001376542	chr1	+	66999275	67216822	67000041	67208778	25	66999275,66999928,67091529,67098752,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755,	66999620,67000051,67091593,67098777,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67216822,	0	SGIP1	cmpl	cmpl	-1,0,1,2,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,	NA	NA	NA	NA	NA	NA	NA	NA	NA

nearest: A pre-calculated gene track that is intersected with a target gene track.

Example:
```
- name: refSeq.gene
  dist: false
  storeNearest: true
  to: txEnd
  type: nearest
  features:
  - name2
  from: txStart
  local_files:
  - hg19.kgXref.chr*.gz
```
Options:
- dist: bool
  - Calculate the distance to the nearest target gene record. If the
vcf: A VCF v4.* file

chromosomes: The allowable chromosomes.
- Each row of every track must be identified by these chromosomes (during building)
- Each row of any input file submitted for annotation must also be "" "" (during annotation)
- However, Bystro is flexible about the chr prefix
Ex: For the following config
```
chromosomes:
  - chr1
  - chr2
  - chr3
```
Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy
1. We currently follow UCSC conventions for chromosomes, meaning they should be prepended by chr
2. Bystro will automatically append chr to chromosomes read from an input file during annotation.
3. Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field
Ex: Clinvar doesn't have a chr prefix, so during building we specify:
```
tracks:
  - name: clinvar
    build_field_transformations:
      chrom: chr .
    fieldMap:
      Chromosome: chrom
```
Here fieldMap allows us to rename header fields, and build_field_transformations allows us to define a prepend operation (chr . can be interpreted as the perl command "chr" . $chrom)

So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.

In this example chromosomes 1 and chr1 will be built/annotated, but 1_rand will not.

Directories and Files

These describe where the Bystro database and any source files are located.

files_dir : The parent folder within which each track's local_files are located

Bystro automatically checks for local_files at parent/trackName/file

Ex: For the config file containing
```
files_dir: /path/to/files/
track:
  - name: refSeq
    local_files:
      - hg19.refGene.chr1.gz
      # and more files
```
Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz

database_dir : Each database is held within database_dir, in a folder of the name assembly

Ex: For the config file containing
```
assembly: hg19
database_dir: /path/to/databases/
```
Bystro will look for the database /path/to/databases/hg19

bystro's People

Contributors

Stargazers

Watchers

Forkers

wingolab-org raonyguimaraes wingolab alabarga yanglab-emory project-bystro ilhah akotlar poneill cristinaetrv austintalbot7241993 dlin30 bystrogenomics

bystro's Issues

Support more compressed formats

We may be able to gain decompression efficiency by supporting lz4, bgzip. Block-compressed formats can be decompressed using multiple threads.

Improve errors messages

When 0 variants annotated, and error is generated with the message, "Couldn't read statistics file". This is true, but not really the core issue. There is no error, simply no data

Add pLi scores.

Very important. Seems at least as useful as CADD, and maybe more sensitive.

Write md5 hash of tracks configuration to db

Ensure that if YAML configuration is substantially modified (i.e has the track configuration modified) that the database complains.

This should not include absolute paths, database_dir or files_dir, which may be better suited as environmental variables.

This TODO is really about the initiation of use of blockchain to track state.

Revise stripping of delimiters

For instance: RH C/c Polymorphism currently gets transformed in master to RH C c Polymorphism.

We could replace our delimiters with commas or underscores, to preserve the fact that these aren't separate tokens (which google will interpret correctly), and which will allow us to index them as concatenated in elastic.

Ex: RH C/c -> RH C-c or RH C_c would both work well. In google RH C,c works best, returns the same results as RH C/c.

Alternatively our overlapDelimiter could be changed to \\, but I think this makes parsing much more difficult, and should be a last resort.

Edit: By discussion with Thomas, will try \ for now.

Create AWS instance launcher (for Spot market)

Currently we require 2 steps, since user-data is executed as root, and our scripts assumed Bystro is being installed in the home directory.

Simplest solution is to install and launch somewhere from root.

A smarter, better-long-term solution would be to use cloud-init to allow whichever path desired more.

Expanded HGVS notation

Bystro currently supports HGVS search in coding regions.

The questions are:

Should we expand HGVS support to non-coding regions.
Should we permanently store the HGVS notation in a tab-delimited field.

Better online db versioning

Store hash of database in YAML after build
Automatically identify available database builds by querying nodes
Show database build date, version in dropdown
Show deprecation messages before switching databases.
Always provide deprecated database for at least 2 weeks after deprecation

Document sites that didn't liftover to hg38 for gnomad

There should be a list of sites/coordinates where missing values represent sites that didn't lift over from hg19 to hg38 for quality control measures to separate those sites from missing data representing private mutations.

Allow database to be built or pulled

Working on GenPro; realizing that it should be easier to start up the program.

Proposal: add a YAML config property, that provides the link to the remote resource where the version of the database specified should be uploaded to, and then pulled.

Something like

repository:
  path: "s3://" or "http://" or "/path/to"
  buildDate: 10/27/18 11:22pm

When the user first uses the config file, the program should check whether the database exists at the given database_dir, and if it does not, fetch if the repository property exists.

This will allow users to supply custom databases.

Potentially this could be extended to multiple databases. This would mean allowing per-track database configuration (as opposed to having a singleton with a fixed database_dir). It would of course cost access time, but may be reasonable in cases like GenPro, where we may want to allow users to build (or fetch) highly dynamic databases (per experiment). In GenPro's case, the ability to fetch from a remote resource would mean memoization to a remote resource (as opposed to an in-memory data structure).

Cut b11.0.0 release from master

The master branch is a substantial improvement of the b10 codebase, including a new "nearest" track that uses a ahead-of-time de-duplication strategy to reduce disk space and improve annotation performance, and which allows the calculation distance to nearest features.

Currently used to calculate nearest gene, nearest Tss distances (as well as list details about those genes/tss'), and to create a refSeq.gene track, which contains, pLi, pNull, pHI, lofTool, GDI, and more.

Furthermore, building now uses LMDB cursors, and is remarkably faster (build times are < 1/2 of b10).

TODOS

"*" May be deferred for first minor (feature) release

** Likely to be deferred to 2nd (feature) release.

Build Error on Docker Machine

I'm using docker-machine on a Windows 10. Trying to build from Dockerfile with docker build -t bystro . :: script exits with exit code 127 (command not found) when running install/install-go-packages.sh. Additionally, script creates similar warnings when installing lmdb, but does not exit.

This is my terminal output (at step 11):

Step 11/13 : RUN . install/install-go-packages.sh
 ---> Running in 7700bf77c2b1
: not found install/install-go-packages.sh:
-e

Installing go packages (bystro-vcf, stats, snp)

: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
Made /root/go path
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
The command '/bin/sh -c . install/install-go-packages.sh' returned a non-zero code: 127

Note, the script does run when I run it manually through the console.

The error is likely caused by my use of docker machine. The default vm that docker-machine creates does not have go installed, and it does fail to execute the script with exit code 127.

Output allele count and allele number

These are useful for finding singletons.

Don't die with cryptic error if attempting to fetch unknown field in readOnly mode

Right now we implicitly create a meta entry for a key if getFieldDbName is called, even if the database is in readOnly mode.

Error code is 13, EACCESS. We should simply return a more sensible error message at runtime if this is the case.

Use semantic versioning for both database and program

Right now program version is intimately tied to database version.

We either need to decouple them, or use semantic versioning to track all changes, such that any identified database bugs that require a rebuild increment the corresponding minor version digit.

Store partially-overlapping sparse tracks in join?

We should decide whether the join track for genes should require the gene to be fully covered by the joining track (currently we configure clinvar in our hg19.yml and hg38.yml builds).

Finish transition to camelCase

Project started off defining snake_case for variables that were configurable at run time via YAML, and camelCase elsewhere. In part because command line users may not have liked/been used to camelCase.

This was stupid and confusing.

Simplify transaction management

Remove all cleanUp() besides the checkpoints.

In general, how can we utilize LMDB more effectively? This is mostly interesting for the future Go transition, but it feels like our current dbRead vs dbReadCursorUnsafe solution is not completely satisfactory.

Cristina student todo's

A mixture of web and local tasks:

Create new save filters, Go or Perl.
Create in-line documentation on web: documentation should appear for new users (or users who haven't seen the function previously), when they are on a page/section with that function. Can be pretty easily written in Angular Material.
Document new fields going up in master (web)
Document new UNIT SEPARATOR (ASCII 31) for overlapping fields
Document filters
Contribute to VCF / plink export
Contribute to Hail integration

What to do with complex variants that are both a deletion and a SNP?

Example from gnomAD:

chr10 723260 rs61831381 GCCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA ACCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA,A 2362232.76

Example 2:

chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:
chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:

chr10 737933 rs534100935 GTAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA ATAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA,A

Fix b10 hg38 gnomad (early exit)

Default behavior when encountering unexpected chromosomes was to skip and exit early. Fixing this will restore missing hg38 sites.

Add sampleMaf

Contains the number of non-missing alleles at the site. Allows for queries that are maximally flexible. For instance , we could filter variants that are either in gnomAD or are at low frequency in our sample.

Update documentation for "master" merge (b11)

Update install documentation
Update fields documentation
Add documentation on building

Improve upload reliability

Users from Albert Einstein have run into issues with large uploads (10’s of GB).

We should add ability to retry chunks
If uploading from s3, we should run the upload completely in background, rather than as a synchronous event that the user needs to keep a connection open during (meaning don’t tie to request/response lifecycle; start upload and return).

Cc @wingolab

Coerce ints due to Elasticsearch removing coerce option on integers

Due to this decision elastic/elasticsearch#25861

Edit: We need to stay with ES 5.6 for now. 6.x+ remove split_on_whitespace, which dramatically changes queries. In practice, even with the 'split_queries_on_whitespace' option, queries operate very differently.

DbManager should check if dbdata defined

We expect that the dbmanager will store only structures if data (one track of information at each index).

It is moderately safer, and slower, to check that the site is defined, rather than flash.

Utils::LiftOverCadd: allow whitelist

With the release of CADD 1.4, our major use case for liftover goes away until the next human assembly release. However, we still need to lift over the GRCh37 MT to hg19's chrM (pre-patch).

A whitelist will allow this in-app, rather than as a separate processing step.

Add VCF export

Incorporate Dave's script...complicating factor is that it requires sdx files. The obvious solution is to have it read LMDB instead.

Update tests

nearest-dev branch currently contains most up-to-date tests: https://github.com/akotlar/bystro/tree/nearest-dev

TODO:

Complete integration tests for all tracks (insert / fetch)
- in future revisions explore creating more granular tests
- some of this is limited by architectural choices; inlining -> performance+, but more complex tests
Complete unit tests for DB Manager functions
Create unit test for Output.pm
Create unit tests for less important, clearly working utility functions (like IO package)
Create low-level unit tests for gene track's TX builder
its function is already verified in gene track tests, but useful for future development
Write tests for fields that use delimiters that are also used as Bystro delimiters; ensure we aren't generating extra fields in subsequent versions. Currently everything works appropriately, but is fragile because of the lack of tests (can verify at bystro.io using hbox/dead)
Write test to check that newline characters are stripped from db-inserted values.

VCF builder

Builds VCF file, for use primarily with gnomAD, ExAc, etc

Support FLAG types in VCF files

Used in new gnomad ... segdup / lcr flags
appearance as:

AC=2;AF=6.52443e-05;AN=...CSQ=A|intergenic_variant|MODIFIER||||||||||||||||1||||SNV|1||||||||||||||||||||||||||||||||||||||||||||;segdup

So need to check for presence of string, in absence of an equal sign.

Create Docker container

This will be far easier to launch on the command line, and could be useful when we enable private instance launching from https://bystro.io

Add tests for mis-sorted files

We had an issue where VCF track builds were being cut short, because those tracks had unexpected chromosomes as an artifact of liftover.

Need to write tests for all tracks, especially those prone to liftover artifacts, showing handling of multiple chromosomes when program expects only specified chromosomes (which is the case when multiple files are present)

Store region data as array in region db

Currently region data is stored as a hash, but with integer keys; this doesn't seem particularly useful, except in maybe the case that features are split between region and site, but that could be handled in a more deterministic way to reduce the sparsity of the site and region arrays.

Create Singularity, Kubernetes, etc containers

Docker is popular, but other containers are used. For instance, some at NIH use Singularity

Create Docker container
Create Kubernetes cluster driven by said Docker container
Create singularity container

Set up Travis CI

This is slightly tricky: most of our tests require LMDB to be installed. Figure this out.

Add ploidy (het ploidy and homozygote ploidy)

This will be used to allow dropping of samples, without screwing up allele numbers.

We should also include an allele number (maybe "sampleAn") field; this will allow easy updates to homozygosity, heterozygosity and missingness when dropping samples.

Permission configurability

We currently need to set read permissions on output files, so that processes on other nodes can read them without having the same user/group (files are authorized by web server, inaccessible from outside world without authorization).

TODOS:

Modify permissions on only files owned by Bystro, rather than all in output folder (only an issue if using --temp_dir "/some/path" without --archive)
Allow output permission to be set in YAML config

Investigate use of named databases

In this version, every track would get a separate named database, as opposed to a key in the serialized data structure.

The advantage is a substantially easier insertion model, which will allow us to modularly update the database.

The disadvantage may be read performance and size; each database will need a header; need to investigate size, but may be 16 bytes. Also, we will need to deserialize N times for N tracks, although the deserialization will be simpler.

If annotation performance or database size are substantially impacted, or this change significantly higher CPU usage during annotation, the tradeoff will likely not be worth it. Currently on master branch build times are 1 day with 3 additional whole-genome tracks (refSeq.gene, nearest.refSeq, nearestTss.refSeq), which cumulatively take ~ 7 hours. We re-run builds no more than once per month.

Improve query builder.

In the web app, move from regex to something like PEG/ohm.

Fix Pankaj synonym issue: synonym name should match exactly
Prototype Ohm query syntax

Some missing data in refSeq not being undef'd

mRNA field shows up as an empty string in search

Enumerate, and strip common missing values

dbSNP: unknown (function)
clinvar: not provided, not specified, no assertion criteria provided, no interpretation for the single variant, no assertion for the individual variant, see cases : akotlar@7df8409 , 7883b7f

While these provide some information, barring evidence to the contrary, I think we shouldn't waste space on their storage.

Add support for fam files and case/control male/female allele frequencies

Need to support an optional fam parameter bystro-snp and bystro-vcf, and of course pass through the fam file during upload.

Nearest tssName and tssDist

TODO:

Validate that both nearest.refSeq and nearestTss.refSeq are accurate
Decide whether these track names are ok
Decide whether we report all desired data
Document parsing of these fields (since they are de-duplicated in a way that refSeq isn't).

Export to VCF format

Will require using the tab statistics file to get the sample list, and Dave's vcf converter simply tail -n +3 statistics.tsv | cut -f1 > sample_list.txt && seqantToVcf etc.

Would be nice to update Dave's program to use LMDB db.

Note that, as it stands, we will keep multiallelics on separate lines. Could add a facility to recombine multiallelics.

Generate sample-list output from bystro-vcf
Add support for sample-list in YAML config, Bystro Seq.pm
Generate sample-list output from bystro-snp
Propagate sample-list during saving from query
Add Dave Cutler's converter program
Make, use Rust implementation

Set threshold for p = 1 to .9

https://github.com/akotlar/bystro/blob/d4c952b7f454acad8533cfd0ce522d54bf0698dc/lib/SeqFromQuery.pm#L828

Can make configurable.

Add chrPerFile support

This is a low-priority update. Its only benefit is to allow faster skipping of previously-built chromosomes.

Something along the lines of

sub makeChromCheckFunction {
  my ($onNew, $onExit) = @_;

  return sub {
    my ($currentChr, $newChr) = @_;

    if( ($currentChr && $currentChr ne $newChr) || !$currentChr ) {
      if($self->chrPerFile) {
        # show the longer $currentChr ne $newChr condition for clarity
        if($currentChr ne $newChr) {
          # if use guarantees that they have one chromosome per file, this is a fatal error
          $self->log('fatal', $self->name . ": Expected one chromosome in $file, found at leats 2.");
        }
        
        if(!$self->chrIsWanted($newChr)) {
          $self->log('warn', $self->name . ": $newChr unwanted, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        if(!$self->completionMeta->okToBuild($newChr)) {
          $self->log('warn', $self->name . ": $newChr wanted, but completed, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        $onNew->($currentChr, $newChr);

        return $newChr;
      }

      return $self->chrIsWanted($newChr) && $self->completionMeta->okToBuild($newChr) ? $newChr : undef;
    }

    return $currentChr;
  }
  
}

thanks!!