Git Product home page Git Product logo

bystrogenomics / bystro Goto Github PK

View Code? Open in Web Editor NEW
40.0 8.0 13.0 74.87 MB

Bystro genetic analysis (annotation, filtering, statistics)

License: Apache License 2.0

Perl 43.34% Shell 1.75% Dockerfile 0.23% Rust 2.64% Python 48.52% Cython 0.07% Makefile 0.06% Go 3.40%
genomics genomics-search bioinformatics bioinformatics-pipeline bioinformatics-analysis bioinformatics-databases bioinformatics-algorithms bioinformatics-scripts

bystro's Introduction

Bystro DOI Codacy Badge

TLDR; 1,000x+ faster than VEP, more complete annotation + online search (https://bystro.io) for datasets of up to 47TB (compressed) online, or petabytes offline.

Bystro Performance

Bystro Publication

For datasets and scripts used, please visit github.com/bystro-paper

If using Bystro, please cite Kotlar et al, Genome Biology, 2018

Web Tutorial

Start here: TUTORIAL.md

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

The easiest way is to run from Docker: docker pull akotlar/bystro:latest && docker run bystro:latest bystro-annotate.pl

Please read: INSTALL.md for instructions on how to download and use Bystro hg19/hg38/etc databases.

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

  1. VCF format: Bystro-Vcf
  2. SNP format: Bystro-SNP
  3. Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

  • The config file describes the state of both the database and the annotation. It's required for annotating or building

  • It has several keys:

    • tracks: The highest level organization for database values. Tracks have a name property, which must be unique, and a type, which must be one of:

      • sparse: A bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.

        • This is used for dbSNP, and Clinvar records, but many files can be fit this format.
        • Mapping fields can be managed by the fieldMap key
      • score: A wigFix file.

        • Used for phastCons, phyloP
      • cadd:

        • A CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
        • CADD format: http://cadd.gs.washington.edu
      • gene: A UCSC gene track table (ex: knownGene, refGene, sgdGene) stored as a tab separated output, with column names as columns. Conversion from SQL to the expected tab-delimited format is controlled by bin/bystro-utils.pl, which will automatically fetch the requested sql, and generate the tab-delimited output.

        For instance: For a config file that has the following track

        chromosomes:
          - chr1
        tracks:
          tracks:
          - name: refSeq
            type: gene
            utils:
            - args:
                connection:
                  database: hg19
                sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
                  ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
                  '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
                  (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
                  e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
                  (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
                  kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
                  '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
                  GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
                  x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
                  '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
                  GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
                  x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
                  '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
                  refGene r WHERE chrom=%chromosomes%;
        

        Running bin/bystro-utils.pl --config <path/to/this/config> will result in the following config:

        chromosomes:
          - chr1
        tracks:
          tracks:
          - name: refSeq
            type: gene
            local_files:
              - hg19.kgXref.chr1.gz
              name: refSeq
              type: gene
              utils:
              - args:
                  connection:
                    database: hg19
                  sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
                    ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
                    '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
                    (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
                    e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
                    (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
                    kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
                    '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
                    GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
                    x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
                    '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
                    GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
                    x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
                    '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
                    refGene r WHERE chrom=%chromosomes%;
                completed: <date fetched>
                name: fetch
        

        hg19.kgXref.chr1.gz will contain:

        bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	score	name2	cdsStartStat	cdsEndStat	exonFrames	kgID	description	ensemblID	tRnaName	spID	spDisplayID	protAcc	mRNA	rfamAcc
        
        0	NM_001376542	chr1	+	66999275	67216822	67000041	67208778	25	66999275,66999928,67091529,67098752,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755,	66999620,67000051,67091593,67098777,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67216822,	0	SGIP1	cmpl	cmpl	-1,0,1,2,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,	NA	NA	NA	NA	NA	NA	NA	NA	NA
      • nearest: A pre-calculated gene track that is intersected with a target gene track.

        Example:

        - name: refSeq.gene
          dist: false
          storeNearest: true
          to: txEnd
          type: nearest
          features:
          - name2
          from: txStart
          local_files:
          - hg19.kgXref.chr*.gz
        

        Options:

        • dist: bool
          • Calculate the distance to the nearest target gene record. If the
      • vcf: A VCF v4.* file

    • chromosomes: The allowable chromosomes.

      • Each row of every track must be identified by these chromosomes (during building)
      • Each row of any input file submitted for annotation must also be "" "" (during annotation)
      • However, Bystro is flexible about the chr prefix

      Ex: For the following config

      chromosomes:
        - chr1
        - chr2
        - chr3

      Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy

      1. We currently follow UCSC conventions for chromosomes, meaning they should be prepended by chr
      2. Bystro will automatically append chr to chromosomes read from an input file during annotation.
      3. Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field

      Ex: Clinvar doesn't have a chr prefix, so during building we specify:

      tracks:
        - name: clinvar
          build_field_transformations:
            chrom: chr .
          fieldMap:
            Chromosome: chrom

      Here fieldMap allows us to rename header fields, and build_field_transformations allows us to define a prepend operation (chr . can be interpreted as the perl command "chr" . $chrom)

      So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.

      In this example chromosomes 1 and chr1 will be built/annotated, but 1_rand will not.

Directories and Files

These describe where the Bystro database and any source files are located.

  1. files_dir : The parent folder within which each track's local_files are located
  • Bystro automatically checks for local_files at parent/trackName/file

    Ex: For the config file containing

    files_dir: /path/to/files/
    track:
      - name: refSeq
        local_files:
          - hg19.refGene.chr1.gz
          # and more files

    Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz

  1. database_dir : Each database is held within database_dir, in a folder of the name assembly

    Ex: For the config file containing

    assembly: hg19
    database_dir: /path/to/databases/

    Bystro will look for the database /path/to/databases/hg19

bystro's People

Contributors

akotlar avatar austintalbot7241993 avatar codacy-badger avatar cristinaetrv avatar dependabot[bot] avatar dlin30 avatar mfigurski80 avatar poneill avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bystro's Issues

Support more compressed formats

We may be able to gain decompression efficiency by supporting lz4, bgzip. Block-compressed formats can be decompressed using multiple threads.

Improve errors messages

  • When 0 variants annotated, and error is generated with the message, "Couldn't read statistics file". This is true, but not really the core issue. There is no error, simply no data

Add pLi scores.

Very important. Seems at least as useful as CADD, and maybe more sensitive.

Write md5 hash of tracks configuration to db

Ensure that if YAML configuration is substantially modified (i.e has the track configuration modified) that the database complains.

This should not include absolute paths, database_dir or files_dir, which may be better suited as environmental variables.

This TODO is really about the initiation of use of blockchain to track state.

Revise stripping of delimiters

For instance: RH C/c Polymorphism currently gets transformed in master to RH C c Polymorphism.

We could replace our delimiters with commas or underscores, to preserve the fact that these aren't separate tokens (which google will interpret correctly), and which will allow us to index them as concatenated in elastic.

Ex: RH C/c -> RH C-c or RH C_c would both work well. In google RH C,c works best, returns the same results as RH C/c.

Alternatively our overlapDelimiter could be changed to \\, but I think this makes parsing much more difficult, and should be a last resort.

Edit: By discussion with Thomas, will try \ for now.

Create AWS instance launcher (for Spot market)

Currently we require 2 steps, since user-data is executed as root, and our scripts assumed Bystro is being installed in the home directory.

Simplest solution is to install and launch somewhere from root.

A smarter, better-long-term solution would be to use cloud-init to allow whichever path desired more.

Expanded HGVS notation

Bystro currently supports HGVS search in coding regions.

The questions are:

  1. Should we expand HGVS support to non-coding regions.
  2. Should we permanently store the HGVS notation in a tab-delimited field.

Better online db versioning

  • Store hash of database in YAML after build

  • Automatically identify available database builds by querying nodes

  • Show database build date, version in dropdown

  • Show deprecation messages before switching databases.

  • Always provide deprecated database for at least 2 weeks after deprecation

Document sites that didn't liftover to hg38 for gnomad

There should be a list of sites/coordinates where missing values represent sites that didn't lift over from hg19 to hg38 for quality control measures to separate those sites from missing data representing private mutations.

Allow database to be built or pulled

Working on GenPro; realizing that it should be easier to start up the program.

Proposal: add a YAML config property, that provides the link to the remote resource where the version of the database specified should be uploaded to, and then pulled.

Something like

repository:
  path: "s3://" or "http://" or "/path/to"
  buildDate: 10/27/18 11:22pm

When the user first uses the config file, the program should check whether the database exists at the given database_dir, and if it does not, fetch if the repository property exists.

This will allow users to supply custom databases.

Potentially this could be extended to multiple databases. This would mean allowing per-track database configuration (as opposed to having a singleton with a fixed database_dir). It would of course cost access time, but may be reasonable in cases like GenPro, where we may want to allow users to build (or fetch) highly dynamic databases (per experiment). In GenPro's case, the ability to fetch from a remote resource would mean memoization to a remote resource (as opposed to an in-memory data structure).

Cut b11.0.0 release from master

The master branch is a substantial improvement of the b10 codebase, including a new "nearest" track that uses a ahead-of-time de-duplication strategy to reduce disk space and improve annotation performance, and which allows the calculation distance to nearest features.

  • Currently used to calculate nearest gene, nearest Tss distances (as well as list details about those genes/tss'), and to create a refSeq.gene track, which contains, pLi, pNull, pHI, lofTool, GDI, and more.

Furthermore, building now uses LMDB cursors, and is remarkably faster (build times are < 1/2 of b10).

TODOS
  • Update all used annotations, esp gnomad.

  • Finish CADD track test (double check if still needed, we may have all necessary tests)

  • Modify all track tests (besides VCF, which has this done) to show that building from scrambled files when n > 1 files present works

  • Implement overlap delimiter. This delimiter allow a n:m (m > n) relationships between fields within one track. The highest cardinality scalar vector within a given track determines the relationships. By convention this should be the first track. This is, in effect, the primary key.

    • Example: one refSeq.name (by definition all refSeq entries are unique on name) may have multiple kgID's. This would be represented as name1;name2 \t kgIDforName1_1\kgIDforName1_2\kgIDforName1_3;kgIDforName2
  • Decide on the names for nearest genes tracks (currently refSeq.nearest.* refSeq.nearestTss.*) and the refSeq.gene track (which holds gene-level rather than tx-level information overlapping refSeq transcripts...for instance pLi, pNull, etc)

  • Change beanstalk workers, SeqElastic, SeqFromQuery to use supplied configuration files (search/maping and annotation YAML), rather than the corresponding assembly configuration found in config.

    • This is needed to help protect backward compatibility for a single annotation (re-indexing), and make better use of the state (configurations) stored alongside every annotation
  • Make hg38-lifted-over CADD publicly available

    • This should probably be the filtered cadd, since this is far simpler to work with (sorted, bad sites removed)
  • Update SeqElastic, SeqFromQuery to parse the new delimiter

  • Update the front end to handle delimiter use

  • Update changelog

  • Switch clinvar to by-allele-matching, using McArthur lab clinvar vcf.

    • Decide whether to keep the existing clinvar overlap of refSeq
  • * Add basic HGVS support

  • * Update mapping files to support Elasticsearch 6. Namely, split_on_whitespace no longer works, so we need to use copy_to to move

  • ** Allow submission of any valid track type (vcf, .bed, nearest) to add custom annotations

"*" May be deferred for first minor (feature) release

** Likely to be deferred to 2nd (feature) release.

Build Error on Docker Machine

I'm using docker-machine on a Windows 10. Trying to build from Dockerfile with docker build -t bystro . :: script exits with exit code 127 (command not found) when running install/install-go-packages.sh. Additionally, script creates similar warnings when installing lmdb, but does not exit.

This is my terminal output (at step 11):

Step 11/13 : RUN . install/install-go-packages.sh
 ---> Running in 7700bf77c2b1
: not found install/install-go-packages.sh:
-e

Installing go packages (bystro-vcf, stats, snp)

: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
Made /root/go path
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
The command '/bin/sh -c . install/install-go-packages.sh' returned a non-zero code: 127

Note, the script does run when I run it manually through the console.

The error is likely caused by my use of docker machine. The default vm that docker-machine creates does not have go installed, and it does fail to execute the script with exit code 127.

Use semantic versioning for both database and program

Right now program version is intimately tied to database version.

We either need to decouple them, or use semantic versioning to track all changes, such that any identified database bugs that require a rebuild increment the corresponding minor version digit.

Finish transition to camelCase

Project started off defining snake_case for variables that were configurable at run time via YAML, and camelCase elsewhere. In part because command line users may not have liked/been used to camelCase.

This was stupid and confusing.

Simplify transaction management

Remove all cleanUp() besides the checkpoints.

In general, how can we utilize LMDB more effectively? This is mostly interesting for the future Go transition, but it feels like our current dbRead vs dbReadCursorUnsafe solution is not completely satisfactory.

Cristina student todo's

A mixture of web and local tasks:

  • Create new save filters, Go or Perl.

  • Create in-line documentation on web: documentation should appear for new users (or users who haven't seen the function previously), when they are on a page/section with that function. Can be pretty easily written in Angular Material.

  • Document new fields going up in master (web)

  • Document new UNIT SEPARATOR (ASCII 31) for overlapping fields

  • Document filters

  • Contribute to VCF / plink export

  • Contribute to Hail integration

What to do with complex variants that are both a deletion and a SNP?

Example from gnomAD:

chr10 723260 rs61831381 GCCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA ACCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA,A 2362232.76

Example 2:

chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:
chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:

chr10 737933 rs534100935 GTAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA ATAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA,A

Fix b10 hg38 gnomad (early exit)

Default behavior when encountering unexpected chromosomes was to skip and exit early. Fixing this will restore missing hg38 sites.

Add sampleMaf

Contains the number of non-missing alleles at the site. Allows for queries that are maximally flexible. For instance , we could filter variants that are either in gnomAD or are at low frequency in our sample.

Improve upload reliability

Users from Albert Einstein have run into issues with large uploads (10’s of GB).

  1. We should add ability to retry chunks
  2. If uploading from s3, we should run the upload completely in background, rather than as a synchronous event that the user needs to keep a connection open during (meaning don’t tie to request/response lifecycle; start upload and return).

Cc @wingolab

DbManager should check if dbdata defined

We expect that the dbmanager will store only structures if data (one track of information at each index).

It is moderately safer, and slower, to check that the site is defined, rather than flash.

Utils::LiftOverCadd: allow whitelist

With the release of CADD 1.4, our major use case for liftover goes away until the next human assembly release. However, we still need to lift over the GRCh37 MT to hg19's chrM (pre-patch).

A whitelist will allow this in-app, rather than as a separate processing step.

Add VCF export

Incorporate Dave's script...complicating factor is that it requires sdx files. The obvious solution is to have it read LMDB instead.

Update tests

nearest-dev branch currently contains most up-to-date tests: https://github.com/akotlar/bystro/tree/nearest-dev

TODO:

  • Complete integration tests for all tracks (insert / fetch)
    • in future revisions explore creating more granular tests
    • some of this is limited by architectural choices; inlining -> performance+, but more complex tests
  • Complete unit tests for DB Manager functions
  • Create unit test for Output.pm
  • Create unit tests for less important, clearly working utility functions (like IO package)
  • Create low-level unit tests for gene track's TX builder
  • its function is already verified in gene track tests, but useful for future development
  • Write tests for fields that use delimiters that are also used as Bystro delimiters; ensure we aren't generating extra fields in subsequent versions. Currently everything works appropriately, but is fragile because of the lack of tests (can verify at bystro.io using hbox/dead)
  • Write test to check that newline characters are stripped from db-inserted values.

VCF builder

Builds VCF file, for use primarily with gnomAD, ExAc, etc

Support FLAG types in VCF files

Used in new gnomad ... segdup / lcr flags
appearance as:

AC=2;AF=6.52443e-05;AN=...CSQ=A|intergenic_variant|MODIFIER||||||||||||||||1||||SNV|1||||||||||||||||||||||||||||||||||||||||||||;segdup

So need to check for presence of string, in absence of an equal sign.

Add tests for mis-sorted files

We had an issue where VCF track builds were being cut short, because those tracks had unexpected chromosomes as an artifact of liftover.

Need to write tests for all tracks, especially those prone to liftover artifacts, showing handling of multiple chromosomes when program expects only specified chromosomes (which is the case when multiple files are present)

Store region data as array in region db

Currently region data is stored as a hash, but with integer keys; this doesn't seem particularly useful, except in maybe the case that features are split between region and site, but that could be handled in a more deterministic way to reduce the sparsity of the site and region arrays.

Create Singularity, Kubernetes, etc containers

Docker is popular, but other containers are used. For instance, some at NIH use Singularity

  • Create Docker container
  • Create Kubernetes cluster driven by said Docker container
  • Create singularity container

Set up Travis CI

This is slightly tricky: most of our tests require LMDB to be installed. Figure this out.

Add ploidy (het ploidy and homozygote ploidy)

This will be used to allow dropping of samples, without screwing up allele numbers.

We should also include an allele number (maybe "sampleAn") field; this will allow easy updates to homozygosity, heterozygosity and missingness when dropping samples.

Permission configurability

We currently need to set read permissions on output files, so that processes on other nodes can read them without having the same user/group (files are authorized by web server, inaccessible from outside world without authorization).

TODOS:

  • Modify permissions on only files owned by Bystro, rather than all in output folder (only an issue if using --temp_dir "/some/path" without --archive)

  • Allow output permission to be set in YAML config

Investigate use of named databases

In this version, every track would get a separate named database, as opposed to a key in the serialized data structure.

The advantage is a substantially easier insertion model, which will allow us to modularly update the database.

The disadvantage may be read performance and size; each database will need a header; need to investigate size, but may be 16 bytes. Also, we will need to deserialize N times for N tracks, although the deserialization will be simpler.

If annotation performance or database size are substantially impacted, or this change significantly higher CPU usage during annotation, the tradeoff will likely not be worth it. Currently on master branch build times are 1 day with 3 additional whole-genome tracks (refSeq.gene, nearest.refSeq, nearestTss.refSeq), which cumulatively take ~ 7 hours. We re-run builds no more than once per month.

Improve query builder.

In the web app, move from regex to something like PEG/ohm.

  • Fix Pankaj synonym issue: synonym name should match exactly

  • Prototype Ohm query syntax

Enumerate, and strip common missing values

  • dbSNP: unknown (function)
  • clinvar: not provided, not specified, no assertion criteria provided, no interpretation for the single variant, no assertion for the individual variant, see cases : akotlar@7df8409 , 7883b7f

While these provide some information, barring evidence to the contrary, I think we shouldn't waste space on their storage.

Nearest tssName and tssDist

TODO:

  1. Validate that both nearest.refSeq and nearestTss.refSeq are accurate
  2. Decide whether these track names are ok
  3. Decide whether we report all desired data
  4. Document parsing of these fields (since they are de-duplicated in a way that refSeq isn't).

Export to VCF format

Will require using the tab statistics file to get the sample list, and Dave's vcf converter simply tail -n +3 statistics.tsv | cut -f1 > sample_list.txt && seqantToVcf etc.

Would be nice to update Dave's program to use LMDB db.

Note that, as it stands, we will keep multiallelics on separate lines. Could add a facility to recombine multiallelics.

  • Generate sample-list output from bystro-vcf

  • Add support for sample-list in YAML config, Bystro Seq.pm

  • Generate sample-list output from bystro-snp

  • Propagate sample-list during saving from query

  • Add Dave Cutler's converter program

  • Make, use Rust implementation

Add chrPerFile support

This is a low-priority update. Its only benefit is to allow faster skipping of previously-built chromosomes.

Something along the lines of

sub makeChromCheckFunction {
  my ($onNew, $onExit) = @_;

  return sub {
    my ($currentChr, $newChr) = @_;

    if( ($currentChr && $currentChr ne $newChr) || !$currentChr ) {
      if($self->chrPerFile) {
        # show the longer $currentChr ne $newChr condition for clarity
        if($currentChr ne $newChr) {
          # if use guarantees that they have one chromosome per file, this is a fatal error
          $self->log('fatal', $self->name . ": Expected one chromosome in $file, found at leats 2.");
        }
        
        if(!$self->chrIsWanted($newChr)) {
          $self->log('warn', $self->name . ": $newChr unwanted, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        if(!$self->completionMeta->okToBuild($newChr)) {
          $self->log('warn', $self->name . ": $newChr wanted, but completed, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        $onNew->($currentChr, $newChr);

        return $newChr;
      }

      return $self->chrIsWanted($newChr) && $self->completionMeta->okToBuild($newChr) ? $newChr : undef;
    }

    return $currentChr;
  }
  
}

Depth of coverage

Alex,

Is there any proposal to also include the depth of coverage statistics in the summary output?

thanks!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.