Git Product home page Git Product logo

exomiser's Introduction

The Exomiser - A Tool to Annotate and Prioritize Exome Variants

GitHub release CircleCI Codacy Badge Documentation

Overview:

The Exomiser is a Java program that finds potential disease-causing variants from whole-exome or whole-genome sequencing data.

Starting from a VCF file and a set of phenotypes encoded using the Human Phenotype Ontology (HPO) it will annotate, filter and prioritise likely causative variants. The program does this based on user-defined criteria such as a variant's predicted pathogenicity, frequency of occurrence in a population and also how closely the given phenotype matches the known phenotype of diseased genes from human and model organism data.

The functional annotation of variants is handled by Jannovar and uses any of UCSC, RefSeq or Ensembl KnownGene transcript definitions and hg19 or hg38 genomic coordinates.

Variants are prioritised according to user-defined criteria on variant frequency, pathogenicity, quality, inheritance pattern, and model organism phenotype data. Predicted pathogenicity data is extracted from the dbNSFP resource. Variant frequency data is taken from the 1000 Genomes, ESP, TOPMed, UK10K, ExAC and gnomAD datasets. Subsets of these frequency and pathogenicity data can be defined to further tune the analysis. Cross-species phenotype comparisons come from our PhenoDigm tool powered by the OWLTools OWLSim algorithm.

The Exomiser was developed by the Computational Biology and Bioinformatics group at the Institute for Medical Genetics and Human Genetics of the Charité - Universitätsmedizin Berlin, the Mouse Informatics Group at the Sanger Institute and other members of the Monarch initiative.

Download and Installation

The prebuilt Exomiser binaries can be obtained from the releases page and supporting data files can be downloaded from the Exomiser FTP site.

It is possible to use the same data sources for multiple versions, in order to avoid having to download the data files for each software point release. We recommend maintaining a dedicated exomiser data directory where you can extract versions of the hg19, hg38 and phenotype data. To do this, edit the exomiser.data-directory field in the application.properties file to point to the dedicated data directory. The version for the data releases should also be specified in the application.properties file:

For example, if you have an exomiser installation located at /opt/exomiser-cli-11.0.0 and you have extracted the data files to the directory /opt/exomiser-data. When there is a new data release, you can change the data versions by specifying the version in the /opt/exomiser-cli-11.0.0/application.properties from

# root path where data is to be downloaded and worked on
# it is assumed that all the files required by exomiser listed in this properties file
# will be found in the data directory unless specifically overridden here.
exomiser.data-directory=data

# old data versions
exomiser.hg19.data-version=1802
...
exomiser.hg38.data-version=1802
...
exomiser.phenotype.data-version=1802

to

# overridden data-directory containing multiple data versions
exomiser.data-directory=/opt/exomiser-data

# updated data versions
exomiser.hg19.data-version=1805
...
exomiser.hg38.data-version=1805
...
exomiser.phenotype.data-version=1807

We strongly recommend using the latest versions of both the application and the data for optimum results.

For further instructions on installing and running please refer to the README.md file.

Running it

Please refer to the manual for details on how to configure and run the Exomiser.

Demo site

There is a limited demo version of the exomiser hosted by the Monarch Initiative. This instance is for teaching purposes only and is limited to small exome analysis.

Using The Exomiser in your code

The exomiser can also be used as a library in Spring Java applications. Add the exomiser-spring-boot-starter library to your pom/gradle build script.

In your configuration class add the @EnableExomiser annotation

@EnableExomiser
public class MainConfig {
   
}

Or if using Spring boot for your application, the exomiser will be autoconfigured if it is on your classpath.

@SpringBootApplication
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

In your application use the AnalysisBuilder obtained from the Exomiser instance to configure your analysis. Then run the Analysis using the Exomiser class. Creation of the Exomiser is a complicated process so defer this to Spring and the exomiser-spring-boot-starter. Calling the add prefixed methods will add that analysis step to the analysis in the order that they have been defined in your code.

Example usage:

@Autowired
private final Exomiser exomiser;

...
           
    Analysis analysis = exomiser.getAnalysisBuilder()
                .genomeAssembly(GenomeAssembly.HG19)
                .vcfPath(vcfPath)
                .pedPath(pedPath)
                .probandSampleName(probandSampleId)
                .hpoIds(phenotypes)
                .analysisMode(AnalysisMode.PASS_ONLY)
                .modeOfInheritance(EnumSet.of(ModeOfInheritance.AUTOSOMAL_DOMINANT, ModeOfInheritance.AUTOSOMAL_RECESSIVE))
                .frequencySources(FrequencySource.ALL_EXTERNAL_FREQ_SOURCES)
                .pathogenicitySources(EnumSet.of(PathogenicitySource.POLYPHEN, PathogenicitySource.MUTATION_TASTER, PathogenicitySource.SIFT))
                .addPhivePrioritiser()
                .addPriorityScoreFilter(PriorityType.PHIVE_PRIORITY, 0.501f)
                .addQualityFilter(500.0)
                .addRegulatoryFeatureFilter()
                .addFrequencyFilter(0.01f)
                .addPathogenicityFilter(true)
                .addInheritanceFilter()
                .addOmimPrioritiser()
                .build();
                
    AnalysisResults analysisResults = exomiser.run(analysis);

Memory usage

Analysing whole genomes using the AnalysisMode.FULL will use a lot of RAM (~16GB for 4.5 million variants without any extra variant data being loaded) the standard Java GC will fail to cope well with these. Using the G1GC should solve this issue. e.g. add -XX:+UseG1GC to your java -jar -Xmx... incantation.

Caching

Since 9.0.0 caching uses the standard Spring mechanisms.

To enable and configure caching in your Spring application, use the @EnableCaching annotation on a @Configuration class, include the required cache implementation jar and add the specific properties to the application.properties.

For example, to use Caffeine just add the dependency to your pom:

<dependency>
    <groupId>com.github.ben-manes.caffeine</groupId>
    <artifactId>caffeine</artifactId>
</dependency>

and these lines to the application.properties:

spring.cache.type=caffeine
spring.cache.caffeine.spec=maximumSize=300000

Recognition

The Exomiser is proud to be recognised by the International Rare Diseases Research Consortium (IRDiRC) as an IRDiRC Recognized Resource. This is 'a quality indicator, based on a specific set of criteria, that was created to highlight key resources which, if used more broadly, would accelerate the pace of translating discoveries into clinical applications.' These resources 'must be of fundamental importance to the international rare diseases research and development community'.

IRDiRC recognised resource

exomiser's People

Contributors

buske avatar cmungall avatar damiansm avatar dependabot-support avatar drseb avatar holtgrewe avatar iquxle avatar julesjacobsen avatar visze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

exomiser's Issues

Incorporate Jannovar v0.11 + into Exomiser

This will likely be pretty invasive so needs to ideally be in version 6.
I'll work with Manuel on this.

Changes will impact on anything involving Variant and VariantEvaluation. (core.filters, core.factories)

Running cli jar without any arguments should display help

As per the protocols manuscript, running the jar without any arguments should display the cli help (currently only displayed if -help or --help is provided):

To test whether the installation was successful, run the command
$java -jar exomiser-cli-5.0.1.jar 
If the installation was successful, you will see a help message.

Make h2 access read-only to prevent db locking

It seems that, by default, the jdbc connection will update the h2 database during normal Exomiser runs (presumably just last-accessed fields and whatnot). This causes problems when attempting to access the h2 database concurrently, since this then requires table locking, and ultimately can result in the database entering an inconsistent state with uncommitted changes if an Exomiser run crashes. Once that happens Exomiser can no longer run until you manually go in and drop uncommitted changes in the db.

There is an easy fix that has worked for us. Adding the following to jdbc.properties:
ACCESS_MODE_DATA=r

Add option to filter VCF output to PASS entries

This could simplify the handling of, for example, whole-genome VCF files. We would use this feature in PhenomeCentral. As is, we just have to run the VCF files through a separate AWK step.

Enable filtering of whole-genome VCF files

Can't do this at present as the entire VCF file is read into memory, converted to a Variant and annotated using Jannovar, once this is done the variants are collected into their relevant genes, then filtered.

The VCF parsing, annotation and filtering needs to be streamed into a whole-exon set first, then we can continue on our merry way without requiring tens of gigs of RAM and hours of time.

Update intro page

Change first sentence of second paragraph to the below and also combine 1st and 2nd paragraph so all in bold as all equally important

"Variants are prioritized according to user-defined criteria on variant frequency, pathogenicity, quality, inheritance pattern, phenotype data from human and model organisms, and proximity in the interactome to phenotypically similar genes"

Fail to run --prioritiser phenix

I just pulled the newest development branch and packed it with maven. I also used the actual data from the FTP website.

My command:

java -Xms5g -Xmx5g -jar exomiser-cli-6.0.0.jar \
--prioritiser=exomiser-allspecies -I AR -F 1 -D 607060 \
-v testVCF.vcf -o results/testresult \
--out-format=HTML \
--prioritiser phenix \
-p testPED.ped

The error is:

/home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out (No such file or directory)
    at de.charite.compbio.exomiser.core.prioritisers.util.ScoreDistributionContainer.parseDistributions(ScoreDistributionContainer.java:175)
    at de.charite.compbio.exomiser.core.prioritisers.PhenixPriority.<init>(PhenixPriority.java:153)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.getPhenixPrioritiser(PriorityFactory.java:82)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.makePrioritisers(PriorityFactory.java:54)
    at de.charite.compbio.exomiser.core.Exomiser.analyse(Exomiser.java:72)
    at de.charite.compbio.exomiser.cli.Main.runAnalysis(Main.java:130)
    at de.charite.compbio.exomiser.cli.Main.main(Main.java:62)

The path is correctly set. /home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/1.out exists but not /home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out. 0.out is not present in the download file.

TSV-writer appears to substitute any 'vcf' in file path

It appears that anywhere the string vcf appears in the path of the output file, it is substituted with genes.tsv, even if it doesn't appear at the end (note that the vcf directory is changed to a genes.tsv directory which doesn't exist):

2015-02-09 02:24:01,429 INFO  de.charite.compbio.exomiser.core.writers.VcfResultsWriter [main] - VCF results written to file /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.
2015-02-09 02:24:01,431 ERROR de.charite.compbio.exomiser.core.writers.TsvGeneResultsWriter [main] - Unable to write results to file /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv.
java.nio.file.NoSuchFileException: /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv

Add licence and file headers for Maven Central hosting

Maven (naturally) has a plugin for this. Investigate and then follow through with what's needed to add/update the licence headers in the source code:

http://mojo.codehaus.org/license-maven-plugin

Also, in preparation for fully-opening up the codebase it would be good to add the minimum headers as required by maven Central Repository:
http://central.sonatype.org/pages/requirements.html

Then we can build on TravisCI and publish builds to maven central making it trivially easy for Java developers to use Exomiser.

Clean-up package structure

The package structure in exomiser-core is still a bit out of kilter. I want to move everything under the core package so there can be no possibility of namespace clashes with classes from other exomiser jar files.

This will necessitate a full version number increment as it will break any existing code relying on exomiser-core.

The structure should be so:

  • core
    • Exomiser.java
    • ExomiserSettings.java
    • dao
    • factories
    • filter(s) (rename?)
    • frequency (move into model?)
    • model
    • pathogenicity (move into model?)
    • util (might not be needed)
    • writer(s) (rename?)
    • io (might not be needed)
    • priority (rename to prioritisers)
      • exomewalker
      • inheritance
      • omim
      • ...

remove-off-target-syn option description is confusing

Does specifying this option remove off-target variants, or keep them? And is the default to keep them or remove them? It isn't quite clear.

 -T,--remove-off-target-syn             Keep off-target variants. These
                                        are defined as intergenic,
                                        intronic, upstream, downstream,
                                        synonymous or intronic ncRNA
                                        variants. Default: true

Prioritiser options

  • Make exomiser v2 the default
  • Change Exomiser v1 to Exomiser (mouse only)
  • Change Exomiser v2 to Exomiser (all species)
  • Do we want to offer Phenix (human only) as well
  • Do we want to offer ExomeWalker

Reformat files to use unix newlines

I would suggest having the source base use unix newlines instead of windows newlines. Having the windows newlines makes it harder to work with github, if nothing else. For example, if you try to edit a file (e.g. exomiser-cli/src/main/resources/jdbc.properties) within github, you'll see that the diff is the entire file because the ^M characters at the end of every line get automatically stripped.

TAB delimited output format (tsv) for variants

You wrote in the exomiser draft protocol that there is a TAB delimited file format. Right now there exists one for genes. I think if people using pipelines it will be great to have a TSV-file with the variants and all the annotations (still in the vcf-file annotations are missing).

I can start implementing this feature if it is OK with you.

Behaviour of a Filterable object

Since issue #2 the VcfWriter so output is now more compatible with the actual VCF spec, in particular, if a variant has not been filtered the FILTER column should be empty for that variant:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  GENOTYPE
chr3    4   .   G   C   2.2     ;VARIANT NOT ANALYSED - NO GENE ANNOTATIONS GT  0/1
chr1    1   .   A   T   2.2 PASS    ;EXOMISER_GENE=ABC1;EXOMISER_VARIANT_SCORE=1.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr1    2   .   T   -   2.2 Target  ;EXOMISER_GENE=ABC1;EXOMISER_VARIANT_SCORE=0.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr2    3   .   C   T   2.2 Frequency;Target    ;EXOMISER_GENE=CDE2;EXOMISER_VARIANT_SCORE=0.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr3    5   .   G   C   2.2     ;EXOMISER_GENE=CDE2;EXOMISER_VARIANT_SCORE=1.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1

However this has raised a question about the Filterable interface which I'd like some feedback on as it impacts a fundamental behaviour. Currently a Filterable either passes or fails, but really it can exist in one of three states - passed, failed and unfiltered/not yet filtered. Given this:

  1. Should a Filterable (VariantEvaluation or Gene) report true or false to passedFilters if no filters have been applied? (I think this should really be false - however, this is a reversal of the current behaviour which is always true until a filter has been failed and passedFilters is quite well used- a more accurate name would be hasNotFailedFilters )
  2. Should a Filterable be able to report its actual state - PASSED/FAILED/UNFILTERED? (I vote yes as it is quite explicit and prevents you from having to combine passedFilters and a newly added isUnFiltered Booleans in order to infer a missing failedFilters or add the failedFilters Boolean too)
  3. Given the first two points is there any point in having any other methods in the Filterable interface other than passedFilter(FilterType) and getFilterStatus?

It would be possible to keep the existing behaviour of passedFilters and add getFilterStatus for ease of use and backwards compatibility at the expense of some potential confusion. But this all depends on what people are using - will any of these changes actually have any direct impact on you? If not I suggest clean logic should be applied as this tends to keeps things simpler and simple is good.

Add more informative error message to PhenIX when no HPO terms have been supplied

Child of issue #37
Related to issues #47 and #48

The error message is really confusing. It gives the impression that some files are missing. Maybe we should add a better error. Like: Please insert HPO terms for method phenix.

Or convert the OMIM-ID to HPO-terms.

/home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out (No such file or directory)
    at de.charite.compbio.exomiser.core.prioritisers.util.ScoreDistributionContainer.parseDistributions(ScoreDistributionContainer.java:175)
    at de.charite.compbio.exomiser.core.prioritisers.PhenixPriority.<init>(PhenixPriority.java:153)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.getPhenixPrioritiser(PriorityFactory.java:82)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.makePrioritisers(PriorityFactory.java:54)
    at de.charite.compbio.exomiser.core.Exomiser.analyse(Exomiser.java:72)
    at de.charite.compbio.exomiser.cli.Main.runAnalysis(Main.java:130)
    at de.charite.compbio.exomiser.cli.Main.main(Main.java:62)

Collapsible pheno-evidence sections

Results get a bit crazy looking when you have a lot of input HPO terms e.g. the classic Pfeiffer example.

Can we make each evidence section collapsible or only show the one with the best score i.e. contributing to the combined Exomiser score. Prefer the former as we may combine all scores eventually

Filter for genes update

Allow pasting in of large list of genes or upload of a file e.g. the DDD list of developmental disorder genes

Can the io.html package be removed now?

This package has now been rendered redundant for the core exomiser-cli functionality as the latest changes in release 5.2.0 are using Thymeleaf to render the HTML and the HtmlWriter simply hands the context with the data it needs to the rendering engine which fills in the resources/html/templates/results.html with this data.

@pnrobinson - Do you still need it for Panel, CRE and Walker? If not then this package has served it's purpose and it's time to retire it from the codebase. Let me know and I'll take care of it.

Automatic integration tests

These are really needed to catch issues which could arise from changes in underlying dependencies such as jannovar and HTSJDK which could have subtle changes to a variant which can cause drastic changes to the outcome of an analysis.

Add report feature to flag up variants which Jannovar fails to annotate

Sometimes Jannovar is unable to annotate a variant and throws an exception. This is caught by Exomiser, but the variant is not included in the analysis or results which could lead to incorrect results.

There are two ways this could be handled:

  1. These variants could be flagged and indicated to the user so that they are aware of the issue.
  2. Exomiser should simply stop the analysis and report the reason for the failure.

Votes on behaviour please....

Handling multiple alternative alleles

I discussed this earlier with @pnrobinson and he proposed/we agreed on the following (if I remember correctly). I'm adding this issue so we are all on the same page here and for some further sanity checking.

Consider the case of having multiple alternative alleles with REF=A, ALT=T,G. In our individuals, we see the following:

A/A A/T T/G

Exomiser will interpret this as two variants:

0/0 0/1 0/1  ==  wt  het het
0/0 0/0 0/1  ==  wt  wt  het

Fix dbSNP parsing

dbSNP seems to have changed its format recently such that alternate alleles appear in the same row with allele frequencies reported for the ref and these alts in order e.g.

9 140777306 rs4422842 C G,T,A . . RS=4422842;RSPOS=140777306;RV;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050128000a0514012e000100;WGT=1;VC=SNV;PM;PMC;SLO;NSM;REF;ASP;VLD;GNO;KGPhase3;CAF=0.846,.,.,0.154;COMMON=1

This needs to be processed to result in our frequency table
9 140777306 rs4422842 C G .
9 140777306 rs4422842 C T .
9 140777306 rs4422842 C A 0.154

Exomiser parsing of indels is incorrect?

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains:
chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.6116573**2**

The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.