exomiser / exomiser Goto Github PK

View Code? Open in Web Editor NEW

181.0 24.0 54.0 286.1 MB

A Tool to Annotate and Prioritize Exome Variants

Home Page: https://exomiser.readthedocs.io

License: GNU Affero General Public License v3.0

Java 94.74% HTML 5.26% Shell 0.01%

variants analysis genomics exome phenotypes monarchinitiative

exomiser's Introduction

The Exomiser - A Tool to Annotate and Prioritize Exome Variants

Overview:

The Exomiser is a Java program that finds potential disease-causing variants from whole-exome or whole-genome sequencing data.

Starting from a VCF file and a set of phenotypes encoded using the Human Phenotype Ontology (HPO) it will annotate, filter and prioritise likely causative variants. The program does this based on user-defined criteria such as a variant's predicted pathogenicity, frequency of occurrence in a population and also how closely the given phenotype matches the known phenotype of diseased genes from human and model organism data.

The functional annotation of variants is handled by Jannovar and uses any of UCSC, RefSeq or Ensembl KnownGene transcript definitions and hg19 or hg38 genomic coordinates.

Variants are prioritised according to user-defined criteria on variant frequency, pathogenicity, quality, inheritance pattern, and model organism phenotype data. Predicted pathogenicity data is extracted from the dbNSFP resource. Variant frequency data is taken from the 1000 Genomes, ESP, TOPMed, UK10K, ExAC and gnomAD datasets. Subsets of these frequency and pathogenicity data can be defined to further tune the analysis. Cross-species phenotype comparisons come from our PhenoDigm tool powered by the OWLTools OWLSim algorithm.

The Exomiser was developed by the Computational Biology and Bioinformatics group at the Institute for Medical Genetics and Human Genetics of the Charité - Universitätsmedizin Berlin, the Mouse Informatics Group at the Sanger Institute and other members of the Monarch initiative.

Download and Installation

The prebuilt Exomiser binaries can be obtained from the releases page and supporting data files can be downloaded from the Exomiser FTP site.

It is possible to use the same data sources for multiple versions, in order to avoid having to download the data files for each software point release. We recommend maintaining a dedicated exomiser data directory where you can extract versions of the hg19, hg38 and phenotype data. To do this, edit the exomiser.data-directory field in the application.properties file to point to the dedicated data directory. The version for the data releases should also be specified in the application.properties file:

For example, if you have an exomiser installation located at /opt/exomiser-cli-11.0.0 and you have extracted the data files to the directory /opt/exomiser-data. When there is a new data release, you can change the data versions by specifying the version in the /opt/exomiser-cli-11.0.0/application.properties from

# root path where data is to be downloaded and worked on
# it is assumed that all the files required by exomiser listed in this properties file
# will be found in the data directory unless specifically overridden here.
exomiser.data-directory=data

# old data versions
exomiser.hg19.data-version=1802
...
exomiser.hg38.data-version=1802
...
exomiser.phenotype.data-version=1802

# overridden data-directory containing multiple data versions
exomiser.data-directory=/opt/exomiser-data

# updated data versions
exomiser.hg19.data-version=1805
...
exomiser.hg38.data-version=1805
...
exomiser.phenotype.data-version=1807

We strongly recommend using the latest versions of both the application and the data for optimum results.

For further instructions on installing and running please refer to the README.md file.

Running it

Please refer to the manual for details on how to configure and run the Exomiser.

Demo site

There is a limited demo version of the exomiser hosted by the Monarch Initiative. This instance is for teaching purposes only and is limited to small exome analysis.

Using The Exomiser in your code

The exomiser can also be used as a library in Spring Java applications. Add the exomiser-spring-boot-starter library to your pom/gradle build script.

In your configuration class add the @EnableExomiser annotation

@EnableExomiser
public class MainConfig {
   
}

Or if using Spring boot for your application, the exomiser will be autoconfigured if it is on your classpath.

@SpringBootApplication
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

In your application use the AnalysisBuilder obtained from the Exomiser instance to configure your analysis. Then run the Analysis using the Exomiser class. Creation of the Exomiser is a complicated process so defer this to Spring and the exomiser-spring-boot-starter. Calling the add prefixed methods will add that analysis step to the analysis in the order that they have been defined in your code.

Example usage:

@Autowired
private final Exomiser exomiser;

...
           
    Analysis analysis = exomiser.getAnalysisBuilder()
                .genomeAssembly(GenomeAssembly.HG19)
                .vcfPath(vcfPath)
                .pedPath(pedPath)
                .probandSampleName(probandSampleId)
                .hpoIds(phenotypes)
                .analysisMode(AnalysisMode.PASS_ONLY)
                .modeOfInheritance(EnumSet.of(ModeOfInheritance.AUTOSOMAL_DOMINANT, ModeOfInheritance.AUTOSOMAL_RECESSIVE))
                .frequencySources(FrequencySource.ALL_EXTERNAL_FREQ_SOURCES)
                .pathogenicitySources(EnumSet.of(PathogenicitySource.POLYPHEN, PathogenicitySource.MUTATION_TASTER, PathogenicitySource.SIFT))
                .addPhivePrioritiser()
                .addPriorityScoreFilter(PriorityType.PHIVE_PRIORITY, 0.501f)
                .addQualityFilter(500.0)
                .addRegulatoryFeatureFilter()
                .addFrequencyFilter(0.01f)
                .addPathogenicityFilter(true)
                .addInheritanceFilter()
                .addOmimPrioritiser()
                .build();
                
    AnalysisResults analysisResults = exomiser.run(analysis);

Memory usage

Analysing whole genomes using the AnalysisMode.FULL will use a lot of RAM (~16GB for 4.5 million variants without any extra variant data being loaded) the standard Java GC will fail to cope well with these. Using the G1GC should solve this issue. e.g. add -XX:+UseG1GC to your java -jar -Xmx... incantation.

Caching

Since 9.0.0 caching uses the standard Spring mechanisms.

To enable and configure caching in your Spring application, use the @EnableCaching annotation on a @Configuration class, include the required cache implementation jar and add the specific properties to the application.properties.

For example, to use Caffeine just add the dependency to your pom:

<dependency>
    <groupId>com.github.ben-manes.caffeine</groupId>
    <artifactId>caffeine</artifactId>
</dependency>

and these lines to the application.properties:

spring.cache.type=caffeine
spring.cache.caffeine.spec=maximumSize=300000

Recognition

The Exomiser is proud to be recognised by the International Rare Diseases Research Consortium (IRDiRC) as an IRDiRC Recognized Resource. This is 'a quality indicator, based on a specific set of criteria, that was created to highlight key resources which, if used more broadly, would accelerate the pace of translating discoveries into clinical applications.' These resources 'must be of fundamental importance to the international rare diseases research and development community'.

exomiser's People

Contributors

Stargazers

Watchers

exomiser's Issues

Incorporate Jannovar v0.11 + into Exomiser

This will likely be pretty invasive so needs to ideally be in version 6.
I'll work with Manuel on this.

Changes will impact on anything involving Variant and VariantEvaluation. (core.filters, core.factories)

Running cli jar without any arguments should display help

As per the protocols manuscript, running the jar without any arguments should display the cli help (currently only displayed if -help or --help is provided):

To test whether the installation was successful, run the command
$java -jar exomiser-cli-5.0.1.jar 
If the installation was successful, you will see a help message.

Vestigial mention of phive-allspecies in cli help

s/phive-allspecies/hiphive/g

-E,--hiphive-params <type>              Comma separated list of optional
                                         parameters for phive-allspecies

Make h2 access read-only to prevent db locking

It seems that, by default, the jdbc connection will update the h2 database during normal Exomiser runs (presumably just last-accessed fields and whatnot). This causes problems when attempting to access the h2 database concurrently, since this then requires table locking, and ultimately can result in the database entering an inconsistent state with uncommitted changes if an Exomiser run crashes. Once that happens Exomiser can no longer run until you manually go in and drop uncommitted changes in the db.

There is an easy fix that has worked for us. Adding the following to jdbc.properties:
ACCESS_MODE_DATA=r

Defaults vs suggestions for fields

Find the grey colouring of the suggested values a bit confusing. Seems like they will be applied as defaults.

Replacement of pathogenicity scores with a single CADD score

Exomiser v2 - make sure it stays within the memory limits of the dev and production servers

Get running and make sure it stays within the memory limits of the dev and production servers - requires the DataMatrix object to be a singleton. Can we use floats instead of doubles.
Handle displaying of the phenotype evidence. On current site appears in a pop-up

CLI options for --disease-id and --hpo-terms should be mutually exclusive

Spawned from issue #37

--disease-id and --hpo-terms should be mutually exclusive. - Check this doesn't break HiPHIVE.
--disease-id should be converted into a set of HPO terms such that the prioritisers only work off HPO terms as they already do internally. See issue #47.

Add a gene list filter to the submit page

Could also add an option to use some pre-canned lists e.g. the DDD list of developmental genes

Fix HTML output from OMIM prioritiser

Currently the output is already pre-marked-up in HTML so that this is visible directly on the output:

<a href="http://www.omim.org/entry/-10">Craniosynostosis</a>

should look like:

Craniosynostosis

Add option to filter VCF output to PASS entries

This could simplify the handling of, for example, whole-genome VCF files. We would use this feature in PhenomeCentral. As is, we just have to run the VCF files through a separate AWK step.

Back to query form button

Do we need this instead of users just using the back button? Maybe it is just me.

Enable filtering of whole-genome VCF files

Can't do this at present as the entire VCF file is read into memory, converted to a Variant and annotated using Jannovar, once this is done the variants are collected into their relevant genes, then filtered.

The VCF parsing, annotation and filtering needs to be streamed into a whole-exon set first, then we can continue on our merry way without requiring tens of gigs of RAM and hours of time.

Update intro page

Change first sentence of second paragraph to the below and also combine 1st and 2nd paragraph so all in bold as all equally important

"Variants are prioritized according to user-defined criteria on variant frequency, pathogenicity, quality, inheritance pattern, phenotype data from human and model organisms, and proximity in the interactome to phenotypically similar genes"

Fail to run --prioritiser phenix

I just pulled the newest development branch and packed it with maven. I also used the actual data from the FTP website.

My command:

java -Xms5g -Xmx5g -jar exomiser-cli-6.0.0.jar \
--prioritiser=exomiser-allspecies -I AR -F 1 -D 607060 \
-v testVCF.vcf -o results/testresult \
--out-format=HTML \
--prioritiser phenix \
-p testPED.ped

The error is:

/home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out (No such file or directory)
    at de.charite.compbio.exomiser.core.prioritisers.util.ScoreDistributionContainer.parseDistributions(ScoreDistributionContainer.java:175)
    at de.charite.compbio.exomiser.core.prioritisers.PhenixPriority.<init>(PhenixPriority.java:153)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.getPhenixPrioritiser(PriorityFactory.java:82)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.makePrioritisers(PriorityFactory.java:54)
    at de.charite.compbio.exomiser.core.Exomiser.analyse(Exomiser.java:72)
    at de.charite.compbio.exomiser.cli.Main.runAnalysis(Main.java:130)
    at de.charite.compbio.exomiser.cli.Main.main(Main.java:62)

The path is correctly set. /home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/1.out exists but not /home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out. 0.out is not present in the download file.

Better error handling for large files

Put a warning on the form about VCF files needing to be < 60Mb.

Re-work prioritisers to add tests and clean-up API

Make OntologyService class for managing this aspect for Prioritisers

Required by Phive, HiPhive and Phenix too.

Currently this is handled by code-duplication.

Add data input validation from exomiser/submit on server-side

Resolve styling clash with Sanger css

Can we add in our own Sanger header and footer

Exomiser v2 - Show phenotype evidence in results page.

Handle displaying of the phenotype evidence. On current site appears in a pop-up

Guarantee that GenomeChange is always non-null.

Make 'none' type Prioritiser the default in SettingsBuilder

This will make using the Exomiser to do the filtering the default and only run the prioritisation step if explicitly specified.

TSV-writer appears to substitute any 'vcf' in file path

It appears that anywhere the string vcf appears in the path of the output file, it is substituted with genes.tsv, even if it doesn't appear at the end (note that the vcf directory is changed to a genes.tsv directory which doesn't exist):

2015-02-09 02:24:01,429 INFO  de.charite.compbio.exomiser.core.writers.VcfResultsWriter [main] - VCF results written to file /dupa-filer/buske/phenomecentral/geno/vcf/F0000009/F0000009.vcf.
2015-02-09 02:24:01,431 ERROR de.charite.compbio.exomiser.core.writers.TsvGeneResultsWriter [main] - Unable to write results to file /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv.
java.nio.file.NoSuchFileException: /dupa-filer/buske/phenomecentral/geno/genes.tsv/F0000009/F0000009.genes.tsv

Add licence and file headers for Maven Central hosting

Maven (naturally) has a plugin for this. Investigate and then follow through with what's needed to add/update the licence headers in the source code:

http://mojo.codehaus.org/license-maven-plugin

Also, in preparation for fully-opening up the codebase it would be good to add the minimum headers as required by maven Central Repository:
http://central.sonatype.org/pages/requirements.html

Then we can build on TravisCI and publish builds to maven central making it trivially easy for Java developers to use Exomiser.

Replace hard-coded HTML PriorityScore output with object representation of the data

Start with ExomiserAllSpeciesPriority as this has a huge amount of display logic embedded within the prioritisation logic making the actual algorithm rather hard to see.

Clean-up package structure

The package structure in exomiser-core is still a bit out of kilter. I want to move everything under the core package so there can be no possibility of namespace clashes with classes from other exomiser jar files.

This will necessitate a full version number increment as it will break any existing code relying on exomiser-core.

The structure should be so:

core
- Exomiser.java
- ExomiserSettings.java
- dao
- factories
- filter(s) (rename?)
- frequency (move into model?)
- model
- pathogenicity (move into model?)
- util (might not be needed)
- writer(s) (rename?)
- io (might not be needed)
- priority (rename to prioritisers)
  - exomewalker
  - inheritance
  - omim
  - ...

remove-off-target-syn option description is confusing

Does specifying this option remove off-target variants, or keep them? And is the default to keep them or remove them? It isn't quite clear.

 -T,--remove-off-target-syn             Keep off-target variants. These
                                        are defined as intergenic,
                                        intronic, upstream, downstream,
                                        synonymous or intronic ncRNA
                                        variants. Default: true

Automate build and deploy processes

This is clunky at the moment and relies on a few manual steps to pull everything together and deploy.

This could be achieved by setting up a GO CD server:

http://www.go.cd/download/

need to ask systems for a VM to deploy this to.

Prioritiser options

Make exomiser v2 the default
Change Exomiser v1 to Exomiser (mouse only)
Change Exomiser v2 to Exomiser (all species)
Do we want to offer Phenix (human only) as well
Do we want to offer ExomeWalker

Reformat files to use unix newlines

I would suggest having the source base use unix newlines instead of windows newlines. Having the windows newlines makes it harder to work with github, if nothing else. For example, if you try to edit a file (e.g. exomiser-cli/src/main/resources/jdbc.properties) within github, you'll see that the diff is the entire file because the ^M characters at the end of every line get automatically stripped.

Change results page links to Ensembl

Link to Ensembl for variants and genes - we should make an effort to be more compatible with the rest of the campus!

Enable filtering with no prioritisation

The genes should be scored using just the variant score, i.e., add a gene (phenotype) score of zero to every gene, everything else is the same.

Add ReadTheDocs documentation to GitHub wiki

This is a nice feature in Jannovar - please can you set this up for Exomiser too.

TAB delimited output format (tsv) for variants

You wrote in the exomiser draft protocol that there is a TAB delimited file format. Right now there exists one for genes. I think if people using pipelines it will be great to have a TSV-file with the variants and all the annotations (still in the vcf-file annotations are missing).

I can start implementing this feature if it is OK with you.

Behaviour of a Filterable object

Since issue #2 the VcfWriter so output is now more compatible with the actual VCF spec, in particular, if a variant has not been filtered the FILTER column should be empty for that variant:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  GENOTYPE
chr3    4   .   G   C   2.2     ;VARIANT NOT ANALYSED - NO GENE ANNOTATIONS GT  0/1
chr1    1   .   A   T   2.2 PASS    ;EXOMISER_GENE=ABC1;EXOMISER_VARIANT_SCORE=1.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr1    2   .   T   -   2.2 Target  ;EXOMISER_GENE=ABC1;EXOMISER_VARIANT_SCORE=0.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr2    3   .   C   T   2.2 Frequency;Target    ;EXOMISER_GENE=CDE2;EXOMISER_VARIANT_SCORE=0.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1
chr3    5   .   G   C   2.2     ;EXOMISER_GENE=CDE2;EXOMISER_VARIANT_SCORE=1.0;EXOMISER_GENE_PHENO_SCORE=0.0;EXOMISER_GENE_VARIANT_SCORE=0.0;EXOMISER_GENE_COMBINED_SCORE=0.0   GT  0/1

However this has raised a question about the Filterable interface which I'd like some feedback on as it impacts a fundamental behaviour. Currently a Filterable either passes or fails, but really it can exist in one of three states - passed, failed and unfiltered/not yet filtered. Given this:

Should a Filterable (VariantEvaluation or Gene) report true or false to passedFilters if no filters have been applied? (I think this should really be false - however, this is a reversal of the current behaviour which is always true until a filter has been failed and passedFilters is quite well used- a more accurate name would be hasNotFailedFilters )
Should a Filterable be able to report its actual state - PASSED/FAILED/UNFILTERED? (I vote yes as it is quite explicit and prevents you from having to combine passedFilters and a newly added isUnFiltered Booleans in order to infer a missing failedFilters or add the failedFilters Boolean too)
Given the first two points is there any point in having any other methods in the Filterable interface other than passedFilter(FilterType) and getFilterStatus?

It would be possible to keep the existing behaviour of passedFilters and add getFilterStatus for ease of use and backwards compatibility at the expense of some potential confusion. But this all depends on what people are using - will any of these changes actually have any direct impact on you? If not I suggest clean logic should be applied as this tends to keeps things simpler and simple is good.

Add more informative error message to PhenIX when no HPO terms have been supplied

Child of issue #37
Related to issues #47 and #48

The error message is really confusing. It gives the impression that some files are missing. Maybe we should add a better error. Like: Please insert HPO terms for method phenix.

Or convert the OMIM-ID to HPO-terms.

/home/mschubach/Exomiser/files/exomiser-cli-6.0.0/data/phenix/0.out (No such file or directory)
    at de.charite.compbio.exomiser.core.prioritisers.util.ScoreDistributionContainer.parseDistributions(ScoreDistributionContainer.java:175)
    at de.charite.compbio.exomiser.core.prioritisers.PhenixPriority.<init>(PhenixPriority.java:153)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.getPhenixPrioritiser(PriorityFactory.java:82)
    at de.charite.compbio.exomiser.core.prioritisers.PriorityFactory.makePrioritisers(PriorityFactory.java:54)
    at de.charite.compbio.exomiser.core.Exomiser.analyse(Exomiser.java:72)
    at de.charite.compbio.exomiser.cli.Main.runAnalysis(Main.java:130)
    at de.charite.compbio.exomiser.cli.Main.main(Main.java:62)

Collapsible pheno-evidence sections

Results get a bit crazy looking when you have a lot of input HPO terms e.g. the classic Pfeiffer example.

Can we make each evidence section collapsible or only show the one with the best score i.e. contributing to the combined Exomiser score. Prefer the former as we may combine all scores eventually

Download facility from website

If we do this it should be all results though i.e. not just top 200

Filter for genes update

Allow pasting in of large list of genes or upload of a file e.g. the DDD list of developmental disorder genes

Can the io.html package be removed now?

This package has now been rendered redundant for the core exomiser-cli functionality as the latest changes in release 5.2.0 are using Thymeleaf to render the HTML and the HtmlWriter simply hands the context with the data it needs to the rendering engine which fills in the resources/html/templates/results.html with this data.

@pnrobinson - Do you still need it for Panel, CRE and Walker? If not then this package has served it's purpose and it's time to retire it from the codebase. Let me know and I'll take care of it.

Automatic integration tests

These are really needed to catch issues which could arise from changes in underlying dependencies such as jannovar and HTSJDK which could have subtle changes to a variant which can cause drastic changes to the outcome of an analysis.

Add data input validation from exomiser/submit on client-side

Bootstrap has some javascript stuff for this - see http://getbootstrap.com/javascript/ for inspiration...

Feedback after hit submit

Add report feature to flag up variants which Jannovar fails to annotate

Sometimes Jannovar is unable to annotate a variant and throws an exception. This is caught by Exomiser, but the variant is not included in the analysis or results which could lead to incorrect results.

There are two ways this could be handled:

These variants could be flagged and indicated to the user so that they are aware of the issue.
Exomiser should simply stop the analysis and report the reason for the failure.

Votes on behaviour please....

Make results available in Excel format

Is this a good idea?

Handling multiple alternative alleles

I discussed this earlier with @pnrobinson and he proposed/we agreed on the following (if I remember correctly). I'm adding this issue so we are all on the same page here and for some further sanity checking.

Consider the case of having multiple alternative alleles with REF=A, ALT=T,G. In our individuals, we see the following:

A/A A/T T/G

Exomiser will interpret this as two variants:

0/0 0/1 0/1  ==  wt  het het
0/0 0/0 0/1  ==  wt  wt  het

Remove the evil that is Jannovar.Constants

This is java heresy - an interface used as an enum. Such horrific misuse of the language must be purged with fire.

Fix dbSNP parsing

dbSNP seems to have changed its format recently such that alternate alleles appear in the same row with allele frequencies reported for the ref and these alts in order e.g.

9 140777306 rs4422842 C G,T,A . . RS=4422842;RSPOS=140777306;RV;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050128000a0514012e000100;WGT=1;VC=SNV;PM;PMC;SLO;NSM;REF;ASP;VLD;GNO;KGPhase3;CAF=0.846,.,.,0.154;COMMON=1

This needs to be processed to result in our frequency table
9 140777306 rs4422842 C G .
9 140777306 rs4422842 C T .
9 140777306 rs4422842 C A 0.154

Exomiser parsing of indels is incorrect?

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains:
chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.6116573**2**

The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?