ohsu-comp-bio / g2p-aggregator Goto Github PK

Associations of genomic features, drugs and diseases

Python 3.25% HTML 71.17% Shell 0.26% Makefile 0.02% Lua 0.06% Jupyter Notebook 25.11% JavaScript 0.10% CSS 0.01% Dockerfile 0.03%

g2p-aggregator's Introduction

The VICC Meta-Knowledgebase

What is it? Why use it?

For researchers, who need to investigate genotype phenotype associations, smmart-g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. more
Quickly determine the diseases, drugs and outcomes based on evidence from trusted sources. Find relevant articles and (soon) drug response data.
Inform discussions on modeling interpretation data between the VICC, GA4GH, and ClinGen

We host this Meta-Knowledgebase online at search.cancervariants.org.

Documentation and usage examples can be found online at docs.cancervariants.org

Where does the data come from?

Now:

Jackson Lab Clinical Knowledge Base
Washington University CIViC
oncokb Precision Oncology Knowledge Base
Cancer Genome Interpreter Cancer bioMarkers database
GA4GH reference server
Cornell pmkb
MolecularMatch

In progress:

BMEG

To see analysis of harmonization and overlaps see figures

How to use it?

JUST GOOGLE IT:

Use the search box like a google search. To search your data, enter your search criteria in the Query bar and press Enter or click Search to submit the request. For a full explanation of the search capabilities see Search examples, syntax
The charts and list are all tied to the search. Click to constrain your results

Technology stack

ElasticSearch, Kibana v6.0
- Provisioned using aws or locally using docker
- Snapshots and exports are managed by these utilities
Python 2.7
- Provisioned by your OS image
API Flask 0.12.2
- Provisioned by docker
nginx openresty/openresty
- Provisioned by docker, for more information see cloud setup

On top of Elasticsearch, we built REST-based web services using the Flask web framework.

search.cancervariants.org provides two simple REST-based web services: an association query service and a GA4GH beacon service. The association query service allows users to query for evidence using any combination of keywords, while the beacon service provisions associations into the GA4GH beacon network enabling retrieval of associations based on genomic location.

How do I import new data into it?

Start up an elastic search container
Register and download CosmicMutantExport.csv into the harvester directory
Make the required files from the harvester Makefile

$ cd harvester
$ make oncokb_all_actionable_variants.tsv cgi_biomarkers_per_variant.tsv cosmic_lookup_table.tsv cgi_mut_benchmarking.tsv oncokb_mut_benchmarking.tsv benchmark_results.txt

Note: If you will be extracting from molecularmatch, you will need to contact them from an API key. Disease normalization depends on bioontology, see https://bioportal.bioontology.org/accounts/new for an API key.

Install required python packages

pip install -r requirements.txt

Run the harvester

$ python harvester.py  -h
usage: harvester.py [-h] [--elastic_search ELASTIC_SEARCH]
                    [--elastic_index ELASTIC_INDEX] [--delete_index]
                    [--delete_source]
                    [--harvesters HARVESTERS [HARVESTERS ...]]

optional arguments:
  -h, --help            show this help message and exit
  --elastic_search ELASTIC_SEARCH, -es ELASTIC_SEARCH
                        elastic search endpoint
  --elastic_index ELASTIC_INDEX, -i ELASTIC_INDEX
                        elastic search index
  --delete_index, -d    delete elastic search index
  --delete_source, -ds  delete all content for source before harvest
  --harvesters HARVESTERS [HARVESTERS ...]
                        harvest from these sources. default: ['cgi_biomarkers', 'jax', 'civic', 'oncokb', 'g2p']

How do I write a new harvester?

A harvester is a python module that implements this duck typing interface.

#!/usr/bin/python


def harvest(genes):
    """ given a list of genes, yield an evidence item """
    # for gene in genes:
    #   gene_data = your_implementation_goes_here
    #      yield gene_data
    pass


def convert(gene_data):
    """ given a gene_data in it's original form, produce a feature_association """
    # gene: a string gene name
    # feature: a dict representing a ga4gh feature https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/sequence_annotations.proto#L30
    # association: a dict representing a ga4gh g2p association https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/genotype_phenotype.proto#L124
    #
    # feature_association = {'gene': gene ,
    #                        'feature': feature,
    #                        'association': association,
    #                        'source': 'my_source',
    #                        'my_source': {... original data from source ... }
    # yield feature_association
    pass


def harvest_and_convert(genes):
    """ get data from your source, convert it to ga4gh and return via yield """
    for gene_data in harvest(genes):
        for feature_association in convert(gene_data):
            yield feature_association

How do I test it?

$ cd harvester
$ pytest -s -v
======================================================================================================================================================= test session starts ========================================================================================================================================================
platform darwin -- Python 2.7.13, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/local/opt/python/bin/python2.7
cachedir: ../../.cache
rootdir: /Users/walsbr, inifile:
collected 13 items

tests/integration/test_elastic_silo.py::test_args PASSED
tests/integration/test_elastic_silo.py::test_init PASSED
tests/integration/test_elastic_silo.py::test_save PASSED
tests/integration/test_elastic_silo.py::test_delete_all PASSED
tests/integration/test_elastic_silo.py::test_delete_source PASSED
tests/integration/test_kafka_silo.py::test_populate_args PASSED
tests/integration/test_kafka_silo.py::test_init PASSED
tests/integration/test_kafka_silo.py::test_save PASSED
tests/integration/test_pb_deserialize.py::test_civic_pb PASSED
tests/integration/test_pb_deserialize.py::test_jax_pb PASSED
tests/integration/test_pb_deserialize.py::test_oncokb_pb PASSED
tests/integration/test_pb_deserialize.py::test_molecular_match_pb PASSED
tests/integration/test_pb_deserialize.py::test_cgi_pb PASSED

How do I launch the database, bring up the website, etc. ?

There is a docker compose configuration file in the root directory.

Launch it by:

ELASTIC_PORT=9200 KIBANA_PORT=5601 docker-compose up -d

This will automatically download elastic search etc. and will expose the standard elastic search and kibana ports (9200 and 5601)

If you would like to host an instance, launch docker-compose with an additional nginx file.

docker-compose -f docker-compose.yml -f cloud-setup/docker-compose-nginx.yml up -d

This will do the same setup, but will also include an nginx proxy to map http and https ports.

Our demo site is hosted on aws and includes the API server and nginx proxy

docker-compose -f docker-compose-aws.yml up -d

As a convenience, there is a juypter image for notebook analysis:

docker-compose -f docker-compose.yml -f docker-compose-jupyter.yml up -d

What else do I need to know?

See the README.md in harvester/tests/integration to see how harvested evidence is mapped to protocol buffer messages.

OK, I get it. But what about .... ?

`NEXT STEPS`

Work with users, gather feedback
Load alternative data sources [literome, ensemble]
Load smmart drugs [Olaparib, Folfox, Pembrolizumab, …]
Integrate with bmeg (machine learning evidence)
Improve data normalization
- Variant naming (HGVS)
- Ontologies (diseases, drugs, variants)
Add GA4GH::G2P api (or successor)
Harden prototype:
- python notebook
- web app (deprecate kibana UI)

g2p-aggregator's People

Contributors

Stargazers

Watchers

Forkers

jgoecks heyuange shane-neeley color4 him72 falquaddoomi svipdb xmuyulab qihaowei89 hui-zju

g2p-aggregator's Issues

normalizer adds drugs improperly for cgi

For example, CGI has a MUT alteration association for ABL1 in CML that confers resistance to Imatinib. However, our normalizer parses this to be several drugs, including Imatinib (correct), inhibitor (incorrect, and a mismatch to the associated compound ID CID657356, which is named "Diacylglycerol Kinase Inhibitor Ii"), and Clorazepate Dipotassium (incorrect, and not sure why this got added).

Variant normalization: cannot add documents with non-numeric chromosome

There are several entries in CGI that are on chromosome X, but currently these cannot be added to ElasticSearch. They fail with a parsing error:

TransportError(400, u'mapper_parsing_exception', u'failed to parse [feature.chromosome]')
{'cgi': '{"Targeting": "", "Biomarker": "G6PD (S218F)", "Source": "FDA", "cDNA": "c.653C>T", "Primary Tumor type": "CANCER", "individual_mutation": "G6PD:p.S218F", "Drug full name": "Dabrafenib (BRAF inhibitor)", "Association": "Increased Toxicity (Haemolytic Anemia)", "Drug family": "BRAF inhibitor", "Curator": "DTamborero;CRubio-Perez", "Drug": "Dabrafenib", "Alteration": "G6PD:S218F", "gDNA": "chrX:g.153762634G>A", "Drug status": "Approved", "Gene": "G6PD", "transcript": "ENST00000393562", "strand": "-", "info": "CSQN=Missense;reference_codon=TCC;candidate_codons=TTT,TTC;candidate_mnv_variants=chrX:g.153762633_153762634delGGinsAA;aliases=ENSP00000377192;source=Ensembl", "Assay type": "", "Alteration type": "MUT", "region": "inside_[cds_in_exon_6]", "Evidence level": "FDA guidelines", "gene": "G6PD", "Metastatic Tumor Type": ""}', 'tags': [], 'feature': {'name': 'G6PD (S218F)', 'start': '153762634', 'geneSymbol': 'G6PD', 'alt': 'A', 'ref': 'G', 'chromosome': 'X', 'description': 'G6PD:S218F'}, 'source': 'cgi', 'gene': 'G6PD', 'association': {'drug_labels': 'Dabrafenib (BRAF inhibitor)', 'description': 'G6PD Dabrafenib (BRAF inhibitor) Increased Toxicity (Haemolytic Anemia)', 'publication_url': 'https://www.google.com/#q=FDA', 'evidence': [{'info': {'publications': ['https://www.google.com/#q=FDA']}, 'evidenceType': {'sourceName': 'cgi'}, 'description': 'Increased Toxicity (Haemolytic Anemia)'}], 'environmentalContexts': [{'description': 'Dabrafenib (BRAF inhibitor)'}], 'evidence_label': 'Increased Toxicity (Haemolytic Anemia) FDA guidelines', 'phenotype': {'description': 'CANCER'}}}```

importer fails

There's an issue with the normalization routines that causes them to fail on importing some data. From the run log:

2017-08-01 00:38:41,671 - root - ERROR - string indices must be integers
Traceback (most recent call last):
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 211, in normalize
    drugs = normalize_biothings(name)
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 134, in normalize_biothings
    if product['approved'] == 'true':
TypeError: string indices must be integers
Traceback (most recent call last):
  File "harvester.py", line 150, in <module>
    main()
  File "harvester.py", line 136, in main
    normalize(feature_association)
  File "harvester.py", line 124, in normalize
    drug_normalizer.normalize_feature_association(feature_association)
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 240, in normalize_feature_association
    ctx_drugs = normalize(ctx['description'])
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 227, in normalize
    raise e
TypeError: string indices must be integers

cgi importer requires dictionary for phenotype

Currently, CGI uses abbreviations that are not recognized by our normalization routines as phenotypes, which leads to high normalization failure rates.

Contact David Tamborero for mappings.

python notebook (overlaps and other metrics)

@ahwagner Hi. During our last call, you showed a notebook that illustrated overlaps and other metrics? Is it possible to share it? We can either incorporate it as-is or, perhaps we can adapt it into a test. Let me know your thoughts.

brca exchange

Collect oncogenic info from BRCA Exchange expert reviewed variants: http://brcaexchange.org/variants?orderBy=Pathogenicity_expert&order=descending

Got some encouragement from D Haussler et al - VICC & BRCA exchange are both ga4gh driver projects

Variant Normalization (genomic location)

All knowledge bases should complete Feature genomic location.

nginx_g2p requires OHSU certificate

The current docker container for nginx (nginx_g2p) requires the use of an OHSU certificate (/compbio-tls/compbio_ohsu_edu_cert.cer). While this makes sense for the live version hosted by OHSU, it causes docker-compose to fail for off-site users.

@bwalsh, would you update the image for locale-agnostic instantiation? Or am I missing the way to adjust this for a local instance?

Enable anonymous read-only access for cloud deployment

Coordinate with Monarch/Phenopackets

Update molecular match: API key no longer works?

review jax lab gene names

In jax parsing, the molecular profile field is parsed to lookup variant and add position information to the association. See here

I believe not all permutations have been accounted for. For example, in errant cases gene names are parsed as ['over', 'mut', 'inact', 'exp', 'exon19', '+',...]

I believe the molecular profile field needs to be parsed, not just split.

The test case below illustrates one possible approach. While not very elegant, the parse() method should ensure that miscellaneous strings are handled.

Currently, it seems like these 'variant name' string vary between knowledge bases. Therefore, I think this type of parsing belongs local to each harvester

@jgoecks @mayfielg Let me know your thoughts

molecular_profiles = [
    "APC inact mut KRAS G12D",
    "APC mutant BRAF mutant PIK3CA mutant SMAD4 mutant TP53 mutant",
    "BRAF V600E EGFR amp",
    "BRAF V600E MAP2K1 L115P",
    "BRAF V600E NRAS Q61K NRAS A146T MAP2K1 P387S",
    "BRAF amp BRAF V600X NRAS Q61K",
    "CDKN2A mut MET del exon14 PDGFRA mut SMAD4 Q249H",
    "DNMT3A R882H FLT3 Y599_D600insSTDNEYFYVDFREYEY NPM1 W288fs",
    "EGFR E746_A750del EGFR T790M EGFR L718Q",
    "EGFR exon 19 del MET amp MET D1228V",
    "ERBB2 over exp PIK3CA H1047R SRC over exp",
    "ETV6 - JAK2 JAK2 G831R",
    "FGFR2 E565A FGFR2 K659M FGFR2 N549H FGFR2 N549K FGFR2 V564F FGFR2-ZMYM4",
    "FGFR2 N550K PIK3CA I20M PIK3CA P539R PTEN R130Q PTEN T321fs*23",
    "FGFR3 wild-type FGFR3 dec exp HRAS G12V",
    "FLT3 exon 14 ins FLT3 D835N",
    "FLT3 exon 14 ins FLT3 F691L FLT3 D698N",
    "FLT3 exon 14 ins FLT3 M837G FLT3 S838R FLT3 D839H",
    "JAK2 over exp MPL over exp",
    "KRAS G12D PIK3CA E545K PIK3CA H1047L TP53 wild-type",
    "KRAS G12D PTEN dec exp TP53 R306*",
    "KRAS G13C PIK3CA H1047Y PTEN G143fs*4 PTEN K267fs*9",
    "KRAS mut + TP53 wild-type",
    "MET del exon14 TP53 N30fs*14",
    "NPM1-ALK ALK L1196M ALK D1203N",
]


def parse(molecular_profile):
    """ returns gene, tuples[] """
    parts = molecular_profile.split()
    gene = None
    gene_complete = True

    tuples = []
    tuple = None
    for idx, part in enumerate(parts):
        if not gene:
            gene = part
        # deal with 'GENE - GENE '
        if part == '-':
            gene += part
            gene_complete = False
            continue
        elif not gene_complete:
            gene += part
            gene_complete = True
            continue

        if not tuple:
            tuple = []
        # build first tuple
        if len(tuples) == 0:
            if len(tuple) == 0:
                tuple.append(gene)
        if idx == 0:
            continue

        # ignore standalone plus
        if not part == '+':
            tuple.append(part)

        # we know there is at least one more to fetch before terminating tuple
        if len(tuple) == 1 and idx < len(parts)-1:
            continue

        # is the current tuple complete?
        if (
                (len(tuple) > 1 and part.isupper()) or
                idx == len(parts)-1 or
                parts[idx+1].isupper()
           ):
                tuples.append(tuple)
                tuple = None

    return gene, tuples


def test_parse_all():
    """ just loop through all test profiles, ensure no exceptions """
    genes = []
    for molecular_profile in molecular_profiles:
        genes.append(parse(molecular_profile)[0])


def test_parse_fusion():
    """ make sure we handle fusion format """
    gene, tuples = parse("ETV6 - JAK2")
    assert "ETV6-JAK2" == gene


def test_parse_simple():
    """ make sure we handle fusion format """
    gene, tuples = parse("BRAF V600E")
    assert "BRAF" == gene
    assert tuples == [["BRAF", "V600E"]]


def test_parse_simple_annotated():
    """ make sure we 'annotations' on gene """
    gene, tuples = parse("MET del exon14")
    assert "MET" == gene
    assert tuples == [["MET", "del", "exon14"]]


def test_parse_compound_annotated():
    """ make sure we 'annotations' on gene and others """
    gene, tuples = parse("MET del exon14 TP53 N30fs*14")
    assert "MET" == gene
    assert tuples == [["MET", "del", "exon14"], ["TP53", "N30fs*14"]]


def test_parse_mixed_annotated_compound():
    """ make sure we handle fusion format """
    gene, tuples = parse("CDKN2A mut MET del exon14 PDGFRA mut SMAD4 Q249H")
    assert "CDKN2A" == gene
    assert tuples == [["CDKN2A", "mut"],
                      ["MET", "del", "exon14"],
                      ["PDGFRA", "mut"],
                      ["SMAD4", "Q249H"]]


def test_parse_terminate_with_fusion():
    """ make sure we handle fusion format in last tuple"""
    gene, tuples = parse("FGFR2 E565A FGFR2 K659M FGFR2 N549H FGFR2 N549K FGFR2 V564F FGFR2-ZMYM4")  # NOQA
    assert "FGFR2" == gene
    assert tuples == [["FGFR2", "E565A"],
                      ["FGFR2", "K659M"],
                      ["FGFR2", "N549H"],
                      ["FGFR2", "N549K"],
                      ["FGFR2", "V564F"],
                      ["FGFR2-ZMYM4"],
                      ]


def test_plus_sign():
    """ make sure we handle fusion format in last tuple"""
    gene, tuples = parse("KRAS mut + TP53 wild-type")  # NOQA
    assert "KRAS" == gene
    assert tuples == [["KRAS", "mut"],
                      ["TP53", "wild-type"]]

Oncokb code review

A topic was raised during this week's call regarding the count of evidence records from oncokb.

Overview
The database captures evidence using the same API that provision the Actionable Genes pages.
Harvester Pseudo code
- Get the list of genes ( currently 476 unique genes )
- For each gene:
  - Get the clinical variants using hugoSymbol
    - For each clinical variant
      - Convert to GA4GH and persist ( 375 unique clinical evidence )

gene	evidence_count
KIT	98
BRAF	45
PDGFRA	43
FGFR3	27
EGFR	24
MET	22
IDH1	20
ERBB2	7
MTOR	6
ALK	5
IDH2	5
KRAS	5
NRAS	5
PIK3CA	5
FGFR2	4
MAP2K1	4
ESR1	3
NF1	3
PDGFRB	3
PTEN	3
ABL1	2
AKT1	2
ARAF	2
ATM	2
BRCA1	2
BRCA2	2
CDK4	2
CDKN2A	2
FGFR1	2
MDM2	2
NTRK1	2
PTCH1	2
RET	2
ROS1	2
TSC1	2
ERCC2	1
EZH2	1
FLT3	1
JAK2	1
NTRK2	1
NTRK3	1
RAF1	1
TSC2	1

Normalize Evidence Direction

civic evidence_labels

Brian, in your slide showing the break down of evidence categories, it looked like CIViC was contributing almost entirely in the “preclinical” category.
CIViC has lots of “case study” and “clinical” entries, but those two bars had nothing from CIViC?

Move pmkb from web scrape to API calls

Currently, the pmkb data is harvested by screen scraping. See harvester/pmkb.py
Cornell has provided API documentation. It would be useful to move the code over to using the API.

From LIH3001 AT med DOT cornell DOT edu
Michael Mienko, who left about a month ago, helped me set up some new endpoints on PMKB as follows:

/api/genes
/api/genes/{gene_id}
/api/interpretations
/api/interpretations/{interpretation_id}
/api/tissues
/api/tissues/{tissue_id}
/api/tumors
/api/tumors/{tumor_id}
/api/variants
/api/variants/{variant_id}
/api/health_check
/api/search?query={query} # Endpoint for the full-text search bar in PMKB

Right now these are behind an basic authentication. I can add a username to the PMKB website if you'd like to try them.

There is another more specialized endpoint for PMKB that allows you to look up variants and interpretations with more detail. You've probably seen this example code before, but I've attached it again in api_reference.py, plus the following write-up:

params = {
        "gene": "KRAS",
        "aa_change": "G12A",
        "dna_change": "35G>C",
        "exons": "2",
        "tumor": "Adenocarcinoma",
        "tissue": "Lung",
        "transcript": "ENST00000256078"
    }
    p = json.dumps([params])
    url = 'https://pmkb.weill.cornell.edu/api/lookups'
    headers = {
       'Content-Type': 'application/json',
    }
    response = requests.post(url, data=p, headers=headers)

(This particular endpoint is not behind authorization.)

Right now there are 8 levels of relevance:
1. An exact HGVS notation match, e.g. BRAF V600E matches BRAF V600E
2. A partial HGVS notation match, e.g. BRAF V600E matches BRAF V600
3. Codon match and variant type is specific, e.g. BRAF V600E matches BRAF codon 599-601 missense
4. Codon match and variant type is not specific, e.g. BRAF V600E matches BRAF codon 599-601 any mutation
5. Exon match and variant type is specific, e.g. BRAF V600E matches BRAF exon 15 missense
6. Exon match and variant type is not specific, e.g. BRAF V600E matches BRAF exon 15 any mutation
7. Gene match and variant type is specific, e.g. BRAF V600E matches BRAF any missense
8. Gene match and variant type is not specific, e.g. BRAF V600E matches BRAF any mutation

api_reference.py

import json
import requests

params = {
    "gene": "EGFR",
    "aa_change": "p.(=)",
    "dna_change": "c.2573G>A",
    "exons": "2", # optional
    "tumor": "Adenocarcinoma",
    "tissue": "Lung",
    "variant_type": "silent",
    "transcript": "ENST00000256078" #optional
}
params_cnv = {
    "variant_type": "CNV",
    "gene": "CDKN2A",
    "cnv_type": "loss",
    "tumor": "Squamous Cell Carcinoma",
    "tissue": "Lung"
}
params_fusion = {
    "variant_type": "rearrangement",
    "gene": "ERG",
    "partner_gene": "TMPRSS2",
    "tumor": "Adenocarcinoma",
    "tissue": "Prostate"
}

p = json.dumps([params])
url = 'https://pmkb.weill.cornell.edu/api/lookups'
headers = {
    'Content-Type': 'application/json'
}
response = requests.post(url, data=p, headers=headers)
print response.text.encode('utf-8')

sources have inconsistent structures for associations

Two differences observed so far using the following code:

double_listed_refs = Counter()
single_listed_refs = Counter()

listed_evidence = Counter()
single_evidence = Counter()
for hit in res['hits']['hits']:
    if isinstance(hit['_source']['association']['evidence'], list):
        listed_evidence[hit['_source']['source']] += 1
        for evidence in hit['_source']['association']['evidence']:
            for pmid_url in evidence['info']['publications']:
                if isinstance(pmid_url, list):
                    double_listed_refs[hit['_source']['source']] += 1
                else:
                    single_listed_refs[hit['_source']['source']] += 1
    else:
        single_evidence[hit['_source']['source']] += 1
        evidence = hit['_source']['association']['evidence']
        for pmid_url in evidence['info']['publications']:
            if isinstance(pmid_url, list):
                double_listed_refs[hit['_source']['source']] += 1
            else:
                single_listed_refs[hit['_source']['source']] += 1

double_listed_refs are jax, oncokb, and pmkb. single_listed_refs are cgi and civic. single_evidence is civic, the others are all listed_evidence. I'm handling these inconsistencies with transformations in my analysis, but it would be good to unify them under single_listed and listed_evidence.

GENIE analysis: in line figures

I've made two small changes to the GENIE analysis figures
see
https://github.com/ohsu-comp-bio/g2p-aggregator/blob/v0.7/notebooks/GENIE_Analysis.ipynb

stacked bars

rotated x axis

%%opts Bars.Stacked [stack_index=1 title_format='GENIE coverage' height=600 width=600 legend_position='top' xrotation=90 ]

...

bars.relabel(group='Stacked')

help needed

One challenge with the Genie notebook is that images are not persisted with the notebook.
This change is purported to work, but hasn't
@jgoecks @mayfielg

from bokeh.resources import INLINE
from bokeh.io import output_notebook
output_notebook(resources=INLINE)

Variant normalization: e.g. Assembly identifier (GRC notation, e.g. `GRCh37`)

I've been looking into adding a simple GA4GH beacon.
As part of that I discovered a new area for Harmonization - reference genome

From ga4gh beacon:
https://github.com/ga4gh/beacon-team/blob/develop/src/main/resources/avro/beacon.avdl#L76
What we have now

References:
https://genome.ucsc.edu/FAQ/FAQreleases.html#release1

Issue: What standard harmonization do we need to apply?

@jgoecks @ahwagner : thoughts?

Add 'Cosmic Resistance' knowledgebase

Import CGI oncogenic evidence

This is an extension of the biologic/ongenic updatesmade to the oncokb harvester. We wish to log oncogenic info from CGI as well.

https://www.cancergenomeinterpreter.org/mutations

Publish pipeline & notebook integration cookbook

Biomarker type normalization

Review oncokb harvester counts

currently ~ 300+ evidence items, should be much higher

fix duplicates caused by splitting features

Re. duplicates - here is a quick dashboard to show them. https://dms-dev.compbio.ohsu.edu/kibana/app/kibana#/dashboard/7a678fb0-789f-11e7-81e2-c5c499f34804

[4:23]
AFAIK, they seem to be a side effect of taking a single entry from the source and splitting it up into different associations. The most egregious example is: ABL1 (I242T,M244V,K247R,L248V,G250E,G250R,Q252R,Q252H,Y253F,Y253H,E255K,E255V,M237V,E258D,W261L,L273M,E275K,E275Q,D276G,T277A,E279K,V280A,V289A,V289I,E292V,E292Q,I293V,L298V,V299L,F311L,F311I,T315I,F317L,F317V,F317I,F317C,Y320C,L324Q,Y342H,M343T,A344V,A350V,M351T,E355D,E355G,E355A,F359V,F359I,F359C,F359L,D363Y,L364I,A365V,A366G,L370P,V371A,E373K,V379I,A380T,F382L,L384M,L387M,L387F,L387V,M388L,Y393C,H396P,H396R,H396A,A397P,S417F,S417Y,I418S,I418V,A433T,S438C,E450K,E450G,E450A,E450V,E453K,E453G,E453A,E453V,E459K,E459G,E459A,E459V,M472I,P480L,F486S,E507G) Which created 92 associations. I believe this is a bug/feature of the harvesters (edited)

API access ( create api access to aggregated data )

Technical input to VICC paper (draft notes)

@jgoecks @mayfielg @ahwagner : See my notes below. I simply wanted to capture them here and if it's appropriate, move them over to google docs.

G2P - Knowledgebase integration

Abstract

Background

The g2p-aggregator is a bioinformatics tool designed to integrate evidence from disparate datasources and and support interpretation, prioritization and report generation. It is implemented by Oregon Health and Sciences University (OHSU), and integrates evidence from:

The g2p-aggregator has create an open source suite based on the GA4GH schemas, which are efficiently interrogated to find sets of relevant evidence through a search api.

Methods

The g2p-aggregator integrates data coming from multiple knowedgebases and allows users to query a harmonized result set. The harmonization consists of structural and ontology mapping. Structural mapping manipulates the input stream into a GA4GH genomic feature association. The content of the original data source is maintained for queries and is returned in result sets. Ontology mapping is targeted at variants, environment (drugs), phenotypes (diseases), and evidence metadata such as evidence strength and direction. The system then stores evidence in a variety of possible stores [elastic search, kafka message queues, RDBMS or files]. A full text search, aggregations and GA4GH beacon and provided via the elastic search store. Integration with downstream systems are provided by the kafka or file system store.

Our central theme was to provide a robust search facility, giving a focus to the conversation between the researcher and the aggregated evidence. Researchers should expect a rich search experience and should be able to make judgments about the applicability of the evidence set to their research question based on the quality of one or two sets of search results.

Results

The g2p-aggregator manages data of 9 knowledgebases with a total count of over 25K evidence and clinical trial items distributed over:

9 knowledge bases
513 Genes, 7221 unique Locations
383 Diseases, 185 unique Disease Ontologies
1119 Drugs, 898 unique pubchem identifiers
7905 unique publications

Conclusions

G2p-aggregator is a useful implementation of how web-scale, open source architectures and components can be implemented to support translational research. The next steps of our project will involve the extension of its capabilities by implementing new plug-in devoted to bioinformatics data analysis as well as a temporal query module. For researchers, who need to investigate genomic events, g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. For informaticians, who need to annotate genomic events, g2p is a search tool that provides a query api for any pipeline to gather evidence ‘hits’ unlike current practices which have not focused on evidence, the product allows the informatician to identify, filter and sort genomic events based on evidence.

Background

The intent of the GA4GH schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies defined by the GA4GH G2P schemas and the individual data sources. The system's harmonization process follows the intent expressed by the original G2P task team.

Where a G2P association is between the G(enotype) in the context of
some E(environment), which gives rise to a P(henotype). These
associations have further evidence, provenance, and attribution.
We leverage the GenomicFeature in the sequenceAnnotation schema here
as it can accomodate any genomic feature from a single nucleotide variation
(SNV), up through a gene, and/or complex rearrangements. Each can
be modeled as genomic features, and generally linked to a phenotype.
Collections of these features can represent a genotype at different levels
of completeness. Therefore, we can represent single allelic variation,
allelic complement, and multiple variants in a genotype that can each or
collectively be associated with a phenotype.
To enable standardized integration, this schema relies heavily on
OntologyTerms, for typing phenotype, genomic features, and levels
of evidence.

Methods

Harvester

A harvester is a python module that implements this duck typing interface.

harvest: A fairly straightforward mechanism to use the knowledgebase's access method (api, file download, etc.) to retrieve the evidence items in their native format.
convert: Each harvester needs to map and harmonize the evidence presented to a GA4GH FeatureAssociation. This function is supported by several helper methods:
- A normalized vocabulary for evidence_level which harmonizes the source to AMP/ASCO/CAP guidelines.
- A alias and lookup service for genotype that leverages a webservice provided by EBI to lookup human disease ontology
- A lookup service for environment that leverages a webservice provided by Biothings to lookup pubchem, chebi or chembl identifiers as well as toxicity, taxonomy and approved countries.
- The COSMIC variant table to parse and harmonize variant location.

Once deployed via standard docker containers, the system extends the value of the underlying data by enabling query via a GA4GH beacon or elastic search's API. The resulting system has a minimal footprint, and is currently deployed on Amazon's free tier.

Use cases

Use cases are divided into three categories; discovery, exploration and integration.

GA4GH Beacon

The Beacon project is a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts.

Our implementation follows the beacon specification and returns meta information about the beacon and a simple evidence summary for a specific genomic location.

UI

Our current alpha UI allows the user to query using a 'google search' and then presents visualizations and the ability to drill down to the specific FeaturePhenotypeAssociation and associated evidence from the original source.

As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query in the form GIST AND imatinib AND KIT.

In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. Additionaly, the query could be extended by either drilling down on the UI's widgets and/or continuing to add full text search terms.

API

The /analysis folder contains a python notebook that leverages the entire knowledgebase for comparison with the GENIE database of clinical outcomes. We have used the Elasticsearch DSL to abstract the low level APIs (which are still available for use) to provide "a more convenient and idiomatic way to write and manipulate queries".

An example to simply retrieve all evidence items would be:

res = es.search(index=\"g2p\", size=10000, body={\"query\": {\"match_all\": {}}})

Alternative APIs are available in most commonly used programming environments.

Results

Harmonization

Evidence Level

Our first challenge was to align the diverse "strength of evidence" fields presented by different knowledgebases.

Detail: Evidence Label by source

source	filters	Count
cgi	A	304
cgi	B	49
cgi	C	563
cgi	D	515
cgi	NA	2
civic	A	62
civic	B	1131
civic	C	1003
civic	D	980
civic	NA	5
jax	A	64
jax	B	54
jax	C	647
jax	D	2884
jax	NA	3
jax_trials	D	1131
jax_trials	NA	3
molecularmatch	A	298
molecularmatch	B	73
molecularmatch	C	150
molecularmatch	D	500
molecularmatch_trials	D	64379
molecularmatch_trials	NA	885
oncokb	A	114
oncokb	B	116
oncokb	C	69
oncokb	D	97
oncokb	NA	185
pmkb	A	414
pmkb	C	160
pmkb	D	35
pmkb	NA	609
sage	C	33
sage	D	36

Phenotype & Environment

In order to enable cross knowledgebase queries, we needed a uniform Phenotype and Environment.

Detail: Exceptions to phenotype and environment harmonization

source	environment	count
molecularmatch_trials	Surgery	930
molecularmatch_trials	HSCT	745
molecularmatch_trials	Allotransplantation	349
molecularmatch_trials	Cytotoxic T Lymphocytes	138
molecularmatch_trials	RG7446	111
brca
oncokb	Debio1347	12
oncokb	AP32788	5
oncokb	BAY1436032	5
oncokb	BGB659	2
jax	N/A	39
jax	MRX-2843	11
jax	AZ8010	7
jax	TASIN-1	7
jax	BAY1187982	6
civic	AMGMDS3	8
civic	Chemotherapy	4
civic	Adjuvant Chemotherapy	2
civic	Adoptive T-cell Transfer	2
civic	Antiangiogenic Therapy	2
cgi	FGFR inhibitors	25
cgi	PARP inhibitors	21
cgi	MTOR inhibitors	17
cgi	PI3K pathway inhibitors	17
cgi	HDAC inhibitors	6
jax_trials	IDH305	2
jax_trials	INCB054828	2
jax_trials	SYM004	2
jax_trials	AC0010MA	1
jax_trials	ALRN-6924	1
molecularmatch	Sym004	5
molecularmatch	3
molecularmatch	RG7446	3
molecularmatch	ETC159	2
molecularmatch	MEDI6469	2
pmkb
sage	mTOR inhibitors	7

source	phenotype	count
molecularmatch_trials	Acute myeloid leukaemia, disease	1069
molecularmatch_trials	HIV - Human immunodeficiency virus infection	841
molecularmatch_trials	Myeloproliferative disorder	735
molecularmatch_trials	Chronic lymphoid leukaemia, disease	706
molecularmatch_trials	ALL - Acute lymphoblastic leukaemia	632
brca
oncokb	Soft Tissue Sarcoma	6
oncokb	CNS Cancer	2
oncokb	Embryonal Tumor	2
oncokb	Esophagogastric Cancer	2
oncokb	Esophageal/Stomach Cancer, NOS	1
jax	Indication other than cancer	1
civic	Desmoid Fibromatosis	9
civic	T-cell Acute Lymphoblastic Leukemia	8
civic	Epithelial Ovarian Cancer	5
civic	Hepatocellular Fibrolamellar Carcinoma	3
civic	Anaplastic Oligodendroglioma	2
cgi	Renal	17
cgi	Bladder BLCA	10
cgi	Head an neck	9
cgi	Head an neck squamous	8
cgi	Myelodisplasic proliferative syndrome	8
jax_trials
molecularmatch	Metastasis from malignant melanoma of skin	2
pmkb	MDS with Ring Sideroblasts	10
pmkb	Glial Neoplasm	9
pmkb	Histiocytic and Dendritic Cell Neoplasms	7
pmkb	Langerhans Cell Histiocytosis	7
pmkb	Other Tumor Type	5
sage	mesothelioma	2
sage	head and neck cancer	1

Genotype

Breakdown of normalized variants/biomarkers by source.

source	filters	count
brca	genomic location	5733
brca	no genomic location	0
cgi	genomic location	589
cgi	no genomic location	842
civic	genomic location	2865
civic	no genomic location	311
jax	genomic location	3009
jax	no genomic location	640
jax_trials	genomic location	1051
jax_trials	no genomic location	80
molecularmatch	genomic location	804
molecularmatch	no genomic location	217
molecularmatch_trials	genomic location	0
molecularmatch_trials	no genomic location	64379
oncokb	genomic location	1811
oncokb	no genomic location	2338
pmkb	genomic location	609
pmkb	no genomic location	0
sage	genomic location	0
sage	no genomic location	69

Analysis

GENIE Analysis: Variant Level

The AACR Project Genomics, Evidence, Neoplasia, Information, Exchange (GENIE)—clinical targeted sequencing panel data from 8 different cancer centers

G2P Knowledge Base coverage of GENIE at the variant level:

Total coverage of non-unique variants is 28%
Adding more databases increases total coverage
Different databases contribute different types of evidence
OncoKB and MolecularMatch contribute guideline recommendations
CIViC and JAX contribute substantial preclinical evidence
10% of variants associated with A-level and are highly actionable
22% of variants associated with B-level and are moderately actionable

GENIE Analysis: Donor Level

G2P coverage of GENIE donors is encouraging:

42% of donors have 1+ actionable variant
48% of donors with non-unique variant(s) also have 1+ actionable variant
Adding more databases increases total coverage
25% of donors have variant with A-level evidence, 42% of donors have variant with B-level evidence

Discussion

//TODO

Conclusions

//TODO

List of abbreviations used

//TODO

Declarations

//TODO

Acknowledgements

//TODO

Electronic supplementary material

//TODO

References

//TODO

Disease normalization

@ahwagner

Alex

Hi. I was hoping you might have some insight into an alias or lookup table we can use to get around this issue.

Out of the 14,780 evidence items, we have 5,111 with no matched phenotype.
Out of those 5111 items, these 32 items have more than 10 entries each for a total of 4635.

Do you have any insight into what alias we might use for these items?

Thanks,

phenotype	Count
Advanced Solid Tumor	1571
Solid tumor	634
Malignant neoplastic disease	423
Neoplasm of respiratory tract	417
Neoplasm of respiratory system	407
Neoplasia	153
Any cancer type	142
Neoplasm of colon	131
Neoplasm of rectum	114
Neoplasm of digestive system	105
Non-small cell lung	66
Primary malignant neoplasm of intrathoracic organs	65
Gastrointestinal stromal	60
Primary malignant neoplasm of lung	46
Ovary	29
T Lymphoblastic Leukemia/Lymphoma	28
B Lymphoblastic Leukemia/Lymphoma	21
Malignant tumour of soft tissue	21
Chronic Myelomonocytic Leukemia	20
Primary malignant neoplasm of bone marrow	19
Malignant peripheral nerve sheat tumor	18
Lung squamous cell	17
Urothelial Carcinoma	17
Neoplasm of breast	16
Renal	16
Diffuse Large B Cell Lymphoma	13
Neoplasm of digestive tract	12
Neoplasm of intra-abdominal organs	12
Thyroid	12
All Tumors	10
Bladder BLCA	10
MDS with Ring Sideroblasts	10

Normalize Drugs

As a researcher, in order to evaluate evidence from different sources, I need to see a uniform drug name and identifier.

Capture usage and user activities

For understanding use of the system and eventually for learning from usage, we should instrument the user interface as much as possible. This includes:

recording search terms
recording when and how a user clicks on UI elements to filter results.

Normalize Evidence Type

Use Civic?

Add general purpose `tagging` functionality

An association now support tags[] and dev_tags[] fields.
See harvester\tagger.py and utils\tagger.sh for example usage.

Add TP53 website to aggregator

The database is here:

https://p53.fr/

UMD_variants_US-annotated.xlsx

Create download solution for data_mutations_extended_1.0.1.txt

8648ca5#diff-3a2ba7b492f00029d14cec3994b73ac7R4

Use AWS elastic search service for cloud deployment

deployed using ES 5.3.0.
Elastic, KIbana, and Nginx all deployed on AWS free tier

oncokb clinical sample files for GENIE notebook

@mayfielg thanks. the code looks great. trying to run GENIE notebook now.
where can I find the following files? thanks.

PostgreSQL backend?

@bwalsh, now that we're moving to normalize on a number of concepts, I think it would make sense to transition the backend to SQL. This will require more work up front, but should drastically improve the results.

I'd like to hear your thoughts (and @obigriffith, @malachig, @AAMargolin's) on the subject in this issue.

import additional oncogenic evidence from oncokb

We have already imported evidence pmkb & civic that associates location with disease. (no drug)
We should do the same with oncokb

harmonize location information?

Problem statement:

How to harmonize location information?
i.e:
For those entries without genomic location specifics, is it possible to retrieve appropriate fields and append them to the evidence record?

Worked example

Evidence without location information

original from source :
https://civic.genome.wustl.edu/events/genes/58/summary/variants/1970/summary#variant

in g2p:
https://g2p-ohsu.ddns.net/_plugin/kibana/app/kibana#/doc/associations/associations-new/association?id=AV6bnNSKd2hRurWfSY2g&_g=()

Can we take the gene and variant info and deduce more?

Methodology:

use gene and variant info from source VHL V130L (c.388G>C) to retrieve hits from clinvar
if hit(s) retrieve using idlist

Issues: what to select from clinvar? how to map to feature?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=V130L+%28c.388G%3EC%29&retmode=json

  {
      "header": {
          "type": "esearch",
          "version": "0.3"
      },
      "esearchresult": {
          "count": "1",
          "retmax": "1",
          "retstart": "0",
          "idlist": [
              "2229"
          ],
          "translationset": [
          ],
          "translationstack": [
              {
                  "term": "VHL[All Fields]",
                  "field": "All Fields",
                  "count": "686",
                  "explode": "N"
              },
              {
                  "term": "c0x2e388G0x3eC[All Fields]",
                  "field": "All Fields",
                  "count": "5",
                  "explode": "N"
              },
              "AND",
              "GROUP"
          ],
          "querytranslation": "VHL[All Fields] AND c0x2e388G0x3eC[All Fields]"
      }
  }

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=2229&retmode=json'

{
   "header": {
       "type": "esummary",
       "version": "0.3"
   },
   "result": {
       "uids": [
           "2229"
       ],
       "2229": {
           "uid": "2229",
           "obj_type": "Simple",
           "accession": "",
           "accession_version": "",
           "title": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
           "variation_set": [
               {
                   "measure_id": "17268",
                   "variation_xrefs": [
                       {
                           "db_source": "UniProtKB",
                           "db_id": "P40337#VAR_005733"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "608537.0021"
                       },
                       {
                           "db_source": "dbSNP",
                           "db_id": "104893830"
                       }
                   ],
                   "variation_name": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
                   "cdna_change": "c.388G&gt;C (p.Val130Leu)",
                   "aliases": [
                   ],
                   "variation_loc": [
                       {
                           "status": "current",
                           "assembly_name": "GRCh38",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10146561",
                           "stop": "10146561",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10146561",
                           "display_stop": "10146561",
                           "assembly_acc_ver": "GCF_000001405.33",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       },
                       {
                           "status": "previous",
                           "assembly_name": "GRCh37",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10188245",
                           "stop": "10188245",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10188245",
                           "display_stop": "10188245",
                           "assembly_acc_ver": "GCF_000001405.25",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       }
                   ],
                   "allele_freq_set": [
                   ],
                   "variant_type": "single nucleotide variant"
               }
           ],
           "trait_set": [
               {
                   "trait_xrefs": [
                       {
                           "db_source": "Gene",
                           "db_id": "8056"
                       },
                       {
                           "db_source": "MedGen",
                           "db_id": "C1837915"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "238557"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "263400"
                       }
                   ],
                   "trait_name": "Erythrocytosis, familial, 2"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0019562"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "892"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "193300"
                       }
                   ],
                   "trait_name": "Von Hippel-Lindau syndrome"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0027672"
                       }
                   ],
                   "trait_name": "Hereditary cancer-predisposing syndrome"
               }
           ],
           "supporting_submissions": {
               "scv": [
                   "SCV000053262",
                   "SCV000580968",
                   "SCV000022475",
                   "SCV000264729"
               ],
               "rcv": [
                   "RCV000030586",
                   "RCV000002317",
                   "RCV000492250"
               ]
           },
           "clinical_significance": {
               "description": "Pathogenic",
               "last_evaluated": "2016/08/16 00:00",
               "review_status": "criteria provided, multiple submitters, no conflicts"
           },
           "record_status": "",
           "gene_sort": "VHL",
           "chr_sort": "03",
           "location_sort": "00000000000010146561",
           "variation_set_name": "",
           "variation_set_id": "",
           "genes": [
               {
                   "symbol": "VHL",
                   "geneid": "7428",
                   "strand": "+",
                   "source": "submitted"
               }
           ]
       }
   }
}

Stable source / data release

Hey @bwalsh. On Tuesday we discussed having a static data silo that we could draw upon for the paper figures. Any progress on this front?

Also, we should identify a source branch that our analyses will be based upon--there may be some merging that needs to take place here (@jgoecks I'd like your input on this as well).

Let me know if I can help out at all with setting this up, so that we can move forward on generating the figures for the next paper call.

python notebooks
pipelines

ohsu-comp-bio / g2p-aggregator Goto Github PK

g2p-aggregator's Introduction

The VICC Meta-Knowledgebase

What is it? Why use it?

Where does the data come from?

How to use it?

Technology stack

How do I import new data into it?

How do I write a new harvester?

How do I test it?

How do I launch the database, bring up the website, etc. ?

What else do I need to know?

OK, I get it. But what about .... ?

NEXT STEPS

g2p-aggregator's People

Contributors

Stargazers

Watchers

Forkers

g2p-aggregator's Issues

stacked bars

rotated x axis

help needed

Issue: What standard harmonization do we need to apply?

G2P - Knowledgebase integration

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Harvester

Use cases

GA4GH Beacon

UI

API

Results

Harmonization

Evidence Level

Phenotype & Environment

Genotype

Analysis

GENIE Analysis: Variant Level

GENIE Analysis: Donor Level

Discussion

Conclusions

List of abbreviations used

Declarations

Acknowledgements

Electronic supplementary material

References

Problem statement:

Worked example

Evidence without location information

Can we take the gene and variant info and deduce more?

Issues: what to select from clinvar? how to map to feature?

Recommend Projects

Recommend Topics

Recommend Org

`NEXT STEPS`