Git Product home page Git Product logo

g2p-aggregator's Introduction

The VICC Meta-Knowledgebase

What is it? Why use it?

image

  • For researchers, who need to investigate genotype phenotype associations, smmart-g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. more

  • Quickly determine the diseases, drugs and outcomes based on evidence from trusted sources. Find relevant articles and (soon) drug response data.

  • Inform discussions on modeling interpretation data between the VICC, GA4GH, and ClinGen

We host this Meta-Knowledgebase online at search.cancervariants.org.

Documentation and usage examples can be found online at docs.cancervariants.org

Where does the data come from?

Now:

In progress:

To see analysis of harmonization and overlaps see figures

How to use it?

JUST GOOGLE IT:

  • Use the search box like a google search. To search your data, enter your search criteria in the Query bar and press Enter or click Search to submit the request. For a full explanation of the search capabilities see Search examples, syntax

  • The charts and list are all tied to the search. Click to constrain your results

Technology stack

image

  • ElasticSearch, Kibana v6.0
    • Provisioned using aws or locally using docker
    • Snapshots and exports are managed by these utilities
  • Python 2.7
    • Provisioned by your OS image
  • API Flask 0.12.2
  • nginx openresty/openresty

On top of Elasticsearch, we built REST-based web services using the Flask web framework.

search.cancervariants.org provides two simple REST-based web services: an association query service and a GA4GH beacon service. The association query service allows users to query for evidence using any combination of keywords, while the beacon service provisions associations into the GA4GH beacon network enabling retrieval of associations based on genomic location.

How do I import new data into it?

  1. Start up an elastic search container
  2. Register and download CosmicMutantExport.csv into the harvester directory
  3. Make the required files from the harvester Makefile
$ cd harvester
$ make oncokb_all_actionable_variants.tsv cgi_biomarkers_per_variant.tsv cosmic_lookup_table.tsv cgi_mut_benchmarking.tsv oncokb_mut_benchmarking.tsv benchmark_results.txt

Note: If you will be extracting from molecularmatch, you will need to contact them from an API key. Disease normalization depends on bioontology, see https://bioportal.bioontology.org/accounts/new for an API key.

  1. Install required python packages
pip install -r requirements.txt
  1. Run the harvester
$ python harvester.py  -h
usage: harvester.py [-h] [--elastic_search ELASTIC_SEARCH]
                    [--elastic_index ELASTIC_INDEX] [--delete_index]
                    [--delete_source]
                    [--harvesters HARVESTERS [HARVESTERS ...]]

optional arguments:
  -h, --help            show this help message and exit
  --elastic_search ELASTIC_SEARCH, -es ELASTIC_SEARCH
                        elastic search endpoint
  --elastic_index ELASTIC_INDEX, -i ELASTIC_INDEX
                        elastic search index
  --delete_index, -d    delete elastic search index
  --delete_source, -ds  delete all content for source before harvest
  --harvesters HARVESTERS [HARVESTERS ...]
                        harvest from these sources. default: ['cgi_biomarkers', 'jax', 'civic', 'oncokb', 'g2p']

How do I write a new harvester?

A harvester is a python module that implements this duck typing interface.

#!/usr/bin/python


def harvest(genes):
    """ given a list of genes, yield an evidence item """
    # for gene in genes:
    #   gene_data = your_implementation_goes_here
    #      yield gene_data
    pass


def convert(gene_data):
    """ given a gene_data in it's original form, produce a feature_association """
    # gene: a string gene name
    # feature: a dict representing a ga4gh feature https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/sequence_annotations.proto#L30
    # association: a dict representing a ga4gh g2p association https://github.com/ga4gh/ga4gh-schemas/blob/master/src/main/proto/ga4gh/genotype_phenotype.proto#L124
    #
    # feature_association = {'gene': gene ,
    #                        'feature': feature,
    #                        'association': association,
    #                        'source': 'my_source',
    #                        'my_source': {... original data from source ... }
    # yield feature_association
    pass


def harvest_and_convert(genes):
    """ get data from your source, convert it to ga4gh and return via yield """
    for gene_data in harvest(genes):
        for feature_association in convert(gene_data):
            yield feature_association

How do I test it?

$ cd harvester
$ pytest -s -v
======================================================================================================================================================= test session starts ========================================================================================================================================================
platform darwin -- Python 2.7.13, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /usr/local/opt/python/bin/python2.7
cachedir: ../../.cache
rootdir: /Users/walsbr, inifile:
collected 13 items

tests/integration/test_elastic_silo.py::test_args PASSED
tests/integration/test_elastic_silo.py::test_init PASSED
tests/integration/test_elastic_silo.py::test_save PASSED
tests/integration/test_elastic_silo.py::test_delete_all PASSED
tests/integration/test_elastic_silo.py::test_delete_source PASSED
tests/integration/test_kafka_silo.py::test_populate_args PASSED
tests/integration/test_kafka_silo.py::test_init PASSED
tests/integration/test_kafka_silo.py::test_save PASSED
tests/integration/test_pb_deserialize.py::test_civic_pb PASSED
tests/integration/test_pb_deserialize.py::test_jax_pb PASSED
tests/integration/test_pb_deserialize.py::test_oncokb_pb PASSED
tests/integration/test_pb_deserialize.py::test_molecular_match_pb PASSED
tests/integration/test_pb_deserialize.py::test_cgi_pb PASSED

How do I launch the database, bring up the website, etc. ?

There is a docker compose configuration file in the root directory.

Launch it by:

ELASTIC_PORT=9200 KIBANA_PORT=5601 docker-compose up -d

This will automatically download elastic search etc. and will expose the standard elastic search and kibana ports (9200 and 5601)

If you would like to host an instance, launch docker-compose with an additional nginx file.

docker-compose -f docker-compose.yml -f cloud-setup/docker-compose-nginx.yml up -d

This will do the same setup, but will also include an nginx proxy to map http and https ports.

Our demo site is hosted on aws and includes the API server and nginx proxy

docker-compose -f docker-compose-aws.yml up -d

As a convenience, there is a juypter image for notebook analysis:

docker-compose -f docker-compose.yml -f docker-compose-jupyter.yml up -d

What else do I need to know?

OK, I get it. But what about .... ?

NEXT STEPS

  • Work with users, gather feedback
  • Load alternative data sources [literome, ensemble]
  • Load smmart drugs [Olaparib, Folfox, Pembrolizumab, …]
  • Integrate with bmeg (machine learning evidence)
  • Improve data normalization
    • Variant naming (HGVS)
    • Ontologies (diseases, drugs, variants)
  • Add GA4GH::G2P api (or successor)
  • Harden prototype:
    • python notebook
    • web app (deprecate kibana UI)

g2p-aggregator's People

Contributors

ahwagner avatar bwalsh avatar grmayfie avatar shane-neeley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

g2p-aggregator's Issues

normalizer adds drugs improperly for cgi

For example, CGI has a MUT alteration association for ABL1 in CML that confers resistance to Imatinib. However, our normalizer parses this to be several drugs, including Imatinib (correct), inhibitor (incorrect, and a mismatch to the associated compound ID CID657356, which is named "Diacylglycerol Kinase Inhibitor Ii"), and Clorazepate Dipotassium (incorrect, and not sure why this got added).

Variant normalization: cannot add documents with non-numeric chromosome

There are several entries in CGI that are on chromosome X, but currently these cannot be added to ElasticSearch. They fail with a parsing error:

TransportError(400, u'mapper_parsing_exception', u'failed to parse [feature.chromosome]')
{'cgi': '{"Targeting": "", "Biomarker": "G6PD (S218F)", "Source": "FDA", "cDNA": "c.653C>T", "Primary Tumor type": "CANCER", "individual_mutation": "G6PD:p.S218F", "Drug full name": "Dabrafenib (BRAF inhibitor)", "Association": "Increased Toxicity (Haemolytic Anemia)", "Drug family": "BRAF inhibitor", "Curator": "DTamborero;CRubio-Perez", "Drug": "Dabrafenib", "Alteration": "G6PD:S218F", "gDNA": "chrX:g.153762634G>A", "Drug status": "Approved", "Gene": "G6PD", "transcript": "ENST00000393562", "strand": "-", "info": "CSQN=Missense;reference_codon=TCC;candidate_codons=TTT,TTC;candidate_mnv_variants=chrX:g.153762633_153762634delGGinsAA;aliases=ENSP00000377192;source=Ensembl", "Assay type": "", "Alteration type": "MUT", "region": "inside_[cds_in_exon_6]", "Evidence level": "FDA guidelines", "gene": "G6PD", "Metastatic Tumor Type": ""}', 'tags': [], 'feature': {'name': 'G6PD (S218F)', 'start': '153762634', 'geneSymbol': 'G6PD', 'alt': 'A', 'ref': 'G', 'chromosome': 'X', 'description': 'G6PD:S218F'}, 'source': 'cgi', 'gene': 'G6PD', 'association': {'drug_labels': 'Dabrafenib (BRAF inhibitor)', 'description': 'G6PD Dabrafenib (BRAF inhibitor) Increased Toxicity (Haemolytic Anemia)', 'publication_url': 'https://www.google.com/#q=FDA', 'evidence': [{'info': {'publications': ['https://www.google.com/#q=FDA']}, 'evidenceType': {'sourceName': 'cgi'}, 'description': 'Increased Toxicity (Haemolytic Anemia)'}], 'environmentalContexts': [{'description': 'Dabrafenib (BRAF inhibitor)'}], 'evidence_label': 'Increased Toxicity (Haemolytic Anemia) FDA guidelines', 'phenotype': {'description': 'CANCER'}}}```

importer fails

There's an issue with the normalization routines that causes them to fail on importing some data. From the run log:

2017-08-01 00:38:41,671 - root - ERROR - string indices must be integers
Traceback (most recent call last):
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 211, in normalize
    drugs = normalize_biothings(name)
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 134, in normalize_biothings
    if product['approved'] == 'true':
TypeError: string indices must be integers
Traceback (most recent call last):
  File "harvester.py", line 150, in <module>
    main()
  File "harvester.py", line 136, in main
    normalize(feature_association)
  File "harvester.py", line 124, in normalize
    drug_normalizer.normalize_feature_association(feature_association)
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 240, in normalize_feature_association
    ctx_drugs = normalize(ctx['description'])
  File "/Users/awagner/Workspace/git/g2p-aggregator/harvester/drug_normalizer.py", line 227, in normalize
    raise e
TypeError: string indices must be integers

cgi importer requires dictionary for phenotype

Currently, CGI uses abbreviations that are not recognized by our normalization routines as phenotypes, which leads to high normalization failure rates.

Contact David Tamborero for mappings.

python notebook (overlaps and other metrics)

@ahwagner Hi. During our last call, you showed a notebook that illustrated overlaps and other metrics? Is it possible to share it? We can either incorporate it as-is or, perhaps we can adapt it into a test. Let me know your thoughts.

nginx_g2p requires OHSU certificate

The current docker container for nginx (nginx_g2p) requires the use of an OHSU certificate (/compbio-tls/compbio_ohsu_edu_cert.cer). While this makes sense for the live version hosted by OHSU, it causes docker-compose to fail for off-site users.

@bwalsh, would you update the image for locale-agnostic instantiation? Or am I missing the way to adjust this for a local instance?

review jax lab gene names

In jax parsing, the molecular profile field is parsed to lookup variant and add position information to the association. See here

I believe not all permutations have been accounted for. For example, in errant cases gene names are parsed as ['over', 'mut', 'inact', 'exp', 'exon19', '+',...]

I believe the molecular profile field needs to be parsed, not just split.

The test case below illustrates one possible approach. While not very elegant, the parse() method should ensure that miscellaneous strings are handled.

Currently, it seems like these 'variant name' string vary between knowledge bases. Therefore, I think this type of parsing belongs local to each harvester

@jgoecks @mayfielg Let me know your thoughts

molecular_profiles = [
    "APC inact mut KRAS G12D",
    "APC mutant BRAF mutant PIK3CA mutant SMAD4 mutant TP53 mutant",
    "BRAF V600E EGFR amp",
    "BRAF V600E MAP2K1 L115P",
    "BRAF V600E NRAS Q61K NRAS A146T MAP2K1 P387S",
    "BRAF amp BRAF V600X NRAS Q61K",
    "CDKN2A mut MET del exon14 PDGFRA mut SMAD4 Q249H",
    "DNMT3A R882H FLT3 Y599_D600insSTDNEYFYVDFREYEY NPM1 W288fs",
    "EGFR E746_A750del EGFR T790M EGFR L718Q",
    "EGFR exon 19 del MET amp MET D1228V",
    "ERBB2 over exp PIK3CA H1047R SRC over exp",
    "ETV6 - JAK2 JAK2 G831R",
    "FGFR2 E565A FGFR2 K659M FGFR2 N549H FGFR2 N549K FGFR2 V564F FGFR2-ZMYM4",
    "FGFR2 N550K PIK3CA I20M PIK3CA P539R PTEN R130Q PTEN T321fs*23",
    "FGFR3 wild-type FGFR3 dec exp HRAS G12V",
    "FLT3 exon 14 ins FLT3 D835N",
    "FLT3 exon 14 ins FLT3 F691L FLT3 D698N",
    "FLT3 exon 14 ins FLT3 M837G FLT3 S838R FLT3 D839H",
    "JAK2 over exp MPL over exp",
    "KRAS G12D PIK3CA E545K PIK3CA H1047L TP53 wild-type",
    "KRAS G12D PTEN dec exp TP53 R306*",
    "KRAS G13C PIK3CA H1047Y PTEN G143fs*4 PTEN K267fs*9",
    "KRAS mut + TP53 wild-type",
    "MET del exon14 TP53 N30fs*14",
    "NPM1-ALK ALK L1196M ALK D1203N",
]


def parse(molecular_profile):
    """ returns gene, tuples[] """
    parts = molecular_profile.split()
    gene = None
    gene_complete = True

    tuples = []
    tuple = None
    for idx, part in enumerate(parts):
        if not gene:
            gene = part
        # deal with 'GENE - GENE '
        if part == '-':
            gene += part
            gene_complete = False
            continue
        elif not gene_complete:
            gene += part
            gene_complete = True
            continue

        if not tuple:
            tuple = []
        # build first tuple
        if len(tuples) == 0:
            if len(tuple) == 0:
                tuple.append(gene)
        if idx == 0:
            continue

        # ignore standalone plus
        if not part == '+':
            tuple.append(part)

        # we know there is at least one more to fetch before terminating tuple
        if len(tuple) == 1 and idx < len(parts)-1:
            continue

        # is the current tuple complete?
        if (
                (len(tuple) > 1 and part.isupper()) or
                idx == len(parts)-1 or
                parts[idx+1].isupper()
           ):
                tuples.append(tuple)
                tuple = None

    return gene, tuples


def test_parse_all():
    """ just loop through all test profiles, ensure no exceptions """
    genes = []
    for molecular_profile in molecular_profiles:
        genes.append(parse(molecular_profile)[0])


def test_parse_fusion():
    """ make sure we handle fusion format """
    gene, tuples = parse("ETV6 - JAK2")
    assert "ETV6-JAK2" == gene


def test_parse_simple():
    """ make sure we handle fusion format """
    gene, tuples = parse("BRAF V600E")
    assert "BRAF" == gene
    assert tuples == [["BRAF", "V600E"]]


def test_parse_simple_annotated():
    """ make sure we 'annotations' on gene """
    gene, tuples = parse("MET del exon14")
    assert "MET" == gene
    assert tuples == [["MET", "del", "exon14"]]


def test_parse_compound_annotated():
    """ make sure we 'annotations' on gene and others """
    gene, tuples = parse("MET del exon14 TP53 N30fs*14")
    assert "MET" == gene
    assert tuples == [["MET", "del", "exon14"], ["TP53", "N30fs*14"]]


def test_parse_mixed_annotated_compound():
    """ make sure we handle fusion format """
    gene, tuples = parse("CDKN2A mut MET del exon14 PDGFRA mut SMAD4 Q249H")
    assert "CDKN2A" == gene
    assert tuples == [["CDKN2A", "mut"],
                      ["MET", "del", "exon14"],
                      ["PDGFRA", "mut"],
                      ["SMAD4", "Q249H"]]


def test_parse_terminate_with_fusion():
    """ make sure we handle fusion format in last tuple"""
    gene, tuples = parse("FGFR2 E565A FGFR2 K659M FGFR2 N549H FGFR2 N549K FGFR2 V564F FGFR2-ZMYM4")  # NOQA
    assert "FGFR2" == gene
    assert tuples == [["FGFR2", "E565A"],
                      ["FGFR2", "K659M"],
                      ["FGFR2", "N549H"],
                      ["FGFR2", "N549K"],
                      ["FGFR2", "V564F"],
                      ["FGFR2-ZMYM4"],
                      ]


def test_plus_sign():
    """ make sure we handle fusion format in last tuple"""
    gene, tuples = parse("KRAS mut + TP53 wild-type")  # NOQA
    assert "KRAS" == gene
    assert tuples == [["KRAS", "mut"],
                      ["TP53", "wild-type"]]

Oncokb code review

A topic was raised during this week's call regarding the count of evidence records from oncokb.

  • Overview
    The database captures evidence using the same API that provision the Actionable Genes pages.

  • Harvester Pseudo code

    • Get the list of genes ( currently 476 unique genes )
    • For each gene:
      • Get the clinical variants using hugoSymbol
        • For each clinical variant
          • Convert to GA4GH and persist ( 375 unique clinical evidence )

image

gene evidence_count
KIT 98
BRAF 45
PDGFRA 43
FGFR3 27
EGFR 24
MET 22
IDH1 20
ERBB2 7
MTOR 6
ALK 5
IDH2 5
KRAS 5
NRAS 5
PIK3CA 5
FGFR2 4
MAP2K1 4
ESR1 3
NF1 3
PDGFRB 3
PTEN 3
ABL1 2
AKT1 2
ARAF 2
ATM 2
BRCA1 2
BRCA2 2
CDK4 2
CDKN2A 2
FGFR1 2
MDM2 2
NTRK1 2
PTCH1 2
RET 2
ROS1 2
TSC1 2
ERCC2 1
EZH2 1
FLT3 1
JAK2 1
NTRK2 1
NTRK3 1
RAF1 1
TSC2 1

civic evidence_labels

Brian, in your slide showing the break down of evidence categories, it looked like CIViC was contributing almost entirely in the “preclinical” category.
CIViC has lots of “case study” and “clinical” entries, but those two bars had nothing from CIViC?

Move pmkb from web scrape to API calls

Currently, the pmkb data is harvested by screen scraping. See harvester/pmkb.py
Cornell has provided API documentation. It would be useful to move the code over to using the API.

From LIH3001 AT med DOT cornell DOT edu
Michael Mienko, who left about a month ago, helped me set up some new endpoints on PMKB as follows:

/api/genes
/api/genes/{gene_id}
/api/interpretations
/api/interpretations/{interpretation_id}
/api/tissues
/api/tissues/{tissue_id}
/api/tumors
/api/tumors/{tumor_id}
/api/variants
/api/variants/{variant_id}
/api/health_check
/api/search?query={query} # Endpoint for the full-text search bar in PMKB

Right now these are behind an basic authentication. I can add a username to the PMKB website if you'd like to try them.

There is another more specialized endpoint for PMKB that allows you to look up variants and interpretations with more detail. You've probably seen this example code before, but I've attached it again in api_reference.py, plus the following write-up:

params = {
        "gene": "KRAS",
        "aa_change": "G12A",
        "dna_change": "35G>C",
        "exons": "2",
        "tumor": "Adenocarcinoma",
        "tissue": "Lung",
        "transcript": "ENST00000256078"
    }
    p = json.dumps([params])
    url = 'https://pmkb.weill.cornell.edu/api/lookups'
    headers = {
       'Content-Type': 'application/json',
    }
    response = requests.post(url, data=p, headers=headers)

(This particular endpoint is not behind authorization.)

Right now there are 8 levels of relevance:
1. An exact HGVS notation match, e.g. BRAF V600E matches BRAF V600E
2. A partial HGVS notation match, e.g. BRAF V600E matches BRAF V600
3. Codon match and variant type is specific, e.g. BRAF V600E matches BRAF codon 599-601 missense
4. Codon match and variant type is not specific, e.g. BRAF V600E matches BRAF codon 599-601 any mutation
5. Exon match and variant type is specific, e.g. BRAF V600E matches BRAF exon 15 missense
6. Exon match and variant type is not specific, e.g. BRAF V600E matches BRAF exon 15 any mutation
7. Gene match and variant type is specific, e.g. BRAF V600E matches BRAF any missense
8. Gene match and variant type is not specific, e.g. BRAF V600E matches BRAF any mutation


api_reference.py

import json
import requests

params = {
    "gene": "EGFR",
    "aa_change": "p.(=)",
    "dna_change": "c.2573G>A",
    "exons": "2", # optional
    "tumor": "Adenocarcinoma",
    "tissue": "Lung",
    "variant_type": "silent",
    "transcript": "ENST00000256078" #optional
}
params_cnv = {
    "variant_type": "CNV",
    "gene": "CDKN2A",
    "cnv_type": "loss",
    "tumor": "Squamous Cell Carcinoma",
    "tissue": "Lung"
}
params_fusion = {
    "variant_type": "rearrangement",
    "gene": "ERG",
    "partner_gene": "TMPRSS2",
    "tumor": "Adenocarcinoma",
    "tissue": "Prostate"
}

p = json.dumps([params])
url = 'https://pmkb.weill.cornell.edu/api/lookups'
headers = {
    'Content-Type': 'application/json'
}
response = requests.post(url, data=p, headers=headers)
print response.text.encode('utf-8')

sources have inconsistent structures for associations

Two differences observed so far using the following code:

double_listed_refs = Counter()
single_listed_refs = Counter()

listed_evidence = Counter()
single_evidence = Counter()
for hit in res['hits']['hits']:
    if isinstance(hit['_source']['association']['evidence'], list):
        listed_evidence[hit['_source']['source']] += 1
        for evidence in hit['_source']['association']['evidence']:
            for pmid_url in evidence['info']['publications']:
                if isinstance(pmid_url, list):
                    double_listed_refs[hit['_source']['source']] += 1
                else:
                    single_listed_refs[hit['_source']['source']] += 1
    else:
        single_evidence[hit['_source']['source']] += 1
        evidence = hit['_source']['association']['evidence']
        for pmid_url in evidence['info']['publications']:
            if isinstance(pmid_url, list):
                double_listed_refs[hit['_source']['source']] += 1
            else:
                single_listed_refs[hit['_source']['source']] += 1

double_listed_refs are jax, oncokb, and pmkb. single_listed_refs are cgi and civic. single_evidence is civic, the others are all listed_evidence. I'm handling these inconsistencies with transformations in my analysis, but it would be good to unify them under single_listed and listed_evidence.

GENIE analysis: in line figures

I've made two small changes to the GENIE analysis figures
see
https://github.com/ohsu-comp-bio/g2p-aggregator/blob/v0.7/notebooks/GENIE_Analysis.ipynb

stacked bars

rotated x axis

%%opts Bars.Stacked [stack_index=1 title_format='GENIE coverage' height=600 width=600 legend_position='top' xrotation=90 ]

...

bars.relabel(group='Stacked')

image

help needed

One challenge with the Genie notebook is that images are not persisted with the notebook.
This change is purported to work, but hasn't
@jgoecks @mayfielg

from bokeh.resources import INLINE
from bokeh.io import output_notebook
output_notebook(resources=INLINE)

fix duplicates caused by splitting features

Re. duplicates - here is a quick dashboard to show them. https://dms-dev.compbio.ohsu.edu/kibana/app/kibana#/dashboard/7a678fb0-789f-11e7-81e2-c5c499f34804

[4:23]
AFAIK, they seem to be a side effect of taking a single entry from the source and splitting it up into different associations. The most egregious example is: ABL1 (I242T,M244V,K247R,L248V,G250E,G250R,Q252R,Q252H,Y253F,Y253H,E255K,E255V,M237V,E258D,W261L,L273M,E275K,E275Q,D276G,T277A,E279K,V280A,V289A,V289I,E292V,E292Q,I293V,L298V,V299L,F311L,F311I,T315I,F317L,F317V,F317I,F317C,Y320C,L324Q,Y342H,M343T,A344V,A350V,M351T,E355D,E355G,E355A,F359V,F359I,F359C,F359L,D363Y,L364I,A365V,A366G,L370P,V371A,E373K,V379I,A380T,F382L,L384M,L387M,L387F,L387V,M388L,Y393C,H396P,H396R,H396A,A397P,S417F,S417Y,I418S,I418V,A433T,S438C,E450K,E450G,E450A,E450V,E453K,E453G,E453A,E453V,E459K,E459G,E459A,E459V,M472I,P480L,F486S,E507G) Which created 92 associations. I believe this is a bug/feature of the harvesters (edited)

Technical input to VICC paper (draft notes)

@jgoecks @mayfielg @ahwagner : See my notes below. I simply wanted to capture them here and if it's appropriate, move them over to google docs.


G2P - Knowledgebase integration

Abstract

Background

The g2p-aggregator is a bioinformatics tool designed to integrate evidence from disparate datasources and and support interpretation, prioritization and report generation. It is implemented by Oregon Health and Sciences University (OHSU), and integrates evidence from:

The g2p-aggregator has create an open source suite based on the GA4GH schemas, which are efficiently interrogated to find sets of relevant evidence through a search api.

Methods

The g2p-aggregator integrates data coming from multiple knowedgebases and allows users to query a harmonized result set. The harmonization consists of structural and ontology mapping. Structural mapping manipulates the input stream into a GA4GH genomic feature association. The content of the original data source is maintained for queries and is returned in result sets. Ontology mapping is targeted at variants, environment (drugs), phenotypes (diseases), and evidence metadata such as evidence strength and direction. The system then stores evidence in a variety of possible stores [elastic search, kafka message queues, RDBMS or files]. A full text search, aggregations and GA4GH beacon and provided via the elastic search store. Integration with downstream systems are provided by the kafka or file system store.

Our central theme was to provide a robust search facility, giving a focus to the conversation between the researcher and the aggregated evidence. Researchers should expect a rich search experience and should be able to make judgments about the applicability of the evidence set to their research question based on the quality of one or two sets of search results.

Results

The g2p-aggregator manages data of 9 knowledgebases with a total count of over 25K evidence and clinical trial items distributed over:

  • 9 knowledge bases
  • 513 Genes, 7221 unique Locations
  • 383 Diseases, 185 unique Disease Ontologies
  • 1119 Drugs, 898 unique pubchem identifiers
  • 7905 unique publications

Conclusions

G2p-aggregator is a useful implementation of how web-scale, open source architectures and components can be implemented to support translational research. The next steps of our project will involve the extension of its capabilities by implementing new plug-in devoted to bioinformatics data analysis as well as a temporal query module. For researchers, who need to investigate genomic events, g2p is a search tool that aggregates evidence from several knowledge bases unlike ad-hoc searches, the product allows the researcher to focus on the evidence, not on the search. For informaticians, who need to annotate genomic events, g2p is a search tool that provides a query api for any pipeline to gather evidence ‘hits’ unlike current practices which have not focused on evidence, the product allows the informatician to identify, filter and sort genomic events based on evidence.

Background

The intent of the GA4GH schema provides structures for unambiguous references to ontological concepts and/or controlled vocabularies defined by the GA4GH G2P schemas and the individual data sources. The system's harmonization process follows the intent expressed by the original G2P task team.

Where a G2P association is between the G(enotype) in the context of
some E(environment), which gives rise to a P(henotype). These
associations have further evidence, provenance, and attribution.
We leverage the GenomicFeature in the sequenceAnnotation schema here
as it can accomodate any genomic feature from a single nucleotide variation
(SNV), up through a gene, and/or complex rearrangements. Each can
be modeled as genomic features, and generally linked to a phenotype.
Collections of these features can represent a genotype at different levels
of completeness. Therefore, we can represent single allelic variation,
allelic complement, and multiple variants in a genotype that can each or
collectively be associated with a phenotype.
To enable standardized integration, this schema relies heavily on
OntologyTerms, for typing phenotype, genomic features, and levels
of evidence.

Methods

Harvester

A harvester is a python module that implements this duck typing interface.

image

  • harvest: A fairly straightforward mechanism to use the knowledgebase's access method (api, file download, etc.) to retrieve the evidence items in their native format.

  • convert: Each harvester needs to map and harmonize the evidence presented to a GA4GH FeatureAssociation. This function is supported by several helper methods:

    • A normalized vocabulary for evidence_level which harmonizes the source to AMP/ASCO/CAP guidelines.
    • A alias and lookup service for genotype that leverages a webservice provided by EBI to lookup human disease ontology
    • A lookup service for environment that leverages a webservice provided by Biothings to lookup pubchem, chebi or chembl identifiers as well as toxicity, taxonomy and approved countries.
    • The COSMIC variant table to parse and harmonize variant location.

Once deployed via standard docker containers, the system extends the value of the underlying data by enabling query via a GA4GH beacon or elastic search's API. The resulting system has a minimal footprint, and is currently deployed on Amazon's free tier.

Use cases

Use cases are divided into three categories; discovery, exploration and integration.

image

GA4GH Beacon

The Beacon project is a project to test the willingness of international sites to share genetic data in the simplest of all technical contexts.

Our implementation follows the beacon specification and returns meta information about the beacon and a simple evidence summary for a specific genomic location.

UI

Our current alpha UI allows the user to query using a 'google search' and then presents visualizations and the ability to drill down to the specific FeaturePhenotypeAssociation and associated evidence from the original source.

As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, and a proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treatment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST to imatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of several cancer types. I could submit a query in the form GIST AND imatinib AND KIT.

In response, I will receive back a list of associations involving GIST and KIT, which I can filter for instances where imatinib is mentioned. Additionaly, the query could be extended by either drilling down on the UI's widgets and/or continuing to add full text search terms.

image

API

The /analysis folder contains a python notebook that leverages the entire knowledgebase for comparison with the GENIE database of clinical outcomes. We have used the Elasticsearch DSL to abstract the low level APIs (which are still available for use) to provide "a more convenient and idiomatic way to write and manipulate queries".

An example to simply retrieve all evidence items would be:

res = es.search(index=\"g2p\", size=10000, body={\"query\": {\"match_all\": {}}})

Alternative APIs are available in most commonly used programming environments.

Results

Harmonization

Evidence Level

Our first challenge was to align the diverse "strength of evidence" fields presented by different knowledgebases.

image

Detail: Evidence Label by source

source filters Count
cgi A 304
cgi B 49
cgi C 563
cgi D 515
cgi NA 2
civic A 62
civic B 1131
civic C 1003
civic D 980
civic NA 5
jax A 64
jax B 54
jax C 647
jax D 2884
jax NA 3
jax_trials D 1131
jax_trials NA 3
molecularmatch A 298
molecularmatch B 73
molecularmatch C 150
molecularmatch D 500
molecularmatch_trials D 64379
molecularmatch_trials NA 885
oncokb A 114
oncokb B 116
oncokb C 69
oncokb D 97
oncokb NA 185
pmkb A 414
pmkb C 160
pmkb D 35
pmkb NA 609
sage C 33
sage D 36

Phenotype & Environment

In order to enable cross knowledgebase queries, we needed a uniform Phenotype and Environment.

image

Detail: Exceptions to phenotype and environment harmonization

source environment count
molecularmatch_trials Surgery 930
molecularmatch_trials HSCT 745
molecularmatch_trials Allotransplantation 349
molecularmatch_trials Cytotoxic T Lymphocytes 138
molecularmatch_trials RG7446 111
brca    
oncokb Debio1347 12
oncokb AP32788 5
oncokb BAY1436032 5
oncokb BGB659 2
jax N/A 39
jax MRX-2843 11
jax AZ8010 7
jax TASIN-1 7
jax BAY1187982 6
civic AMGMDS3 8
civic Chemotherapy 4
civic Adjuvant Chemotherapy 2
civic Adoptive T-cell Transfer 2
civic Antiangiogenic Therapy 2
cgi FGFR inhibitors 25
cgi PARP inhibitors 21
cgi MTOR inhibitors 17
cgi PI3K pathway inhibitors 17
cgi HDAC inhibitors 6
jax_trials IDH305 2
jax_trials INCB054828 2
jax_trials SYM004 2
jax_trials AC0010MA 1
jax_trials ALRN-6924 1
molecularmatch Sym004 5
molecularmatch 3
molecularmatch RG7446 3
molecularmatch ETC159 2
molecularmatch MEDI6469 2
pmkb    
sage mTOR inhibitors 7
source phenotype count
molecularmatch_trials Acute myeloid leukaemia, disease 1069
molecularmatch_trials HIV - Human immunodeficiency virus infection 841
molecularmatch_trials Myeloproliferative disorder 735
molecularmatch_trials Chronic lymphoid leukaemia, disease 706
molecularmatch_trials ALL - Acute lymphoblastic leukaemia 632
brca    
oncokb Soft Tissue Sarcoma 6
oncokb CNS Cancer 2
oncokb Embryonal Tumor 2
oncokb Esophagogastric Cancer 2
oncokb Esophageal/Stomach Cancer, NOS 1
jax Indication other than cancer 1
civic Desmoid Fibromatosis 9
civic T-cell Acute Lymphoblastic Leukemia 8
civic Epithelial Ovarian Cancer 5
civic Hepatocellular Fibrolamellar Carcinoma 3
civic Anaplastic Oligodendroglioma 2
cgi Renal 17
cgi Bladder BLCA 10
cgi Head an neck 9
cgi Head an neck squamous 8
cgi Myelodisplasic proliferative syndrome 8
jax_trials    
molecularmatch Metastasis from malignant melanoma of skin 2
pmkb MDS with Ring Sideroblasts 10
pmkb Glial Neoplasm 9
pmkb Histiocytic and Dendritic Cell Neoplasms 7
pmkb Langerhans Cell Histiocytosis 7
pmkb Other Tumor Type 5
sage mesothelioma 2
sage head and neck cancer 1

Genotype

Breakdown of normalized variants/biomarkers by source.
image

source filters count
brca genomic location 5733
brca no genomic location 0
cgi genomic location 589
cgi no genomic location 842
civic genomic location 2865
civic no genomic location 311
jax genomic location 3009
jax no genomic location 640
jax_trials genomic location 1051
jax_trials no genomic location 80
molecularmatch genomic location 804
molecularmatch no genomic location 217
molecularmatch_trials genomic location 0
molecularmatch_trials no genomic location 64379
oncokb genomic location 1811
oncokb no genomic location 2338
pmkb genomic location 609
pmkb no genomic location 0
sage genomic location 0
sage no genomic location 69

Analysis

GENIE Analysis: Variant Level

The AACR Project Genomics, Evidence, Neoplasia, Information, Exchange (GENIE)—clinical targeted sequencing panel data from 8 different cancer centers

G2P Knowledge Base coverage of GENIE at the variant level:

  • Total coverage of non-unique variants is 28%
  • Adding more databases increases total coverage
  • Different databases contribute different types of evidence
  • OncoKB and MolecularMatch contribute guideline recommendations
  • CIViC and JAX contribute substantial preclinical evidence
  • 10% of variants associated with A-level and are highly actionable
  • 22% of variants associated with B-level and are moderately actionable

image

GENIE Analysis: Donor Level

G2P coverage of GENIE donors is encouraging:

  • 42% of donors have 1+ actionable variant
  • 48% of donors with non-unique variant(s) also have 1+ actionable variant
  • Adding more databases increases total coverage
  • 25% of donors have variant with A-level evidence, 42% of donors have variant with B-level evidence

image

Discussion

//TODO

Conclusions

//TODO

List of abbreviations used

//TODO

Declarations

//TODO

Acknowledgements

//TODO

Electronic supplementary material

//TODO

References

//TODO

Disease normalization

@ahwagner

Alex

Hi. I was hoping you might have some insight into an alias or lookup table we can use to get around this issue.

Out of the 14,780 evidence items, we have 5,111 with no matched phenotype.
Out of those 5111 items, these 32 items have more than 10 entries each for a total of 4635.

Do you have any insight into what alias we might use for these items?

Thanks,

--

phenotype Count
Advanced Solid Tumor 1571
Solid tumor 634
Malignant neoplastic disease 423
Neoplasm of respiratory tract 417
Neoplasm of respiratory system 407
Neoplasia 153
Any cancer type 142
Neoplasm of colon 131
Neoplasm of rectum 114
Neoplasm of digestive system 105
Non-small cell lung 66
Primary malignant neoplasm of intrathoracic organs 65
Gastrointestinal stromal 60
Primary malignant neoplasm of lung 46
Ovary 29
T Lymphoblastic Leukemia/Lymphoma 28
B Lymphoblastic Leukemia/Lymphoma 21
Malignant tumour of soft tissue 21
Chronic Myelomonocytic Leukemia 20
Primary malignant neoplasm of bone marrow 19
Malignant peripheral nerve sheat tumor 18
Lung squamous cell 17
Urothelial Carcinoma 17
Neoplasm of breast 16
Renal 16
Diffuse Large B Cell Lymphoma 13
Neoplasm of digestive tract 12
Neoplasm of intra-abdominal organs 12
Thyroid 12
All Tumors 10
Bladder BLCA 10
MDS with Ring Sideroblasts 10

Normalize Drugs

As a researcher, in order to evaluate evidence from different sources, I need to see a uniform drug name and identifier.

Capture usage and user activities

For understanding use of the system and eventually for learning from usage, we should instrument the user interface as much as possible. This includes:

  • recording search terms
  • recording when and how a user clicks on UI elements to filter results.

PostgreSQL backend?

@bwalsh, now that we're moving to normalize on a number of concepts, I think it would make sense to transition the backend to SQL. This will require more work up front, but should drastically improve the results.

I'd like to hear your thoughts (and @obigriffith, @malachig, @AAMargolin's) on the subject in this issue.

harmonize location information?

Problem statement:

How to harmonize location information?
i.e:
For those entries without genomic location specifics, is it possible to retrieve appropriate fields and append them to the evidence record?

Worked example

Evidence without location information

original from source :
https://civic.genome.wustl.edu/events/genes/58/summary/variants/1970/summary#variant

in g2p:
https://g2p-ohsu.ddns.net/_plugin/kibana/app/kibana#/doc/associations/associations-new/association?id=AV6bnNSKd2hRurWfSY2g&_g=()

Can we take the gene and variant info and deduce more?

Methodology:

  • use gene and variant info from source VHL V130L (c.388G>C) to retrieve hits from clinvar
  • if hit(s) retrieve using idlist

Issues: what to select from clinvar? how to map to feature?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=V130L+%28c.388G%3EC%29&retmode=json

  {
      "header": {
          "type": "esearch",
          "version": "0.3"
      },
      "esearchresult": {
          "count": "1",
          "retmax": "1",
          "retstart": "0",
          "idlist": [
              "2229"
          ],
          "translationset": [
          ],
          "translationstack": [
              {
                  "term": "VHL[All Fields]",
                  "field": "All Fields",
                  "count": "686",
                  "explode": "N"
              },
              {
                  "term": "c0x2e388G0x3eC[All Fields]",
                  "field": "All Fields",
                  "count": "5",
                  "explode": "N"
              },
              "AND",
              "GROUP"
          ],
          "querytranslation": "VHL[All Fields] AND c0x2e388G0x3eC[All Fields]"
      }
  }

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=2229&retmode=json'

{
   "header": {
       "type": "esummary",
       "version": "0.3"
   },
   "result": {
       "uids": [
           "2229"
       ],
       "2229": {
           "uid": "2229",
           "obj_type": "Simple",
           "accession": "",
           "accession_version": "",
           "title": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
           "variation_set": [
               {
                   "measure_id": "17268",
                   "variation_xrefs": [
                       {
                           "db_source": "UniProtKB",
                           "db_id": "P40337#VAR_005733"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "608537.0021"
                       },
                       {
                           "db_source": "dbSNP",
                           "db_id": "104893830"
                       }
                   ],
                   "variation_name": "NM_000551.3(VHL):c.388G&gt;C (p.Val130Leu)",
                   "cdna_change": "c.388G&gt;C (p.Val130Leu)",
                   "aliases": [
                   ],
                   "variation_loc": [
                       {
                           "status": "current",
                           "assembly_name": "GRCh38",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10146561",
                           "stop": "10146561",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10146561",
                           "display_stop": "10146561",
                           "assembly_acc_ver": "GCF_000001405.33",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       },
                       {
                           "status": "previous",
                           "assembly_name": "GRCh37",
                           "chr": "3",
                           "band": "3p25;3p25.3",
                           "start": "10188245",
                           "stop": "10188245",
                           "inner_start": "",
                           "inner_stop": "",
                           "outer_start": "",
                           "outer_stop": "",
                           "display_start": "10188245",
                           "display_stop": "10188245",
                           "assembly_acc_ver": "GCF_000001405.25",
                           "annotation_release": "",
                           "alt": "C",
                           "ref": "G"
                       }
                   ],
                   "allele_freq_set": [
                   ],
                   "variant_type": "single nucleotide variant"
               }
           ],
           "trait_set": [
               {
                   "trait_xrefs": [
                       {
                           "db_source": "Gene",
                           "db_id": "8056"
                       },
                       {
                           "db_source": "MedGen",
                           "db_id": "C1837915"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "238557"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "263400"
                       }
                   ],
                   "trait_name": "Erythrocytosis, familial, 2"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0019562"
                       },
                       {
                           "db_source": "Orphanet",
                           "db_id": "892"
                       },
                       {
                           "db_source": "OMIM",
                           "db_id": "193300"
                       }
                   ],
                   "trait_name": "Von Hippel-Lindau syndrome"
               },
               {
                   "trait_xrefs": [
                       {
                           "db_source": "MedGen",
                           "db_id": "C0027672"
                       }
                   ],
                   "trait_name": "Hereditary cancer-predisposing syndrome"
               }
           ],
           "supporting_submissions": {
               "scv": [
                   "SCV000053262",
                   "SCV000580968",
                   "SCV000022475",
                   "SCV000264729"
               ],
               "rcv": [
                   "RCV000030586",
                   "RCV000002317",
                   "RCV000492250"
               ]
           },
           "clinical_significance": {
               "description": "Pathogenic",
               "last_evaluated": "2016/08/16 00:00",
               "review_status": "criteria provided, multiple submitters, no conflicts"
           },
           "record_status": "",
           "gene_sort": "VHL",
           "chr_sort": "03",
           "location_sort": "00000000000010146561",
           "variation_set_name": "",
           "variation_set_id": "",
           "genes": [
               {
                   "symbol": "VHL",
                   "geneid": "7428",
                   "strand": "+",
                   "source": "submitted"
               }
           ]
       }
   }
}

Stable source / data release

Hey @bwalsh. On Tuesday we discussed having a static data silo that we could draw upon for the paper figures. Any progress on this front?

Also, we should identify a source branch that our analyses will be based upon--there may be some merging that needs to take place here (@jgoecks I'd like your input on this as well).

Let me know if I can help out at all with setting this up, so that we can move forward on generating the figures for the next paper call.

Normalize Clinical Significance

As a researcher or informatician, in order to search or aggregate evidence that meets a given level, I need all evidence to be tagged with a consistent vocabulary.

Use civic?
image

Researcher support

Support researcher's activities to integrate g2p into:

  • python notebooks
  • pipelines

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.