Git Product home page Git Product logo

answercoalesce's Introduction

Build Status

AnswerCoalesce

A web service and Swagger UI for the Answer Coalesce service for ARAGORN.

This service accepts a translator reasoner standard message containing answers and returns the same format with answers that have been coalesced.

Demonstration

A live version of the API can be found here.

An example notebook demonstrating the API can be found here.

Deployment

Please download and implement the Docker container located in the Docker hub repo: renciorg\ac.

Local Deployment

This environment expects Python version 3.8.

cd <code base>
pip install -r requirements.txt
python main.py

Docker

cd <code base>
docker-compose build
docker-compose up -d

Kubernetes configurations

kubernetes configurations and helm charts for this project can be found at: 

https://github.com/helxplatform/translator-devops/answer-coalesce

Usage

http://"host name or IP":"port"/docs

answercoalesce's People

Contributors

cbizon avatar kennethmorton avatar phillipsowen avatar wumirose avatar yaphetkg avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

answercoalesce's Issues

Property repeats

Run X-treats->schizophrenia through strider, omnicorp, and ranker, then send to property coalesce.

We find multiple coalsescences on the same exact properties. i.e. sets of two nodes are picked out, and then a different set of 2 nodes is picked out with the same properties.

My current guess is that the omnicorp edges are interfering with the coalescent opportunities, but I have not tested that.

Probably, we should ignore support edges in coalescing.

Remove smart api registrations

From the smartapi faq:

To delete an API you must be the registered owner of that API. Log in to access the user dashboard and see a list of the APIs you have registered. Click on the delete button of the API you want to delete and follow the instructions to delete.

AC has 2, but we don't need either. We will instead access AC exclusively via aragorn

Predicate combinations / RO chains

Currently there are two thoughts for using edges in coalescence.

  1. only coalesce when edge preds are the same
  2. Ignore edge preds

But different edges might end up doing the same thing overall. Something like

A-[upregulates]->B and A-[downregulates]->C

Suppose B->[downregulates]->D and C-[upregulates]->D. Then D should be brought in as an extra node, merging these two, since the net effect of them is the same.

Workflow Implemented

TRAPI input has a workflow section with operations that must be completed in order specified by workflow.

Due: July 29, 2021

Details in architecture repo Git issue here.

Increase performance of "property" coalescence

Given the following question:

"nodes": [
        {
            "id": "a",
            "type": "disease",
            "curie": "MONDO:0005015"
        },
        {
            "id": "b",
            "type": "gene"
        },
        {
            "id": "c",
            "type": "chemical_substance"
        }
    ],
    "edges": [
        {
            "id": "ab",
            "source_id": "a",
            "target_id": "b",
            "type": "gene_associated_with_condition",
        },
        {
            "id": "bc",
            "source_id": "c",
            "target_id": "b",
            "type": "decreases_activity_of"
        }
    ]
    }

If you run this through strider, and wait for it to complete, you get O(500) answers. Sending this to property coalesence does not complete in 2 hours. I suspect that this could be helped through batching calls to the sqlite db, but there may be other issues.

Non-enriched coalescence


    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "object": "n0",
                    "subject": "n1",
                    "predicates": [
                        "biolink:entity_negatively_regulates_entity"
                    ]
                },
                "e02": {
                    "object": "n1",
                    "subject": "n2",
                    "predicates": [
                        "biolink:increases_abundance_of",
                        "biolink:increases_expression_of",
                        "biolink:increases_stability_of",
                        "biolink:increases_uptake_of",
                        "biolink:decreases_degradation_of",
                        "biolink:increases_secretion_of",
                        "biolink:increases_metabolic_processing_of",
                        "biolink:increases_folding_of",
                        "biolink:increases_localization_of",
                        "biolink:increases_synthesis_of",
                        "biolink:increases_response_to",
                        "biolink:increases_splicing_of",
                        "biolink:increases_mutation_rate_of",
                        "biolink:increases_transport_of",
                        "biolink:increases_activity_of",
                        "biolink:increases_molecular_modification_of",
                        "biolink:increases_molecular_interaction"
                    ]
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:23221"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ]
                }
            }
        }
    }
}

A 2 hop. Results: https://arax.ncats.io/?source=ARS&id=77a219c7-1d47-4da8-9df8-4da5c4579dc2

Many of the AC answers don't seem to add a new node. How is that occurring? I assume it's that the enriched node is already in the graph.

It's not a bad idea to allow this kind of thing; maybe we should do it on purpose!

Simplify?

Consider aragorn's coalesced answers to https://arax.ncats.io/?r=ba19fa4a-56a8-4868-9001-824ea2c3bced

it's a one hop, and we find that a bunch of them all go together somehow. Two problems:

  1. The way that they together might be boring but b/c so many get pulled in it gets highly ranked
  2. It's unreadable in any interface

There are a few ways I can think of to fix this

  1. Improve ranking so that the uninteresing ones are lower ranked
  2. Improve AC so that uninteresting ones don't even show up
  3. Remove the grouping from the answers entirely and put it somewhere else. Requires a TRAPI update
  4. Simplify the result. So if we have a query (GeneA)-[]-(SmallMolecule) and we get two answers, SM1 and SM2 and we merge them on connection to node Q then we have a diamond structure (GeneA)-(SM1)-(Q)-(SM2)-(GeneB). Can we merge SM1 and SM2 somehow? So then we return the answer (GeneA)-(merged SMs)-(Q). This will be easier to view for sure. We could put the merged ids in attributes on the merged SM. I don't know what we'd put in for the SM identifier though. Omnicorp could recognize the merged node and provide grouped counts? Then ranker doesn't have to deal with as crazy of a parallel path.

Thoughts @kennethmorton , @raynCovar ?

Handle synonyms

The property and graph databases are using CHEBI instead of PUBCHEM.COMPOUND. The ontology version will only undertstand chebi since there's no ontology in pubchem.

The simple approach to solving most of this is just to rebuild the property and graph databases using the new robokop graph. This should also allow us to remove some temporary code fixing the names of edges.

Longer term, I think we should either have equivalent_identifiers from upstream, or we should get them from the normalizer. And then, we should use them to find the identifiers we want. This will insulate us a little bit from changes to prefix ordering in the biolink model, and allow us to do ontology reasoning even if the preferred id is non-ontological.

Provenance on GC edges

This is an edge added by GC:

{'subject': 'NCBIGene:1576', 'object': 'PUBCHEM.COMPOUND:68911', 'predicate': 'biolink:increases_degradation_of', 'attributes': [{'type': 'EDAM:data_0006', 'value': 1, 'name': 'weight'}]}

Note that there is no provenance on this edge. We need to add that information to the AC redis somehow.

Edge ids should be strings

When we create new edges, we're returning their kg_id as integers, which is out of spec.

Instead, they should be strings.

Investigate coalescence for query

I have this query:

{
"message": {
"query_graph": {
"nodes": {
"n1": {
"ids": ["MONDO:0004979"],
"categories": ["biolink:Disease"]
},
"n0": {
"categories": ["biolink:ChemicalSubstance"]
}
},
"edges": {
"e01": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:treats"]
}
}
}
}
}

I would expect it to do some answer coalescing. It returns ~200 results, but none of them are coalesced. Can you just double check that something odd is not going on?

Unique responses

Given this question:

"nodes": [
        {
            "id": "a",
            "type": "disease",
            "curie": "MONDO:0005015"
        },
        {
            "id": "b",
            "type": "gene"
        },
        {
            "id": "c",
            "type": "chemical_substance"
        }
    ],
    "edges": [
        {
            "id": "ab",
            "source_id": "a",
            "target_id": "b",
            "type": "gene_associated_with_condition",
        },
        {
            "id": "bc",
            "source_id": "c",
            "target_id": "b",
            "type": "decreases_activity_of"
        }
    ]
    }

Running through strider and just taking the first 20 or so chronological responses, I see repeat answers in both ontology and graph coalesence. e.g. here is some response from ontology coalesence:

================
glisoxepide (CHEBI:135731)
carbutamide (CHEBI:135118)
chlorpropamide (CHEBI:3650)
tolbutamide (CHEBI:27999)
acetohexamide (CHEBI:28052)
glimepiride (CHEBI:5383)
glipizide (CHEBI:5384)
tolazamide (CHEBI:9613)
gliclazide (CHEBI:31654)
glyburide (CHEBI:5441)
----have superclass----
sulfonamide (CHEBI:35358)

tolazamide (CHEBI:9613)
gliclazide (CHEBI:31654)
chlorpropamide (CHEBI:3650)
acetohexamide (CHEBI:28052)
tolbutamide (CHEBI:27999)
glyburide (CHEBI:5441)
glimepiride (CHEBI:5383)
glipizide (CHEBI:5384)
----have superclass----
N-sulfonylurea (CHEBI:76983)

glisoxepide (CHEBI:135731)
carbutamide (CHEBI:135118)
chlorpropamide (CHEBI:3650)
tolbutamide (CHEBI:27999)
acetohexamide (CHEBI:28052)
glimepiride (CHEBI:5383)
glipizide (CHEBI:5384)
tolazamide (CHEBI:9613)
gliclazide (CHEBI:31654)
glyburide (CHEBI:5441)
----have superclass----
sulfonamide (CHEBI:35358)

tolazamide (CHEBI:9613)
gliclazide (CHEBI:31654)
chlorpropamide (CHEBI:3650)
acetohexamide (CHEBI:28052)
tolbutamide (CHEBI:27999)
glyburide (CHEBI:5441)
glimepiride (CHEBI:5383)
glipizide (CHEBI:5384)
----have superclass----
N-sulfonylurea (CHEBI:76983)

There are 4 outputs here, separated by ===== . There are actually only 2, but they are each repeated.

Reproducing this may be difficult because the output from strider is not deterministic.

Generalize coalescent opportunities

Currently we can only coalesce answers that are the same in every way except that the identify of one node is different. This is ok, and understandable, but it's over-specified.

Suppose you had something like:

(A)-?-? and you had 2 answers A-B-C and A-X-Y. And suppose that C and Y were both coalescable. Unless B and X were equal you would not get a merged answer.

I think you'd like to get these. And if B and X coalesced at the same time, then you'd like to know that as well.

Things get complicated in a situation where you also have say A-M-O. And suppose that B/M coalesced, and Y/O coalesced. Do you want to see an answer like that? (NN is new node)

            B          C
            NN1
A           M          O
                       NN2
            X          Y

The options here would be to return a single answer with both or two answers independently.

I think that the best thing to do is try a few options. The simplest is to

  1. Ignore edge types
  2. out of the global set of answers, find all coalescence at spots (all the nodes at gene1)
  3. Coalesce those into one answer, involving any answer that gets dragged into one of the coalesce groups
  4. There will be some disjoints where one set of coalescent nodes involves answers 1-3 and a different set of coalescent nodes involves answers 6-10. Those should be caught in step 3.

A more constrained version would do the same thing, but take into account edge types.

When graph enrichment re-uses a node, edge bindings collapse

Take a query like this:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:ChemicalSubstance"
                    ]
                },
                "n1": {
                    "ids": [
                        "HGNC:6284"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Which is a onehop from HGNC:6284 to chemical substance.

Sometimes the chemicals find an enrichment by finding a (different?) edge back to 6284.

When that happens right now, all the edge bindings go into the original query edge (e0) and nothing is bound into the extra_edge. That's no good for downstream processing.

AC 500

Given this question:

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n01",
                    "object": "porphyria"
                }
            },
            "nodes": {
                "porphyria": 
                    "ids": [
                        "MONDO:0037939"
                    ]
                },
                "n01": {
                    "categories": [
                        "biolink:NamedThing"
                    ]
                }
            }
        }
    }
}

Strider produces about 2k results, and AC throws a 500. If you add a category to the porphyria node, AC works.

What happens: first we look for opportunities. The mental assumption is that all of the input answers will be onehops with MONDO:0037939 on one end. So the opportunities will be groupings at n01.

However, because there are subclasses of porphyria, you can also have the porphyria node as the merged node (porphyria merged with its subclassses). This is fine, but the opportunity selector is looking at the qgraph to get the type and there's not one for this node. So at some point somebody does something with a None and 500.

To fix: the opportunity finder needs to recognize that it didn't get a type and do something smarter. I think either reject that opportunity (probably the best) or dig around in the KG to get a type for the id.

Redo test files bigger.json and famcov.json

these files were hand edited to use better node normalized values. this actually corrected a couple tests in test_graph_coalescer.py but created a failure in test_bigs.py.

these files should be rebuilt to hopefully sync various curies derived from the latest normalization services.

Incorrect categories on extra qnodes.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "id": "NCBIGene:6611",
                    "category": "biolink:Gene"
                },
                "n1": {
                    "category": "biolink:ChemicalSubstance"
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

For this input, AC (graph) is adding a new query node:

"extra_qn_0": {
            "category": [
                "biolink:NamedThing",
                "biolink:BiologicalEntity",
                "biolink:MolecularEntity",
                "biolink:Gene",
                "biolink:GeneOrGeneProduct",
                "biolink:MacromolecularMachine",
                "biolink:GenomicEntity"
            ],
            "is_set": false
        }

But not all of the bound nodes match this set of categories. There are also diseases and phenotypes.

EPC fix

Right now, our results about how things got coalesced are put into the node bindings:

{'node_bindings': {'n0': [{'id': 'NCBIGene:6611'}], 'n1': [{'id': 'PUBCHEM.COMPOUND:5573', 'coalescence_method': 'property_enrichment', 'p_values': [8.693874563940759e-05], 'properties': ['Benzoates']}, {'id': 'PUBCHEM.COMPOUND:9865515', 'coalescence_method': 'property_enrichment', 'p_values': [8.693874563940759e-05], 'properties': ['Benzoates']}]}, 'edge_bindings': {'e01': [{'id': '8f92dbb28ca1', 'weight': 1}, {'id': '2a6ad9515281', 'weight': 1}]}, 'score': 0}

This is repetitive, and not in line with the EPC recs. So fix this.

Does graph coalesce return uncoalesced answers?

if we pass in 3 answers and 2 coalesce into 1, do we return the 3rd uncoalesced answer as well? In other words, are we adding to the result set, or replacing the entire result set?

  1. What is it doing?
  2. What do we want it to do?
  3. Make these the same.

Handle property hierarchy

he first thing that we do is look for answers that can be smushed together. We currently have a strict definition. Two answers must be exactly the same, except for the identity of one node. "exactly the same" extends to edges. So let's say that our question is A-B. We'll take any edge type.

Sometimes there will be nodes that could be coalesced if we took a less explicit view of the predicates.

So for instance, A-[increases_expression_of]-B will merge with A-[increases_expression_of]-B' but not with A-[affects_expression_of]-B'.

increases_expression_of is a affects_expression_of, so there should be an opportunity for merging at the higher level.

A less strict example that would require thinking through tradeoffs would be whether it makes sense to merge e.g. 'incresease_expression_of' and 'decreases_expression_of' at their shared parent. This of course leads to the possibility of merging every edge, since all can be pushed up to related_to. That will require some way to handle the enrichement calculation, I think.

Use anscestor relations in predicates

Currently there are two thoughts for using edges in coalescence.

  1. only coalesce when edge preds are the same
  2. Ignore edge preds

One extension: Use predicate subclasses. So don't require perfect equality, but allow subclassing. If A-[increases expression of]-B and A-[related to]-C, then B and C should be allowed to coalesce with a predicate of the superclass (related to).

In fact, if the edge preds are ignored, maybe we should consider that as merging at the lowest common superproperty.

Handle symmetric relationships

he first thing that we do is look for answers that can be smushed together. We currently have a strict definition. Two answers must be exactly the same, except for the identity of one node. "exactly the same" extends to edges. So let's say that our question is A-B. We'll take any edge type.

It's very easy to end up with cases where A and B are connected by multiple edge types. Our strictness means that all edge directions must match. But some edges are symmetric. For example 'related_to' or 'correlated_with'

So for instance, A-[related_to]->B will merge with A-[related_to]->B' but not with A<-[related to]-B'.

That's wrong because there's no real reason to favor one direction over another. Note that this needs to happen both for edges attached to our candidate node, but also to the whole knowledge graph.

Probably a simple way to do this is to immediately cycle through the KG, looking for symmetric edge types, and flipping them so that they are all have source_id as the lexigraphically low node in the relationship.

Make output properties TRAPI compliant

Right now the p-values and stuff returned from AC I think are not in full trapi compliance?

{'node_bindings': {'n1': [{'id': 'NCBIGene:3778'}],
  'n0': [{'id': 'PUBCHEM.COMPOUND:5994',
    'coalescence_method': 'property_enrichment',
    'p_values': [5.515487725071365e-06],
    'properties': ['molecule_type:Small molecule']},

No Graph coalescence?

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:ChemicalSubstance"
                    ]
                },
                "n1": {
                    "ids": [
                        "HGNC:6284"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Running the above query through aragorn, we get a bunch of results. They give all kinds of property coalescence, but no graph coalescence. That seems incorrect, but may be correct? need to double check.

Output data returns "key": null

the output returns a node with the property "curie": null.

not only is this bad json, it should not return properties that do not have a value.

Handle multiple edges more carefully.

The first thing that we do is look for answers that can be smushed together. We currently have a strict definition. Two answers must be exactly the same, except for the identity of one node. "exactly the same" extends to edges. So let's say that our question is A-B. We'll take any edge type.

It's very easy to end up with cases where A and B are connected by multiple edge types. Our strictness means that all edge types must match.

So for instance, A-[related_to]-B will merge with A-[related_to]-B' but not with A-[affects]-B'. If there are two edges like
A-[related_to,affects]-B it will merge with A-[related_to,affects]-B', but NOT with A-[related_to]-B'.

Is that what we want? I think we probably need to be a little more flexible here.

Multiple Deployments

Make prod / dev deployments.
Make sure to call the right NN

We can use the same instance of AC for dev & prod until we want to make some changes, so this is 2nd priority.

Include a pre-ranking?

Ask for (named_thing)<-[located in]-(myocarditis)

Semmed returns a bunch, some right, some wrong. It gives "Heart" but also "Brain"

Aragorn groups all of them together and says "you found anatomical entities!"

There are multiple responses - first, the predicate located in should help set the allowed kinds of things named response returns, and that might set the denominator (see #99 ) .

Second, maybe the first step would be to do a score on the individual answers, and do some filtering to chop out garbage. Otherwise we will always be in the realm of wanting to merge good and bad answers without knowledge.

Merge Merged answers?

An example of this problem is running AC on strider_relay_mouse.json.txt (in the repo).

In graph coalesence, we merge N nodes and link that merged set to a common new node.

Sometimes the same N nodes link to multiple different new nodes. We return each of these new nodes as a new coalesced answer.

So if old1, old2 are both linked to new1, new2, we return 2 new answers, one with new1, one with new 2.

That was done primarily so that the rewritten query is simple (we add in one new query node). But it makes looking through the results suboptimal. Really, it would be better to combine n1 and n2 into a single answer. But now there are 2 new nodes in an answer.

Do we need to put 2 qnodes into the query? Can answers that only have one new node just not include a mapping to extra_qnode_2? Or can we leave just a single new qnode and have 2 mappings from one answer to that qnode?

@patrickkwang this gets into a TRAPI issue that I'm not sure the best way forward on.

Remove ontology coalescence?

If we can put subclass_of edges into the enrichment database, then ontology coalesce is a subset of graph coalesce.

500 if any missing scores

If any of the results are missing scores, the AC service returns a 500. It should not. When it merges answers, it should be choosing the score of the best ranked merged answer. If it merges answers with and without scores, it should return the score of the best result that has a score. If none have a score, it should not have a score on the merged answer.

Correct Denominators

When we are doing the enrichment calculations, we use the type of the coalesced nodes. So if we we are merging on chemicals, we suppose that any chemical could be in that spot, and so we say how likely is it e.g. to have X of those chemicals to have a particular property.

That's not wrong, exactly, but it is probably not specific enough. So for instance consider
(asthma)<-[treats]-(chemical)

We'll usually find an enriched property for the chemical like "drug" or features that tend to be more common in druglike space (like heterocyclic organic compounds). And that is correct, it's more likely than by chance that drugs treat a disease rather than just random chemicals. But it's not terribly interesting.

Instead, I think we'd rather use the denominator of how many chemicals could have inhabited that spot in a graph. So something like, out of all the chemicals with a 'treats' edge, how likely is it that you would have this many with property X. Now the chance of having 'drug' is pretty high in that group, so it's not returned, which is what we want.

That would be doable in this case, and we could precache counts by edge. But in the general case (where there are an arbitrary number of edges coming out of the merging node) then we'd need to actually cache the identities of nodes with each edge so that we could intersect them to find the appropriate denominator.

graph coalescence produces odd question

When we do graph coalescence we're adding a new node that connects multiple different nodes.

So if we have a-b-c then it might turn c into a set and make a new question a-b-(set of c)-d

But what we actually get are many many named nodes in the question that aren't attached to anything. So we get a-b-(c), d, e, f, g, h,....

It's easy to know how to fix this bug, if there is a single place where we want to attach a new node (like d is attached to c). But what if in some groupings we attach d to c and in some groupings we attach something like e to b. What should we return in the query_graph? Does the std allow for partial matches?

Rerank answers

After coalescing, rerank answers rather than just picking the best rank from the coalesced set.

Include extra edges

When we add a new node, we should maybe add extra edges to that node. So let's say that I go
(chemical {id:})-(disease), so I get a bunch of diseases. Now the graph coalesce adds a gene, so I get
(chemical {id:})-(set of named diseases)-(named gene). I want to know if that named gene has any relation to the chemical.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.