<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

PostgreSQL backend? about g2p-aggregator HOT 6 CLOSED

ohsu-comp-bio commented on August 15, 2024

PostgreSQL backend?

from g2p-aggregator.

Comments (6)

bwalsh commented on August 15, 2024

@ahwagner

Thanks for the feedback.

I have to push back a bit on this. ES was selected because:

Document store enables conformance with GA4GH document schemas,
Enables front end "google style" searches,
Similar architecture to other bio portals we reviewed prior to selection (ICGC, GDC, MyGene.info and MyVariant.info)

Re. normalization, consider the variant name -> variant normalization that was done for cgi. It was all done as part of the harvest at the python level. We anticipate that future normalizations should follow the same pattern.

At the same time, ES is implemented as a "silo", a sink to store data. We've made the silos pluggable and have included one for kafka. This is the mechanism for extension and support for downstream use cases.

from g2p-aggregator.

ahwagner commented on August 15, 2024

@bwalsh, thanks for the detailed response, this is exactly the sort of discussion I was hoping for.

These points make perfect sense in the context you have framed the problem in. My concern is if--going forward--whether or not we want to have an ES style search. I do like the idea of a "Google style" search interface, but my hope was that it will be context-aware (e.g. "V600E" is shorthand for an amino-acid change, and thus searches should be constrained to features that represent or contain such a change). I see the path towards getting there as evaluating the query, normalizing it, and then performing the search on an appropriate table. The results I envision presenting will be more than matching positions + returned meta data, but rather a sensible aggregation of important information from any available resources (such as today's example of variant/exon/gene level interpretations).

I don't mind trying to bypass or adapt ES to construct this behavior, but I've noted instances in the past where groups with similar goals (as the ones I have in mind) have ultimately adapted to a SQL backend in order to do advanced queries of this nature (e.g. cBioPortal, MAGI).

I also didn't realize that NoSQL was a requirement for conformance with the G2P schema--would you be able to clarify here? You're talking about the protobuf docs, right? You're better versed in the GA4GH schema, and I'm just trying to understand what I'm missing.

I also want to make sure that the broader group is in agreement about the best path forward, as I intend to coordinate my efforts in the same space so that we can have the best possible product when we publish. Since the initial design decisions on this prototype were made without input from the broader G2P team, I want to have that discussion here before we go too much farther.

from g2p-aggregator.

jgoecks commented on August 15, 2024

@bwalsh @ahwagner This is great discussion, my thoughts:

Many modern biomedical data stores use both a relational database as well as a ElasticSearch (ES)/NoSQL database. The relational database is used for normalization, incremental updates, and ensuring referential integrity, and ES/NoSQL is used for flexible and fast searching. As best I know, here's the breakdown of related technologies and how they use data stores:

ICGC: ES only.
GDC: relational + graph database and ES to power the portal
myvariant/mygene.info: MongoDB as a substitute for a relational database, ES for fast searching

I've used this tool for several cases now (3 patients, ~10 somatic mutations each), and I can say that full-text search is very nice to have. Even if we succeed in normalizing many of the data facets, full-text search is great for ensuring that nothing is missed and for fuzzy matching.
@ahwagner I'd like to hear more about why you think a relational database will improve results. Precise searches are possible in ES (e.g. show me only entries where hgvs.p=V600E) as are fuzzy searches. Of course, ES searches are limited to the keys present in the ES documents, so results would be improved if we have a poor schema for our ES documents.
My opinion is that in the long-term we probably want a relational database and then a process to ETL the data into ES. Whether we can do this in the short-term depends on how much effort we have on this prototype and our development priorities.

from g2p-aggregator.

bwalsh commented on August 15, 2024

Thanks Alex and Jeremy for a good thread.

Just to clarify, an "ES style search" is more or less equivalent to a "Google style search" ( full text & keyword enabled).

Also, to belabor it a bit more, the use of ES is separate from the kibana UI that is in place now. I'm sure that there will be a number of ideas of how the UI should behave and how it should be implemented. Kibana is probably only useful in this exploratory stage.

Regarding protobuf schemas and a noSQL database like ES, there is no hard dependency. You can certainly use the backend of your choice to store data. However, with SQL (or similar) there is more than a bit of a extra work required, code needs to be written to marshall & unmarshall the messages as well as creating and maintaining schemas. With document databases like ES,mongo,... the messages are simply read from and written to the db.

Re. broader group ... for sure +1

Hope this helps...

from g2p-aggregator.

ahwagner commented on August 15, 2024

@jgoecks re: how I think relational will improve results, I'm concerned about where this will go in the future--do we want up-to-the-minute results beacon-style? ES relies on heavy indexing, and doesn't do well with continually updated resources. Perhaps we don't care as much about that, and are willing to accept somewhat stale results (weekly/daily dumps?) in order to get the fuzzy search performance that ES offers?

As for the keys specified in the documents, I agree that this would be the means of adding the functionality I'm envisioning to our existing datastore.

I believe that we can work with either model, and I agree with @bwalsh that a SQL backend will require more work to get up and running. @jgoecks, I really liked your feedback re: usefulness of fuzzy search. Your finding that a full-text search improved your ability to interpret variants is super helpful in making this decision, and it might be worthwhile to accept slightly stale results in the long run to keep that functionality.

I think I'm convinced that for the near term (i.e. until we publish this paper) we should stick to the ES-only approach. Maybe we should re-evaluate after, and go for a hybrid approach like @jgoecks suggested.

I suggest we leave this thread open for a while for other thoughts on the subject.

from g2p-aggregator.

malachig commented on August 15, 2024

I have already found this discussion very helpful.

If we go with ES only, I think it will be valuable to have demonstrations of both styles of searching (i.e. "Google like" and more "context aware").

An example of the context aware search could be: (amino acid designation is G12D) AND (gene is KRAS OR NRAS) and (disease is melanoma).

from g2p-aggregator.

PostgreSQL backend? about g2p-aggregator HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent