Git Product home page Git Product logo

Comments (6)

bwalsh avatar bwalsh commented on August 15, 2024

@ahwagner

Thanks for the feedback.

I have to push back a bit on this. ES was selected because:

  • Document store enables conformance with GA4GH document schemas,
  • Enables front end "google style" searches,
  • Similar architecture to other bio portals we reviewed prior to selection (ICGC, GDC, MyGene.info and MyVariant.info)

Re. normalization, consider the variant name -> variant normalization that was done for cgi. It was all done as part of the harvest at the python level. We anticipate that future normalizations should follow the same pattern.

At the same time, ES is implemented as a "silo", a sink to store data. We've made the silos pluggable and have included one for kafka. This is the mechanism for extension and support for downstream use cases.

from g2p-aggregator.

ahwagner avatar ahwagner commented on August 15, 2024

@bwalsh, thanks for the detailed response, this is exactly the sort of discussion I was hoping for.

These points make perfect sense in the context you have framed the problem in. My concern is if--going forward--whether or not we want to have an ES style search. I do like the idea of a "Google style" search interface, but my hope was that it will be context-aware (e.g. "V600E" is shorthand for an amino-acid change, and thus searches should be constrained to features that represent or contain such a change). I see the path towards getting there as evaluating the query, normalizing it, and then performing the search on an appropriate table. The results I envision presenting will be more than matching positions + returned meta data, but rather a sensible aggregation of important information from any available resources (such as today's example of variant/exon/gene level interpretations).

I don't mind trying to bypass or adapt ES to construct this behavior, but I've noted instances in the past where groups with similar goals (as the ones I have in mind) have ultimately adapted to a SQL backend in order to do advanced queries of this nature (e.g. cBioPortal, MAGI).

I also didn't realize that NoSQL was a requirement for conformance with the G2P schema--would you be able to clarify here? You're talking about the protobuf docs, right? You're better versed in the GA4GH schema, and I'm just trying to understand what I'm missing.

I also want to make sure that the broader group is in agreement about the best path forward, as I intend to coordinate my efforts in the same space so that we can have the best possible product when we publish. Since the initial design decisions on this prototype were made without input from the broader G2P team, I want to have that discussion here before we go too much farther.

from g2p-aggregator.

jgoecks avatar jgoecks commented on August 15, 2024

@bwalsh @ahwagner This is great discussion, my thoughts:

  1. Many modern biomedical data stores use both a relational database as well as a ElasticSearch (ES)/NoSQL database. The relational database is used for normalization, incremental updates, and ensuring referential integrity, and ES/NoSQL is used for flexible and fast searching. As best I know, here's the breakdown of related technologies and how they use data stores:
  • ICGC: ES only.
  • GDC: relational + graph database and ES to power the portal
  • myvariant/mygene.info: MongoDB as a substitute for a relational database, ES for fast searching
  1. I've used this tool for several cases now (3 patients, ~10 somatic mutations each), and I can say that full-text search is very nice to have. Even if we succeed in normalizing many of the data facets, full-text search is great for ensuring that nothing is missed and for fuzzy matching.

  2. @ahwagner I'd like to hear more about why you think a relational database will improve results. Precise searches are possible in ES (e.g. show me only entries where hgvs.p=V600E) as are fuzzy searches. Of course, ES searches are limited to the keys present in the ES documents, so results would be improved if we have a poor schema for our ES documents.

  3. My opinion is that in the long-term we probably want a relational database and then a process to ETL the data into ES. Whether we can do this in the short-term depends on how much effort we have on this prototype and our development priorities.

from g2p-aggregator.

bwalsh avatar bwalsh commented on August 15, 2024

Thanks Alex and Jeremy for a good thread.

Just to clarify, an "ES style search" is more or less equivalent to a "Google style search" ( full text & keyword enabled).

Also, to belabor it a bit more, the use of ES is separate from the kibana UI that is in place now. I'm sure that there will be a number of ideas of how the UI should behave and how it should be implemented. Kibana is probably only useful in this exploratory stage.

Regarding protobuf schemas and a noSQL database like ES, there is no hard dependency. You can certainly use the backend of your choice to store data. However, with SQL (or similar) there is more than a bit of a extra work required, code needs to be written to marshall & unmarshall the messages as well as creating and maintaining schemas. With document databases like ES,mongo,... the messages are simply read from and written to the db.

Re. broader group ... for sure +1

Hope this helps...

from g2p-aggregator.

ahwagner avatar ahwagner commented on August 15, 2024

@jgoecks re: how I think relational will improve results, I'm concerned about where this will go in the future--do we want up-to-the-minute results beacon-style? ES relies on heavy indexing, and doesn't do well with continually updated resources. Perhaps we don't care as much about that, and are willing to accept somewhat stale results (weekly/daily dumps?) in order to get the fuzzy search performance that ES offers?

As for the keys specified in the documents, I agree that this would be the means of adding the functionality I'm envisioning to our existing datastore.

I believe that we can work with either model, and I agree with @bwalsh that a SQL backend will require more work to get up and running. @jgoecks, I really liked your feedback re: usefulness of fuzzy search. Your finding that a full-text search improved your ability to interpret variants is super helpful in making this decision, and it might be worthwhile to accept slightly stale results in the long run to keep that functionality.

I think I'm convinced that for the near term (i.e. until we publish this paper) we should stick to the ES-only approach. Maybe we should re-evaluate after, and go for a hybrid approach like @jgoecks suggested.

I suggest we leave this thread open for a while for other thoughts on the subject.

from g2p-aggregator.

malachig avatar malachig commented on August 15, 2024

I have already found this discussion very helpful.

If we go with ES only, I think it will be valuable to have demonstrations of both styles of searching (i.e. "Google like" and more "context aware").

An example of the context aware search could be: (amino acid designation is G12D) AND (gene is KRAS OR NRAS) and (disease is melanoma).

from g2p-aggregator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.