Comments (6)
Thanks for the feedback.
I have to push back a bit on this. ES was selected because:
- Document store enables conformance with GA4GH document schemas,
- Enables front end "google style" searches,
- Similar architecture to other bio portals we reviewed prior to selection (ICGC, GDC, MyGene.info and MyVariant.info)
Re. normalization, consider the variant name -> variant normalization that was done for cgi. It was all done as part of the harvest at the python level. We anticipate that future normalizations should follow the same pattern.
At the same time, ES is implemented as a "silo", a sink to store data. We've made the silos pluggable and have included one for kafka. This is the mechanism for extension and support for downstream use cases.
from g2p-aggregator.
@bwalsh, thanks for the detailed response, this is exactly the sort of discussion I was hoping for.
These points make perfect sense in the context you have framed the problem in. My concern is if--going forward--whether or not we want to have an ES style search. I do like the idea of a "Google style" search interface, but my hope was that it will be context-aware (e.g. "V600E" is shorthand for an amino-acid change, and thus searches should be constrained to features that represent or contain such a change). I see the path towards getting there as evaluating the query, normalizing it, and then performing the search on an appropriate table. The results I envision presenting will be more than matching positions + returned meta data, but rather a sensible aggregation of important information from any available resources (such as today's example of variant/exon/gene level interpretations).
I don't mind trying to bypass or adapt ES to construct this behavior, but I've noted instances in the past where groups with similar goals (as the ones I have in mind) have ultimately adapted to a SQL backend in order to do advanced queries of this nature (e.g. cBioPortal, MAGI).
I also didn't realize that NoSQL was a requirement for conformance with the G2P schema--would you be able to clarify here? You're talking about the protobuf docs, right? You're better versed in the GA4GH schema, and I'm just trying to understand what I'm missing.
I also want to make sure that the broader group is in agreement about the best path forward, as I intend to coordinate my efforts in the same space so that we can have the best possible product when we publish. Since the initial design decisions on this prototype were made without input from the broader G2P team, I want to have that discussion here before we go too much farther.
from g2p-aggregator.
@bwalsh @ahwagner This is great discussion, my thoughts:
- Many modern biomedical data stores use both a relational database as well as a ElasticSearch (ES)/NoSQL database. The relational database is used for normalization, incremental updates, and ensuring referential integrity, and ES/NoSQL is used for flexible and fast searching. As best I know, here's the breakdown of related technologies and how they use data stores:
- ICGC: ES only.
- GDC: relational + graph database and ES to power the portal
- myvariant/mygene.info: MongoDB as a substitute for a relational database, ES for fast searching
-
I've used this tool for several cases now (3 patients, ~10 somatic mutations each), and I can say that full-text search is very nice to have. Even if we succeed in normalizing many of the data facets, full-text search is great for ensuring that nothing is missed and for fuzzy matching.
-
@ahwagner I'd like to hear more about why you think a relational database will improve results. Precise searches are possible in ES (e.g. show me only entries where
hgvs.p=V600E
) as are fuzzy searches. Of course, ES searches are limited to the keys present in the ES documents, so results would be improved if we have a poor schema for our ES documents. -
My opinion is that in the long-term we probably want a relational database and then a process to ETL the data into ES. Whether we can do this in the short-term depends on how much effort we have on this prototype and our development priorities.
from g2p-aggregator.
Thanks Alex and Jeremy for a good thread.
Just to clarify, an "ES style search" is more or less equivalent to a "Google style search" ( full text & keyword enabled).
Also, to belabor it a bit more, the use of ES is separate from the kibana UI that is in place now. I'm sure that there will be a number of ideas of how the UI should behave and how it should be implemented. Kibana is probably only useful in this exploratory stage.
Regarding protobuf schemas and a noSQL database like ES, there is no hard dependency. You can certainly use the backend of your choice to store data. However, with SQL (or similar) there is more than a bit of a extra work required, code needs to be written to marshall & unmarshall the messages as well as creating and maintaining schemas. With document databases like ES,mongo,... the messages are simply read from and written to the db.
Re. broader group ... for sure +1
Hope this helps...
from g2p-aggregator.
@jgoecks re: how I think relational will improve results, I'm concerned about where this will go in the future--do we want up-to-the-minute results beacon-style? ES relies on heavy indexing, and doesn't do well with continually updated resources. Perhaps we don't care as much about that, and are willing to accept somewhat stale results (weekly/daily dumps?) in order to get the fuzzy search performance that ES offers?
As for the keys specified in the documents, I agree that this would be the means of adding the functionality I'm envisioning to our existing datastore.
I believe that we can work with either model, and I agree with @bwalsh that a SQL backend will require more work to get up and running. @jgoecks, I really liked your feedback re: usefulness of fuzzy search. Your finding that a full-text search improved your ability to interpret variants is super helpful in making this decision, and it might be worthwhile to accept slightly stale results in the long run to keep that functionality.
I think I'm convinced that for the near term (i.e. until we publish this paper) we should stick to the ES-only approach. Maybe we should re-evaluate after, and go for a hybrid approach like @jgoecks suggested.
I suggest we leave this thread open for a while for other thoughts on the subject.
from g2p-aggregator.
I have already found this discussion very helpful.
If we go with ES only, I think it will be valuable to have demonstrations of both styles of searching (i.e. "Google like" and more "context aware").
An example of the context aware search could be: (amino acid designation is G12D) AND (gene is KRAS OR NRAS) and (disease is melanoma).
from g2p-aggregator.
Related Issues (20)
- Fix directionality
- Incorrect value provided for disease source field
- intuitive search HOT 13
- revisit civic evidence mapping
- add p. and g. descriptions to feature records HOT 2
- correctly harvest and normalize variants from Table 2 HOT 14
- static-html/_plugin/kibana/app/kibana 404 (Not Found) HOT 1
- where are the the indexs?
- schedule updates and provide "current dataset" downloads
- api port issue during installation HOT 3
- VICC-meta-knowledgebase /error/url-overflow bug on internet explorer HOT 1
- quotes on single term changes search result
- Reset search
- rename feature names from CIViC
- DO issues for CGI mapping
- Unable to load data to local cluster using harvester
- Edge browser errors
- What is the relationsihp between the VICC Meta-Knowledgebase and the GA4GH Variation Representation Specification? HOT 1
- Adjust default from / offset value
- Internet Explorer 11 failure
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from g2p-aggregator.