knowledgesystems / knowledgesystems-k8s-deployment Goto Github PK
View Code? Open in Web Editor NEWDocumentation and configuration files for deploying the Knowledge Systems Group's services using kubernetes
Documentation and configuration files for deploying the Knowledge Systems Group's services using kubernetes
This should prolly be a separate issue afterwards:
For the webinar it would be useful to see Genome Nexus resource usage as well
Already done this: https://grafana.cbioportal.org/d/vWTDoH6Wz/genome-nexus?refresh=5s&panelId=18&orgId=1&tab=general. Just submitting this issue for logging purposes
We are currently on version 0.15.0:
Would be good to upgrade. Do make sure the grafana dashboard still properly displays the metrics of nginx: https://grafana.cbioportal.org/d/7R5LYe_iz/cbioportal-ingress-pod-stats?refresh=5s&orgId=1. Last time we tried to upgrade this was an issue, because the metrics field names had changed. Prometheus needs to get all the metrics from nginx properly and the dashboard field names probably need to be renamed
Explore the use of Terracotta on k8s as a means of offloading cache resource usage to pod outside of the cBioPortal backend. I suggest the GENIE portal be the first application for this.
Link to helm/chart/terracotta -
https://github.com/helm/charts/tree/master/stable/terracotta
Seems to be Ehcache 3.x compatible. Note, its unclear if this requires a license for Terracotta Server
One way to do it semi-manually now is:
First download a study from datahub on your local machine following https://github.com/cbioportal/datahub#how-to-download-data
Update a local portal.properties to have the correct import credentials
Start a cbioportal-importer container:
kubectl run --rm -i --tty cbioportal-importer --image=cbioportal/cbioportal:release-3.3.0 --restart=Never -- sh
You can exit it for now after starting
kubectl cp portal.properties cbioportal-importer:/cbioportal/
kubectl cp ucec_tcga_pub cbioportal-importer:/cbioportal/ucec_tcga_pub
kubectl attach -it cbioportal-importer
# once inside
metaImport.py -s /cbioportal/ucec_tcga_pub -o -u https://beta.cbioportal.org
Now you can detach again with Ctrl + P followed by Ctrl + Q
Check logs with kubectl logs -f cbioportal-importer
Delete the pod again with kubectl delete pod cbioportal-importer
We should find some way to this in a more automated fashion. Some options:
Spin up a pod with multiple containers. First container downloads the datahub study of interest set with some env variable. Second container is the cBioPortal importer that actually imports it
Have an "always on" import container. Whenever we copy files into a particular folder it starts importing
We currently have duplicate logs in the elasticsearch cluster
uken/fluent-plugin-elasticsearch#312
Another solution would be to switch to a different logging system altogether
Right now the rc database sync script deletes the existing database and the dump takes quite a long time to complete. Would be better to import into a new database, then drop the old one and rename once completed
use helm chart stable/prometheus-operator
https://github.com/helm/charts/tree/master/stable/prometheus-operator
instead of the two coreos charts:
coreos/prometheus-operator
and
coreos/kube-prometheus
This would allow us to:
While trying to up the database resources for a workshop without downtime I gave AWS Aurora MySQL a try. It worked pretty well
One can set up the aurora replicas to read from the already existing cbioportal public db instance. Takes about an hour or so to start. Make sure to allow connections from the kubernetes VPC. Then you can connect by running a mysql client on the k8s cluster:
kubectl run --rm -i --tty mysql-client --image=mysql:5.7 --restart=Never -- sh
And connect to the read endpoint that Amazon gives you. All db settings are copied so can log in with same things as usual.
To connect the cBioPortal pods one can simply change the DB_HOST to point to the aurora instance.
To really use this in production there were a few issues:
Currently there is no backup of the Mongo DB used to cache Genome Nexus variant annotations. Setup daily Mongo DB dump to S3 for use during recovery of Mongo pod crash.
Done condition : on restart of the mongo server in kubernetes ... the persisted cache of mongo contents is loaded from backup. Backups are made on a daily bases of the mongo contents (avoiding corruption / loss of integrity)
Consider leveraging the VEP cache mounting tool.
Avery made a PR to add the HRRS filter to save requests but we don't know exactly yet how to use it on the cluster
I'm trying to follow along with the Kubernetes installation procedures, but it seems like many resources depend on a configmap in some portal-configuration directory? I'm guessing this is meant to be customized for each individual installation, but I can't find any information about the schema for this config map. Apologies if this is documented somewhere an I simply haven't seen it.
develop a backup script that dumps the session store for backup.
we can keep a central backup of all sessions in a merged/joint repository, and then insert missing sessions into any future session-service deployment that is incomplete
This is good if we lose our session service database in an accident / if we need to recover data.
Need to figure out how to store metrics for ~1 week or so
Follow example of OncoKB: #36
digits-eks/eks-prod/shared-services/redis_persistence/redis_persistence_helm_install_values.yaml
public-eks
)Before sharing tools with SAGE, lets populate the Mongo DB cache by annotating the latest consortium release of GENIE
We could use the genie MAF directly as well.
For the webinar with over 2000 registrants it would be good to have a separate instance, so even if the production machine goes down the webinar can continue :)
Currently genie genome-nexus downloads the VEP cache from S3. We want to replace this cache with an indexed tabix-converted version which improves VEP performance. Furthermore, because this file takes a while to download, it would be better to figure out a way to have the cache stored on a persistent volume and just have the GN pod mount it on startup (preventing longer startup/restart times)
Consider NFS as a solution .. "timebox" the effort selecting a tool / approach. Not more than 1 day of examination. If at all possible ... avoid an amazon-specific solution. (Follow other kubernetes generic projects)
Previously we have been able to deploy instances of cellxgene in heroku. See e.g.:
https://github.com/cBioPortal/brugge-singlecell
It alows one to point to an h5ad file in a bucket and deploy it. It would be great if we could deploy more cellxgene datasets on the cbioportal domain like e.g. *.cellxgene.cbioportal.org using our k8s infrastructure (or Fargate or some other AWS option, whichever makes sense). There is also this extension that allows one to use multiple datasets:
https://github.com/Novartis/cellxgene-gateway/tree/master
Using the cBioPortal domain would in addition allow mounting the tab inside of cBioPortal (this doesn't work atm because CORS is disabled on https://cellxgene.cziscience.com/)
Previously Jessica Singh from the Hyve used it and demo'ed its use. We've also had a server set up at single-cell.cbioportal.org but it is currently broken. We thought using the K8s infra or some other set up with auto ssl and all the bells and whistles might help with maintainability
SAML is old, OAuth2 is cool
./docs/Caching.md:@Cacheable(cacheNames = "ClinicalDataCache", condition = "@cacheEnabledConfig.getEnabled()")
Re-configure so that REDIS is not built into the JVM process, but is instead run as an external process. The cBioPortal persistence layer annotations can be updated to refer to an external service.
Consult with Hongxin ... oncoKB is using external caching and we can use the approach as a template. (Helm chart and linkage) (OncoKb CacheConfiguration)
One or more help deployments of Redis are running in the kubernetes cluster and being used for various websites to cache persistence layer return values.
Use a separate pool of Redis services for the distinct cohort databases
Start with Genie database .. but plan for having a separate pool of servers for public.
The code base should allow continued use of embedded Ehcache .. but allow reconfiguration for using external redis services.
Can ask CHOP folks how to do this
We're currently not saving the nginx access log for longer than a day or so
Currently we don't have a way to restart the pods on the cluster after doing a database update
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.