Documentation and configuration files for deploying the Knowledge Systems Group's services using kubernetes

Shell 1.59% CSS 51.04% FreeMarker 47.34% Dockerfile 0.04%

knowledgesystems-k8s-deployment's People

Contributors

Stargazers

Watchers

knowledgesystems-k8s-deployment's Issues

Set up HTAN instance for cBioPortal (htan.cbioportal.org)

set up instance
use keycloak

This should prolly be a separate issue afterwards:

load hg38 data

Create Genome Nexus Dashboard in Grafana for webinar

For the webinar it would be useful to see Genome Nexus resource usage as well

Already done this: https://grafana.cbioportal.org/d/vWTDoH6Wz/genome-nexus?refresh=5s&panelId=18&orgId=1&tab=general. Just submitting this issue for logging purposes

Forward and maintain URL path (upgrade nginx)

We are currently on version 0.15.0:

https://github.com/knowledgesystems/knowledgesystems-k8s-deployment/blob/master/ingress/ingress_values.yml#L3

Would be good to upgrade. Do make sure the grafana dashboard still properly displays the metrics of nginx: https://grafana.cbioportal.org/d/7R5LYe_iz/cbioportal-ingress-pod-stats?refresh=5s&orgId=1. Last time we tried to upgrade this was an issue, because the metrics field names had changed. Prometheus needs to get all the metrics from nginx properly and the dashboard field names probably need to be renamed

Deploy helm terracotta

Explore the use of Terracotta on k8s as a means of offloading cache resource usage to pod outside of the cBioPortal backend. I suggest the GENIE portal be the first application for this.

Link to helm/chart/terracotta -
https://github.com/helm/charts/tree/master/stable/terracotta

Seems to be Ehcache 3.x compatible. Note, its unclear if this requires a license for Terracotta Server

Logging and Monitoring Retention

Increase retention of prometheus/grafana metrics, so we can see memory usage from e.g. 2 months ago (now it's only a week)
Increase retention of nginx logs (we currently can only see last couple of days), so it's hard to track down things longer ago

Import into cBioPortal database from Kubernetes

One way to do it semi-manually now is:

First download a study from datahub on your local machine following https://github.com/cbioportal/datahub#how-to-download-data
Update a local portal.properties to have the correct import credentials
Start a cbioportal-importer container:

kubectl run --rm -i --tty cbioportal-importer --image=cbioportal/cbioportal:release-3.3.0 --restart=Never -- sh

You can exit it for now after starting

Copy the properties & study into the container

kubectl cp portal.properties cbioportal-importer:/cbioportal/
kubectl cp ucec_tcga_pub cbioportal-importer:/cbioportal/ucec_tcga_pub

Attach to the cbioportal-importer pod again and import (this tries to import into beta database):

kubectl attach -it cbioportal-importer
# once inside
metaImport.py -s /cbioportal/ucec_tcga_pub -o -u https://beta.cbioportal.org

Now you can detach again with Ctrl + P followed by Ctrl + Q

Check logs with kubectl logs -f cbioportal-importer
Delete the pod again with kubectl delete pod cbioportal-importer

We should find some way to this in a more automated fashion. Some options:

Spin up a pod with multiple containers. First container downloads the datahub study of interest set with some env variable. Second container is the cBioPortal importer that actually imports it
Have an "always on" import container. Whenever we copy files into a particular folder it starts importing

Fix duplicate logs

We currently have duplicate logs in the elasticsearch cluster

uken/fluent-plugin-elasticsearch#312

Another solution would be to switch to a different logging system altogether

make rc sync create a new database and then rename

Right now the rc database sync script deletes the existing database and the dump takes quite a long time to complete. Would be better to import into a new database, then drop the old one and rename once completed

Move genie portal to AWS/Kubernetes

Make the cbioportal importer push data to the AWS db

upgrade to stable/prometheus-operator chart

use helm chart stable/prometheus-operator

https://github.com/helm/charts/tree/master/stable/prometheus-operator

instead of the two coreos charts:

coreos/prometheus-operator

and

coreos/kube-prometheus

This would allow us to:

show grafana worldmap (currently that plugin is disabled)
add a volume to prometheus. That would allow us to show the metrics for > 7days e.g. 1 month. Currently we use the pods local drive, which is not a v good practice.

Investigate using Amazon Aurora MySQLdb

While trying to up the database resources for a workshop without downtime I gave AWS Aurora MySQL a try. It worked pretty well

One can set up the aurora replicas to read from the already existing cbioportal public db instance. Takes about an hour or so to start. Make sure to allow connections from the kubernetes VPC. Then you can connect by running a mysql client on the k8s cluster:

kubectl run --rm -i --tty mysql-client --image=mysql:5.7 --restart=Never -- sh

And connect to the read endpoint that Amazon gives you. All db settings are copied so can log in with same things as usual.

To connect the cBioPortal pods one can simply change the DB_HOST to point to the aurora instance.

To really use this in production there were a few issues:

I got a 500 error when using the default mysql configuration and doing a TP53 query on all curated studies. Might have to copy some values from here to support big queries: https://github.com/knowledgesystems/knowledgesystems-k8s-deployment/blob/master/cbioportal/cbioportal_mysql_db_values.yml#L12-L25
One can only connect to Aurora from an EC2 machine in the same cluster. So it's not possible to directly connect to some mysql string. You can spin up EC2 instance and forward though. Should also work thru kubernetes cluster forwarding with some one liner. But one can see that we have to hash out some issues here with importing. Another option would be to leave the existing RDS instance running for now and only use it for writing by cbio_importer. Reading will be done thru the Aurora cluster. Then at some point we might switchover to use the aurora write endpoint instead

Create a Persistent Mongo Cache that can be Mounted by a GN Pod on Startup

Currently there is no backup of the Mongo DB used to cache Genome Nexus variant annotations. Setup daily Mongo DB dump to S3 for use during recovery of Mongo pod crash.

Done condition : on restart of the mongo server in kubernetes ... the persisted cache of mongo contents is loaded from backup. Backups are made on a daily bases of the mongo contents (avoiding corruption / loss of integrity)

Consider leveraging the VEP cache mounting tool.

Figure out a way to store the POST requests from cBioPortal

Avery made a PR to add the HRRS filter to save requests but we don't know exactly yet how to use it on the cluster

cBioPortal/cbioportal#5806

Where to access portal-configuration?

I'm trying to follow along with the Kubernetes installation procedures, but it seems like many resources depend on a configmap in some portal-configuration directory? I'm guessing this is meant to be customized for each individual installation, but I can't find any information about the schema for this config map. Apologies if this is documented somewhere an I simply haven't seen it.

keycloak certificate expired

https://keycloak.cbioportal.org/

gives

NET::ERR_CERT_DATE_INVALID

central backup of session service

develop a backup script that dumps the session store for backup.
we can keep a central backup of all sessions in a merged/joint repository, and then insert missing sessions into any future session-service deployment that is incomplete
This is good if we lose our session service database in an accident / if we need to recover data.

Monitoring Metrics Only Stored for a Day

Need to figure out how to store metrics for ~1 week or so

make backups of session-service mongo database

Follow example of OncoKB: #36

clean up repo

One set of instructions for logging in using saml2aws (several scattered currently)
there are different redis persistence values in different places e.g. digits-eks/eks-prod/shared-services/redis_persistence/redis_persistence_helm_install_values.yaml
for current root folders there are some that are by type of service (e.g. monitoring), others that are by cluster (public-eks)

Populate GENIE Genome Nexus Mongo DB cache

Before sharing tools with SAGE, lets populate the Mongo DB cache by annotating the latest consortium release of GENIE

We could use the genie MAF directly as well.

Set up separate instance for webinar + instructions

For the webinar with over 2000 registrants it would be good to have a separate instance, so even if the production machine goes down the webinar can continue :)

Mount VEP Cache File w/o Downloading from S3

Currently genie genome-nexus downloads the VEP cache from S3. We want to replace this cache with an indexed tabix-converted version which improves VEP performance. Furthermore, because this file takes a while to download, it would be better to figure out a way to have the cache stored on a persistent volume and just have the GN pod mount it on startup (preventing longer startup/restart times)

Check/locate which cache version to use
Figure out where to store cache (AWS volume)
Figure out how to mount through kubernetes config file (instead of curl download)

Consider NFS as a solution .. "timebox" the effort selecting a tool / approach. Not more than 1 day of examination. If at all possible ... avoid an amazon-specific solution. (Follow other kubernetes generic projects)

https://aws.amazon.com/efs/

deploy cellxgene in kubernetes

Previously we have been able to deploy instances of cellxgene in heroku. See e.g.:

https://github.com/cBioPortal/brugge-singlecell

It alows one to point to an h5ad file in a bucket and deploy it. It would be great if we could deploy more cellxgene datasets on the cbioportal domain like e.g. *.cellxgene.cbioportal.org using our k8s infrastructure (or Fargate or some other AWS option, whichever makes sense). There is also this extension that allows one to use multiple datasets:

https://github.com/Novartis/cellxgene-gateway/tree/master

Using the cBioPortal domain would in addition allow mounting the tab inside of cBioPortal (this doesn't work atm because CORS is disabled on https://cellxgene.cziscience.com/)

Previously Jessica Singh from the Hyve used it and demo'ed its use. We've also had a server set up at single-cell.cbioportal.org but it is currently broken. We thought using the K8s infra or some other set up with auto ssl and all the bells and whistles might help with maintainability

Migrate SAML to OAuth2 for all portals

SAML is old, OAuth2 is cool

Update to cBioPortal cache persistence using REDIS

get the connection to the redis server to work and to store cached values
each portal needs to save to a different cache names (see https://stackoverflow.com/questions/13189814/how-can-i-make-the-cache-name-in-spring-cache-configurable and https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/cache/interceptor/CacheResolver.html) -- Manda will start this
configure the redis caches (client side)
add configuration properties into deployment / portal properties
allow configuration of either ehcache or redis as the storage for caching
tuning the helm settings / properties ... and considering the limits on memory footprint / disk use
we should test on dashi/dashi-dev and see whether the packaged redisson client (v3.13.2) clashes with the jar version of redisson in the tomcat deployment directory (in use for session storage)
reconnect the cacheStatistics service so that (at least for Ehcache) the statistics are available (maybe Mbeans is enough)
Update documentation: e.g. ./docs/Caching.md:@Cacheable(cacheNames = "ClinicalDataCache", condition = "@cacheEnabledConfig.getEnabled()")

Re-configure so that REDIS is not built into the JVM process, but is instead run as an external process. The cBioPortal persistence layer annotations can be updated to refer to an external service.

Consult with Hongxin ... oncoKB is using external caching and we can use the approach as a template. (Helm chart and linkage) (OncoKb CacheConfiguration)

One or more help deployments of Redis are running in the kubernetes cluster and being used for various websites to cache persistence layer return values.

Use a separate pool of Redis services for the distinct cohort databases
Start with Genie database .. but plan for having a separate pool of servers for public.

The code base should allow continued use of embedded Ehcache .. but allow reconfiguration for using external redis services.

knowledgesystems / knowledgesystems-k8s-deployment Goto Github PK

knowledgesystems-k8s-deployment's People

Contributors

Stargazers

Watchers

Forkers

knowledgesystems-k8s-deployment's Issues

Recommend Projects

Recommend Topics

Recommend Org