The filebased cache is throwing dozens of 500 errors every day, and we've confirmed that users see some (if not all) of them.
Let's switch the cache to memcached, or possibly redis. memcached should be easier to set up (because that's what all the other caches use) but it may consume a lot of RAM (given the size of some of our courses). Redis might not be too hard to set up, if we use a service.
The DataDog service supports a large number of integrations, many of which would be useful for our purposes. We should fork the DataDog formula and addd support for integrations that we find useful.
This is for tracking the tasks necessary to rearchitect our deployment and management of virtual infrastructure. Rather than our current approach of maintaining long-running instances and patching/upgrading them in place we are going to be rebuilding the images and deploying copies of them. This will increase deployment speed as well as preventing a lot of deploy-time issues by letting us verify the final state of a deployed instance before it is actually put into production.
Now that we have a new kibana cluster for search logs, we need to work out the details of forwarding the logs from the MITx servers into it. Much of this is already done, but there appears to still be some tweaking to do.
Add fluentd to servers to capture logs & forward to kibana
pre-process logs so that they are indexed appropriately
demo new kibana interface for Peter, Ben and anyone else who needs to review MITx logs
It would be good to assess our current monitoring solution (Zenoss). Zenoss does SNMP walks and executes plugins that perform HTTP requests to healthcheck/status URLs to determine availability. It checks what services are running via patterns, etc. It does simple graphing does threshold/pattern based alerting on any data point.
However, we could really benefit from having a couple of things:
Reactive monitoring: @blarghmatey mentioned this one. There are sometimes things we can expect to happen (on disk, for instance) the require cleanup. For example, Studio imports from git leave a buildup of repositories over time on disk. Instead of trying to build crons for all of these things that periodically check the size of the directory/rotate the oldest repositories, it would be nice to have a service that reacts to inode events, etc - and have these watches be controlled from a central place.
Maintained support for our integrations: Carson made a Zenoss plugin for Hipchat, but he's gone now and it's not maintained currently, especially since he's not using Zenoss at his new job. While it's probably not going to break anytime soon, It would be nice to use a monitoring service that directly supports our chat combination.
Not having a crappy, convoluted interface.
An easy way to automatically add/remove managed machines upon orchestration.
This will allow for building sandbox AMIs that can be used for quickly creating new sandboxes for testing Micromasters, TeachersPortal, and changes to the edX codebase.