I've included two diagrams, a high-level block diagram and a simplified flow chart, to help facilitate the converstaion.
This is really, at its core, a very common solution for long-running tasks based on user requests and is applicable regardless of the end payload composition. You could be calculating arc-flash hazard boundaries or raster images of text and it will work out the same.
- Some tasks may be long-running, such as in the case of image transformations
- Some assets, such as previously transformed images, may be reused
- There are no requirement that this run as a monolithic process; a microservice/SOA is acceptable
The system is comprised of the following components:
- HTTPS Ingress
- Cache Server
- Task Queue
- Database
- Out-of-band services (text and image processing)
This represents all the front-facing input, such as reverse proxy and load balancing in addition to the process that fetches the requested resources or schedules their out-of-band processing.
The cache server allows quick response for previously generated resources, such as processed images.
Using a task queue for long-running, out-of-band jobs such as image processing will allow you to distribute load as well as return placeholder data or job status
Given Assumption 2, that assets may be reused, it would be wise to persist this in a database. The duration of persistence is debatable.
Raster image generation from text assets, and image transformations such as downsampling.
- Accept payload
- Parse payload
- Check cache
- Return cached assets
- Accept payload
- Parse payload
- Check cache
- Check database
- Return stored assets
- Load assets in cache
- Accept payload
- Parse payload
- Check cache
- Check database
- Schedule task to process assets
Either the assets are cached/stored or they are not. If they are not then a task is queued to perform the processing so the results can be stored and cached for later use.
Four, since you could use the same service as the cache and the task queue. If you were adventurous, or wanted to reduce the number of services, you could use a single service (mongodb, for instance) to house the job queue backend and drop the caching layer.
Horizontally. The bottlenecks are at the ingress point and the out-of-band processing (though that is less critical than the ingress since tasks get queued).
This is a rather loaded question; technically it can fail anywhere. By using a distributed task queue we can build redundancy and fault tolerance into the system. This also lets you guard against conditions such as sudden spikes in demand (I'm told this is the Thundering Herd problem).
Since the majority of the action happens behind the HTTPS ingress point there is not much that I can see here, though I am no web security expert. I do feel that this architecture allows for very small API surface...which I would assume translates into a smaller attack surface.
Each component can be tested in isolation with various degrees of mocking. The out-of-band services obviously enjoy the most autonomy, however each component is highly decoupled and depends on an eventual universal source of truth; the database.
Yes, because the long running process issue it is handy to store state and by using the task queue we enjoy that ability baked in. This allows a request for assets to provide status updated to the rest of the system.
I'd recommend using Redis for the caching server. Celery (python) or Kue (node) for task management. For a database it really doesn't matter much in my thinking; something with low latency reads would be ideal. Given that all our long running jobs are out-of-band then it seems like Node/Express/Koa would work well for handling the requests. That said, AWS has a swarm of managed services that could be exploited to stitch this together; SQS and Elastic Beanstalk (and to a lesser degree Lambda) come to mind. After all, not everyone wants to feed and water a cluster of Redis servers...