Goal
A caching service to reduce latency in reading and writing files and other objects within batch pipelines.
Considerations
Interface
For now, it probably makes the most sense to add some sort of “cache me” flag to resource files in the Batch interface.
We might want to build an interface on top of that that will allow us to directly cache certain python/Scala objects (mostly in the sense of not having to explicitly write out as file/read in as file).
Storage-backed in-memory cache
The cache is a key-value store of potentially-cached objects; each object must be backed by a file in some persistent network storage (currently just Google Storage). If an object is in the cache, a read will return directly from the cache; otherwise, it will load into the cache from persistent storage. As long as we’re using redis as the in-memory store, we will probably use a basic LRU eviction policy, although it’s possible we’ll want to factor in the size of the object being cached.
When storing an object to the cache (e.g. as the output to a job), the cache will wait for the object to finish being written to persistent storage before returning success. Another option would be to allow jobs to be considered done without confirming the write to disk; this would need a way to rerun the job that produces the unfinished output file. This is what Spark does and it generally tends to be Really Bad when we do need to rerun jobs, so the extra latency from waiting on a write seems acceptable to avoid this.
Storage/File cleanup
Intermediate file I/O within a batch should get cleaned up when the entire batch finishes. If a file gets deleted from persistent storage, it should probably also get removed from the cache (eventually). Do we need to care about cleaning up the rest?
Security
A user should only be able to access data that they have cached themselves. Every read request needs to come from a specific user and will only be granted if that user was the one who stored the original object.
I plan on writing this logic as a layer around the Redis server within the same container, so that there’s no direct access to the redis server on open ports within the kubernetes cluster. I’m not actually sure if this is a concern (folks being able to access/dump the contents of the redis instance from inside the cluster), but at the very least it seems like a thing to do?
Cache invalidation
What happens when we cache a file and the update it and then try to access again? We do not want to return the old copy (the one on the cache) since the user will be expecting the updated one.
We could have a hailctl command to clear the entire cache for a given user (seems useful in any case), or we could check if the file has been modified from the cached version. Google etags does this for Google Storage.
Billing
Cost to us of maintaining the cache will be mostly dependent on the size of the thing being cached and also how long it stays in the cache; the first is easy to track (include output cache size in billing info message upon job completion) and the latter is more nebulous.
Scaling
There are two main metrics for scaling the cluster; some percentage of the total size of cached objects, and the cache hit rate. Not sure what the sharding strategy would be at this point.
Integration with file localization steps
(will probably need to ping Jackie about gsutil replacement code; will want to integrate caching and file localization into a single thing that handles i/o)
Roadmap/Future Work/Misc
Single redis node
Multiple redis nodes with sharding
Autoscaling for cluster
Fast persistent storage between Google Storage and in-memory cache
More nebulous plans
I expect that we will eventually want to implement our own in-memory store, optimized for the large-block-data that we expect to be caching in general. Ideally, using Redis will give us a better idea of our priorities when we design this.
I think my ultimate goal is to have a caching service that is integrated with the scheduler in a way that allows data to be cached locally, on compute nodes, and accessed via shared memory when the cached data resides on the same node as the computation that requires it. This is almost certainly very far off, but I would like to revisit it once we have some form of caching service running and integrated with batch.
I have this phrase in my notes: “something Cassandra-like optimized for big block storage” which I assume was meant to be a description of what we want this to look like in the end?