Caching Service

Goal

A caching service to reduce latency in reading and writing files and other objects within batch pipelines.

Considerations

Interface

For now, it probably makes the most sense to add some sort of “cache me” flag to resource files in the Batch interface.

We might want to build an interface on top of that that will allow us to directly cache certain python/Scala objects (mostly in the sense of not having to explicitly write out as file/read in as file).

Storage-backed in-memory cache

The cache is a key-value store of potentially-cached objects; each object must be backed by a file in some persistent network storage (currently just Google Storage). If an object is in the cache, a read will return directly from the cache; otherwise, it will load into the cache from persistent storage. As long as we’re using redis as the in-memory store, we will probably use a basic LRU eviction policy, although it’s possible we’ll want to factor in the size of the object being cached.

When storing an object to the cache (e.g. as the output to a job), the cache will wait for the object to finish being written to persistent storage before returning success. Another option would be to allow jobs to be considered done without confirming the write to disk; this would need a way to rerun the job that produces the unfinished output file. This is what Spark does and it generally tends to be Really Bad when we do need to rerun jobs, so the extra latency from waiting on a write seems acceptable to avoid this.

Storage/File cleanup

Intermediate file I/O within a batch should get cleaned up when the entire batch finishes. If a file gets deleted from persistent storage, it should probably also get removed from the cache (eventually). Do we need to care about cleaning up the rest?

Security

A user should only be able to access data that they have cached themselves. Every read request needs to come from a specific user and will only be granted if that user was the one who stored the original object.

I plan on writing this logic as a layer around the Redis server within the same container, so that there’s no direct access to the redis server on open ports within the kubernetes cluster. I’m not actually sure if this is a concern (folks being able to access/dump the contents of the redis instance from inside the cluster), but at the very least it seems like a thing to do?

Cache invalidation

What happens when we cache a file and the update it and then try to access again? We do not want to return the old copy (the one on the cache) since the user will be expecting the updated one.

We could have a hailctl command to clear the entire cache for a given user (seems useful in any case), or we could check if the file has been modified from the cached version. Google etags does this for Google Storage.

Billing

Cost to us of maintaining the cache will be mostly dependent on the size of the thing being cached and also how long it stays in the cache; the first is easy to track (include output cache size in billing info message upon job completion) and the latter is more nebulous.

Scaling

There are two main metrics for scaling the cluster; some percentage of the total size of cached objects, and the cache hit rate. Not sure what the sharding strategy would be at this point.

Integration with file localization steps

(will probably need to ping Jackie about gsutil replacement code; will want to integrate caching and file localization into a single thing that handles i/o)

Roadmap/Future Work/Misc

Single redis node

Multiple redis nodes with sharding

Autoscaling for cluster

Fast persistent storage between Google Storage and in-memory cache

More nebulous plans

I expect that we will eventually want to implement our own in-memory store, optimized for the large-block-data that we expect to be caching in general. Ideally, using Redis will give us a better idea of our priorities when we design this.

I think my ultimate goal is to have a caching service that is integrated with the scheduler in a way that allows data to be cached locally, on compute nodes, and accessed via shared memory when the cached data resides on the same node as the computation that requires it. This is almost certainly very far off, but I would like to revisit it once we have some form of caching service running and integrated with batch.

I have this phrase in my notes: “something Cassandra-like optimized for big block storage” which I assume was meant to be a description of what we want this to look like in the end?

cache invalidation

So, I guess I don’t actually think of this as a “cache.” I’ve been thinking of it as a write-once storage layer whose reads are lower latency than GCS. Using GCS as a backing store is an opaque implementation detail.

It sounds like you’re imagining a general cache of GCS. This seems ripe for unexpected results because of the mutability of GCS.

This seems fine for a single user. The high-level user interface Batch should probably tie the lifetime of a cached file to the lifetime of the Batch anyway. I don’t even think a batch should be able to mutate the data. The cache is shared global state, mutation inevitably leads to data races.

I think for things like Docker images we should resolve the SHA of the image at batch submission time and always refer to the image by its SHA. Since the SHA is immutable there’s no cache issues. Users can also safely push new images to the same tag while the batch is running.

security

Hmm. I think if data is public when it is read into the cache, it should be considered public forever. For example, a given SHA-identified python3 Docker image should be stored exactly once and shared by everyone. I wonder if Docker images are generally kind of special.

Not exposing Redis on the cluster seems great.

integration with file localization

I imagine an API like get_or_fetch(key, url) suffices. Hmm. Should everything go through the cache? Maybe everything should go through the cache by default with a lifetime related to the file size? Every read resets the lifetime. Otherwise Batch would have to make smart caching choices that seem kind of annoying to implement.

The key for the fetch should probably be tied to the batch somehow to avoid cache invalidation issues. Some version of the file is loaded into the cache and then the whole batch sees the same view. Subsequent batches will get a fresh version. Seems easy to explain to users?

Re: not actually being a cache—not sure what you’re referring to, exactly. You’re right that the overall service is less so a cache and more of a low-latency storage layer; I’m happy to take name suggestions to replace “caching service”. A little bit confused that you put this under “cache invalidation”, I think, because I think the cache there is an actual cache.

In the context that you’re envisioning (which I believe is essentially caching intermediate outputs for a batch pipeline), you’re right that the cached objects shouldn’t outlive the batch that created/cached them and jobs should not be able to modify the files once written. I was going to incorporate this into the cleanup where batch deletes things from the cache as it deletes the intermediate files it creates.

There’s a couple of other use cases in which the cached object will likely want to outlive the batch that created it, mostly pertaining to persistence of hail objects. In that case, I think the current workflow would entail being able to access the cached object across batches. It’s possible we can implement this without allowing users to cache mutable files (since we can accomplish this by generating temp files as needed); @cseed brought this issue up so I don’t know if he had a use case in mind.

I actually was not thinking about caching Docker images, although that’s definitely a thing we should support. (I don’t know what other data would ever be public, unless we start tracking cached data in public/shared buckets to avoid duplicating items in the cache—maybe we handle Docker images separately?)

I don’t think everything goes through the cache, specifically (especially not now, when the cache service will not be large/robust enough to handle all I/O), although I do think it’s probably good to have all related i/o go through the same interface. (You’re right—we do need a better name for this, since using “cache” to refer to both the actual cache and “the thing that handles networked I/O decisions and functions as a storage layer” is very confusing!)