Caching Service

Goal

A caching service to reduce latency in reading and writing files and other objects within batch pipelines.

Considerations

Interface

For now, it probably makes the most sense to add some sort of “cache me” flag to resource files in the Batch interface.

We might want to build an interface on top of that that will allow us to directly cache certain python/Scala objects (mostly in the sense of not having to explicitly write out as file/read in as file).

Storage-backed in-memory cache

The cache is a key-value store of potentially-cached objects; each object must be backed by a file in some persistent network storage (currently just Google Storage). If an object is in the cache, a read will return directly from the cache; otherwise, it will load into the cache from persistent storage. As long as we’re using redis as the in-memory store, we will probably use a basic LRU eviction policy, although it’s possible we’ll want to factor in the size of the object being cached.

When storing an object to the cache (e.g. as the output to a job), the cache will wait for the object to finish being written to persistent storage before returning success. Another option would be to allow jobs to be considered done without confirming the write to disk; this would need a way to rerun the job that produces the unfinished output file. This is what Spark does and it generally tends to be Really Bad when we do need to rerun jobs, so the extra latency from waiting on a write seems acceptable to avoid this.

Storage/File cleanup

Intermediate file I/O within a batch should get cleaned up when the entire batch finishes. If a file gets deleted from persistent storage, it should probably also get removed from the cache (eventually). Do we need to care about cleaning up the rest?

Security

A user should only be able to access data that they have cached themselves. Every read request needs to come from a specific user and will only be granted if that user was the one who stored the original object.

I plan on writing this logic as a layer around the Redis server within the same container, so that there’s no direct access to the redis server on open ports within the kubernetes cluster. I’m not actually sure if this is a concern (folks being able to access/dump the contents of the redis instance from inside the cluster), but at the very least it seems like a thing to do?

Cache invalidation

What happens when we cache a file and the update it and then try to access again? We do not want to return the old copy (the one on the cache) since the user will be expecting the updated one.

We could have a hailctl command to clear the entire cache for a given user (seems useful in any case), or we could check if the file has been modified from the cached version. Google etags does this for Google Storage.

Billing

Cost to us of maintaining the cache will be mostly dependent on the size of the thing being cached and also how long it stays in the cache; the first is easy to track (include output cache size in billing info message upon job completion) and the latter is more nebulous.

Scaling

There are two main metrics for scaling the cluster; some percentage of the total size of cached objects, and the cache hit rate. Not sure what the sharding strategy would be at this point.

Integration with file localization steps

(will probably need to ping Jackie about gsutil replacement code; will want to integrate caching and file localization into a single thing that handles i/o)

Roadmap/Future Work/Misc

Single redis node

Multiple redis nodes with sharding

Autoscaling for cluster

Fast persistent storage between Google Storage and in-memory cache

More nebulous plans

I expect that we will eventually want to implement our own in-memory store, optimized for the large-block-data that we expect to be caching in general. Ideally, using Redis will give us a better idea of our priorities when we design this.

I think my ultimate goal is to have a caching service that is integrated with the scheduler in a way that allows data to be cached locally, on compute nodes, and accessed via shared memory when the cached data resides on the same node as the computation that requires it. This is almost certainly very far off, but I would like to revisit it once we have some form of caching service running and integrated with batch.

I have this phrase in my notes: “something Cassandra-like optimized for big block storage” which I assume was meant to be a description of what we want this to look like in the end?

cache invalidation

So, I guess I don’t actually think of this as a “cache.” I’ve been thinking of it as a write-once storage layer whose reads are lower latency than GCS. Using GCS as a backing store is an opaque implementation detail.

It sounds like you’re imagining a general cache of GCS. This seems ripe for unexpected results because of the mutability of GCS.

This seems fine for a single user. The high-level user interface Batch should probably tie the lifetime of a cached file to the lifetime of the Batch anyway. I don’t even think a batch should be able to mutate the data. The cache is shared global state, mutation inevitably leads to data races.

I think for things like Docker images we should resolve the SHA of the image at batch submission time and always refer to the image by its SHA. Since the SHA is immutable there’s no cache issues. Users can also safely push new images to the same tag while the batch is running.

security

Hmm. I think if data is public when it is read into the cache, it should be considered public forever. For example, a given SHA-identified python3 Docker image should be stored exactly once and shared by everyone. I wonder if Docker images are generally kind of special.

Not exposing Redis on the cluster seems great.

integration with file localization

I imagine an API like get_or_fetch(key, url) suffices. Hmm. Should everything go through the cache? Maybe everything should go through the cache by default with a lifetime related to the file size? Every read resets the lifetime. Otherwise Batch would have to make smart caching choices that seem kind of annoying to implement.

The key for the fetch should probably be tied to the batch somehow to avoid cache invalidation issues. Some version of the file is loaded into the cache and then the whole batch sees the same view. Subsequent batches will get a fresh version. Seems easy to explain to users?

Re: not actually being a cache—not sure what you’re referring to, exactly. You’re right that the overall service is less so a cache and more of a low-latency storage layer; I’m happy to take name suggestions to replace “caching service”. A little bit confused that you put this under “cache invalidation”, I think, because I think the cache there is an actual cache.

In the context that you’re envisioning (which I believe is essentially caching intermediate outputs for a batch pipeline), you’re right that the cached objects shouldn’t outlive the batch that created/cached them and jobs should not be able to modify the files once written. I was going to incorporate this into the cleanup where batch deletes things from the cache as it deletes the intermediate files it creates.

There’s a couple of other use cases in which the cached object will likely want to outlive the batch that created it, mostly pertaining to persistence of hail objects. In that case, I think the current workflow would entail being able to access the cached object across batches. It’s possible we can implement this without allowing users to cache mutable files (since we can accomplish this by generating temp files as needed); @cseed brought this issue up so I don’t know if he had a use case in mind.

I actually was not thinking about caching Docker images, although that’s definitely a thing we should support. (I don’t know what other data would ever be public, unless we start tracking cached data in public/shared buckets to avoid duplicating items in the cache—maybe we handle Docker images separately?)

I don’t think everything goes through the cache, specifically (especially not now, when the cache service will not be large/robust enough to handle all I/O), although I do think it’s probably good to have all related i/o go through the same interface. (You’re right—we do need a better name for this, since using “cache” to refer to both the actual cache and “the thing that handles networked I/O decisions and functions as a storage layer” is very confusing!)

Hmm. How about we call it “memory”. It remembers things for us.

I think my question is: why would I want to “invalidate” a key?

I imagine a batch writes stuff into memory. Even if the data outlives the batch, I don’t think invalidation makes sense here.

A key in memory might be populated from a (mutable) GCS file, but I’m having this gut reaction that we shouldn’t treat such a key as a cache of the file but rather a distinct entity with no relationship to that file. We could build a real caching system on top of memory by checking ETAGs or SHAs or w/e, but it seems simpler to conceive of memory as a huge, fast, key-value store.


Musing: we should’ve called batch compute. Then we’d have compute and memory. Shuffle should probably be order. If we ever replace GCS we could call that thing store.

It seems like a few different things are being proposed here, all of them great. Let me attempt to summarize:

  • An ephemeral memory cache, initially built on Redis. Redis performs replacement, so this is not durable. I see this mostly as a system building block (it can be used to speed up Google Storage) but not as an application building block (batch won’t use it directly).

  • A cache in front of Google Storage using ephemeral storage. This should probably be exposed at the level of the FS interface.

  • Dan’s memory: a faster, more expensive distributed storage system. At this point, I don’t know how this is different than the Google Storage cache, except the Google Storage details are hidden. For this reason, the “Google Storage cache” perspective seems like a fine one to take.

  • This hasn’t been said explicitly, but I think we all imagine this memory becomes a storage layer that uses replication to be fault tolerance (maybe with some control over where something is stored in the memory hierarchy: memory, local SSD, …, PD, object storage, or even cold storage).

So I think I’m behind the original proposal: v1 is a cache sitting in front of Google Storage, with the option to use the cache for other stuff (e.g. docker images).

Some more detailed comments:

For now, it probably makes the most sense to add some sort of “cache me” flag to resource files in the Batch interface.

I think this will be useful, and is a good way to try out the new interface, but I don’t see this as super compelling. The case where it is interesting is shared inputs or steps with files with high fan-out. Jackie has already implemented some caching on the worker to handle this case (if two steps get scheduled on a worker that use the same input file, the input will be localized once).

I see three motivating cases:

  • Iterative algorithms on cached block matrices, and other distributed Hail objects,

  • Sending intermediate data (JVM bytecode, serialized globals and the CDA partition context) from query to works when executing

As Dan says, we could cache docker images, although Google Storage backed peer-to-peer distribution is probably a better solution for this case.

We might want to build an interface on top of that that will allow us to directly cache certain python/Scala objects (mostly in the sense of not having to explicitly write out as file/read in as file).

Note, I don’t expect Query will reuse the Batch file localization mechanism. Localization is necessary becuase we are running Docker programs that can’t talk directly to Google Storage. The Query code will use the FS abstraction to directly read/write resources in Google Storage.

Note, the current proposal won’t decrease the latency of single-use files much, because the latency will be dominated by the write to Google Storage.

When storing an object to the cache (e.g. as the output to a job), the cache will wait for the object to finish being written to persistent storage before returning success.

There is also the question of whether the writer also writes a copy to the cache, or lets the first reader miss and fill the cache. If the former, what do you do if the cache write fails but the durable write is still OK?

Intermediate file I/O within a batch should get cleaned up when the entire batch finishes.

This is good for a few reasons, but I don’t think it is strictly necessary: if we don’t reference something again, it will eventually get replaced in the cache, and deleting it won’t change the behavior of the cache otherwise.

A user should only be able to access data that they have cached themselves. Every read request needs to come from a specific user and will only be granted if that user was the one who stored the original object.

Dan wrote:

I think if data is public when it is read into the cache, it should be considered public forever.

As Dan points out, it might be nice to cache public objects (like genome references) once. But I’m slightly uncomfortable with his suggestion. We might want to verify the object is still has public access on, to handle the case public permissions are removed. Do metadata changes change the etags?

I plan on writing this logic as a layer around the Redis server

I think this is a good idea. As far as I’m aware, Redis is not multi-user and we cannot give client code access to a Redis server storing multi-tenant data. Ideally, we’d use (1) a network policy so only the cache service can talk to Redis, (2) only the cache layer has credentials to authenticate to Redis, and (3) all Redis traffic is encrypted.

Cost to us of maintaining the cache will be mostly dependent on the size of the thing being cached and also how long it stays in the cache; the first is easy to track (include output cache size in billing info message upon job completion) and the latter is more nebulous.

Yes, the latter is harder with Redis because we do not know when it replaces a cache entry.

Dan wrote:

The high-level user interface Batch should probably tie the lifetime of a cached file to the lifetime of the Batch anyway. > The key for the fetch should probably be tied to the batch somehow to avoid cache invalidation issues.

I’m a little confused by this. See my comment above about cache invalidation. I don’t think the cache interface should be tied to Batch in particular, since the main use cases are not Batch.

Dan wrote:

Should everything go through the cache?

No, I don’t think so. That will completely thrash the cache. We should charge for user-controlled use of the cache (as compared to Query internal use), hopefully incentivizing users to only cache when it is useful.