Google Cloud touts Lustre cache offload for AI inference

Thu, 2nd Jul 2026 (Today)

Google Cloud has outlined a way to run multi-node large language model inference on Google Kubernetes Engine using Managed Lustre for KV cache offloading. In benchmark testing, the design reduced infrastructure costs and GPU use.

The setup targets production AI workloads that have shifted to distributed, multi-node architectures to support long context windows and agent-based systems. In these environments, KV caches can outgrow local CPU memory and host SSD storage, increasing pressure on cluster design and operating costs.

Google's approach uses Managed Lustre as a shared external file system for prefilled attention state, instead of pooling node-local SSDs across a cluster. This removes the need to manage data distribution and cross-node replication for local storage layers.

In benchmark results shared by Google, Managed Lustre delivered more than 50% total cost of ownership savings and cut GPU-hour requirements by nearly 60% for Llama-3.3-70B inference on a six-node A3 Mega cluster. Those results were linked to a 95% cache hit rate when shared prefilled KV caches were offloaded to Lustre.

The benchmark used a prompt length of 50,000 tokens, an input question length of 256 tokens, and an output length of 512 tokens. The figures reflect a narrow but commercially relevant test case, as long prompts are a common driver of rising inference costs in enterprise deployments.

Hybrid tier

Google also described an extension of the design that adds CPU RAM offload alongside Lustre storage. The hybrid arrangement outperformed CPU offload alone, improving Time to First Token by about 40% and reducing end-to-end latency by 30% for Llama-3.3-70B inference.

The architecture combines GKE GPU nodes for model execution, Managed Lustre as the shared storage tier, and a distributed garbage collection service called PVC Evictor. That component monitors file access patterns and removes least-recently-used cache chunks to preserve spare storage capacity.

The deployment guide sets out separate validated paths for Qwen/Qwen3.5-35B-A3B and google/gemma-4-31B-it. It also states that the Managed Lustre CSI driver is supported on GKE version 1.33 or later, with newer versions recommended for default port handling.

The technical detail points to a broader issue in generative AI infrastructure: inference workloads are becoming less about raw model execution and more about memory management, storage design, and reuse of intermediate state. As context windows lengthen and prompt reuse becomes more common, caching economics can directly affect service margins.

Operational demands

Google's guide also makes clear that the architecture is not lightweight. Users must create a GKE cluster, add a GPU node pool, provision Lustre storage, deploy the vLLM serving engine, and install the PVC Evictor service. The storage layer is auto-provisioned through a Kubernetes StorageClass and PersistentVolumeClaim, with capacity ranges determined by Lustre performance tiers.

The evictor service itself can require substantial compute resources at scale. As a rule of thumb, Google recommends one evictor replica for each 72 TB of Lustre capacity. High-scale configurations may require 12 CPU requests and 8Gi of memory per pod, placed on dedicated machine types such as c4-standard-16.

For larger deployments, the service supports sharding so multiple replicas can divide the cache namespace and avoid redundant scans or race conditions. That reflects the operational burden of running shared file-based cache layers when millions of files may need to be tracked and deleted quickly.

The deployment examples also show users mounting shared Lustre storage into vLLM containers and configuring offload connectors for both CPU memory and file-system-based cache tiers. Qwen-3.5 requires a block size of 528 to avoid fragmentation, while Gemma 4 can use the default 256.

Although the examples focus on Google infrastructure, the work also highlights how quickly the software stack around open-weight models is evolving. The published benchmark for Llama-3.3-70B relied on an earlier version of vLLM and a specific connector implementation, while the current guide uses a newer vLLM release and updated llm-d components.

That suggests vendors are still refining the trade-offs between GPU memory, CPU memory, and external storage as inference systems move from simple single-node deployments to distributed services built for heavier, longer-running workloads. Google's figures indicate that, for some model sizes and prompt lengths, storage architecture may now matter as much to inference economics as the GPUs themselves.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google

Image: Sneha Aradhey