Optimizing Scalable LLM Inference: System-Level Strategies for Proactive KV Cache Management
Speaker: Lei Chen – Tsim Sha Tsui, Hong KongTopic(s): Information Systems, Search, Information Retrieval, Database Systems, Data Mining, Data Science
Abstract
As large language models (LLMs) increasingly underpin mission-critical applications across industries, optimizing their inference efficiency has emerged as a critical priority. Central to this optimization is the effective management of the Key-Value (KV) cache, a memory-intensive component that stores intermediate computations to accelerate autoregressive token generation.
In this talk, we examine recent advancements in system-level KV cache management, emphasizing novel approaches to proactive scheduling that dynamically allocate computational and memory resources. We evaluate techniques optimized for diverse operational contexts—spanning offline batch processing to real-time online serving—and discuss architectural optimizations for single-instance execution as well as coordination strategies for concurrent multi-instance deployments. Finally, we outline promising research directions to address scalability challenges in multi-instance inference. These advancements are crucial for enabling scalable enterprise solutions as LLMs expand into latency-sensitive, high-throughput industrial applications.
About this Lecture
Number of Slides: n/aDuration: n/a minutes
Languages Available: English
Last Updated:
Request this Lecture
To request this particular lecture, please complete this online form.
Request a Tour
To request a tour with this speaker, please complete this online form.
All requests will be sent to ACM headquarters for review.