Optimizing Scalable LLM Inference: System-Level Strategies for Proactive KV Cache Management

Speaker:  Lei Chen – Tsim Sha Tsui, Hong Kong
Topic(s):  Information Systems, Search, Information Retrieval, Database Systems, Data Mining, Data Science

Abstract

As large language models (LLMs) increasingly underpin mission-critical applications across industries, optimizing their inference efficiency has emerged as a critical priority. Central to this optimization is the effective management of the Key-Value (KV) cache, a memory-intensive component that stores intermediate computations to accelerate autoregressive token generation.
 
In this talk, we examine recent advancements in system-level KV cache management, emphasizing novel approaches to proactive scheduling that dynamically allocate computational and memory resources. We evaluate techniques optimized for diverse operational contexts—spanning offline batch processing to real-time online serving—and discuss architectural optimizations for single-instance execution as well as coordination strategies for concurrent multi-instance deployments. Finally, we outline promising research directions to address scalability challenges in multi-instance inference. These advancements are crucial for enabling scalable enterprise solutions as LLMs expand into latency-sensitive, high-throughput industrial applications.

About this Lecture

Number of Slides:  n/a
Duration:  n/a minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.