Building Foundation Models at Scale: System Experiences and Challenges

Speaker:  Jingren Zhou – HANGZHOU, ZHEJIANG, China
Topic(s):  Artificial Intelligence, Machine Learning, Computer Vision, Natural language processing

Abstract

The rapid evolution of AI has led to the emergence of massive and complex foundation models that require enormous computational resources, making efficient training and inference systems essential. Training such models requires large-scale distributed computation, effective overlap of computation and communication, sophisticated parallelization strategies, and robust fault-tolerant mechanisms. Inference systems, on the other hand, must support diverse workloads with varying service-level agreements (SLAs), rapidly integrate engineering optimizations, and carefully balance trade-offs among throughput, latency, cost, and availability, particularly in distributed environments. In this talk, I will discuss the major systems challenges in building large-scale foundation models, focusing on our experiences developing Qwen (large language models) and Wan (video generative models). I will also present ongoing research and system designs that enhance the efficiency of training and inference at scale, enabling more effective management of complex AI workloads in cloud environments.

About this Lecture

Number of Slides:  n/a
Duration:  n/a minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.