AI Clusters and Elastic Capacity Management: Designing Systems for
          Diverse Computational Demands

Ravi Kumar Vankayalapati, Rama Chandra Rao Nampalli

doi:10.63278/jicrcr.v5i2.124

Authors

Ravi Kumar Vankayalapati, Rama Chandra Rao Nampalli Author

DOI:

https://doi.org/10.63278/jicrcr.v5i2.124

Keywords:

Abstract

We present an architectural framework for AI clusters and highlight diverse computational demands from four target industry areas. We argue for elastic capacity management due to the dynamic nature of clusters to accommodate these diverse demands. We discuss key challenges and opportunities related to the design of elastic systems for managing diverse workloads based on AI clusters. The choice of system architecture and low-level orchestration mechanisms is guided by capacity planning requirements. We survey and contrast two distinct industry-driven architectural patterns for building AI clusters. We: (1) describe a logical architecture for managing dynamic AI clusters that is flexible and agnostic about the execution daemon; (2) propose two workload models based on the harvesting of real usage data from state-of-the-art AI clusters; (3) provide a proof-of-concept implementation of the proposed latency-QoS optimum placement methodology and analyze its performance.AI dominates computation today and presents unique system design and management challenges. This paper’s contribution is to espouse the idea that one-size-fits-all AI systems cannot work, precisely because AI computations are diverse. They can be computationally expensive, requiring a specialized GPU, or use modern software tricks to perform deep learning on a traditional CPU with reasonable latency. They may be high-performance, real-time, or high-batch throughput. They may demonstrate multi-modal steady-state behavior or varying degrees of start-up and steady-state variance. There are two design challenges: (1) what is the best AI cluster design with a blend of CPU and GPU platforms that satisfies the diverse AI computation needs of the above four scenarios? (2) How best to manage capacity in an AI cluster? We propose answers that critically also utilize data analytics and AI training to understand usage patterns and customer requirements. The system architectures we espouse are elastic; they can increase or decrease capacity in a serverless fashion. A diverse elastic system nicely pairs with elastic capacity management to manage clouds with AI workloads.

AI Clusters and Elastic Capacity Management: Designing Systems for Diverse Computational Demands

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

imprint

Latest publications

sidebar

Indexing

Information