Our research concentrates primarily on warehouse-scale datacenters, the massive computers that power public and private clouds and host the numerous services we use through cellphones and other embedded devices. The cluster management layer determines how workloads acquire resources from shared pools with tens of thousands of servers. Developers expect that their workloads receive ample resources to achieve high quality-of-service (QoS), either low latency and high throughput for user-facing services or fast execution times for analytics. Operators expect high resource utilization so that a fixed set of resources can host as many workloads as possible. Unfortunately, it is difficult to meet both expectations. Most datacenters currently operate at low utilization as workloads run with excessive resource allocations in order to meet QoS constraints.
We advocated the use of machine-learning techniques to improve cluster management [ASPLOS’13, TOCS’13, ICAC’13, IISWC’13, ASPLOS’14, SOCC’15, ASPLOS’16, ASPLOS’17]. Specifically, we used classification techniques to gain a deep understanding of workload characteristics: how performance varies with the type and speed of resources, their scale-up and scale-out behavior, their sensitivity to interference on various resources, the interference patterns they generate themselves, and their sensitivity to various configuration parameters. We showed that this detailed knowledge is practical to derive by combining a small amount of profiling information from each workload with the large amount of data from previously-executed workloads. Using this knowledge, we built novel cluster management features that improve utilization and protect workload QoS: automatic rightsizing of workloads for both on premise and cloud deployments; optimized mapping of workloads to heterogeneous resources; efficient bin-packing of workloads on shared resources that minimizes the impact of interference; QoS-aware admission control for datacenter workloads; fast workload scheduling with statistical guarantees on QoS. These features allow developers to focus on what their workload needs (QoS goals) and not on the way these goals are achieved (detailed resource allocation and scheduling).
We also advocated the use of feedback-based control and online optimization in cluster management [HPCA’14, ISCA’14, ISCA’15, TOC’16]. Specifically, we focused on complex services that scale across thousands of servers, such as websearch. We used end-to-end performance metrics to dynamically manage low-level hardware and system parameters such as core and memory usage, cache and I/O bandwidth allocation, clock frequency and voltage settings. We demonstrated that, contrary to popular belief, online services can operate in an energy proportional manner. We also showed how to run machine learning workloads on temporarily idling resources in a cluster without affecting the throughput and tail latency of online services running on the same cluster. We demonstrated both capabilities on production clusters for Google search, showcasing how to save millions in energy costs or capital expenses for additional servers. The key ideas of our research on cluster management are currently integrated into open source and proprietary cluster managers.
Follow the links above to the related papers or visit our publications on this topic.
Press coverage of our work: The Register, The Register, NYTimes, Stanford Report, Stanford School of Engineering News, ACM TechNews, EETimes, CloudPro, The Stanford Daily, Scientific Computing, GigaOM, IBM Midsize Insider, The Whir, CIO Review, Green Datacenter News