Antoine Saint Exupery
I am interested in computer architecture, systems, and applied data mining.
My Ph.D. work has focused on improving the resource efficiency of large-scale datacenters.
Since traditional scaling techniques, e.g., using commodity computing or relying on Dennard's scaling are reaching the point of diminishing returns,
we must focus on using existing systems more efficiently.
Specifically, during my Ph.D. I have designed and built practical and scalable scheduling systems that improve system utilization,
without sacrificing application performance.
My approach relies on three main insights: first, systems that manage resources must account for the interactions between hardware architectures and system software. Second, a user should only have to provide a high-level declarative description of application requirements, not how they should be achieved using low-level resources. Third, the system must quickly learn the resource preferences of a new, unknown application. Obtaining this information through application profiling is prohibitively expensive. To make the system practical, I have leveraged efficient data mining techniques that take advantage of existing system knowledge to quickly make high quality scheduling decisions.
Below is a list of projects I am currently working on or have worked on in the past.
Paragon: Paragon is a QoS-aware datacenter scheduler that accounts for interference between co-scheduled workloads and platform heterogeneity. The scheduler leverages fast classification techniques to determine the interference and heterogeneity preferences of incoming applications, which only introduce minimal scheduling overheads. In a 1,000-server EC2 cluster Paragon improves system utilization by 47% compared to a traditional least-loaded scheduler and achieves 96% of optimal performance, while being scalable and lightweight.
[ASPLOS'13 paper] [TopPicks'14 paper] [TOCS'13 paper]
Quasar: Traditionally, datacenters have been plagued by low utilization, primarily due
to users overprovisioning resource reservations to side-step performance unpredictability.
Quasar is a cluster manager that introduces a different interface between system and users.
Instead of specifying raw resources, the user only specifies a performance target a job must meet.
Quasar then leverages efficient data mining techniques to determine the resource preferences of
a new job, much like a movie recommendation
system finds similarities between previous and new users to recommend movies that they are likely to enjoy.
Quasar achieves both high cluster utilization and high per-application performance.
[ASPLOS'14 paper] [demo] [press]
Tarcil: Tarcil is a scheduler that addresses the disparity between sophisticated, but slow centralized schedulers and fast, but low-quality distributed
schedulers. Tarcil uses sampling to lower the scheduling overheads and it accounts for the resource preferences of new jobs, to keep scheduling quality high.
It improves performance both for short and long jobs compared to centralized and distributed schedulers.
Cloud Provisioning: Paragon and Quasar assume that the cluster manager has full control over the entire system. Unfortunately, real life can be more
complicated, especially when the resources used are hosted on a public cloud provider. In this work, I designed a system that determines the most cost-efficient
instance type (reserved vs. on-demand) and size a job needs to satisfy its QoS constraints. I evaluated this system on a cluster with a few hundred servers on
Google Compute Engine.
iBench: Paragon and Quasar need to know the sensitivity of an incoming application to various types of interference. iBench is a benchmark suite that
consists of a set of microbenchmarks each of which puts pressure on a specific shared resource. iBench enables fast and practical characterization of the
interference an application tolerates in various resources and the interference it itself generates.
ARQ: Admission control is needed during periods of high load to prevent cluster overloading. ARQ is a multi-class admission control protocol that
ensures fast application dispatching and low head-of-line blocking.
Datacenter Application Modeling: Previously, I worked on characterizing and modeling the behavior of large-scale datacenter applications.
I designed and implemented ECHO, a consice analytical model that captures and recreates the network traffic of
distributed datacenter applications. I also developed a modeling framework for storage workloads, which generates
synthetic load patterns similar to the original applications. Both modeling frameworks were validated against real datacenter applications from Microsoft,
and were used in a series of efficiency and cost optimization studies.
[IISWC'12 paper] [IISWC'11 paper] [CAL'12 paper] [TPCTC'11 paper]
I also enjoy teaching and mentoring students.
- In Fall 2014 I was a co-instructor for CS316 (Advanced Multicore Systems). I designed and taught several lectures and mentored course-long research projects.
- In Spring 2014 I was a co-instructor for EE282 (Computer Architecture). I designed and taught several lectures. I also designed homework and programming assignments, and exam problems.
- In Spring 2013 I was TAing EE282 (Computer Architecture) and teaching some of the lectures and a weekly recitation.
I am mentoring (or have mentored) the following students.
- Sammy Steele (June 2014-present). Sammy is working with us on porting techniques that improve resource-efficiency on Mesos. She previously worked on an open-source datacenter benchmark suite.
- Felipe Munera Savino (May 2014-present). Felipe is working with us on improving tail performance for latency-critical datacenter applications.
- Maneeshika Madduri, Yifan Ge and Negar Rahmati (September 2014-present). I am mentoring Maneeshika, Yifan and Negar during their class project for CS316. They are working on load prediction techniques for OLTP services.
- Ricardo Rojas, Jithin Thomas and Bhruguraj Chudasama (September-December 2013). I mentored Ricardo, Jithin and Bhruguraj during their class project for CS316. They worked on scheduling techniques for heterogeneous CMPs.
- Charley Ho, Zane Silver and Ed Wang (September-December 2013). I mentored Charley, Zane and Ed during their class project for CS316. They worked on modeling techniques for performance and power estimations across server configurations.
- Manu Bansal (September-December 2010). Manu worked with us on modeling correlations in the resource usage of datacenter workloads.