Current Projects

In my research projects we don't just build prototype systems and write papers; we try to build production-quality software systems that can be used by other people for real work. In most cases we release the software in open-source form. This approach is unusual in academia, but it allows us to do better research: designing for production use forces us to think about important issues that could be ignored otherwise, and measurements of usage allow us to evaluate our ideas more thoroughly. Furthermore, this approach allows students to develop a higher level of system-building maturity than would be possible building toy research prototypes.

Granular Computing (2016– )

The last decade has seen dramatic reductions in latency for several key technologies in computing. For example, modern datacenter networks now support round-trip times less than 5 μs (versus hundreds of microseconds a decade ago); flash storage devices offer access times of 10–50 μs (vs. 5–10 ms for disks); and new nonvolatile memories such as 3D XPoint could offer access times in the 1 μs range. However, today's software stacks have high overheads that make it impossible to reap the full benefits of these technologies.

We are developing new software stack layers that can fully exploit new low-latency technologies. Our overall goal is to enable the efficient execution of very large numbers of very small tasks in a datacenter environment. We call this granular computing, and we envision tasks ranging in length from a few milliseconds down to a few microseconds. Here are a few of our current projects (as of early 2017):

  • Core-aware thread management. Our experience with RAMCloud showed that it is difficult to use cores efficiently in the presence of short-lived tasks. One of the reasons is that today's kernels do not provide applications with enough information about, and control over, core allocation. We are developing a new thread management system called Arachne, in which core management and thread scheduling are removed entirely from the kernel. A user-level core arbiter allocates cores to applications for relatively long time intervals (tens of milliseconds); each application then manages its own cores, creating lightweight user-level threads and scheduling them on its available cores. With this approach, each application knows exactly how many cores it owns at any given point in time; this allows it to tailor its internal concurrency level and thread placement for its current core allocation.
  • Homa transport protocol. Existing transport protocols for datacenter networks, such as TCP, cannot support low-latency communication, particularly under high network loads. We are developing a new transport protocol called Homa, which provides low latency for small messages while also supporting large messages and high network utilization. In Homa, incoming packets are scheduled by receivers in order to prioritize shorter messages and manage congestion; senders transmit small amounts of unscheduled data to avoid an extra roundtrip of latency for scheduling. Homa uses in-network priority queues for efficient preemption and, more importantly, to implement a form of bipartite matching between senders and receivers, which ensures high network utilization.
  • High-performance notifications. Notifications provide an alternate style of communication to remote procedure calls. With notifications (also called publish-subscribe) communication is controlled by receivers rather than send-ers: receivers express interest in particular kinds of events, and a single event can be delivered to multiple receivers. Notifications are particularly useful in event-driven applications such as large-scale control, but almost any large-scale system includes some sort of notification. Granular computing will require an extremely fast and lightweight notification mechanism. Existing notification systems such as Kafka are too heavyweight for this environment. We have begun exploring alternative designs for notification mechanisms that can meet the needs of granular computing.
  • Nanosecond-scale logging. In order for granular computing to be practical, it must be possible to instrument these systems. Unfortunately, today's logging systems are too slow to use for granular computing (logging a simple message typically takes about 1 μs today). We are developing an ultra-high-performance logging system called Nanolog, which can record events at a granularity of tens of nanoseconds. Nanolog uses a combination of compile-time optimizations and run-time pipelining and compression.

RAMCloud (2009– )

The goal of the RAMCloud project has been to create a large-scale datacenter storage system with the lowest possible access latency. RAMCloud does this by keeping all data in DRAM at all times. The system combines three overall attributes: low latency, large scale, and durability. In our 80-node development cluster with QDR Infiniband, a client can read any 100-byte object in less than 5 μs, and durable writes take about 13.5 μs. In a large datacenter with 100,000 nodes, we expect small reads to complete in less than 10 μs, which is 50–1000x faster than existing storage systems.

RAMCloud's second attribute is large scale. In order to support future Web applications, we designed RAMCloud to allow clusters to grow to at least 10,000 servers. RAMCloud aggregates all of their memories into a single coherent key-value store. This allows storage capacities of 1 PB or more.

The third attribute of RAMCloud is durability. Although RAMCloud keeps all data in DRAM, it also maintains backup copies of data on secondary storage to ensure a high level of durability and availability. This frees application developers from the need to manage a separate durable storage system, or to maintain consistency between in-memory and durable storage.

For an overview of RAMCloud, see [TOCS paper]. Here are a few of the interesting projects we have undertaken as part of building RAMCloud:

  • Fast crash recovery. To minimize DRAM cost, RAMCloud keeps only a single copy of data in DRAM, with multiple copies on secondary storage. In order to avoid long gaps in availability, it harnesses hundreds of servers working concurrently to recover lost data within a few seconds after a crash. [SOSP2011 paper] [Ryan Stutsman PhD thesis]
  • Log-structured memory. RAMCloud manages in-memory storage using an approach similar to that of log-structured file systems. This allows RAMCloud to use DRAM twice as efficiently as traditional storage allocators such as malloc. [FAST 2014 paper] [Steven Rumble PhD thesis]
  • Rules-based programming. RAMCloud contains several modules that must manage distributed resources in a concurrent and fault-tolerant fashion (DCFT modules). After struggling to write these modules, we discovered that a rules-based approach works well for them. In the rules-based approach, the algorithm is decomposed in a collection of rules, each of which makes incremental progress towards a goal. [USENIX ATC 2015 paper]
  • Implementing linearizability. One of RAMCloud's goals is to provide a high level of consistency. We developed a general-purpose infrastructure for implementing linearizability, which provides exactly-once semantics for RPCs even in the face of crashes and reconfiguration. We used this to implement a high performance transaction mechanism in RAMCloud. [SOSP 2015 paper]
  • Scalable indexes. The original RAMCloud data model consisted of a key-value store, but we extended it to provide scalable and high-performance secondary indexes. [USENIX 2016 paper] [Ankita Kejriwal PhD thesis]

In January 2014 we officially tagged RAMCloud version 1.0, which meant that the system had reached a point of maturity where it could support real applications. Newer features, such as transactions and secondary indexes, have been added since then. As of early 2017, we are undertaking fewer new research projects on RAMCloud; instead, we are using RAMCloud more as a driving application for projects in granular computing.

Previous Projects

The following sections describe some projects on which I have worked in the past. I am no longer working on these projects, though several of them are still active as open-source projects or commercial products. The years listed for each project represent the period when I was involved.

Raft (2012–2014)

As part of implementing RAMCloud, we needed a consensus algorithm in order to maintain replicated cluster configuration data. We initially considered using Paxos, but found it incredibly difficult to understand. Furthermore, the Paxos architecture requires complex changes to support practical systems. As a result, we decided to see if we could design a new consensus algorithm with better properties Paxos. The most important goal was for the algorithm to be easy to understand and reason about; in addition, we wanted a formulation that was practical for real implementations. The result is Raft. [USENIX ATC 2014 paper] [Diego Ongaro PhD thesis]

Fiz (2008–2011)

The Fiz project explored new frameworks for highly interactive Web applications. The goal of the project was to develop higher-level reusable components in order to encourage reusability and simplify application development by hiding inside the components many of the complexities that bedevil Web developers (such as security issues or using Ajax to enhance interactivity).

Related links:

  • Integrating Long Polling with an MVC Web Framework. This paper appeared in the 2011 USENIX Conference on Web Application Development; it shows how the architecture of an MVC Web development framework can be extended to make it easy for Web applications to push updates automatically from servers out of browsers.
  • Managing State for Ajax-Driven Web Components. This paper appears in the 2010 USENIX Conference on Web Application Development; it discusses the state-management issues that arise when trying to use a component-based approach for Web applications using Ajax, and evaluates two alternative solutions to the problem.
  • Fiz: A Component Framework for Web Applications: a technical report that provides an overview of Fiz.

ElectricCommander (2005–2007, Electric Cloud)

ElectricCommander is the second major product for Electric Cloud. It addresses the problem of managing software development processes such as nightly builds and automated test suites. Most organizations have home-grown software for these tasks, which is hard to manage and scales poorly as the organization grows. ElectricCommander provides a general-purpose Web-based platform for managing these processes. Developers use a Web interface to describe each job as a collection of shell commands that run serially or in parallel on one or more machines. The ElectricCommander server manages the execution of these commands according to schedules defined by the developers. It also provides a variety of reporting tools and manages a collection of server machines to allow concurrent execution of multiple jobs.

ElectricAccelerator (2002–2005, Electric Cloud)

ElectricAccelerator is Electric Cloud's first product; it accelerates software builds based on the make program by running steps concurrently on a cluster of server machines. Previous attempts at concurrent builds have had limited success because most Makefiles do not have adequate dependency information. As a result, the build tool cannot safely identify steps that can execute concurrently and attempts to use large-scale parallelism produce flaky or broken builds. ElectricAccelerator solves this problem using kernel-level file system drivers to record file accesses during the build. From this information it can construct a perfect view of dependencies, even if the Makefiles are incorrect. As a result, ElectricAccelerator can safely utilize dozens of machines in a single build, producing speedups of 10–20x.

Tcl/Tk (1988–2000, U.C. Berkeley, Sun, Scriptics)

Tcl is an embeddable scripting language: it is a simple interpreted language implemented as a library package that can be incorporated into a variety of applications. Furthermore, Tcl is extensible, so additional functions (such as those provided by the enclosing application) can easily be added to those already provided by the Tcl interpreter. Tk is a GUI toolkit that is implemented as a Tcl extension. I initially built Tcl and Tk as hobby projects and didn't think that anyone besides me would care about them. However, they were widely adopted because they solved two problems. First, the combination of Tcl and Tk made it much easier to create graphical user interfaces than previous frameworks such as Motif (and, Tcl and Tk were eventually ported from Unix to the Macintosh and Windows, making Tcl/Tk applications highly portable). Second, Tcl made it easy to include powerful command languages in a variety of applications ranging from design tools to embedded systems.

Related links:

Log-Structured File Systems (1988–1994, U.C. Berkeley)

A log-structured file system (LFS) is one where all information is written to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery. The log is the only structure on disk; it contains index information so files can be read back from the log efficiently. LFS was initially motivated by the RAID project, because random writes are expensive in RAID disk arrays. However, the LFS approach has also found use in other settings, such as flash memory where wear-leveling is an important problem. Mendel Rosenblum created the first LFS implementation as part of the Sprite project; Ken Shirriff added RAID support to LFS in the Sawmill project, and John Hartman extended the LFS ideas into the world of cluster file systems with Zebra.

Related links:

Sprite (1984–1994, U.C. Berkeley)

Sprite was a Unix-like network operating system built at U.C. Berkeley and used for day-to-day computing by about 100 students and staff for more than five years. The overall goal of the project was to create a single system image, meaning that a collection of workstations would appear to users the same as a single time-shared system, except with much better performance. Among its more notable features were support for diskless workstations, large client-level file caches with guaranteed consistency, and a transparent process migration mechanism. We built a parallel version of make called pmake that took advantage of process migration to run builds across multiple machines, providing 3–5x speedups (pmake was the inspiration for the ElectricAccelerator product at Electric Cloud).

Related links:

VLSI Design Tools (1980–1986, U.C. Berkeley)

When I arrived at Berkeley in 1980 the Mead-Conway VLSI revolution was in full bloom, enabling university researchers to create large-scale integrated circuits such as the Berkeley RISC chips. However, there were few tools for chip designers to use. In this project my students and I created a series of design tools, starting with a simple layout editor called Caesar. We quickly replaced Caesar with a more powerful layout editor called Magic. Magic made two contributions: first, it implemented a number of novel interactive features for developers, such as incremental design-rule checking, which notified developers of design rule violations immediately while they edited, rather than waiting for a slow offline batch run. Second, Magic incorporated algorithms for operating on hierarchical layout structures, which were much more efficient than previous approaches that "flattened" the layout into a non-hierarchical form. Magic took advantage of a new data structure called corner stitching, which permitted efficient implementation of a variety of geometric operations. We also created a switch-level timing analysis program called Crystal. We made free source-level releases of these tools and many others in what became known as the Berkeley VLSI Tools Distributions; these represented one of the earliest open-source software releases. Magic continued to evolve after the end of the Berkeley project; as of 2005 it was still widely used.

Cm* and Medusa (1975–1980, Carnegie Mellon University)

As a graduate student at Carnegie Mellon University I worked on the Cm* project, which created the first large-scale NUMA multiprocessor. Cm* consisted of 50 LSI-ll processors connected by a network of special-purpose processors called Kmaps that provided shared virtual memory. I worked initially on the design and implementation of the Kmap hardware, then shifted to the software side and led the Medusa project, which was one of two operating systems created for Cm*. The goal of the Medusa project was to understand how to structure an operating system to run on the NUMA hardware; among other things, Medusa was the first system to implement coscheduling (now called "gang scheduling").