Most emerging applications are data intensive. Hence, my computer architecture research focuses primarily on energy efficient memory systems that scale to thousands of cores. The majority of memory systems research over the past four decades has been empirical. While quite successful, this approach has led to systems that are difficult to reason about their scaling properties and QoS pathologies. We proposed the use of robust analytical techniques to build practical and scalable memory systems with well understood behavior and control [ASPLOS’10, MICRO’10, ISCA’11, PACT’11, HPCA’12, ISCA’13]. We used this approach to design highly associative caches with the effective associativity independent of the number of physical ways, a cache partitioning scheme that supports hundreds of arbitrarily-sized partitions, a scalable coherence directory that eliminates directory-induced invalidations with minimal sizing, and a fast microarchitectural simulator for thousand-core systems.
We also revisited near data processing (NDP), the only practical way to improve the energy efficiency of data intensive workloads with limited temporal locality [ISCA’12, PACT’15, HPCA’16, ISCA’16, ASPLOS’17]. The availability of 3D integration technology provides a viable path for placing computation units in the main memory system. Our work focused on the hardware support and the runtime system necessary to support machine learning, graph processing, and MapReduce applications in NDP systems. We also developed a new reconfigurable fabric for NDP systems that combines key features from coarse-grain and fine-grain FPGAs and can saturate the high bandwidth available in 3D memory stacks. Moreover, we developed mechanisms that use dense DRAM arrays to implement bit-level reconfigurable fabrics with a 10x density advantage and a 3x power advantage over conventional, SRAM-based FPGAs.
Finally, we investigated specialized accelerators for emerging workloads. While accelerators can improve performance and energy efficiency by orders of magnitude over programmable cores, they suffer from lack of flexibility and long development cycles. Hence, our research focused on domain-specific accelerators that support multiple workloads within a domain and on tools for automatic generation of accelerators from high-level code. We developed a domain-specific accelerator for the convolutional patterns prevalent in computational photography, computer vision, and video processing [ISCA’10, CACM’11, ISCA’13, CACM’15]. The accelerator minimizes instruction overheads and maximizes data reuse in stencil patterns in order to enable a large number of arithmetic operations per memory access. We are currently working on a tool chain that automatically generates specialized accelerators from the parallel patterns in many domain-specific languages such as map, reduce, filter, and groupBy [ASPLOS’16, ISCA’16]. We are addressing fundamental challenges such as the need for fast exploration of vast design spaces and the need to co-specialize compute and memory systems.