Hardware Accelerators for Training of Neural Networks

Hardware Accelerators for Training Deep Neural Networks

Ardavan Pedram & Kunle Olukotun
Stanford Univeristy

The 46th International Symposium on Computer Architecture (ISCA-2019)

Sunday Afternoon, June 23rd, 2019;
Room 102A, Phoenix Convention Center, Phoenix, Arizona, USA

Funding for this Tutorial was provided by the National Science Foundation Division of Computing and Communication Foundations under award number 1563113.

Slides

Abstract

In this tutorial we cover the challenges when training deep neural networks. We will cover the architectural techniques used to design accelerators for training systems. Training neural networks has demanding computational and memory complexity compared to inference. Henceforth, several techniques have been exploited by state of the art researchers to overcome these challenges. We will discuss the limitations of current solutions and the approachs to overcome those. We will introduce the problem of training by examining the ubiquitous synchronous stochastic gradient descent and asynchronous variants. We will cover tricks like lowering the precision and their impact on network convergence and performance. We will also cover the challenges with respect to exploiting sparsity specially while training. We look at the problem of scaling DNN training on distributed multiprocessors and its attendant problems of increasing batch size and balancing computation and communication. We then broadly survey the architectures to support training, including special purpose accelerators.

Recommended Background

This tutorial assumes basic knowledge about computation kernels of Deep Neural Networks and how inference is accelerated.

We recommend the Saturday tutorial by MIT friends on Hardware Accelerators for Deep Learning.
Several background and reading material are available on our Stanford CS217 course website.

The Hotchips30 tutorial video of Architectures for Accelerating Deep Neural Nets is also a quick way to get up to speed.

Overview

Fundamentals of training

Stochastic Gradient Descent
Backpropagation
Optimization
Generalization

Training from computation point of view

Data dependency
Parallelism
Batch size

Low/Mixed Precision Training

Floating-point representations for training
Loss Scaling

Sparse Training

Activation sparsity
Weight Sparsity
Structured Sparsity

Normalization

Batch normalization
Group normalization
Layer normalization
Online normalization

Scaling Training

DNN architecture scaling
System scaling / distributed Training

Accelerators for training
A Note on ML Benchmarks
Conclusion and Q&A

Speakers

Ardavan Pedram is currently a member of technical staff at Cerebras Systems and an adjunct professor at Stanford University directing the PRISM project. He organized and taught the first course on hardware accelerators for machine learning (CS217) in Fall 2018 with professor Olukotun at Stanford Computer Science department. His work on algorithm/architecture codesign of specialized accelerators for linear-algebra and machine-learning has won two National Science Foundation Awards in 2012 and 2016 respectively. Ardavan received his Ph.D. in computer engineering from The University of Texas at Austin in 2013.

Kunle Olukotun is the Cadence Design Systems Professor in the School of Engineering and Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is well known as a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. Olukotun founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multicore processor, called Niagara, was acquired by Sun Microsystems. Niagara derived processors now power all Oracle SPARC-based servers. Olukotun currently directs the Stanford Pervasive Parallelism Lab (PPL), which seeks to proliferate the use of heterogeneous parallelism in all application areas using Domain Specific Languages (DSLs).