Hardware Accelerators for Training Deep Neural Networks
Ardavan Pedram & Kunle Olukotun
Sunday Afternoon, June 23rd, 2019;
Room 102A, Phoenix Convention Center, Phoenix, Arizona, USA
Funding for this Tutorial was provided by the National Science Foundation Division of Computing and Communication Foundations under award number 1563113.
In this tutorial we cover the challenges when training deep neural networks.
We will cover the architectural techniques used to design accelerators for training systems.
Training neural networks has demanding computational and memory complexity compared to inference. Henceforth, several techniques have been exploited by state of the art researchers to overcome these challenges. We will discuss the limitations of current solutions and the approachs to overcome those.
We will introduce the problem of training by examining the ubiquitous synchronous stochastic gradient descent and asynchronous variants. We will cover tricks like lowering the precision and their impact on network convergence and performance. We will also cover the challenges with respect to exploiting sparsity specially while training. We look at the problem of scaling DNN training on distributed multiprocessors and its attendant problems of increasing batch size and balancing computation and communication. We then broadly survey the architectures to support training, including special purpose accelerators.
This tutorial assumes basic knowledge about computation kernels of Deep Neural Networks and how inference is accelerated.
- We recommend the
Saturday tutorial by MIT friends on Hardware Accelerators for Deep Learning.
- Several background and reading material are available on our Stanford CS217 course website.
- The Hotchips30 tutorial video of Architectures for Accelerating Deep Neural Nets is also a quick way to get up to speed.
- Fundamentals of training
- Stochastic Gradient Descent
- Training from computation point of view
- Data dependency
- Batch size
- Low/Mixed Precision Training
- Floating-point representations for training
- Loss Scaling
- Sparse Training
- Activation sparsity
- Weight Sparsity
- Structured Sparsity
- Batch normalization
- Group normalization
- Layer normalization
- Online normalization
- Scaling Training
- DNN architecture scaling
- System scaling / distributed Training
- Accelerators for training
- A Note on ML Benchmarks
- Conclusion and Q&A
|| Ardavan Pedram
is currently a member of technical staff at Cerebras Systems and an adjunct professor at Stanford University directing the PRISM project. He organized and taught the first course on hardware accelerators for machine learning
(CS217) in Fall 2018 with professor Olukotun at Stanford Computer Science department. His work on algorithm/architecture codesign of specialized accelerators for linear-algebra and machine-learning has won two National Science Foundation Awards in 2012 and 2016 respectively. Ardavan received his Ph.D. in computer engineering from The University of Texas at Austin in 2013.
is the Cadence Design Systems Professor in the School of Engineering and Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is well known as a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. Olukotun founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multicore processor, called Niagara, was acquired by Sun Microsystems. Niagara derived processors now power all Oracle SPARC-based servers. Olukotun currently directs the Stanford Pervasive Parallelism Lab (PPL), which seeks to proliferate the use of heterogeneous parallelism in all application areas using Domain Specific Languages (DSLs).