# Lecture 10: Hardware Acceleration of DNNs

Visual Computing Systems Stanford CS348K, Fall 2018

# Hardware acceleration for DNNs





**Huawei Kirin NPU** 





Volta GPU with **Tensor Cores** 



### Slide credit: Xuan Yang







### **Apple Neural Engine**

### **Intel Lake Crest Deep Learning Accelerator**

# And many more...

| IC Giants             | Intel, Qualcomm, Nvidia, Samsung, AMD, Apple, Xilinx, IBM,<br>STMicroelectronics, NXP, MediaTek, HiSilicon                                                                                                                                                                    | 12 |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Cloud/HPC             | Google, Amazon_AWS, Microsoft, Aliyun, Tencent Cloud, Baidu, Baidu<br>Cloud, HUAWEI Cloud, Fujitsu                                                                                                                                                                            | 9  |
| IP Vendors            | ARM, Synopsys, Imagination, CEVA, Cadence, VeriSilicon                                                                                                                                                                                                                        | 6  |
| Startups in<br>China  | Cambricon, Horizon Robotics, DeePhi, Bitmain, Chipintelli, Thinkforce                                                                                                                                                                                                         | 6  |
| Startups<br>Worldwide | Cerebras, Wave Computing, Graphcore, PEZY, KnuEdge, Tenstorrent,<br>ThinCI, Koniku, Adapteva, Knowm, Mythic, Kalray, BrainChip,<br>Almotive, DeepScale, Leepmind, Krtkl, NovuMind, REM, TERADEEP,<br>DEEP VISION, Groq, KAIST DNPU, Kneron, Vathys, Esperanto<br>Technologies | 26 |

# Modern NVIDIA GPU (Volta)

# **Recall: properties of GPUs**

- "Compute rich": packed densely with processing elements
  - Good for compute-bound applications
- Good, because dense-matrix multiplication and DNN convolutional layers (when implemented properly) are compute bound
- But recall cost of instruction stream processing and control in a programmable processor: Clock and



# **One solution: more complex instructions**

- Fused multiply add (ax + b)
- 4-component dot product x = A dot B
- 4x4 matrix multiply
  - AB + C for 4x4 matrices A, B, C
- Key principle: amortize cost of instruction stream processing across many operations of a single complex instruction

# Volta GPU

| M                               |     |        |        |         |           |       |    |    |      |     |
|---------------------------------|-----|--------|--------|---------|-----------|-------|----|----|------|-----|
|                                 |     |        |        |         |           |       |    | L1 | Inst | ruc |
|                                 |     |        |        | 4'au 0  |           |       |    |    |      |     |
|                                 |     |        | nstruc |         |           |       |    |    | _    |     |
|                                 | War | p Sch  | edule  | r (32 t | hrea      | ıd/cl | k) |    |      |     |
|                                 | Di  | spatcl | h Unit | (32 th  | reac      | l/clk | )  |    |      |     |
| Register File (16,384 x 32-bit) |     |        |        |         |           |       |    |    |      |     |
| FP64                            | INT | INT    | FP32   | FP32    | $\square$ |       |    |    |      |     |
| FP64                            | INT | INT    | FP32   | FP32    |           |       |    |    |      |     |
| FP64                            | INT | INT    | FP32   | FP32    |           |       |    |    |      |     |
| FP64                            | INT | INT    | FP32   | FP32    |           |       |    |    |      |     |

| FP64             | INT       | INT       | FP32      | FP32      | $\square$ |           |        |
|------------------|-----------|-----------|-----------|-----------|-----------|-----------|--------|
| FP64             | INT       | INT       | FP32      | FP32      | TEN       | SOR       | TENSOR |
| FP64             | INT       | INT       | FP32      | FP32      | co        | RE        | CORE   |
| FP64             | INT       | INT       | FP32      | FP32      |           |           |        |
| FP64             | INT       | INT       | FP32      | FP32      |           |           |        |
| FP64             | INT       | INT       | FP32      | FP32      |           |           |        |
| LD/ LD/<br>ST ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU    |

| L0 Instru                             | ction Cache                     | L0 Instruct                           | tion Cache              |  |  |  |  |
|---------------------------------------|---------------------------------|---------------------------------------|-------------------------|--|--|--|--|
| Warp Schedul                          | er (32 thread/clk)              | Warp Scheduler (32 thread/clk)        |                         |  |  |  |  |
| Dispatch Uni                          | t (32 thread/clk)               | Dispatch Unit (                       | (32 thread/clk)         |  |  |  |  |
| Register File                         | (16,384 x 32-bit)               | Register File (1                      | 6,384 x 32-bit)         |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| FP64 INT INT FP3                      | <sup>2 FP32</sup> TENSOR TENSOR | FP64 INT INT FP32                     | FP32 TENSOR             |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32 CORE CORE                | FP64 INT INT FP32                     | FP32 CORE               |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| FP64 INT INT FP3                      | 2 FP32                          | FP64 INT INT FP32                     | FP32                    |  |  |  |  |
| LD/ LD/ LD/ LD/ LD/<br>ST ST ST ST ST | LD/ LD/ LD/<br>ST ST ST SFU     | LD/ LD/ LD/ LD/ LD/<br>ST ST ST ST ST | LD/ LD/ LD/<br>ST ST ST |  |  |  |  |
|                                       | 128KB L1 Data Cach              | ne / Shared Memory                    |                         |  |  |  |  |
| Тех                                   | Тех                             | Тех                                   | Тех                     |  |  |  |  |

ion Cache

FP64

**FP64** 

**FP64** 

**FP64** 

**FP64** 

FP64

FP64

FP64

LD/

LD/

ST

L0 Instruction Cache

Warp Scheduler (32 thread/cll

Dispatch Unit (32 thread/clk)

Register File (16,384 x 32-bit)

FP32 FP32

FP32 FP32

**FP32 FP32** 

FP32 FP32

FP32 FP3

**FP32 FP32** 

FP32 FP32

FP32 FP32

LD/

LD/

TENSOR

CORE

SFU

TENSOR

CORE

SFU

TENSOR CORE

LD/

ST

LD/ ST

INT INT

INT

LD/

INT

LD/

cores

| 120ND LT | Data Cat | ne / Sha |
|----------|----------|----------|
| Tex      |          |          |

Single instruction to perform 2x4x4x4 + 4x4 ops

- Each SM core has: 64 fp32 ALUs (mul-add) **32 fp64 ALUs**
- 8 "tensor cores"
- **Execute 4x4 matrix mul-add instr**
- A x B + C for 4x4 matrices A,B,C
- A, B stored as fp16, accumulation with fp32 C
- There are 80 SM cores in the GV100 GPU: 5,120 fp32 mul-add ALUs
- 640 tensor cores
- 6 MB of L2 cache
- 1.5 GHz max clock
- = 15.7 TFLOPs fp32
- = 125 TFLOPs (fp16/32 mixed) in tensor

# Efficiency estimates \*

### Estimated overhead of programmability (instruction stream, control, etc.)

- Half-precision FMA (fused multiply-add)
- Half-precision DP4 (vec4 dot product)
- Half-precision MMA (matrix-matrix multiply + accumulate)



## stream, control, etc.) 2000% 500% cumulate) 27%

NVIDIA Xavier (SoC for automotive domain)

Features a Computer Vision Accelerator (CVA), a custom module for deep learning acceleration (large matrix multiply unit)

But only 2x more efficient than Volta MMA instruction despite being highly specialized component. (includes optimization of gating multipliers if either operand is zero)

# Google TPU (version 1)

# Google's TPU



Figure credit: Jouppi et al. 2017

# **TPU area proportionality**



Figure credit: Jouppi et al. 2017

### Compute ~ 30% of chip Note low area footprint of control

Key instructions: read host memory write host memory read weights matrix\_multiply / convolve activate

## (matrix vector multiplication example: y=Wx)



Accumulators (32-bit)

## (matrix vector multiplication example: y=Wx)



Accumulators (32-bit)

## (matrix vector multiplication example: y=Wx)



Accumulators (32-bit)

## (matrix vector multiplication example: y=Wx)



Accumulators (32-bit)

## (matrix vector multiplication example: y=Wx)



**Accumulators (32-bit)** 

## (matrix vector multiplication example: y=Wx)



**Accumulators (32-bit)** 

### (matrix matrix multiplication example: *Y=WX*)



### Notice: need multiple 4x32bit accumulators to hold output columns

Accumulators (32-bit)

## **Example:** A = 8x8, B = 8x4096, C = 8x4096



Assume 4096 accumulators

## Example: A = 8x8, B = 8x4096, C = 8x4096



Assume 4096 accumulators

## Example: A = 8x8, B = 8x4096, C=8x4096



## Example: A = 8x8, B = 8x4096, C = 8x4096



## **TPU Performance/Watt**



**GM** = **geometric mean over all apps** WM = weighted mean over all apps

### total = cost of host machine + CPU incremental = only cost of TPU

# Alternative scheduling strategies

### **Psum = partial sum**



Figure credit: Sze et al. 2017





# **Exploiting sparsity**





- conv1 conv2 conv3 conv4 co
   Don't move data from register file to ALU (save energy)
- But ALU is idle (so computation doesn't run faster, just saves energy)



<sup>conv5</sup> energy) er, just saves energy)

# **Recall: model compression**

- Step 1: sparsify weights by truncating weights with small values to zero
- Step 2: compress surviving non-zeros
  - Cluster weights via k-means clustering
  - Compress weights by only storing index of assigned cluster (lg(k) bits)



[Han et al.]

### [Figure credit: Han ICLR16]

# Sparse, weight-sharing ful

$$b_i = ReLU\left(\sum_{j=0}^{n-1} W_{ij}a_j\right)$$

**Fully-connected layer:** 

$$b_i = ReLU\left(\sum_{j \in X_i \cap Y} S[I_{ij}]a_j\right)$$

 $I_{ij} = index for weight W_{ij}$ 

Note: activations can be sparse due to ReLU

## Matrix-vector multiplication of activation vector a against weight matrix W

## Sparse, weight-sharing representation: S[] = table of shared weight values $X_i =$ list of non-zero indices in row i Y =list of non-zero indices in vector a

# Sparse-matrix, vector multiplication

## **Represent weight matrix in compressed sparse column (CSC) format to** exploit sparsity in activation vector:

for each nonzero a\_j in a: for each nonzero M\_ij in column M\_j: b i += M ij \* a j

### More detailed version (assumes CSC matrix):

```
int16* a_values; // dense
                                     for j=0 to length(a):
PTR* M_j_start; // column j
int4* M_j_values;
int4* M_j_indices;
int16* lookup; // lookup table for
              // cluster values (from
                                        for i=0, i_count=0 to col_nonzeros:
              // deep compression paper)
                                               += col_indices[i_count];
                                           i
```

### \* Recall from deep compression paper: there is a unique lookup table for each chunk of matrix values

if (a[j] == 0) continue; // scan to next nonzero col\_values = M\_j\_values[M\_j\_start[j]]; // j-th col col\_indices = M\_j\_indices[M\_j\_start[j]]; // row idx in col col\_nonzeros = M\_j\_start[j+1] - M\_j\_start[j]; b[i] += lookup[col\_values[i\_count]] \* a\_values[j];

## **Parallelization of sparse-matrix-vector product** Stride rows of matrix across processing elements **Output activations strided across processing elements**



### Weights stored local to PEs. Must broadcast non-zero a\_j's to all PEs Accumulation of each output b i is local to PE

| Virtual yay yay yay |  |  |
|---------------------|--|--|
|---------------------|--|--|



# Efficient Inference Engine (EIE) for quantized sparse/matrix vector product

### Custom hardware for decoding compressed-sparse representation Tuple representing non-zero activation (a<sub>i</sub>, j) arrives and is enqueued



# **EIE efficiency**



Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.



### CPU: Core i7 5930k (6 cores) GPU: GTX Titan X mGPU: Tegra K1

### Warning: these are not end-to-end numbers: just fully connected layers!

### Sources of energy savings:

- Compression allows all weights to be stored in SRAM (few DRAM loads)
- Low-precision 16-bit fixed-point math (5x more efficient than 32-bit fixed math)
- Skip math on input activations that are zero (65% less math)

### RAM loads) an 32-bit fixed math) )

# A critical eye...

- ElE paper highlights performance on fully connected layers (see graph above)
  - Final layers of networks like AlexNet, VGG...
  - Common in recurrent network topologies like LSTMs
- But many state-of-the-art image processing networks have moved to fully convolutional designs (or fully connected layers are a small part of computational cost)
  - Recall Inception, MobileNet, etc..

## Input stationary design (dense 1D conv example)



### **Stream of weights** (2 filters of size 3)

### Processing elements



### Accumulators (implement +=)

## Input stationary design (sparse example)



# Stream of sparse weights (2 filters, each with 2 non-zeros)

### Processing elements

# Accumulators (implement +=)

Note: accumulate is now a scatter

### **SCNN: accelerating sparse conv layers** Like EIE: assume both activations and conv weights are sparse

- Weight stationary design:
  - **Each PE receives:** 
    - A set of I input activations from an input channel: a list of I (value, (x,y)) pairs
    - A list of F non-zero weights
  - Each PE computes: the cross-product of these values: P x I values
  - Then scatters P x I results to correct accumulator buffer cell
  - Then repeat for new set of F weights (reuse l inputs)
- Then, after convolution:
  - **ReLU** sparsifies output
  - **Compress outputs into** sparse representation for use as input to next layer



**Stanford CS** 

# SCNN results (on GoogLeNet)



**DCNN** = dense CNN evaluation DCNN-opt = includes ALU gating, and compression/decompression of activations

### [Parashar et al. ISCA17]

# Summary of hardware accelerators for efficient inference

- **Specialized instructions for dense linear algebra computations** 
  - Reduce overhead of control (compared to CPUs/GPUs)
- **Reduced precision operations (cheaper computation + reduce bandwidth** requirements)
- Systolic architectures for efficient communication
  - Different scheduling strategies: weight-stationary, input/output stationary, etc.
- **Exploit sparsity in activations and weights** 
  - Skip computation involving zeros
  - Hardware to accelerates decompression of sparse representations like compressed sparse row/column