## GDDR Memory Enabling Al and High Performance Compute Wolfgang Spirkl, Fellow at Micron Technology GTC, S9968, 20-March-2019

©2019 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to change without notice. All information is provided on an "AS IS" basis without warranties of any kind. Statements regarding products, including regarding their features, availability, functionality, or compatibility, are provided for informational purposes only and do not modify the warranty, if any, applicable to any product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the property of Micron Technology, Inc. All other trademarks are the property of their respective owners.



GTC 2019, Micron GD



# Agenda

- The Demand for faster Memory and storage
- Competing Compute/Memory Solutions
- GDDR6 for AI applications and more
- Micron GDDR6 AI demonstration



# Accelerated Data Cycle

Driven by Increasing Data Value

- Creates continuous need to capture, process, move & store data
- Generates ever-increasing demand for memory & fast storage
- Al is amplifying the Accelerated Cycle



### Al Landscape for Memory & Storage





## Al Workloads Unleash the Need For More Memory & Storage

Significant Growth Across Private, Public & Hybrid Cloud







- The Demand for faster Memory and storage
- Competing Compute/Memory Solutions
- GDDR6 for AI applications and more
- Micron GDDR6 AI demonstration



## Al Acceleration is Driving Demand for Memory Bandwidth

- AI accelerators increase compute performance
  - GPU, TPU, etc..
- Accelerated applications are more likely to be memory bound [3]
- Micron supports next gen technologies
  - GDDR6, HBM2E



Compute Ops / Byte of Memory (log)  $\longrightarrow$ 

[1] Jouppi, Norman, et al, 2017. In-Datacenter Performance Analysis of a TPU, ISCA

[2] Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM.

(3) Forrester report on memory and storage impact on AI



## **Memory Options**



Smart Edge

**Intelligent Endpoint** 







The Demand for faster Memory and storage

**Competing Compute/Memory Solutions** 

GDDR6 for AI applications and more

Micron GDDR6 AI demonstration



# **GDDR5/GDDR6 Features**

| Feature            | GDDR5                                      | GDDR6                                         |  |  |  |
|--------------------|--------------------------------------------|-----------------------------------------------|--|--|--|
| Density            | 512Mb – 8Gb                                | 8Gb – 32Gb                                    |  |  |  |
| VDD, VDDQ          | 1.5V + 1.35V                               | 1.35V                                         |  |  |  |
| VPP                | N/A                                        | 1.8V                                          |  |  |  |
| Package            | BGA-170<br>14mm x 12mm<br>0.8mm ball pitch | BGA-180<br>14mm x 12mm<br>0.75mm ball pitch   |  |  |  |
| Signaling          | POD15 / POD135                             | POD135                                        |  |  |  |
| Data rate          | ≤8 Gbps                                    | ≤16 Gbps                                      |  |  |  |
| I/O Width          | x32/x16                                    | 2-ch x16/x8                                   |  |  |  |
| Access Granularity | 32B                                        | 2-ch 32B each or<br>1-ch 64B w/ PC mode       |  |  |  |
| I/O Count          | 61                                         | 62 / 74                                       |  |  |  |
| ABI, DBI           | $\checkmark$                               | $\checkmark$                                  |  |  |  |
| CRC                | CRC-8 (BL8)                                | 2x CRC-8 (BL16);<br>compressed 2x CRC-8 (BL8) |  |  |  |
| RDQS Mode          | √ (BL8)                                    | √ (BL16)                                      |  |  |  |
| ODT                | $\checkmark$                               | $\checkmark$                                  |  |  |  |
| V <sub>REFC</sub>  | external                                   | external / internal                           |  |  |  |
| V <sub>REFD</sub>  | ext. / int.                                | internal                                      |  |  |  |
| Temp Sensor        | $\checkmark$                               | $\checkmark$                                  |  |  |  |

### Package Configuration 2 channels 4 channel 3 2 bits 6 DDR5 CDDR5 CDDR5 CDDR6 CDDR6

Dual channel organization

- Maintains fine granularity (32 bytes per colum access)
- In spite of doubled prefetch size

New features for high data rates:

- Optimized signal ball-out for low-effort PCB design
- Per lane DFE and VREFD
- Transmitter equalization





## **Calculating GDDR Bandwidth**

- Bandwidth = number of bits/s between GPU and memory
- Memory bus is like traffic lanes
  - More lanes, the greater the flow
  - Higher lane speed, the greater the flow

| Memory Bandwidth is             | GDDR6<br>Example |  |
|---------------------------------|------------------|--|
| number of memory components     | 8                |  |
| ( number of lanes per component | 32               |  |
| X Data rate per lane (Gbps)     | 16               |  |
| Memory Bandwidth (GB/s)         | 512              |  |

Micron



## **GDDR Bandwidth / Memory Bus**

| x32<br>x32<br>x32        |                         | Technology | Speed<br>(Gbps) | # of<br>comp. | # of<br>lanes | Memory<br>bus (bit) | Bandwidth<br>(GB/s) |  |
|--------------------------|-------------------------|------------|-----------------|---------------|---------------|---------------------|---------------------|--|
| SPU w/<br>384 bit<br>bus |                         | HBM2       | 2               | 4             | 1024          | 4096                | 1024                |  |
|                          | GDDR6                   | 16         | 12              | 32            | 384           | 768                 |                     |  |
|                          |                         | GDDR6      | 14              | 12            | 32            | 384                 | 672                 |  |
| GPU w/<br>256 bit<br>bus |                         | GDDR5X     | 11              | 12            | 32            | 384                 | 528                 |  |
|                          | GDDR5                   | 7          | 12              | 32            | 384           | 336                 |                     |  |
| GPU w/<br>192 bit<br>bus | GDDR6                   | 14         | 8               | 32            | 256           | 448                 |                     |  |
|                          |                         | GDDR5X     | 11              | 8             | 32            | 256                 | 352                 |  |
|                          | <ul><li>▲ x32</li></ul> | GDDR5      | 7               | 8             | 32            | 256                 | 224                 |  |

GTC 2019, Micron GDDR6

x32

GPU

384 k

GPU

256

GPU

192



## **GDDR6 20 Gbps Data Eye**

Measured performance beyond the specification



Figure 15: Measured 20Gb/s data eye based on a PRBS6 pattern

https://www.micron.com/-/media/client/global/documents/products/whitepaper/16gb\_s\_and\_beyond\_w\_single\_endedio\_in\_graphics\_memory.pdf?la=en



GTC 2019, Micron GDDR6

## **Speech Recognition Craves Memory Bandwidth**

- "Deep Speech" recognition application
  - Baidu Research's Al algorithm (<u>https://arxiv.org/pdf/1412.5567.pdf</u>)
  - Mozilla's tensorflow implementation
  - Speech-to-text benchmark for AI hardware (<u>https://github.com/mozilla/DeepSpeech</u>)
- Hardware
  - NVIDIA RTX 2080 Ti
  - 11GB GDDR6
    - 384 bit bus @14Gb/s/pin, 672GB/s
- Experiment setup
  - Adjust GDDR6 clock rate
  - Measure speech recognition inference rate:
    - Inference rate =  $\frac{Audio\ file\ duration}{Inference\ time}$





## Al Demonstrates the Need for Memory speed

SPEECH RECOGNITION DEMO – Micron Booth # 1713



Micron

### Conclusions

• AI Landscape demands higher performance memory to feed the compute needs

- Micron delivers a broad range of memory solutions for AI applications from data center to cloud to edge to endpoint devices
- GDDR6 high performance memory optimized for applications beyond graphics

### Experience Micron speech recognition AI with GDDR6 in our booth 1713!



