

#### Summary

RAM

- Optimized data transfer paths using PCIe peer-to-peer transactions
- Direct block-level disk access from GPU, eliminating CPU in I/O data path entirely
- Concurrently sharing NVMe drives between multiple hosts and GPUs
- PCIe non-transparent bridges offer great flexibility in dynamic device configurations



#### Outline

- PCIe and NVMe
- Non-Transparent Bridging
- GPUDirect RDMA & Async
- Device Lending and SmartIO

#### PCI Express (PCIe)



### The PCIe fabric is structured as a tree, where devices form the leaf nodes (endpoints) and the CPU is on top of the root



# The PCIe fabric is structured as a tree, where devices form the leaf nodes (endpoints) and the CPU is on top of the root









#### \$ lspci -s XX:XX.X -v

| root@petty:~ (ssh)                                                                                                                                                                              |                         |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| root@petty ~]#                                                                                                                                                                                  |                         |
| root@petty ~]# lspci -s 4:3.1 -v                                                                                                                                                                |                         |
| 4:03.1 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P620] (rev a1) (prog-                                                                                                      | if 00 [VGA controller]) |
| Subsystem: NVIDIA Corporation Device 1264                                                                                                                                                       |                         |
| Flags: bus master, fast devsel, latency 0, IRQ 71, NUMA node 0<br>Memory at 3830e5000000 (32-bit, non-prefetchable) [size=16M]<br>Memory at 3830f0000000 (32-bit, non-prefetchable) [size=256M] | Base Address Regions    |
| Memory at 383100000000 (32-bit, non-prefetchable) [size=32M]                                                                                                                                    | (BARs)                  |
| Expansion ROM at 3830e6000000 [disabled] [size=512K]                                                                                                                                            | · · · · ·               |
| Capabilities: [60] Power Management version 3                                                                                                                                                   |                         |
| Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+                                                                                                                                      |                         |
| Capabilities: [78] Express Legacy Endpoint, MSI 00                                                                                                                                              |                         |
| Capabilities: [100] Virtual Channel                                                                                                                                                             |                         |
| Capabilities: [250] Latency Tolerance Reporting                                                                                                                                                 |                         |
| Capabilities: [128] Power Budgeting                                                                                                                                                             |                         |
| Capabilities: [420] Advanced Error Reporting                                                                                                                                                    |                         |
| Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024                                                                                                                          |                         |
| Capabilities: [900] #19                                                                                                                                                                         |                         |
| Kernel driver in use: nvidia                                                                                                                                                                    |                         |
| Kernel modules: nouveau, nvidia_drm, nvidia                                                                                                                                                     |                         |

[root@petty ~]#





#### \$ ./bandwidthTest





#### \$ ./bandwidthTest



# As device memory is mapped in to the same address space by the system, devices can also access other devices' memory



#### \$ ./p2pBandwidthLatencyTest

| ● ● ● root@xde: /usr/local/cuda-10.1/samples/1_Utilities/p2pBandwidthLatencyTest (ssh)  |
|-----------------------------------------------------------------------------------------|
| P2P Connectivity Matrix                                                                 |
| D\D 0 1                                                                                 |
| 0 1 1                                                                                   |
| 1 1 1                                                                                   |
| Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)                                     |
| D\D 0 1                                                                                 |
| 0 21.44 5.76                                                                            |
|                                                                                         |
| Unidirectional P2P=Enaled Bandwidth (P2P Writes) Matrix (GB/s)                          |
| $\begin{array}{cccc} D \backslash D & 0 & 1 \\ 0 & 2^{1} & 44 & 6 & 71 \end{array}$     |
| 0 21.44 6.71<br>1 6.71 21.45                                                            |
| idirectional P2P Pisabled Bandwidth Matrix (GB/s)                                       |
| $D \setminus D = \emptyset = 1$                                                         |
| 0 21.47 5.77                                                                            |
| 1 5.78 21.48                                                                            |
| Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)                                       |
| D\D 0 1                                                                                 |
| 0 21.47 13.23                                                                           |
| 1 13.20 21.48                                                                           |
| P2P=Disabled Latency Matrix (us)                                                        |
| GPU 0 1                                                                                 |
| 0 6.35 24.79                                                                            |
| 1 26.33 6.34                                                                            |
|                                                                                         |
| CPU 0 1                                                                                 |
| 1 7.59 3.26                                                                             |
| P2P=Enabled Latency (P2P rites) Matrix (us)                                             |
| GPU 0 1                                                                                 |
| 0 6.39 2.05                                                                             |
| 1 2.00 6.42                                                                             |
|                                                                                         |
| CPU 0 1                                                                                 |
| 0 3.38 2.51                                                                             |
| 1 2.49 3.36                                                                             |
|                                                                                         |
| NOTE: The CUDA Samples are not meant for performance measurements. Results may vary whe |

n GPU Boost is enabled. root@xde:/usr/local/cuda-10.1/samples/1\_Utilities/p2pBandwidthLatencyTest#

#### **PCIe Summary**

- Devices share address space with the CPU and are able to access memory = DMA
- Memory reads and writes are forwarded shortest path on the PCIe fabric
- Devices can access memory on other devices
  = peer-to-peer DMA

#### NVM Express (NVMe)



### NVMe is designed around multiple parallel I/O command submission queues (SQs) and command completion queues (CQs)



#### One SQ and one CQ per CPU (1:1)

### NVMe is designed around multiple parallel I/O command submission queues (SQs) and command completion queues (CQs)



Multiple SQs sharing single CQ (N:M)

#### Command queues are implemented as ring-buffers where each individual queue has a dedicated doorbell register



#### Command queues are implemented as ring-buffers where each individual queue has a dedicated doorbell register



## I/O commands use physical page region lists (PRP lists) to describe physical addresses of non-contiguous memory



#### Software can set up up I/O command queues anywhere in memory



#### Software can set up up I/O command queues anywhere in memory



#### Software can set up up I/O command queues anywhere in memory





















## Software detects completions either by waiting for hardware interrupts or by polling completion queue memory



# Software detects completions either by waiting for hardware interrupts or by polling completion queue memory



#### NVMe Summary

- Multiple I/O queues enables highly parallel design
- Controller uses DMA = queues and data buffers can be hosted <u>anywhere in memory space</u>
- Single doorbell register write in I/O path
- Software can poll for completions instead of waiting for hardware interrupt

#### **Non-Transparent Bridging**









#### Since PCIe devices are also part of address space, it is also possible to map remote device resources



#### Since PCIe devices are also part of address space, it is also possible to map remote device resources



#### Since PCIe devices are also part of address space, it is also possible to map remote device resources



# Using NTBs, it is possible for a local driver to use a remote device by setting up MMIO and DMA mappings



# Using NTBs, it is possible for a local driver to use a remote device by setting up MMIO and DMA mappings



Native NVMe over PCIe NTB

4 kB read completion latency = ~14.21 μs

NVMe over Fabrics (NVMeoF) using 100 GbE Ethernet RDMA (SPDK target, kernel direct)

#### **NTB** Summary

- NTBs connects separate independent root complexes and translating addrs between them
- Since device memory (BARs) are part of address space, we can map remote device resources for a local host

#### **GPUDirect RDMA & Async**





https://docs.nvidia.com/cuda/gpudirect-rdma/



**Onboard Device Memory** 



**Onboard Device Memory** 





**Onboard Device Memory** 







# This allows a third-party device to read and write directly to GPU memory instead of copying to and from system memory



## Unified Memory allows mapping controller registers and queue doorbells in to memory space managed by the CUDA driver



# With doorbell registers mapped in to CUDA memory space, a GPU kernel can now trigger doorbell writes using GPUDirect Async



# By assigning I/O queues to each individual GPU, multiple GPUs can share a single NVMe disk simultaneously



# By assigning I/O queues to each individual GPU, multiple GPUs can share a single NVMe disk simultaneously





#### **GPUDirect Summary**

- GPUDirect DMA allows third-party devices, such as NVMe disks, to access GPU memory directly
- GPUDirect Async allows memory-mapped I/O from a CUDA kernel = eliminate CPU in I/O path entirely
- We have used these to make a distributed NVMe driver in CUDA kernel code

#### **Device Lending and SmartIO**



# In PCIe clusters, the same fabric is used both for interconnecting hosts as well as the local I/O bus inside each host



## Using an NTB, it is possible to map remote device memory regions (BARs) for a local host



### Using an NTB, it is possible to map remote device memory regions (BARs) for a local host



#### The remote system can in turn reverse-map the local system's memory and interrupt addresses for the device



### The remote system can in turn reverse-map the local system's memory and interrupt addresses for the device



### The remote system can in turn reverse-map the local system's memory and interrupt addresses for the device



#### By emulating a PCIe hot-add event, the remote device is inserted into the kernel device tree, making it appear locally installed



## By emulating a PCIe hot-add event, the remote device is inserted into the kernel device tree, making it appear locally installed





## Using Device Lending and our CUDA NVMe driver, it is possible to use create highly flexible and distributed I/O workloads















Submission Queue Location





## Summary

RAM



- Optimized data transfer paths using PCIe peer-to-peer transactions
- Direct block-level disk access from GPU, eliminating CPU in I/O data path entirely
- Concurrently sharing NVMe drives between multiple hosts and GPUs
- PCIe non-transparent bridges offer great flexibility in dynamic device configurations



