

### FUTURE OF ISO AND CUDA C++

The NVIDIA ISO C++ Delegation Bryce Adelstein Lelbach @ble1bach Chief ISO C++ Library Designer, US Programming Language Standards Chair

Olivier Giroux @\_\_simt\_\_ ISO C++ Concurrency and Parallelism Chair

Michał Dominiak @Guriwesu Polish C++ Standards Chair, Extended Floating Point Author

Jared Hoberock Parallelism TS Project Editor, Executors Author

David Olsen Scalable Synchronization Library Author, Extended Floating Point Author

Timothy Costa Product Manager, HPC Software

Graham Lopez Product Manager, HPC Compilers

#include <C++>

Copyright (C) 2020 NVIDIA





Source: https://isocpp.org/std/status

#### #include <C++>

Copyright (C) 2020 NVIDIA



## C++20 The Biggest Release in a Decade

- Modules
- Coroutines
- Concepts
- Ranges
- Scalable Synchronization



## C++23 Asynchrony and Parallelism

- Standard Library Modules
- Coroutine Support Library
- Executors
- Networking
- > mdspan/mdarray



#### C++23 Executors

Simplifying Work Creation

```
void compute(int resource, ...) {
  switch(resource) {
    case GPU:
      kernel<<<...);</pre>
    case MULTI GPU:
      cudaSetDevice(0);
      kernel<<<...);
                                          VS
      cudaSetDevice(1);
      kernel<<<...);</pre>
                                               }
    case COOP GPU:
      cudaLaunchCooperativeKernel(...);
    case GRAPH:
      cudaGraphLaunch(...);
```

```
void compute(executor auto ex, ...) {
    execute(ex, ...);
}
```

#### #include <C++>



#### C++23 Executors

```
static thread pool pool(16);
executor auto ex = pool.executor();
execute(ex, []{ cout << "Hello world from the thread pool!"; });</pre>
sender auto begin = schedule(ex);
sender auto hi again = then(begin, []{ cout << "Hi again! Have an int."; return 13; });</pre>
sender auto work = then(hi again, [](int arg) { return arg + 42; });
receiver auto print result = as receiver([](int arg) { cout << "Received.\n"; });</pre>
submit(work, print result);
```



#### Linear Algebra & C++23 mdspan/mdarray

```
auto x = ...; // An `mdspan<double, dynamic_extent>`.
auto y = ...; // An `mdspan<double, dynamic_extent>`.
```

auto A = ...; // An `mdspan<double, dynamic\_extent, dynamic\_extent>`.

```
// y = transpose(A) * x;
matrix_vector_product(par, transpose_view(A), x, y);
```

#### #include <C++>



#### C++23 Extended Floating Point Types

- std::float16\_t // IEEE-754-2008 binary16.
- std::float32\_t // IEEE-754-2008 binary32.
- std::float64\_t // IEEE-754-2008 binary64.
- std::float128\_t // IEEE-754-2008 binary128.
- std::bfloat16\_t // binary32 with 16 bits truncated.



#### Why does NVIDIA care about ISO C++?

#### What does NVIDIA hope to accomplish in ISO C++?

#### What is the relationship between ISO C++ and CUDA C++?





# Modern NVIDIA GPUs implement the C++ execution model.

#### We spent transistors to get there.

# WHY C++?

|                     | Relevant in the 80s/90s |
|---------------------|-------------------------|
| Non-8-bit char      |                         |
| Noncommittal sizeof |                         |
| Non-2's comp. int   |                         |
| Non-IEEE float      |                         |
| Segmented memory    |                         |
| Non-endian pointers |                         |

|                     | Relevant in the 80s/90s | Relevant Today |  |
|---------------------|-------------------------|----------------|--|
| Non-8-bit char      | ✓                       |                |  |
| Noncommittal sizeof |                         |                |  |
| Non-2's comp. int   |                         |                |  |
| Non-IEEE float      |                         |                |  |
| Segmented memory    |                         |                |  |
| Non-endian pointers |                         |                |  |

### WHY IS THIS NOT HELPFUL?

#### **FALSE CHOICES**

Most options are <u>dictated</u>. New CPU? Match AARCH64. New GPU? Match the host.

GPUs match all the hosts.

#### **BAD CHOICES**

Most alternatives are <u>bad</u>. Negligible area savings. Negligible power savings. Programmer surprise.

#### **IT'S TOO RISKY IN 2019**







### **1990-2000'S TOPOLOGY**









Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

# Future silicon performance wins will come from architectural innovation, not transistor density scaling.

### What Makes C++ Portable?

The C++ Execution Model: Memory Model + Forward Progress

- Threads evaluate expressions that access and modify flat storage.
- Evaluation within a thread is driven by sequenced before relations.
- Interactions between threads is driven  $\bullet$ by synchronizes with relations.
- Forward progress promises eventual  $\bullet$ termination.



#### #include <C++>

26





# Modern NVIDIA GPUs implement the C++ execution model.

#### We spent transistors to get there.







| GPU ARCH               | CUDA      | X86                                   | ARM & POWER                   |  |
|------------------------|-----------|---------------------------------------|-------------------------------|--|
| Tesla & Fermi          | 1+        | cudaMalloc & cudaMemcpy               |                               |  |
| Kepler & Maxwell       | 6+        | cudaMallocManaged<br>(Symmetric Heap) |                               |  |
| Pascal, Volta & Turing | 8+        | cudaMallocManaged<br>(paging)         | cudaMallocManaged<br>(NVLink) |  |
| "Tastes like memory."  | Linux HMM | malloc<br>(paging)                    | malloc<br>(NVLink)            |  |

### CONSISTENCY



Completely new hardware memory model in Volta, outline similar to POWER.

Everything but consume is accelerated. Stay tuned about consume.

See the PTX 6.0 ISA programming guide, chapter 8.

Copyright (C) 2020 NVIDIA

| thread of execution |     | = A chain of evaluations in your code.          |  |
|---------------------|-----|-------------------------------------------------|--|
| execution agent     |     | = A thing that runs your code.                  |  |
| thread              |     | = A particularly onerous example of that thing. |  |
| CPU                 | CPU |                                                 |  |
| CPU                 | CPU |                                                 |  |
| GPU                 |     |                                                 |  |

| thread of execution |     | = A chain of evaluations in your code.          |  |  |
|---------------------|-----|-------------------------------------------------|--|--|
| execution agent     |     | = A thing that runs your code.                  |  |  |
| thread              |     | = A particularly onerous example of that thing. |  |  |
| CPU                 | CPU |                                                 |  |  |
| CPU                 | CPU |                                                 |  |  |
| GF                  | งป  | = Runs things that aren't onerous.              |  |  |



C++17

i.

Clarification

| thread of execution |     | thread of execution           |     | thread of execution  |           |
|---------------------|-----|-------------------------------|-----|----------------------|-----------|
| concurrent e.a.     |     | parallel e.a.                 |     | weakly parallel e.a. |           |
| std:: / main thread |     | Volta thread / pool           |     | GPU / SIMD lane      |           |
| CPU                 | CPU | CPU                           | CPU | CPU                  | CPU       |
| CPU                 | CPU | CPU                           | CPU | CPU                  | CPU       |
|                     |     | <b>Volta</b><br># 5120-163840 |     |                      | her<br>SU |

- Concurrent Forward Progress: The thread will make progress, regardless of whether other threads are making progress.
- Parallel Forward Progress: Once the thread has executed its first execution step, the thread will make progress.
- Weakly Parallel Forward Progress: The thread is not guaranteed to make progress.

## PROGRESS

Not "business as usual".

A concerted effort by dedicated engineers.

Volta+ is alone of its kind.



#### WARP IMPLEMENTATION



#### Volta

32 thread warp with independent scheduling

Copyright (C) 2020 NVIDIA

### PASCAL WARP EXECUTION MODEL



## **VOLTA WARP EXECUTION MODEL**

Synchronization may lead to interleaved scheduling!



## SCORECARD

| Problem                     | Disposition                                                              |
|-----------------------------|--------------------------------------------------------------------------|
| Memory Coherency            | Supported since Pascal                                                   |
| Memory Consistency          | Supported since Volta in PTX<br>cuda::std::atomic<> exposure forthcoming |
| Forward Progress Guarantees | Supported since Volta<br>Clarified in C++17                              |

# Modern NVIDIA GPUs implement the C++ execution model.

#### We spent transistors to get there.

## CUDA C++ IS A SUPERSET OF ISO C++

| Host processors can use                                           | All processors can use                            | All processors can use                             |
|-------------------------------------------------------------------|---------------------------------------------------|----------------------------------------------------|
| alone                                                             | <b>isolated</b>                                   | together                                           |
| throw<br>catch<br>typeid<br>dynamic_cast<br>thread_local<br>std:: | virtual functions<br>function pointers<br>lambdas | <rest c++="" iso="" of=""><br/>cuda::std::†</rest> |

† Coming in a future CUDA release.

#### **libcu++** The CUDA C++ Standard Library

Opt-in, heterogeneous, incremental C++ standard library for CUDA.

Open source; port of LLVM's libc++; contributing upstream.

Version 1 (CUDA 10.2): <atomic> (Pascal+), <type\_traits>.

Version 2 (CUDA next): atomic<T>::wait/notify, <barrier>, <latch>, <counting\_semaphore> (all Volta+), <chrono>, <ratio>, <functional> minus function.

Future priorities: atomic\_ref<T>, <complex>, <tuple>, <array>, <utility>, <cmath>, string processing, ...

#include <C++>



libcu++ is the opt-in, heterogeneous, incremental CUDA C++ Standard Library.



## Opt-in

#### Does not interfere with or replace your host standard library.

// ISO C++, \_\_host\_\_ only.
#include <atomic>
std::atomic<int> x;

// CUDA C++, \_\_host\_\_ device\_\_.
// Strictly conforming to the ISO C++.
#include <cuda/std/atomic>
cuda::std::atomic<int> x;

// CUDA C++, \_\_host\_\_ device\_\_.
// Conforming extensions to ISO C++.
#include <cuda/atomic>
cuda::atomic<int, cuda::thread\_scope\_block> x;

#include <C++>

Copyright (C) 2020 NVIDIA



#### Heterogeneous

Copyable/Movable objects can migrate between host & device. Host & device can call all (member) functions.

Host & device can concurrently use synchronization primitives\*.

\*: Synchronization primitives must be in managed memory and be declared with cuda::std::thread\_scope\_system.



#### Incremental

#### Not a complete standard library today; each release will add more.





#### Based on LLVM's libc++

Forked from LLVM's libc++.

License: Apache 2.0 with LLVM Exception.

NVIDIA is already contributing back to the community:

Freestanding atomic<T>: reviews.llvm.org/D56913

C++20 synchronization library: <u>reviews.llvm.org/D68480</u>

