

# The Rocky Road To Tasking

March 21, 2019 | Ivo Kabadshow, Laura Morgenstern | Jülich Supercomputing Centre



Member of the Helmholtz Association





















### Requirements for MD

- Strong scalability
- Performance portability



### **Our Motivation**

Solving Coulomb problem for Molecular Dynamics

Task: Compute all pairwise interactions of *N* particles

N-body problem:  $\mathcal{O}(N^2) \rightarrow \mathcal{O}(N)$  with FMM

### Why is that an issue?

- MD targets < 1ms runtime per time step</li>
- MD runs millions or billions of time steps
- not compute-bound, but synchronization bound
- no libraries (like BLAS) to do the heavy lifting

We might have to look under the hood ... and get our hands dirty.



### **Parallelization Potential**



# **Parallelization Potential**



### **Coarse-Grained Parallelization**

![](_page_9_Figure_1.jpeg)

![](_page_9_Picture_2.jpeg)

### **Coarse-Grained Parallelization**

![](_page_10_Figure_1.jpeg)

- Different amount of available loop-level parallelism within each phase
- Some phases contain sub-dependencies
- Synchronizations might be problematic

![](_page_10_Picture_5.jpeg)

Multipole to multipole (M2M), shifting multipoles upwards

![](_page_11_Figure_2.jpeg)

![](_page_11_Picture_3.jpeg)

Multipole to multipole (M2M), shifting multipoles upwards

![](_page_12_Figure_2.jpeg)

![](_page_12_Picture_3.jpeg)

Multipole to local (M2L), translate remote multipoles into local taylor moments

![](_page_13_Figure_2.jpeg)

![](_page_13_Picture_3.jpeg)

Multipole to local (M2L), translate remote multipoles into local taylor moments

![](_page_14_Figure_2.jpeg)

![](_page_14_Picture_3.jpeg)

Local to local (L2L), shifting Taylor moments downwards

![](_page_15_Figure_2.jpeg)

![](_page_15_Picture_3.jpeg)

Local to local (L2L), shifting Taylor moments downwards

![](_page_16_Figure_2.jpeg)

![](_page_16_Picture_3.jpeg)

![](_page_17_Figure_1.jpeg)

![](_page_17_Picture_2.jpeg)

![](_page_18_Figure_1.jpeg)

![](_page_18_Picture_2.jpeg)

![](_page_19_Figure_1.jpeg)

![](_page_19_Picture_2.jpeg)

![](_page_20_Figure_1.jpeg)

![](_page_20_Picture_2.jpeg)

![](_page_21_Figure_2.jpeg)

![](_page_21_Picture_3.jpeg)

![](_page_22_Figure_2.jpeg)

![](_page_22_Picture_3.jpeg)

![](_page_23_Figure_2.jpeg)

![](_page_23_Picture_3.jpeg)

![](_page_24_Figure_2.jpeg)

![](_page_24_Picture_3.jpeg)

![](_page_25_Figure_2.jpeg)

- Tasks can be prioritized by task type
- Only ready-to-execute tasks are stored in queue
- Workstealing from other threads is possible

![](_page_25_Picture_6.jpeg)

# **Tasking Without Workstealing**

#### 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

![](_page_26_Figure_2.jpeg)

![](_page_26_Picture_3.jpeg)

# **Tasking With Workstealing**

#### 103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

![](_page_27_Figure_2.jpeg)

![](_page_27_Picture_3.jpeg)

![](_page_28_Picture_0.jpeg)

# The Rocky Road To Tasking

March 21, 2019 | Ivo Kabadshow, Laura Morgenstern | Jülich Supercomputing Centre

![](_page_28_Picture_3.jpeg)

Member of the Helmholtz Association

Goal

- Provide same features as CPU tasking:
  - Static and dynamic load balancing
  - Priority queues
  - Ready-to-execute tasks

![](_page_29_Picture_6.jpeg)

![](_page_30_Picture_0.jpeg)

Uniform Programming Model for CPUs and GPUs

![](_page_30_Figure_2.jpeg)

![](_page_30_Picture_3.jpeg)

Uniform Programming Model for CPUs and GPUs

![](_page_31_Figure_2.jpeg)

![](_page_31_Picture_3.jpeg)

FPU

FPU

Uniform Programming Model for CPUs and GPUs

![](_page_32_Figure_2.jpeg)

EPIL EPIL EPIL EPIL

#### Uniform Programming Model for CPUs and GPUs

![](_page_33_Figure_2.jpeg)

run on

#### Memory L3 Cache Core Core Core Core Core Core L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache EPI FPU EPU EPH EPH EPH EPH EPH EPI EPH EPH EDI EDH CDU EDI

CPU

GPU

| Global Memory               |     |     |     |  |                             |     |     |     |  |                             |     |     |     |  |                             |     |     |     |  |
|-----------------------------|-----|-----|-----|--|-----------------------------|-----|-----|-----|--|-----------------------------|-----|-----|-----|--|-----------------------------|-----|-----|-----|--|
| Streaming<br>Multiprocessor |     |     |     |  |
| Shared Memory               |     |     |     |  | Shared Memory               |     |     |     |  | Shared Memory               |     |     |     |  | Shared Memory               |     |     |     |  |
| FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  |
| FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  |
| FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  |
| FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  | FPU                         | FPU | FPU | FPU |  |

![](_page_33_Picture_7.jpeg)

Many Persistent Thread Blocks

run on

![](_page_33_Picture_9.jpeg)

EPU

#### Uniform Programming Model for CPUs and GPUs

![](_page_34_Figure_2.jpeg)

![](_page_34_Picture_3.jpeg)

Streaming

Shared Memory

199

### **Pitfalls**

#### **Performance Portability**

Diverse GPU programming approaches:

- OpenCL
- CUDA
- SYCL

Our requirements:

- Strong subset of C++11
- Portability between GPU vendors
- Tasking features
- Maturity

### (Intermediate) Solution

Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being portable out of the box.

![](_page_35_Picture_13.jpeg)

### **Pitfalls**

**Performance Portability** 

For performance portability we consider diverse GPU programming approaches:

- OpenCL
- CUDA
- SYCL

### Unsatisfying (Intermediate) Solution

Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the box.

![](_page_36_Picture_8.jpeg)

![](_page_37_Picture_0.jpeg)

**Architectural Differences** 

### Pitfalls for Load Balancing

- No thread pinning
- No cache coherency

### Pitfalls for Mutual Exclusion

- Weak memory consistency
- Missing forward progress guarantees

![](_page_37_Picture_8.jpeg)

![](_page_38_Picture_0.jpeg)

Load Balancing

- No possibility to pin threads to streaming multiprocessors
- No direct access to shared memory of other streaming multiprocessors
- Work stealing requires multi-producer multi-consumer queues  $\rightarrow$  Mechanism for mutual exclusion?

![](_page_38_Picture_5.jpeg)

![](_page_39_Picture_0.jpeg)

**Mutual Exclusion** 

- Weak memory consistency
- Warp-synchronous deadlocks due to lock step
- How to prove thread safety?

![](_page_39_Picture_5.jpeg)

### **Pitfalls**

```
Mutex Implementation
class Mutex
{
    __inline___device__ void lock()
    {
        while (atomicCAS(\&mutex, 0, 1) != 0)
        ____threadfence();
    };
    ___inline____device___void unlock()
    {
        threadfence();
        atomicExch(&mutex, 0);
    };
    int mutex = 0;
```

```
JÜLICH
Forschungszentrum
```

};

### **Very First Evaluation**

Conditions

- Tasking with global queue only
- Measurements without work load to determine enqueue and dequeue overhead
- Measurements on P100 with 56 thread blocks with 1024 threads each
- Measurements on V100 with 80 thread blocks with 1024 threads each

![](_page_41_Picture_6.jpeg)

### **First Evaluation**

### Tasking Overhead on P100 and V100

![](_page_42_Figure_2.jpeg)

![](_page_42_Picture_3.jpeg)

Conclusion

- Fine-grained task parallelism pays off on CPUs
- Developed mapping between CPU and GPU concepts
- (Partly) overcome pitfalls:
  - Lock-based mutual exclusion
  - Reusability of CPU tasking code
  - Architectural differences between CPU and GPU
- Successfully transferred parts of CPU tasking to GPUs

![](_page_43_Picture_9.jpeg)

### **Next Steps**

- Analyze and solve performance issues in dependency resolution
- Use memory pool for dynamic allocations
- Implement hierarchical queues
- Transfer priority queue to GPU
- Exploit data-parallelism through warps
- Consider the use of lock-free data structures
- Implement FMM based on GPU tasking

![](_page_44_Picture_8.jpeg)

![](_page_44_Picture_9.jpeg)

### **Thank You to Our Sponsor!**

#### NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by

![](_page_45_Picture_2.jpeg)

![](_page_45_Picture_3.jpeg)

![](_page_46_Picture_0.jpeg)

# The Rocky Road To Tasking

March 21, 2019 | Ivo Kabadshow, Laura Morgenstern | Jülich Supercomputing Centre

![](_page_46_Picture_3.jpeg)

Member of the Helmholtz Association