

### **CUDA 4.0**



## The 'Super' Computing Company From Super Phones to Super Computers

6, 6,9

## **CUDA 4.0 for Broader Developer Adoption**



## **CUDA 4.0**

Application Porting Made Simpler

Rapid Application Porting Unified Virtual Addressing

Faster Multi-GPU Programming NVIDIA GPUDirect<sup>™</sup> 2.0

Easier Parallel Programming in C++ Thrust

© NVIDIA Corporation 2011

## **CUDA 4.0: Highlights**



## **Easier Porting of Existing Applications**

### Share GPUs across multiple threads

- Easier porting of multi-threaded apps pthreads / OpenMP threads share a GPU
- Launch concurrent kernels from different host threads

Eliminates context switching overhead

New, simple context management APIs Old context migration APIs still supported

### Single thread access to all GPUs

Each host thread can now access all GPUs in the system

One thread per GPU limitation removed

Easier than ever for applications to take advantage of multi-GPU

Single-threaded applications can now benefit from multiple GPUs

Easily coordinate work across multiple GPUs (e.g. halo exchange)

## **No-copy Pinning of System Memory**

- Reduce system memory usage and CPU memcpy() overhead
  - Easier to add CUDA acceleration to existing applications
  - Just register malloc'd system memory for async operations and then call cudaMemcpy() as usual

|   | Before No-copy Pinning                  | With No-copy Pinning     |
|---|-----------------------------------------|--------------------------|
| E | xtra allocation and extra copy required | Just register and go!    |
|   | malloc(a)                               |                          |
|   | cudaMallocHost(b)                       |                          |
|   | memcpy(b, a)                            | cudaHostRegister(a)      |
|   | cudaMemcpy() to GPU, launch kernel      | s, cudaMemcpy() from GPU |
|   | memcpy(a, b)                            |                          |
|   | cudaFreeHost(b)                         | cudaHostUnregister(a)    |



All CUDA-capable GPUs on Linux or Windows Requires Linux kernel 2.6.15+ (RHEL 5)

## **New CUDA C/C++ Language Features**

### C++ new/delete

**Dynamic memory management** 

C++ virtual functions

Easier porting of existing applications

### Inline PTX

**Enables assembly-level optimization** 

### C++ Templatized Algorithms & Data Structures (Thrust)

- Powerful open source C++ parallel algorithms & data structures
  - Similar to C++ Standard Template Library (STL)
  - Automatically chooses the fastest code path at compile time
    - Divides work between GPUs and multi-core CPUs
      - Parallel sorting @ 5x to 100x faster

| Data Structures                                                     | Algorithms                                            |       |
|---------------------------------------------------------------------|-------------------------------------------------------|-------|
| <ul><li>thrust::device_vector</li><li>thrust::host_vector</li></ul> | <ul><li>thrust::sort</li><li>thrust::reduce</li></ul> | Thrus |
| <ul><li>thrust::device_ptr</li><li>Etc.</li></ul>                   | <ul><li>thrust::exclusive_scan</li><li>Etc.</li></ul> |       |

## **GPU-Accelerated Image Processing**

### **NVIDIA Performance Primitives (NPP) library**

- 10x to 36x faster image processing
- Initial focus on imaging and video related primitives





NVIDIA TESLA Imaging



Set, Convert, CopyConstBorder, Copy, Transpose, SwapChannels

#### **Color Conversion**

RGB To YCbCr (& vice versa), ColorTwist, LUT\_Linear

Threshold & Compare Ops

Threshold, Compare

#### Statistics

Mean, StdDev, NormDiff, MinMax, Histogram, SqrIntegral, RectStdDev

#### **Filter Functions**

FilterBox, Row, Column, Max, Min, Median, Dilate, Erode, SumWindowColumn/Row

Geometry Transforms Mirror, WarpAffine / Back/ Quad, WarpPerspective / Back / Quad, Resize

Arithmetic & Logical Ops Add, Sub, Mul, Div, AbsDiff

#### JPEG

DCTQuantInv/Fwd, QuantizationTable



## Layered Textures – Faster Image Processing

Ideal for processing multiple textures with same size/format

- Large sizes supported on Tesla T20 (Fermi) GPUs (up to 16k x 16k x 2k)
- e.g. Medical Imaging, Terrain Rendering (flight simulators), etc.

### Faster Performance

- Reduced CPU overhead: single binding for entire texture array
- Faster than 3D Textures: more efficient filter caching
- **Fast interop with OpenGL / Direct3D for each layer**
- No need to create/manage a texture atlas

### **No sampling artifacts**

Linear/Bilinear filtering applied only within a layer

## **CUDA 4.0: Highlights**

Easier Parallel Application Porting Faster Multi-GPU Programming New & Improved Developer Tools

- Share GPUs across multiple threads
- Single thread access to all GPUs
- No-copy pinning of system memory
- New CUDA C/C++ features
- Thrust templated primitives library
- NPP image/video processing library
- Layered Textures

#### • NVIDIA GPUDirect<sup>™</sup> v2.0

- Peer-to-Peer Access
- Peer-to-Peer Transfers
- Unified Virtual Addressing

Auto Performance Analysis
C++ Debugging
GPU Binary Disassembler
cuda-gdb for MacOS

### **NVIDIA GPUDirect™:***Towards Eliminating the CPU Bottleneck*



- Direct access to GPU memory for 3<sup>rd</sup> party devices
- Eliminates unnecessary sys mem copies & CPU overhead
- Supported by Mellanox and Qlogic
- Up to 30% improvement in communication performance

- Peer-to-Peer memory access, transfers & synchronization
- Less code, higher programmer productivity

### **Before NVIDIA GPUDirect™ v2.0**

### Required Copy into Main Memory



#### Two copies required:

1. cudaMemcpy(GPU2, sysmem) 2. cudaMemcpy(sysmem, GPU1)

## NVIDIA GPUDirect<sup>™</sup> v2.0: Peer-to-Peer Communication

**Direct Transfers between GPUs** 



Only one copy required: 1. cudaMemcpy(GPU2, GPU1)

### **GPUDirect v2.0: Peer-to-Peer Communication**

- Direct communication between GPUs
  - Faster no system memory copy overhead
  - More convenient multi-GPU programming

### Direct Transfers

- Copy from GPU0 memory to GPU1 memory
- Works transparently with UVA

### Direct Access

GPU0 reads or writes GPU1 memory (load/store)

# Supported on Tesla 20-series and other Fermi GPUs 64-bit applications on Linux and Windows TCC

### Unified Virtual Addressing Easier to Program with Single Address Space

### No UVA: Multiple Memory Spaces

#### **GPU**0 System GPU1 Memory Memory Memory 0x0000 0x0000 0x0000 **0xFFFF 0xFFFF** 0xFFFF **CPU GPU**<sub>0</sub> GPU1 PCI-e

### **UVA : Single Address Space**



## **Unified Virtual Addressing**

One address space for all CPU and GPU memory

- Determine physical memory location from pointer value
- Enables libraries to simplify their interfaces (e.g. cudaMemcpy)

| Before UVA                                                                                           | With UVA                                                              |
|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| Separate options for each permutation                                                                | One function handles all cases                                        |
| cudaMemcpyHostToHost<br>cudaMemcpyHostToDevice<br>cudaMemcpyDeviceToHost<br>cudaMemcpyDeviceToDevice | cudaMemcpyDefault<br>(data location becomes an implementation detail) |



Supported on Tesla 20-series and other Fermi GPUs

64-bit applications on Linux and Windows TCC

## **CUDA 4.0: Highlights**

Easier Parallel Application Porting

Faster Multi-GPU Programming New & Improved Developer Tools

- Share GPUs across multiple threads
- Single thread access to all GPUs
- No-copy pinning of system memory
- New CUDA C/C++ features
- Thrust templated primitives library
- NPP image/video processing library
- Layered Textures

• NVIDIA GPUDirect<sup>™</sup> v2.0

- Peer-to-Peer Access
- Peer-to-Peer Transfers
- Unified Virtual Addressing

Auto Performance Analysis
C++ Debugging
GPU Binary Disassembler
cuda-gdb for MacOS

## **Automated Performance Analysis in Visual Profiler**

### Summary analysis & hints



### Device

Context

Kernel

### New UI for kernel analysis

- Identify limiting factor
- Analyze instruction throughput
- Analyze memory throughput
- Analyze kernel occupancy

| Analysis                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                    |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|--|
| Instruction Throug                                                                                                                                                                                                                        | hput Analysis for kernel conv                                                                                                                                                                                                                                                      | volutionColumnsKern                                                                                                                                            | el on device GeForce (                                                                                                                                    | 5TX 480                                                                                                                         |  |
| • IPC: 1.56                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                    |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| Maximum IPC:                                                                                                                                                                                                                              | : 2<br>nches(%): 0.00                                                                                                                                                                                                                                                              |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| · Control flow d                                                                                                                                                                                                                          | ivergence(%): 0.03                                                                                                                                                                                                                                                                 |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
|                                                                                                                                                                                                                                           | ructions(%): 29.65<br>{ memory replay(%): 0.00                                                                                                                                                                                                                                     |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| Local                                                                                                                                                                                                                                     | memory replays(%): 0.00                                                                                                                                                                                                                                                            |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
|                                                                                                                                                                                                                                           | ed bank conflict replay(%): 26.38<br>ry bank conflict per shared memor                                                                                                                                                                                                             | y instruction(%): 99.90                                                                                                                                        |                                                                                                                                                           |                                                                                                                                 |  |
| Hint(s)                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                    |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| • The kernel i                                                                                                                                                                                                                            | s compute bound, to reduce in                                                                                                                                                                                                                                                      | struction count                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| oUnder                                                                                                                                                                                                                                    | rstand the instruction mix, as sing                                                                                                                                                                                                                                                | le precision floating poin                                                                                                                                     |                                                                                                                                                           |                                                                                                                                 |  |
|                                                                                                                                                                                                                                           | cendentals, etc. have different th<br>ng point literals without an f suffix                                                                                                                                                                                                        |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
| o Try us                                                                                                                                                                                                                                  | sing arithmetic intrinsic functions.                                                                                                                                                                                                                                               |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
|                                                                                                                                                                                                                                           | sing compiler flags(-ftz=true, -pre<br>: in some precision loss;                                                                                                                                                                                                                   | c-div=faise, -prec-sqrt=                                                                                                                                       | raise etc) to get higher pe                                                                                                                               | erformance, but may                                                                                                             |  |
| Refer to the "Arithmetic Instructions" section in the "Performance Guidelines" chapter of the CUDA C Programming                                                                                                                          |                                                                                                                                                                                                                                                                                    |                                                                                                                                                                |                                                                                                                                                           |                                                                                                                                 |  |
|                                                                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                    | the 'Performance Guide                                                                                                                                         | lines" chapter of the CUD/                                                                                                                                | A C Programming Guide                                                                                                           |  |
| for more deta<br>• Shared men                                                                                                                                                                                                             | ís.<br>nory bank conflicts are high v                                                                                                                                                                                                                                              |                                                                                                                                                                |                                                                                                                                                           | and the second second second second                                                                                             |  |
| for more deta                                                                                                                                                                                                                             | ís.<br>nory bank conflicts are high v                                                                                                                                                                                                                                              |                                                                                                                                                                |                                                                                                                                                           | and the second second second second                                                                                             |  |
| for more deta<br>• Shared men<br>conflicts can b<br>•Using                                                                                                                                                                                | is.<br>nory bank conflicts are high v<br>reduced by<br>appropriate padding for data stor                                                                                                                                                                                           | which causes serialization                                                                                                                                     | n of threads within a warp                                                                                                                                | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rearr                                                                                                                                                          | is.<br>nory bank conflicts are high v<br>re reduced by<br>appropriate padding for data stor<br>ent bank;<br>anging data in shared memory, th                                                                                                                                       | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;                                                                                           | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rearr                                                                                                                                                          | ás.<br>nory bank conflicts are high v<br>e reduced by<br>appropriate padding for data stor<br>ent bank;                                                                                                                                                                            | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;                                                                                           | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rear<br>Refer to the "<br>more details.                                                                                                                        | is.<br>nory bank conflicts are high v<br>we reduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, the<br>Shared Memory' section in the "Pe                                                                                                | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;                                                                                           | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>•Using<br>differ<br>• Rearr<br>Refer to the "                                                                                                                                         | is.<br>nory bank conflicts are high v<br>we reduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, the<br>Shared Memory' section in the "Pe                                                                                                | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;                                                                                           | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Refer to the "<br>more details.<br>Factors that may a                                                                                                          | is.<br>nory bank conflicts are high v<br>we reduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, the<br>Shared Memory' section in the "Pe                                                                                                | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;                                                                                           | . Shared memory bank                                                                                                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rear<br>Refer to the "<br>more details.                                                                                                                        | is.<br>nory bank conflicts are high v<br>ereduced by<br>appropriate padding for data stor<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory' section in the "Pe<br>iffect analysis                                                                                 | which causes serialization<br>red in shared memory so<br>us changing access patt                                                                               | n of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro                                                               | . Shared memory bank<br>rp accesses data from a<br>gramming Guide for<br>shared store                                           |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>officit<br>• Rearr<br>Refer to the "<br>more details.<br>Factors that may a<br>Limiting Factor<br>Identification                                                           | is.<br>nory bank conflicts are high v<br>e reduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory" section in the "Pe<br>iffect analysis<br>Show all columns<br>GPU Timestamp (us)                                     | which causes serialization<br>red in shared memory so<br>us changing access patt<br>erformance Guidelines" c                                                   | n of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro                                                               | . Shared memory bank<br>rp accesses data from a<br>gramming Guide for                                                           |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>officer<br>• Rearr<br>Refer to the "<br>more details.<br>Factors that may a<br>Uniting Factor                                                                              | is.<br>nory bank conflicts are high v<br>e reduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory" section in the "Pe<br>iffect analysis<br>Show all columns<br>GPU Timestamp (us)                                     | which causes serialization<br>red in shared memory so<br>us changing access patt<br>erformance Guidelines" c<br>GPU Time (us)                                  | n of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro<br>shared load<br>Type:SM Run:4                               | Shared memory bank<br>rp accesses data from a<br>gramming Guide for<br>shared store<br>Type:SM Run:4                            |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rearr<br>Refer to the "<br>more details.<br>Factors that may a<br>Limiting Factor<br>Identification<br>Memory Throughput<br>Analysis                           | is.<br>nory bank conflicts are high v<br>ereduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory" section in the "Pe<br>iffect analysis<br>Show all columns<br>GPU Timestamp (us)<br>1 38718<br>2 41989.6<br>2 41989.6 | which causes serialization<br>red in shared memory so<br>us changing access patt<br>erformance Guidelines" c<br>GPU Time (us)<br>1652.96                       | n of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro<br>shared load<br>Type:SM Run:4<br>334560                     | Shared memory bank<br>rp accesses data from a<br>gramming Guide for<br>shared store<br>Type:SM Run:4<br>24600                   |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rearr<br>Refer to the "<br>more details.<br>Factors that may a<br>Limiting Factor<br>Identification<br>Memory Throughput                                       | is.<br>nory bank conflicts are high v<br>ereduced by<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory" section in the "Pe<br>iffect analysis<br>Show all columns<br>GPU Timestamp (us)<br>1 38718<br>2 41989.6<br>2 41989.6 | which causes serialization<br>red in shared memory so<br>us changing access patt<br>erformance Guidelines" c<br>GPU Time (us)<br>1652.96<br>1652.86            | h of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro<br>shared load<br>Type:SM Run:4<br>334560<br>334560           | shared store<br>Type:SM Run:4<br>24600<br>24600                                                                                 |  |
| for more deta<br>• Shared men<br>conflicts can b<br>• Using<br>differ<br>• Rearr<br>Refer to the "<br>more details.<br>Factors that may a<br>Limiting Factor<br>Identification<br>Memory Throughput<br>Analysis<br>Instruction throughput | is.<br>nory bank conflicts are high v<br>appropriate padding for data store<br>ent bank;<br>anging data in shared memory, th<br>Shared Memory' section in the "Per<br>iffect analysis<br>Show all columns<br>GPU Timestamp (us)<br>1 38718<br>2 41989.6<br>3 44507.4               | which causes serialization<br>red in shared memory so<br>us changing access patt<br>erformance Guidelines" c<br>GPU Time (us)<br>1652.96<br>1652.86<br>1652.93 | h of threads within a warp<br>that each thread in a war<br>ern;<br>hapter of the CUDA C Pro<br>shared load<br>Type:SM Run:4<br>334560<br>334560<br>334560 | shared memory bank<br>rp accesses data from a<br>gramming Guide for<br>shared store<br>Type:SM Run:4<br>24600<br>24600<br>24600 |  |

## **New Features in cuda-gdb**

#### Now available for both Linux and MacOS

File Edit View Program Commands Status Source Data Help info cuda threads ): 'info cuda threads' / D 🕺 📾 👀 ? 🌶 📐 🤍 G 🖉 👮 automatically updated in DDD Cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Kernel 1 (0,0,0)(0,0,0)(0,0,0)(0, 0, 0)1 0x00000001cea9880 templates.cu (1,0,0)(0,0,0)(1,0,0)1 0x00000001cea98e0 templates.cu (0,0,0)15 (0,0,0) (1,0,0)(1,0,0)(0,0,0)1 0x00000001cea9880 templates.cu 12 (1,0,0)1 0x00000001cea98e0 templates.cu (1.0.0)(1,0,0)(1,0,0)15 Breakpoints on all instances 10 T incr = this->b: 11 if (threadIdx.x == 0) of templated functions STOP STOP t += 4 + incr: Fermi 13 else 14 t += 3; 15 return t + this->a disassembly 🕽 [0x00000001cea9880 <\_ZN8my\_classIfE11my\_functionEf+192> MOV RO. RO 0x000000001cea9888 <\_ZN8my\_classIfE11my\_functionEf+200> MOV R3, R3 0x00000001cea9890 <\_ZN8my\_classIfE11my\_functionEf+208> MOV32I R4. 0x40800000 (cuobjdump) Ox00000001cea9898 <\_ZN8my\_classIfE11my\_functionEf+216> FADD R3, R3, R4 0x00000001cea98a0 <\_ZN8my\_classIfE11my\_functionEf+224> FADD RO, RO, R3 Breakpoint on CUDA kernel launch at my\_kernel<int, float><<<(2,1,1),(2,1,1)>>> (out1=0x200100000, out2=0x200100200) at templates.cu:21 (gdb) break templates.cu:12 Breakpoint 1 at 0x1cea96f8: file templates.cu, line 12. Breakpoint 2 at Ox1cea9880: file templates.cu, line 12. warning: Multiple breakpoints were set. They may be automatically deleted at the end of the run. Use the "delete" command to delete unwanted breakpoints. (gdb) info breakpoints Disp Enb Address Num Type What 0x00000001cea96f8 in my\_class<int>::my\_function(int) at templates.cu:12 1 breakpoint keep v breakpoint keep y 0x00000001cea9880 in my class<float>::my function(float) at templates.cu:12 2 (qdb) continue C++ symbols shown Breakpoint 1, my\_class<int>::my\_function (this=0x3fffc30, t=3) at templates.cu:12 (gdb) continue in stack trace view Breakpoint 2, my\_class<float>::my\_function (this=0x3fffc38, t=2) at templates.cu:12 (gdb) where #0 my\_class<float>::my\_function (this=0x3fffc38, t=2) at templates.cu:12 #1 0x000000001cea95a0 in my\_kernel<int, float><<<(2,1,1),(2,1,1)>>> (out1=0x200100000, out2=0x200100200) at templates.cu:29 (gdb) ∆ Display-1: "info cuda threads" (enabled

#### Details @ http://developer.nvidia.com/object/cuda-gdb.html

© NVIDIA Corporation 2011

## cuda-gdb Now Available for MacOS



Details @ http://developer.nvidia.com/object/cuda-gdb.html

## **NVIDIA Parallel Nsight<sup>™</sup> Pro 1.5**



|                                                            | Professional |                                 |
|------------------------------------------------------------|--------------|---------------------------------|
| CUDA Debugging                                             | $\checkmark$ |                                 |
| Compute Analyzer                                           | $\checkmark$ |                                 |
| CUDA / OpenCL Profiling                                    | $\checkmark$ |                                 |
| Tesla Compute Cluster (TCC) Debugging                      | ✓            | Visual Studio <sup>®</sup> 2010 |
| Tesla Support: C1050/S1070 or higher                       | $\checkmark$ |                                 |
| Quadro Support: G9x or higher                              | ✓            |                                 |
| Windows 7, Vista and HPC Server 2008                       | ✓            |                                 |
| Visual Studio 2008 SP1 and Visual Studio 2010              | ✓            | Indudes MSDN - Subscription     |
| OpenGL and OpenCL Analyzer                                 | ✓            | Microsoft                       |
| DirectX 10 & 11 Analyzer, Debugger & Graphics<br>inspector | ✓            | Microsoft                       |
| GeForce Support: 9 series or higher                        | $\checkmark$ |                                 |

## **CUDA Registered Developer Program**

All GPGPU developers should become NVIDIA Registered Developers

### **Benefits include:**

- Early Access to Pre-Release Software
  - Beta software and libraries
  - CUDA 4.0 Release Candidate available now
- Submit & Track Issues and Bugs
  - Interact directly with NVIDIA QA engineers

### New benefits in 2011

- **Exclusive Q&A Webinars with NVIDIA Engineering**
- **Exclusive deep dive CUDA training webinars**
- In-depth engineering presentations on pre-release software

Sign up Now: www.nvidia.com/ParallelDeveloper

### Additional Information...

CUDA Features Overview
 CUDA Developer Resources from NVIDIA
 CUDA 3<sup>rd</sup> Party Ecosystem
 PGI CUDA x86
 GPU Computing Research & Education
 NVIDIA Parallel Developer Program
 GPU Technology Conference 2011

## **CUDA Features Overview**

|                    | Platform                                                                                                                                                                                                                                                                                                                                                                     | Programming Model                                                                                                                                                                                                                                                                                      | Parallel Libraries                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Development Tools                                                                                                                                                                                                                                                                                                                                                                        |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| New in<br>CUDA 4.0 | <b>GPUDirect<sup>tm</sup> (v 2.0)</b><br>Peer-Peer Communication                                                                                                                                                                                                                                                                                                             | Unified Virtual Addressing<br>C++ new/delete<br>C++ Virtual Functions                                                                                                                                                                                                                                  | Thrust C++ Library<br>Templated Performance<br>Primitives Library                                                                                                                                                                                                                                                                                                                                                                                                                  | Parallel Nsight Pro 1.5                                                                                                                                                                                                                                                                                                                                                                  |
|                    | Hardware Features<br>ECC Memory<br>Double Precision<br>Native 64-bit Architecture<br>Concurrent Kernel Execution<br>Dual Copy Engines<br>6GB per GPU supported<br>Operating System Support<br>MS Windows 32/64<br>Linux 32/64<br>Mac OS X 32/64<br>Mac OS X 32/64<br>Designed for HPC<br>Cluster Management<br>GPUDirect<br>Tesla Compute Cluster (TCC)<br>Multi-GPU support | C support<br>NVIDIA C Compiler<br>CUDA C Parallel Extensions<br>Function Pointers<br>Recursion<br>Atomics<br>malloc/free<br>C++ support<br>Classes/Objects<br>Class Inheritance<br>Polymorphism<br>Operator Overloading<br>Class Templates<br>Function Templates<br>Virtual Base Classes<br>Namespaces | NVIDIA Library Support<br>Complete math.h<br>Complete BLAS Library (1, 2 and 3)<br>Sparse Matrix Math Library<br>RNG Library<br>FFT Library (1D, 2D and 3D)<br>Video Decoding Library (NVCUVID)<br>Video Encoding Library (NVCUVENC)<br>Image Processing Library (NPP)<br>Video Processing Library (NPP) | <ul> <li>NUIDIA Developer Tools</li> <li>Parallel Nsight<br/>for MS Visual Studio</li> <li>cuda-gdb Debugger<br/>with multi-GPU support</li> <li>CUDA/OpenCL Visual Profiler</li> <li>CUDA Memory Checker</li> <li>CUDA Disassembler</li> <li>GPU Computing SDK</li> <li>NVML</li> <li>CUPTI</li> </ul> Strd Party Developer Tools Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPP |

## **CUDA Developer Resources from NVIDIA**

#### Development Tools

CUDA Toolkit Complete GPU computing development kit

cuda-gdb GPU hardware debugging

cuda-memcheck Identifies memory errors

cuobjdump CUDA binary disassembler

Visual Profiler GPU hardware profiler for CUDA C and OpenCL

Parallel Nsight Pro Integrated development environment for Visual Studio



SDKs and Code Samples

GPU Computing SDK CUDA C/C++, DirectCompute, OpenCL code samples and documentation

#### **Books**

CUDA by Example GPU Computing Gems Programming Massively Parallel Processors Many more...

#### **Optimization Guides**

Best Practices for GPU computing and graphics development



Libraries and Engines

Math Libraries

CUFFT, CUBLAS, CUSPARSE, CURAND, math.h

3<sup>rd</sup> Party Libraries CULA LAPACK, VSIPL

NPP Image Libraries Performance primitives for imaging

App Acceleration Engines Ray Tracing: Optix, iRay

Video Encoding / Decoding NVCUVENC / VCUVID



## CUDA 3<sup>rd</sup> Party Ecosystem

#### **Cluster Tools**

Cluster Management Platform HPC Platform Symphony Bright Cluster manager Ganglia Monitoring System Moab Cluster Suite Altair PBS Pro

#### **Job Scheduling**

Altair PBSpro TORQUE Platform LSF

#### **MPI Libraries**

Coming soon...

#### Parallel Language Solutions & APIs

PGI CUDA Fortran PGI Accelerator (C/Fortran) PGI CUDA x86 CAPS HMPP pyCUDA (Python) Tidepowerd GPU.NET (C#) JCuda (Java) Khronos OpenCL Microsoft DirectCompute

#### **3rd Party Math Libraries**

CULA Tools (EM Photonics) MAGMA Heterogeneous LAPACK IMSL (Rogue Wave) VSIPL (GPU VSIPL) NAG

### Parallel Tools

#### Parallel Debuggers

MS Visual Studio with Parallel Nsight Pro Allinea DDT Debugger TotalView Debugger

Parallel Performance Tools
ParaTools VampirTrace
TauCUDA Performance Tools
PAPI
HPC Toolkit

### **Compute Platform Providers Cloud Providers** Amazon EC2 Peer 1 OEM's Dell HP IBM Infiniband Providers Mellanox QLogic

## PGI CUDA x86 Compiler

### **Benefits**

 Deploy CUDA apps on legacy systems without GPUs
 Less code maintenance for developers

### Timeline

April/May 1.0 initial release

Develop, debug, test functionality

Aug 1.1 performance release Multicore, SSE/AVX support



### © NVIDIA Corporation 2011

## **GPU Computing Research & Education**

#### http://research.nvidia.com



World Class Research Leadership and Teaching **University of Cambridge** Harvard University University of Utah **University of Tennessee** University of Maryland University of Illinois at Urbana-Champaign **Tsinghua University Tokyo Institute of Technology Chinese Academy of Sciences National Taiwan University** Georgia Institute of Technology

#### **Academic Partnerships / Fellowships**





Proven Research VisionJohn Hopkins UniversityMass. GerNanyan UniversityNorth CardTechnical University-CzechSwinburneCSIROTechischeSINTEFUCLAHP LabsUniversityICHECUniversityBarcelona SuperComputer CenterVSB-TechClemson UniversityUniversityFraunhofer SCAIAnd moreKarlsruhe Institute Of Technology

Mass. Gen. Hospital/NE Univ North Carolina State University Swinburne University of Tech. Techische Univ. Munich UCLA University of New Mexico University Of Warsaw-ICM er VSB-Tech University of Ostrava And more coming shortly.

#### GPGPU Education 350+ Universities



© NVIDIA Corporation 2011

"Don't kid yourself. **GPUs are a game-changer**." said Frank Chambers, a GTC conference attendee shopping for GPUs for his finite element analysis work. "What we are seeing here is **like going from propellers to jet engines**. That made transcontinental flights routine. Wide access to this kind of computing power is making things like artificial retinas possible, and that wasn't predicted to happen until 2060."

- Inside HPC (Sept 22, 2010)

## GPU Technology Conference 2011 October 11-14 | San Jose, CA

### The one event you can't afford to miss

- Learn about leading-edge advances in GPU computing
- Explore the research as well as the commercial applications
- Discover advances in computational visualization
- Take a deep dive into parallel programming

### Ways to participate

- Speak share your work and gain exposure as a thought leader
- Register learn from the experts and network with your peers
- Exhibit/Sponsor promote your company as a key player in the GPU ecosystem



### www.gputechconf.com