SLI Zone
NVIDIA.com Developer Home

NVIDIA CUDA SDK x64 - Performance Strategies

The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. The SDK includes dozens of code samples covering a wide range of applications including:

  • Simple techniques such as C++ code integration and efficient loading of custom datatypes
  • How-To examples covering CUDA BLAS and FFT libraries, texture fetching in CUDA, and CUDA interoperation with the OpenGL and Direct3D graphics APIS
  • Linear algebra primitives such as matrix transpose and matrix-matrix multiplication
  • Data-parallel algorithms such as parallel prefix sum of large arrays
  • Performance: profiling using timers and bandwidth tests
  • Advanced application examples such as image convolution, Black-Scholes options pricing and binomial options pricing
Refer to the following READMEs for more information ( Linux , Windows )

This code is released free of charge for use in derivative works, whether academic, commercial, or personal. (Full License)

The NVIDIA CUDA Toolkit is required to run and compile code samples. Please obtain the CUDA Toolkit here

Quick Links:
Data-Parallel Algorithms Computational Finance
Performance Strategies Linear Algebra
Physically-Based Simulation CUDA Basic Topics
Graphics Interop Image/Video Processing and Data Compression
CUDA Advanced Topics


Monte-Carlo Option Pricing with multi-GPU support For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample evaluates fair call price for a given set of European options using Monte-Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


Matrix Transpose For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Efficient matrix transpose.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


Clock For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This example shows how to use the clock function to measure the performance of kernel accurately.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


Aligned Types For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A simple test, showing huge access speed gap between aligned and misaligned structures.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


Scan For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.
  Minimum Required GPU
Minimum Required GPUor later



Whitepaper
Download - Windows
Download - Linux


Bandwidth Test For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


Scan of Large Arrays For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This example demonstrates an efficient CUDA implementation of parallel prefix sum (also known as "scan") for arbitrary-sized arrays. Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.
  Minimum Required GPU
Minimum Required GPUor later



Whitepaper
Download - Windows
Download - Linux


Parallel Reduction For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction.
  Minimum Required GPU
Minimum Required GPUor later



Whitepaper
Download - Windows
Download - Linux


asyncAPI For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA streams and events to overlap execution on CPU and GPU.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux


simpleStreams For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA streams to overlap kernel executions with memcopies between the device and the host.
  Minimum Required GPU
Minimum Required GPUor later




Download - Windows
Download - Linux

Last Update: 11/12/2007
NVPerfHUD 4