NVIDIA CUDA - Data-Parallel Algorithms

The CUDA Toolkit includes 100+ code samples, utilities, whitepapers, and additional documentation to help you get started developing, porting, and optimizing your applications for the CUDA architecture. You can get quick access to many of the toolkit resources on this page, CUDA documentation, or download the complete toolkit.

Please note that you may need to install the latest NVIDIA drivers and CUDA Toolkit to compile and run the code samples.

Refer to the samples release notes for more information.


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL Scan)
This example demonstrates how to use the shuffle intrinsic __shfl_up to perform a scan operation across a thread block. A GPU with Compute Capability SM 3.0. is required to run the sample
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Segmentation Tree Thrust Library
This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. Fast Walsh Transform
Naturally(Hadamard)-ordered Fast Walsh Transform for batching vectors of arbitrary eligible lengths that are power of two in size.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Histogram
This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. Line of Sight
This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the Thrust library (http://code.google.com/p/thrust/).
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Parallel Reduction
A parallel sum reduction that computes the sum of a large arrays of values. This sample demonstrates several important optimization strategies for 1:Data-Parallel Algorithms like reduction.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Parallel Prefix Sum (Scan)
This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Separable Convolution
This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. Texture-based Separable Convolution
Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. threadFenceReduction
This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic. to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later).
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Radix Sort (Thrust Library)
This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/). The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only. The optimized code in this sample (and also in reduction and scan) uses a technique known as warp-synchronous programming, which relies on the fact that within a warp of threads running on a CUDA GPU, all threads execute instructions synchronously. The code uses this to avoid __syncthreads() when threads within a warp are sharing data via __shared__ memory. It is important to note that for this to work correctly without race conditions on all GPUs, the shared memory used in these warp-synchronous expressions must be declared volatile. If it is not declared volatile, then in the absence of __syncthreads(), the compiler is free to delay stores to __shared__ memory and keep the data in registers (an optimization technique), which will result in incorrect execution. So please heed the use of volatile in these samples and use it in the same way in any code you derive from them.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA Sorting Networks
This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), this may be the preferred algorithms of choice for sorting batches of short-sized to mid-sized (key, value) array pairs. Refer to an excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. Merge Sort
This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. Mandelbrot
This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this code sample.!
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac


For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon. CUDA N-Body Simulation
This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". With CUDA 5.5, performance on Tesla K20c has increased to over 1.8TFLOP/s single precision. Double Performance has also improved on all Kepler and Fermi GPU architectures as well. Starting in CUDA 4.0, the nBody sample has been updated to take advantage of new features to easily scale the n-body simulation across multiple GPUs in a single PC. Adding "-numbodies=" to the command line will allow users to set # of bodies for simulation. Adding “-numdevices=” to the command line option will cause the sample to use N devices (if available) for simulation. In this mode, the position and velocity data for all bodies are read from system memory using “zero copy” rather than from device memory. For a small number of devices (4 or fewer) and a large enough number of bodies, bandwidth is not a bottleneck so we can achieve strong scaling across these devices.
  Minimum Required GPU
Minimum Required GPU

Download - Windows (x86)
Download - Windows (x64)
Download - Linux/Mac