NVIDIA CUDA - Data-Parallel Algorithms

The CUDA Toolkit includes 100+ code samples, utilities, whitepapers, and additional documentation to help you get started developing, porting, and optimizing your applications for the CUDA architecture. You can get quick access to many of the toolkit resources on this page, CUDA documentation, or download the complete toolkit.

Please note that you may need to install the latest NVIDIA drivers and CUDA Toolkit to compile and run the code samples.

Refer to the samples release notes for more information.


CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL Scan) This example demonstrates how to use the shuffle intrinsic __shfl_up to perform a scan operation across a thread block. A GPU with Compute Capability SM 3.0. is required to run the sample		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Segmentation Tree Thrust Library This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Fast Walsh Transform Naturally(Hadamard)-ordered Fast Walsh Transform for batching vectors of arbitrary eligible lengths that are power of two in size.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Histogram This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Line of Sight This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the Thrust library (http://code.google.com/p/thrust/).		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Parallel Reduction A parallel sum reduction that computes the sum of a large arrays of values. This sample demonstrates several important optimization strategies for 1:Data-Parallel Algorithms like reduction.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Parallel Prefix Sum (Scan) This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Separable Convolution This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Texture-based Separable Convolution Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

threadFenceReduction This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic. to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later).		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Radix Sort (Thrust Library) This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/). The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only. The optimized code in this sample (and also in reduction and scan) uses a technique known as warp-synchronous programming, which relies on the fact that within a warp of threads running on a CUDA GPU, all threads execute instructions synchronously. The code uses this to avoid __syncthreads() when threads within a warp are sharing data via __shared__ memory. It is important to note that for this to work correctly without race conditions on all GPUs, the shared memory used in these warp-synchronous expressions must be declared volatile. If it is not declared volatile, then in the absence of __syncthreads(), the compiler is free to delay stores to __shared__ memory and keep the data in registers (an optimization technique), which will result in incorrect execution. So please heed the use of volatile in these samples and use it in the same way in any code you derive from them.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Sorting Networks This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), this may be the preferred algorithms of choice for sorting batches of short-sized to mid-sized (key, value) array pairs. Refer to an excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Merge Sort This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Mandelbrot This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this code sample.!		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA N-Body Simulation This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". With CUDA 5.5, performance on Tesla K20c has increased to over 1.8TFLOP/s single precision. Double Performance has also improved on all Kepler and Fermi GPU architectures as well. Starting in CUDA 4.0, the nBody sample has been updated to take advantage of new features to easily scale the n-body simulation across multiple GPUs in a single PC. Adding "-numbodies=" to the command line will allow users to set # of bodies for simulation. Adding “-numdevices=” to the command line option will cause the sample to use N devices (if available) for simulation. In this mode, the position and velocity data for all bodies are read from system memory using “zero copy” rather than from device memory. For a small number of devices (4 or fewer) and a large enough number of bodies, bandwidth is not a bottleneck so we can achieve strong scaling across these devices.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac