NVIDIA CUDA C SDK - Data-Parallel Algorithms

The GPU Computing SDK includes 100+ code samples, utilities, whitepapers, and additional documentation to help you get started developing, porting, and optimizing your applications for the CUDA architecture. You can get quick access to many of the SDK resources on this page, SDK documentation, or download the complete SDK.

Please note that you may need to install the latest NVIDIA drivers and CUDA Toolkit to compile and run the code samples.

Refer to the SDK release notes for more information.


CUDA Segmentation Tree Thrust Library This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm.		Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Fast Walsh Transform Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Histogram This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Line of Sight This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the Thrust library (http://code.google.com/p/thrust/).		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Parallel Reduction A parallel sum reduction that computes the sum of a large arrays of values. This sample demonstrates several important optimization strategies for 1:Data-Parallel Algorithms like reduction.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Parallel Prefix Sum (Scan) This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Separable Convolution This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Texture-based Separable Convolution Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

threadFenceReduction This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic. to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later).		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Radix Sort using the Thrust Library This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/).. The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA Sorting Networks This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Merge Sort This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

Mandelbrot This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this sample to the SDK!		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac

CUDA N-Body Simulation This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". With CUDA 4.0, the nBody sample has been updated to take advantage of new features to easily scale the n-body simulation across multiple GPUs in a single PC. Adding “-numdevices=” to the command line option will cause the sample to use N devices (if available) for simulation. In this mode, the position and velocity data for all bodies are read from system memory using “zero copy” rather than from device memory. For a small number of devices (4 or fewer) and a large enough number of bodies, bandwidth is not a bottleneck so we can achieve strong scaling across these devices.		or later Download - Windows (x86) Download - Windows (x64) Download - Linux/Mac