The GPU Computing SDK includes 100+ code samples, utilities, whitepapers, and additional documentation to help you get started developing, porting, and optimizing your applications for the CUDA architecture. You can get quick access to many of the SDK resources on this page, SDK documentation, or download the complete SDK.
Please note that you may need to install the latest NVIDIA drivers and CUDA Toolkit to compile and run the code samples.
Refer to the SDK release notes for more information.
|
|
||
CUDA Segmentation Tree Thrust Library
This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm. |
![]()
Download - Windows (x86) |
|
|
|
||
Fast Walsh Transform
Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Histogram
This sample demonstrates efficient implementation of 64-bin and 256-bin histogram. |
![]() or later
Download - Windows (x86) |
|
|
|
||
Line of Sight
This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the Thrust library (http://code.google.com/p/thrust/). |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Parallel Reduction
A parallel sum reduction that computes the sum of a large arrays of values. This sample demonstrates several important optimization strategies for 1:Data-Parallel Algorithms like reduction. |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Parallel Prefix Sum (Scan)
This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array. |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Separable Convolution
This sample implements a separable convolution filter of a 2D signal with a gaussian kernel. |
![]() or later
Download - Windows (x86) |
|
|
|
||
Texture-based Separable Convolution
Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable. |
![]() or later
Download - Windows (x86) |
|
|
|
||
threadFenceReduction
This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic. to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later). |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Radix Sort using the Thrust Library
This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/).. The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only. |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA Sorting Networks
This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm |
![]() or later
Download - Windows (x86) |
|
|
|
||
Merge Sort
This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm |
![]() or later
Download - Windows (x86) |
|
|
|
||
Mandelbrot
This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this sample to the SDK! |
![]() or later
Download - Windows (x86) |
|
|
|
||
CUDA N-Body Simulation
This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". With CUDA 4.0, the nBody sample has been updated to take advantage of new features to easily scale the n-body simulation across multiple GPUs in a single PC. Adding “-numdevices= |
![]() or later
Download - Windows (x86) |
|