NVIDIA CUDA SDK - Data-Parallel Algorithms

The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. The SDK includes dozens of code samples covering a wide range of applications including:

Simple techniques such as C++ code integration and efficient loading of custom datatypes
How-To examples covering CUDA BLAS and FFT libraries, texture fetching in CUDA, and CUDA interoperation with the OpenGL and Direct3D graphics APIS
Linear algebra primitives such as matrix transpose and matrix-matrix multiplication
Data-parallel algorithms such as parallel prefix sum of large arrays
Performance: profiling using timers and bandwidth tests
Advanced application examples such as image convolution, Black-Scholes options pricing and binomial options pricing

Refer to the following READMEs for more information ( Linux , Windows )

This code is released free of charge for use in derivative works, whether academic, commercial, or personal. (Full License)

The NVIDIA CUDA Toolkit is required to run and compile code samples. Please obtain the CUDA Toolkit here

Quick Links:

Data-Parallel Algorithms	Computational Finance
Performance Strategies	Linear Algebra
Physically-Based Simulation	CUDA Basic Topics
Graphics Interop	Image/Video Processing and Data Compression
CUDA Advanced Topics


256-bin Histogram This sample demonstrates efficient implementation of 256-bin histogram.		or later Whitepaper Download - Windows Download - Linux


64-bin Histogram This sample demonstrates efficient implementation of 64-bin histogram.		or later Whitepaper Download - Windows Download - Linux


Separable Convolution This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.		or later Whitepaper Download - Windows Download - Linux


Texture-based Separable Convolution Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.		or later Download - Windows Download - Linux


Bitonic Sort Bitonic sort is a very simple parallel sorting algorithm that is very efficient when sorting a small number of elements: http://citeseer.ist.psu.edu/blelloch98experimental.html This implementation is based on: http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm		or later Download - Windows Download - Linux


N-Body Simulation This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA".		or later Whitepaper Download - Windows Download - Linux


Parallel Reduction A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction.		or later Whitepaper Download - Windows Download - Linux


Mandelbrot This sample uses CUDA to compute and display the Mandelbrot set.		or later Download - Windows Download - Linux


Fast Walsh Transform Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths		or later Download - Windows Download - Linux


Scan This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Whitepaper Download - Windows Download - Linux


Scan of Large Arrays This example demonstrates an efficient CUDA implementation of parallel prefix sum (also known as "scan") for arbitrary-sized arrays. Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Whitepaper Download - Windows Download - Linux