NVIDIA CUDA SDK Code Samples

The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. The SDK includes dozens of code samples covering a wide range of applications including:

Simple techniques such as C++ code integration and efficient loading of custom datatypes
How-To examples covering CUDA BLAS and FFT libraries, texture fetching in CUDA, and CUDA interoperation with the OpenGL and Direct3D graphics APIS
Linear algebra primitives such as matrix transpose and matrix-matrix multiplication
Data-parallel algorithms such as parallel prefix sum of large arrays
Performance: profiling using timers and bandwidth tests
Advanced application examples such as image convolution, Black-Scholes options pricing and binomial options pricing

Refer to the following READMEs for more information ( Linux , Windows )

This code is released free of charge for use in derivative works, whether academic, commercial, or personal. (Full License)

The NVIDIA CUDA Toolkit is required to run and compile code samples. Please obtain the CUDA Toolkit here

Quick Links:

Data-Parallel Algorithms	Computational Finance
Performance Strategies	Linear Algebra
Physically-Based Simulation	CUDA Basic Topics
Graphics Interop	Image/Video Processing and Data Compression
CUDA Advanced Topics


Monte-Carlo Option Pricing with multi-GPU support This sample evaluates fair call price for a given set of European options using Monte-Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system.		or later Download - Windows Download - Linux


FFT Ocean Simulation This sample simulates an Ocean heightfield using CUFFT and renders the result using OpenGL.		or later Download - Windows Download - Linux


256-bin Histogram This sample demonstrates efficient implementation of 256-bin histogram.		or later Whitepaper Download - Windows Download - Linux


64-bin Histogram This sample demonstrates efficient implementation of 64-bin histogram.		or later Whitepaper Download - Windows Download - Linux


Separable Convolution This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.		or later Whitepaper Download - Windows Download - Linux


Texture-based Separable Convolution Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.		or later Download - Windows Download - Linux


FFT-Based 2D Convolution This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations.		or later Whitepaper Download - Windows Download - Linux


MersenneTwister This sample implements Mersenne Twister random number generator and Cartesian Box-Muller transformation on the GPU.		or later Whitepaper Download - Windows Download - Linux


Monte-Carlo Option Pricing This sample evaluates fair call price for a given set of European options using Monte-Carlo approach.		or later Whitepaper Download - Windows Download - Linux


Black-Scholes Option Pricing This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula.		or later Whitepaper Download - Windows Download - Linux


Binomial Option Pricing This sample evaluates fair call price for a given set of European options under binomial model.		or later Whitepaper Download - Windows Download - Linux


Image denoising This sample demonstrates two adaptive image denoising technqiues: KNN and NLM, based on computation of both geometric and color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up variation of the latter techique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.		or later Whitepaper Download - Windows Download - Linux


DirectX Texture Compressor (DXTC) High Quality DXT Compression using CUDA. This example shows how to implement an existing computationally-intensive CPU compression algorithm in parallel on the GPU, and obtain an order of magnitude performance improvement.		or later Whitepaper Download - Windows Download - Linux


Post-Process in OpenGL This sample shows how to post-process an image rendered in OpenGL using CUDA.		or later Download - Windows Download - Linux


Box Filter Fast image box filter using CUDA with OpenGL rendering.		or later Download - Windows Download - Linux


Bitonic Sort Bitonic sort is a very simple parallel sorting algorithm that is very efficient when sorting a small number of elements: http://citeseer.ist.psu.edu/blelloch98experimental.html This implementation is based on: http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm		or later Download - Windows Download - Linux


Matrix Transpose Efficient matrix transpose.		or later Download - Windows Download - Linux


Scalar Product This sample calculates scalar products of a given set of input vector pairs.		or later Download - Windows Download - Linux


Clock This example shows how to use the clock function to measure the performance of kernel accurately.		or later Download - Windows Download - Linux


Multi-GPU This application demonstrates how to use the CUDA API to use multiple GPUs.		or later Download - Windows Download - Linux


Aligned Types A simple test, showing huge access speed gap between aligned and misaligned structures.		or later Download - Windows Download - Linux


N-Body Simulation This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA".		or later Whitepaper Download - Windows Download - Linux


Parallel Reduction A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction.		or later Whitepaper Download - Windows Download - Linux


asyncAPI This sample uses CUDA streams and events to overlap execution on CPU and GPU.		or later Download - Windows Download - Linux


cudaOpenMP This sample shows how to use OpenMP API to write an application for multiple GPUs.		or later Download - Windows


simpleStreams This sample uses CUDA streams to overlap kernel executions with memcopies between the device and the host.		or later Download - Windows Download - Linux


Mandelbrot This sample uses CUDA to compute and display the Mandelbrot set.		or later Download - Windows Download - Linux


Particles This sample uses CUDA to simulates and visualizes a large set of particles and their physical interaction.		or later Whitepaper Download - Windows Download - Linux


Simple Atomics A simple demonstration of global memory atomic instructions.		or later Download - Windows Download - Linux


Fast Walsh Transform Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths		or later Download - Windows Download - Linux


Eigenvalues The computation of all or a subset of all eigenvalues is an important problem in linear algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.		or later Whitepaper Download - Windows Download - Linux


Sobel Filter This sample implements the Sobel edge detection filter for 8-bit monochrome images.		or later Download - Windows Download - Linux


Device Query This sample enumerates the properties of the CUDA devices present in the system.		or later Download - Windows Download - Linux


Simple Templates This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.		or later Download - Windows Download - Linux


Bandwidth Test This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.		or later Download - Windows Download - Linux


Scan This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Whitepaper Download - Windows Download - Linux


Scan of Large Arrays This example demonstrates an efficient CUDA implementation of parallel prefix sum (also known as "scan") for arbitrary-sized arrays. Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Whitepaper Download - Windows Download - Linux


Simple Texture (Driver Version) Simple example that demonstrates use of textures in CUDA using the driver API.		or later Download - Windows Download - Linux


Fluids (OpenGL Version) An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering.		or later Download - Windows Download - Linux


Fluids (Direct3D Version) An example of fluid simulation using CUDA and CUFFT, with Direct3D 9 rendering.		or later Download - Windows


Simple Texture Simple example that demonstrates use of textures in CUDA.		or later Download - Windows Download - Linux


Matrix Multiplication (Driver Version) This sample implements matrix multiplication using the CUDA driver API. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.		or later Download - Windows Download - Linux


Template A trivial template project that can be used as a starting point to create new CUDA projects.		or later Download - Windows Download - Linux


Simple CUFFT Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain.		or later Download - Windows Download - Linux


Simple Direct3D Simple program which demonstrates interoperability between CUDA and Direct3D. The program modifies vertex positions with CUDA and uses Direct3D to render the geometry.		or later Download - Windows


Simple OpenGL Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA and uses OpenGL to render the geometry.		or later Download - Windows Download - Linux


Simple CUBLAS Example of using CUBLAS.		or later Download - Windows Download - Linux


Matrix Multiplication This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.		or later Download - Windows Download - Linux


C++ Integration This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates that vector types can be used from cpp.		or later Download - Windows Download - Linux


1D Discrete Haar Wavelet Decomposition Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2.		or later Download - Windows Download - Linux