-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA Toolkit v3.2 RC2 Release Notes for Windows -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Manifest -------------------------------------------------------------------------------- This release contains: o NVIDIA CUDA toolkit documentation o NVIDIA OpenCL documentation o NVIDIA CUDA compiler (nvcc) and supporting tools o NVIDIA CUDA runtime libraries o NVIDIA CUBLAS, CUFFT, CUSPARSE and CURAND libraries -------------------------------------------------------------------------------- Documentation (located in the doc directory) -------------------------------------------------------------------------------- o CUDA_Toolkit_Release_Notes_Windows.txt - This document. o CUBLAS_Library.pdf - User manual for the CUDA accelerated BLAS implementation. o CUFFT_Library.pdf - User manual for the CUDA accelerated FFT library. o CURAND_Library.pdf - User manual for the CUDA accelerated Random Number Generation library. o CUSPARSE_Library.pdf - User manual for the CUDA accelerated Sparse Matrix library. o Compute_Profiler.txt - Guide for using the profiler. o EULA.txt - The end user license agreement. o nvcc.pdf - Documentation for the CUDA command line compiler. o CUDA_VideoDecoder_Library.pdf - Documentation for the NVIDIA CUDA Video Decoder library. o CUDA_VideoEncoder_Library.pdf - Documentation for the NVIDIA CUDA Video Encoder library. o OpenCL_Extensions/*.txt - Documentation for the NVIDIA OpenCL extensions. o OpenCL_Implementations_Notes.txt - Notes describing the implementation defined behavior for the NVIDIA OpenCL implementation. o ptx_isa_2.2.pdf - User manual for the Parallel Thread Execution ISA. -------------------------------------------------------------------------------- Important Files -------------------------------------------------------------------------------- bin/nvcc Command line compiler include/ cuda.h CUDA driver API header cudaGL.h CUDA OpenGL interop header for driver API cudaD3D9.h CUDA DirectX 9 interop header cudaD3D10.h CUDA DirectX 10 interop header cudaD3D11.h CUDA Directx 11 interop header cufft.h CUFFT API header cublas.h CUBLAS API header cusparse.h CUSPARSE API header curand.h CURAND API header curand_kernel.h CURAND device API header nvcuvid.h CUDA Video Decoder header cuviddec.h CUDA Video Decoder header NVEncodeDataTypes.h CUDA Video Encoder (C-library or DirectShow) required for projects NVEncodeAPI.h CUDA Video Encoder (C-library) required for projects INvTranscodeFilterGUIDs.h CUDA Video Encoder (DirectShow) required for projects INVVESetting.h CUDA Video Encoder (DirectShow) required for projects lib/ cuda.lib CUDA driver library cudart.lib CUDA runtime library cublas.lib CUDA BLAS library cufft.lib CUDA FFT library cusparse.lib CUDA Sparse Matrix library curand.lib CUDA Random Number Generation library nvcuvenc.lib CUDA Video Encoder library nvcuvid.lib CUDA Video Decoder library -------------------------------------------------------------------------------- Supported NVIDIA Hardware -------------------------------------------------------------------------------- o See http://www.nvidia.com/object/cuda_gpus.html -------------------------------------------------------------------------------- Supported Software Platforms -------------------------------------------------------------------------------- o Supported Operating Systems (32-bit and 64-bit) - Windows XP - Windows Vista - Windows Server 2003 - Windows Server 2008 - Windows Server 2008 R2 - Windows 7 -------------------------------------------------------------------------------- Installation Notes -------------------------------------------------------------------------------- Silent Installation: Install using msiexec.exe from the shell and pass the following arguments: msiexec.exe /i cudatoolkit.msi /qn To uninstall: Use /x instead of /i -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Upgrading from CUDA Toolkit 3.1 -------------------------------------------------------------------------------- o Prior to the 3.1 release, nvcc treated __device__ functions as implicitly static. This behavior has changed with the 3.1 release. As a result, the host linker will give a link error regarding multiple defined symbols, if 1) two identical __device__ functions are defined in two different compilations units. This includes including function definitions through the #include <> mechanism. 2) a __device__ function and an identical host function are defined in two different compilations units. For both cases, declaring the __device__ function as static will make the compilation succeed. o In CUDA Toolkit 3.2, the default installation path has changed from C:\CUDA to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v#.#, where #.# is the version number. This allows concurrent installation of multiple versions of the CUDA Toolkit. As part of the CUDA Toolkit installation process, .rules files named NvCudaDriverApi.v#.#.rules and NvCudaRuntimeApi.v#.#.rules (for version #.# of the CUDA Toolkit, for CUDA Driver API and CUDA Runtime API applications, respectively) are installed into $VisualStudioInstallDir\VC\VCProjectDefaults, and a $CUDA_PATH_V#_# environment variable is set to the installation directory of the toolkit. Together, the .rules file and the environment variable allow Visual Studio projects to locate and configure the CUDA Toolkit and the nvcc compiler. You can reference the .rules files from your Visual Studio project files when building your own CUDA applications. To aid migration between toolkit versions, a $CUDA_PATH environment variable is also set by the CUDA Toolkit installer to point to the installation directory of the most recently installed version of the toolkit. A set of non-versioned .rules files (named NvCudaDriverApi.rules and NvCudaRuntimeApi.rules) can be used in conjunction with $CUDA_PATH for referencing the most recently installed version of the toolkit. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- New CUDA Libraries o CUSPARSE, supporting sparse matrix computations. o CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines. o CUFFT performance tuned radix-3, -5, and -7 transform sized on Fermi architecture GPUs. o CUBLAS performance tuned for Fermi architecture GPUs, especially for matrix multiplication of all datatypes and transpose variations. o H.264 encode/decode libraries that were previously available in the GPU Computing SDK are now part of the CUDA Toolkit. CUDA Driver and CUDA C Runtime o Support for new 6GB Quadro and Tesla products o Cross-stream synchronization Development Tools o Support for debugging GPUs with more than 4GB device memory Miscellaneous o Support for malloc() and free() in device code o Integrated Tesla Compute Cluster (TCC) support in standard Windows driver packages o NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, several GPU performance counters -------------------------------------------------------------------------------- Notes on New Features in the CUDA Driver and CUDA C Runtime -------------------------------------------------------------------------------- o Added a flag to texture reference (CU_TRSF_SRGB) that enables sRGB->linear conversion on read. o Sample FORTRAN wrappers for CUSPARSE routines (similar to CUBLAS) included in the src/ directory. o Added CU_TARGET_COMPUTE_21 to JIT options. o Added cuCtxGetVersion; this allows an application to determine which version of the CUDA API was used to create a CUDA context. Library developers can use cuCtxGetVersion to support different versions of CUDA within the same package. o Enhanced printf; kernels that called printf and were built using CUDA Toolkit 3.1 must be recompiled with the CUDA Toolkit 3.2 tool chain. o CUDA Toolkit 3.2 allows the user set up the printf FIFO size any time until the first launch of a kernel that uses printf. In the previous version, the print FIFO size had to be set before the first module that called printf was loaded. Since CUDA Runtime automatically loads all modules upon the first CUDA API call, this meant that even if cudaThreadSetLimit(cudaLimitPrintfFifoSize, n) was the first cuda API call made in an application, the user's modules would be loaded by the runtime first and cudaThreadSetLimit() would return failure. o cuParamSetTexRef() has been deprecated in CUDA Toolkit 3.2 since this API entry provided no functionality. Accordingly, CUDA C Programming Guide has been updated to remove cuParamSetTexRef from the example. cuParamSetTexRef() has been marked deprecated in cuda.h; this is documented as deprecaded in the CUDA Toolkit Reference Manual (html/chm/pdf). o Improved setting of L1/smem configuration in the CUDA driver. There are 4 new API calls; 2 for driver, 2 for runtime. These calls are documented in the CUDA Toolkit Reference Manual (chm/html/pdf) and the relevant header files: cuCtxGetCacheConfig()/cuCtxSetCacheConfig() and cudaThreadGetCacheConfig()/cudaThreadSetCacheConfig(). o Added cross stream synchronization; cuStreamWaitEvent adds the ability to perform GPU-side synchronization on a CUDA event within a CUDA stream. Cross-stream synchronization and dependency management is resolved on the GPU without any CPU involvement while the work is executing. -------------------------------------------------------------------------------- Notes on New Features in the CUDA Libraries -------------------------------------------------------------------------------- o CUFFT has implemented the Bluestein (or chirp) algorithm for general transform sizes that cannot be factored into supported radices. The "Bluestein" FFT algorithm accelerates transforms for sizes that cannot be factored into 2^a * 3^b * 5^c * 7^d, where a,b,c,d are non-negative integers. This algorithm provides signficant speedup and in many cases improves the accuracy of the final result. Note: The Bluestein algorithm is only supported for 1-D batched transforms; acceleration for 2-D and 3-D transforms will be addressed in future releases. o CUFFT supports batched transforms > 512 MB. The previous version of CUFFT failed when (batch size * transform size * datatype size) for a 1D single-precision transform exceeded 512MB. This has been fixed so that now the total size can be as large as the device memory capacity allows. The exact size varies depending on whether operating in-place or out-of-place and depending on how much internal intermediate memory is required by API, which can vary depending on the actual size of the transform. Note however, that while the total size can be much larger, the size of each individual transform is still limited to 128 M elements (1 GB for single-precision.) o Single precision batched transforms on datasets larger than 512MB report an error for sizes (2^a * 3^b * 5^c * 7^d) where at least one of b, c, and d are non-zero. To work around this issue, the user may split the work into multiple batched calls such that the data processed by each call is <= 512MB. o CUDA Toolkit 3.2 supports 1 pseudo- and 1 quasi-random number generator. A new library called CURAND has been added for generating pseudo- and quasi-random numbers. See CURAND_Library.pdf for full documentation. The CUDA SDK contains several examples that illustrate the use of the new library APIs. In particular, the MonteCarloCURAND, EstimatePiInlineP, EstimatePiInlineQ, EstimatePiP, EstimatePiQ, and SingleAsianOptionP examples can be found in the MonteCarloCURAND directory. o Added a new library, CUSPARSE, for operating on sparse matrices and vectors. See CUSPARSE_Library.pdf for full documentation. The conjugateGradient example in the CUDA SDK illustrates the usage of the new APIs in the CUSPARSE library. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Vista, Server 2008 and Windows 7 related: o In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU. o Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. o The CUDA Profiler does not support performance counter events on Windows Vista. All profiler configuration regarding performance counter events is ignored. o The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to: MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. o The OS may impose artificial limits on the amount of memory you can allocate using the Cuda APIs for both system and video memory. In many cases, these limits are significantly less than the size of physical system and video memory, but there are exceptions that make it difficult to quantify the expected behavior for a particular application. XP related: o Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. XP, Vista, Server 2008 and Windows 7 related: o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA. o malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at: http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix. o When compiling a source file that includes vector_types.h with the Microsoft compiler on a 32-bit Windows system, the 16-byte aligned vector types are not properly aligned at 16 bytes. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release. o OpenGL interoperability - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview). o OpenCL program binary formats may change in this or future releases. Users should create programs from source and should not rely on compatibility of generated binaries between different versions of the driver. o For an OpenCL C program, the maximum alignment of a function scope local variable and a function parameter variable is limited to 16-bytes. o Some applications may not be able to call cuMemAllocHost/cudaMallocHost() due to existing virtual address ranges that they would like to bus master to/from. Registering system memory for DMA will be addressed in future releases. o When a program is terminated while waiting on a breakpoint, the system needs to be rebooted. This affects the TCC driver for Windows Vista and Windows 7. o Divergent_branch counter in Visual Profiler reports an incorrect value (of zero) for Fermi. -------------------------------------------------------------------------------- Documentation Errata -------------------------------------------------------------------------------- o Visual Profiler User Guide.pdf/Overview/Getting Started/Installation and Setup/Windows Under the section entitled, "Running the Compute Visual Profiler", note the additional "3.2" in the path. ------- To run the Compute Visual Profiler, go to: Start->All Programs->NVIDIA Corporation->CUDA Toolkit->v3.2->Compute Visual Profiler ------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Performance Improvements -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- CUFFT Related -------------------------------------------------------------------------------- o Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i. -------------------------------------------------------------------------------- CUBLAS GEMM Related -------------------------------------------------------------------------------- o Increased performance for GEMM kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document. The performance of the CUBLAS routine CGEMM has been significantly improved on Fermi architecture for sizes larger than 300x300. Peak performance is reached when 'k' is a multiple of 16 and 'm' and 'n' are multiples of 64. Performance for ZGEMM has been improved on the Fermi architecture for sizes greater than 256x256. Peak performance is reached when 'k' is a multiple of 8 and 'm' and 'n' are multiples of 32. The performance of the CUBLAS routine DGEMM has significantly improved for the Tesla products based on the Fermi architecture (C20XX, S20XX, M20XX). The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 64, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*784*784). The performance of the CUBLAS routine SGEMM has also been significantly improved on Fermi architecture. The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 96, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*673*673). -------------------------------------------------------------------------------- CUBLAS Related -------------------------------------------------------------------------------- o The performance of CUBLAS routines {S,D,C,Z}SYRK and {C,Z}HERK on the Fermi architecture has been significantly improved. These routines have been derived respectively from their {S,D,C,Z}GEMM counterparts and have the same requirements to achieve peak performance. o Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library. Note that functions that implement a reduction such as *dot, *min, and *max are not improved. -------------------------------------------------------------------------------- Other -------------------------------------------------------------------------------- o The performance of round-to-nearest double precision reciprocals in device code has been improved by more than 50% for both Tesla and Fermi-class architectures. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Bug Fixes -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- o Fixed: CUDA_*_PATH environment variables get resolved properly when used within a Windows command prompt. -------------------------------------------------------------------------------- CUDA Driver Related -------------------------------------------------------------------------------- o Excessive use of device printf can cause new Timeout Detection and Recovery (TDR) errors to be observed. If a kernel is close to exceeding its run time limit, adding printf may push the kernel over its limit causing it to fail. The more printfs that are added, the more likely this is to occur. If a kernel runs fine without calls to printf, but sees cudaErrorLaunchTimeout or CUDA_ERROR_LAUNCH_TIMEOUT errors when calls to printf are added, then the number of printfs should be reduced or parts of the kernel should be bypassed to bring the execution time below the run time limit. This is due to the mechanism in WinVista/Win7's WDDM display driver model called "Timeout Detection and Recovery". See http://www.microsoft.com/whdc/device/displ...dm_timeout.mspx for details. The Microsoft webpage shows the registry keys you can use to change these settings. Note that you have to reboot for changes to the regkeys to take effect. o In the previous versions, 2D Texture size 65536 failed for compute capability equal to 2.0. The limit has been fixed to be in line with hardware capability. -------------------------------------------------------------------------------- CUDA Library Related -------------------------------------------------------------------------------- o On Tesla-architecture GPUs, the cublasSgemm, cublasDgemm, and cublasZgemm routines would fail with an "Unspecified Launch Error" in some cases when the 'k' parameter was not a multiple of 16. This is now fixed. Note: The cublasCgemm routine was not affected by this bug. o If Beta=0 for SSYRK and DSYRK, the 3.2 Release Candidate could produce incorrect results on Fermi in some cases. This has been fixed in the current 3.2 RC2 release. o Fixed: Performance of the CHERK function was reported as being slower in the CUDA 3.2 Toolkit Release Candidate compared to the 3.1 release. o Previous versions of CUFFT would possibly fail if the input or output data pointers were not aligned to a multiple of 256 bytes. Now, CUFFT requires 8 byte alignment for single-precision input data and 16 bytes for double-precision input data. o CUDA CUFFT: In the previous version, the CUFFT library would sometimes generate incorrect results on systems with multiple GPUS where the GPUs were of different types e.g a system with a GTX275 and a GTX480. This issue has been fixed; CUFFT detects the "active" device and correctly configures execution accordingly. o In some cases, the CUFFT library would cause the entire process to terminate when an internal or input error was encountered. This has been fixed so that the CUFFT APIs correctly catch the errors and return gracefully with an appropriate error code. o The previous version of CUFFT incorrectly created and destroyed planners such that a newly created planner could have a handle that was not unique from another existing active handle in certain situations. This has been fixed and now planners can be created and destroyed safely in any order. o In previous versions of CUFFT, C2C transforms of length 4 and C2R and R2C transforms of length 8 would produce incorrect results when the batch size was not a multiple of 64 for Tesla and not a multiple of 128 for Fermi. This has been fixed and transforms of length 4 and 8 will produce correct results for any batch size. -------------------------------------------------------------------------------- CUDA Runtime Related -------------------------------------------------------------------------------- o In the previous version 1D Texture Maximum Size allowed did not match documentation. Fixed maximum width for a 1D texture reference bound to linear memory to be 2^27. -------------------------------------------------------------------------------- Source code for Open64 -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- 10/2010 - Version 3.2 RC2 09/2010 - Version 3.2 RC 06/2010 - Version 3.1 04/2010 - Version 3.1 Beta 02/2010 - Version 3.0 10/2009 - Version 3.0 Beta 07/2009 - Version 2.3 06/2009 - Version 2.3 Beta 05/2009 - Version 2.2 03/2009 - Version 2.2 Beta 11/2008 - Version 2.1 Beta 06/2008 - Version 2.0 11/2007 - Version 1.1 06/2007 - Version 1.0 06/2007 - Version 0.9 02/2007 - Version 0.8 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit: http://www.nvidia.com/cuda -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- NVIDIA extends thanks to the University of Tennessee, and especially Professor Jack Dongarra and Stanimire Tomov, for their contributions to the matrix multiplication functions in the CUBLAS library. These changes are incorporated under the following copyright and with the following conditions: "Copyright (c) 2010 The University of Tennessee. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution. - Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE" --------------------------------------------------------------------------------