=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= NVIDIA CUDA Toolkit v4.2 Release Notes Errata =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ---------------------------------------- Known Issues ---------------------------------------- * Functions cudaGetDeviceProperties, cuDeviceGetProperties, and cuDeviceGetAttribute may return the incorrect clock frequency for the SM clock on Kepler GPUs. [Windows and Linux] * In CUDA Toolkit 4.2, the functions cudaDeviceGetSharedMemConfig() and cudaDeviceSetSharedMemConfig() were added for Kepler. However, the CUDA Reference Manual included with CUDA Toolkit 4.2 was not regenerated to include documentation for these functions. The functions are documented in the Doxygen comments in the file include/cuda_runtime_api.h in the toolkit installation directory. * If required, a Java installation is triggered the first time the Visual Profiler is launched. If this occurs, the Visual Profiler must be exited and restarted. * GraphCut is not supported on GPUs with less than compute capability 1.1. * In the CUDA C Programming Guide for CUDA Toolkit 4.2, some of the instruction throughputs listed for compute capability 3.0 in table 5.1 are incorrect. The table has been corrected in the externally linked document on DevZone and will be corrected in the next version of the CUDA C Programming Guide. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= NVIDIA CUDA Toolkit v4.2 Release Notes for Windows, Linux, and Mac OS X =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ---------------------------------------- CONTENTS ---------------------------------------- -- Release Highlights -- Documentation -- List of Important Files -- Supported NVIDIA hardware -- Supported Operating Systems ---- Windows ---- Linux ---- Mac OS X -- Installation Notes -- New Features -- Resolved Issues -- Known Issues -- Source Code for Open64 and CUDA-GDB -- Revision History -- More Information =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Release Highlights =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * Added support for GK10x Kepler GPUs. * This release contains the following: - NVIDIA CUDA Toolkit documentation - NVIDIA OpenCL documentation - NVIDIA CUDA compiler (nvcc) and supporting tools - NVIDIA CUDA runtime libraries - NVIDIA CUDA-GDB debugger - NVIDIA CUDA-MEMCHECK - NVIDIA Visual Profiler - NVIDIA CUBLAS, CUFFT, CUSPARSE, CURAND, Thrust, and NPP libraries =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Documentation =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= For a list of documents supplied with this release, please refer to the /doc directory of your CUDA Toolkit installation. NOTE: The NVML development package is not shipped with CUDA 4.2. For changes related to nvidia-smi and NVML, please refer to the nvidia-smi man page and the "Tesla Deployment Kit" package located on the developer site http://developer.nvidia.com/tesla-deployment-kit; NVML documentation and the SDK are included. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= List of Important Files =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= bin/ nvcc CUDA C/C++ compiler cuda-gdb CUDA Debugger cuda-memcheck CUDA Memory Checker nvvp NVIDIA Visual Profiler (On Windows, nvvp is located in libnvvp/) include/ cuda.h CUDA driver API header cudaGL.h CUDA OpenGL interop header for driver API cudaVDPAU.h CUDA VDPAU interop header for driver API (Linux only) cuda_gl_interop.h CUDA OpenGL interop header for toolkit API (Linux only) cuda_vdpau_interop.h CUDA VDPAU interop header for toolkit API (Linux only) cudaD3D9.h CUDA DirectX 9 interop header (Windows only) cudaD3D10.h CUDA DirectX 10 interop header (Windows only) cudaD3D11.h CUDA DirectX 11 interop header (Windows only) cufft.h CUFFT API header cublas_v2.h CUBLAS API header cublas.h CUBLAS Legacy API header cusparse_v2.h CUSPARSE API header cusparse.h CUSPARSE Legacy API header curand.h CURAND API header curand_kernel.h CURAND device API header thrust/* Thrust Headers npp.h NPP API Header nvcuvid.h CUDA Video Decoder header (Windows and Linux) cuviddec.h CUDA Video Decoder header (Windows and Linux) NVEncodeDataTypes.h CUDA Video Encoder (C-library or DirectShow) (Windows only) NVEncodeAPI.h CUDA Video Encoder (C-library) (Windows only) INvTranscodeFilterGUIDs.h CUDA Video Encoder (DirectShow) (Windows only) INVVESetting.h CUDA Video Encoder (DirectShow) (Windows only) extras/ CUPTI CUDA Profiling APIs Debugger CUDA Debugger APIs ---------------------------------------- Windows lib files ---------------------------------------- lib/ cuda.lib CUDA driver library cudart.lib CUDA runtime library cublas.lib CUDA BLAS library cufft.lib CUDA FFT library cusparse.lib CUDA Sparse Matrix library curand.lib CUDA Random Number Generation library npp.lib NVIDIA Performance Primitives library nvcuvenc.lib CUDA Video Encoder library nvcuvid.lib CUDA Video Decoder library ---------------------------------------- Linux lib files ---------------------------------------- lib/ libcuda.so CUDA driver library libcudart.so CUDA runtime library libcublas.so CUDA BLAS library libcufft.so CUDA FFT library libcusparse.so CUDA Sparse Matrix library libcurand.so CUDA Random Number Generation library libnpp.so NVIDIA Performance Primitives library ---------------------------------------- Mac OS X lib files ---------------------------------------- lib/ libcuda.dylib CUDA driver library libcudart.dylib CUDA runtime library libcublas.dylib CUDA BLAS library libcufft.dylib CUDA FFT library libcusparse.dylib CUDA Sparse Matrix library libcurand.dylib CUDA Random Number Generation library libnpp.dylib NVIDIA Performance Primitives library libtlshook.dylib NVIDIA internal library =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Supported NVIDIA Hardware =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= See http://www.nvidia.com/object/cuda_gpus.html. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Supported Operating Systems for Windows, Linux, and Mac OS X =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ---------------------------------------- Windows ---------------------------------------- * Supported Operating Systems (32-bit and 64-bit) OS -- Windows Server 2008 Windows XP Windows Vista Windows 7 * Supported Compilers Platform Compiler IDE -------- -------- --- Windows MSVC8(14.00) VS 2005 Windows MSVC9(15.00) VS 2008 Windows MSVC2010(16.00) VS 2010 ---------------------------------------- Linux ---------------------------------------- The CUDA development environment relies on tight integration with the host development environment, including the host compiler and C runtime libraries, and is therefore only supported on distro versions that have been qualified for this CUDA Toolkit release. * Supported Distros Distro 32 64 Kernel GCC GLIBC ------ -- -- ------ --- ----- Fedora14 X X 2.6.35.6-45 4.5.1 2.12.90 ICC Compiler 11.1 X X OpenSUSE-11.2 X X 2.6.31.5-0.1 4.4.1 2.10.1 RHEL-5.>=5 X X 2.6.18-238.el5 4.1.2 2.5 (5.5, 5.6, 5.7) RHEL-6.X X 2.6.32- 4.4.5 2.12 (6.0, 6.1) 131.0.15.el6 SLES 11.1 X X 2.6.32.12-0.7-pae 4.3-62.198 2.11.1-0.17.4 Ubuntu-10.04 X X 2.6.35-23-generic 4.4.5 2.12.1 Ubuntu-11.04 X X 2.6.38-8-generic 4.5.2 2.13 * Distros No Longer Supported Distro 32 64 Kernel GCC GLIBC ------ -- -- ------ --- ----- Fedora13 X X 2.6.33.3-85 4.4.4 2.12 RHEL-4.8 X 2.6.9-89.ELsmpl 3.4.6 2.3.4 Ubuntu-10.10 X X 2.6.35-23-generic 4.4.5 2.12.1 NOTE: 32-bit versions of RHEL 4.8 and RHEL 6.0 have not been tested with this release and are therefore not supported in this CUDA Toolkit release. ---------------------------------------- Mac OS X ---------------------------------------- Platform 32 64 GCC -------- -- -- --- Mac OS X 10.7 X X 4.2.1 (build 5646) Mac OS X 10.6 X X 4.2.1 (build 5646) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Installation Notes =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ---------------------------------------- Windows ---------------------------------------- Silent Installation: - To install, use msiexec.exe from the shell and pass these arguments: msiexec.exe /i cudatoolkit.msi /qn - To uninstall, use /x instead of /i. ---------------------------------------- Linux ---------------------------------------- * In order to run CUDA applications, the CUDA module must be loaded and the entries in /dev created. This may be achieved by initializing X Windows, or by creating a script to load the kernel module and create the entries. An example script (to be run at boot time): #!/bin/bash /sbin/modprobe nvidia if [ "$?" -eq 0 ]; then # Count the number of NVIDIA controllers found. N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l` NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l` N=`expr $N3D + $NVGA - 1` for i in `seq 0 $N`; do mknod -m 666 /dev/nvidia$i c 195 $i; done mknod -m 666 /dev/nvidiactl c 195 255 else exit 1 fi * On some Linux releases, due to a GRUB bug in the handling of upper memory and a default vmalloc too small on 32-bit systems, it may be necessary to pass this information to the bootloader: vmalloc=256MB, uppermem=524288 Example of grub conf: title Red Hat Desktop (2.6.9-42.ELsmp) root (hd0,0) uppermem 524288 kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB pci=nommconf initrd /initrd-2.6.9-42.ELsmp.img * Pinned memory in CUDA is only supported on Linux kernel versions >= 2.6.18. Host side memory allocations pinned for CUDA using cudaHostRegister() API can be passed to 3rd party drivers. Pinned memory allocations returned from cudaHostAlloc() and cudaMallocHost() can also be passed to 3rd party drivers and starting with 4.1, CUDA_NIC_INTEROP is no longer needed on these APIs; thus this flag is now deprecated. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= New Features =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Support for GK10x Kepler GPUs. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Resolved Issues =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * In the routines cusparsecsr2hyb and cusparsedense2hyb, upon the occurrence of an error (typically a device memory allocation problem), the handle to the hybrid format descriptor (cusparseHybMat_t) was wrongly destroyed using cusparseDestroyHybMat. A subsequent call to cusparseDestroyHybMat by the user would then result in an error. This issue has been fixed in the 4.2 toolkit and now the user can and should call cusparseDestroyHybMat to clean up, either after an error or when the matrix is no longer needed. * CUDA-MEMCHECK now explicitly reports calls to assert() inside a CUDA kernel. * The version of Thrust included with the CUDA toolkit has been upgraded from 1.5.1 to 1.5.2. * Rotate primitives falsely used to enforce that the source image's pitch (nSrcStep) was large enough to accommodate the destination ROI's size. This bug was fixed and the restriction no longer exists. * Starting with CUDA Toolkit 4.0, cublasDestroy did not properly free all of the GPU resources, leading to a GPU memory leak of about 256 KB per CUBLAS handle. This could also lead to GPU memory fragmentation when the unreleased resources were scattered over the GPU memory. This issue has been resolved in the 4.2 Toolkit. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Known Issues =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ---------------------------------------- Windows ---------------------------------------- * In the NPP library, the nppiGraphcut_32s8u() and nppiGraphcut8_32s8u() primitives may fail with an error while running on a GPU that supports the sm1.0 architecture, especially on systems with a 64-bit operating system. * Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. * The maximum size of a single memory allocation created by cudaMalloc or cuMemAlloc on WDDM devices is limited to: MIN( (System Memory Size in MB - 512 MB) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. * (Windows and Linux) Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. ---------------------------------------- Linux & Mac ---------------------------------------- * In the NPP library, the nppiGraphcut_32s8u() and nppiGraphcut8_32s8u() primitives may fail with an error while running on a GPU that supports the sm1.0 architecture, especially on systems with a 64-bit operating system. * The Linux kernel provides a mode where it allows user processes to overcommit system memory. (Refer to kernel documentation for /proc/sys/vm/ for details). If this mode is enabled (the default on many distros) the kernel may have to kill processes in order to free up pages for allocation requests. The CUDA driver process, especially for CUDA applications that allocate lots of zero-copy memory with cuMemHostAlloc or cudaMallocHost, is particularly vulnerable to being killed in this way. Since there is no way for the CUDA SW stack to report an OOM error to the user before the process disappears, users, especially on 32-bit Linux, are encouraged to disable memory overcommit in their kernel to avoid this problem. Please refer to documentation on vm.overcommit_memory and vm.overcommit_ratio for more information. * When compiling with GCC, special care must be taken for structs that contain 64-bit integers. This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default. Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC. When using NVCC, this option is automatically passed to GCC. ---------------------------------------- Mac ---------------------------------------- To save power, some Apple products automatically power-down the CUDA-capable GPU in the system. If the operating system has powered down the CUDA-capable GPU, CUDA fails to run and the system returns an error that no device was found. In order to ensure that your CUDA-capable GPU is not powered down by the operating system do the following: 1. Go to "System Preferences" 2. Open the "Energy Saver" section 3. Un-check the "Automatic graphics switching" check box in the upper left ---------------------------------------- Visual Profiler & Command Line Profiler ---------------------------------------- * Visual Profiler fails to generate events or counter information. There are several reasons why Visual Profiler may fail to gather counter information: a. If more than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read counter values. b. If more than one application is using the GPU at the same time when Visual Profiler is profiling a CUDA application. To fix this issue please close all applications and just run the one with Visual Profiler. Interacting with the active desktop should be avoided while the application is generating counter information. Please note that Visual Profiler gathers counters for only one context if the application is using multiple contexts within the same application. * Enabling "{gld|gst} instructions {8|16|32|64|128}bit" counters can cause GPU kernels to run longer than the driver's watchdog timeout limit. In these cases the driver will terminate the GPU kernel resulting in an application error and profiling data will not be available. Please disable driver watchdog timeout before profiling such long running CUDA kernels. - On Linux, setting the X Config option 'Interactive' to false is recommended. - For Windows, detailed information on disabling the Windows TDR is available at: http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx#E2. * On Windows Vista/Win7 profiling an application which makes more than 32K CUDA kernel launch, memory copy, or memory set API calls without a synchronization call can result in an application hang. To work around this issue add synchronization calls like cudaDeviceSynchronize() or cudaStreamSynchronize(). * Enabling counters on GPUs with compute capability (SM type) 1.x can result in occasional hangs. Please disable counters on such runs. * The "warp serialize" counter for GPUs with compute capability 1.x is known to give incorrect and high values for some cases. * Prof triggers are not supported on GPUs with compute capability (SM type) 1.0. * Profiler data gets flushed to a file only at synchronization calls like cudaDeviceSynchronize() and cudaStreamSynchronize() or when the profiler buffer gets full. If an app terminates without these sync calls then profiler data may be lost. * Counters gld_incoherent and gst_incoherent always return zero on GPUs with compute capability (SM type) 1.3. A value of zero doesn't mean that all load/stores are 100% coalesced. * Use Visual Profiler version 4.1 onwards with driver version 285 (or later).Due to compatibility issues with profile counters, Visual Profiler 4.0 (or earlier) must not be used with driver version 285 (or later). =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Source Code for Open64 and CUDA-GDB =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * The Open64 and CUDA-GDB source files are controlled under terms of the GPL license. Current and previously released versions are located at: ftp://download.nvidia.com/CUDAOpen64 * Linux users: - Please refer to the "Release Notes" and "Known Issues" sections in the CUDA-GDB User Manual (cuda-gdb.pdf). - Please refer to cuda-memcheck.pdf for notes on supported error detection and known issues. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Revision History =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= -- 04/2012 - Version 4.2 -- 01/2012 - Version 4.1 Production -- 11/2011 - Version 4.1 RC2 -- 10/2011 - Version 4.1 RC1 -- 09/2011 - Version 4.1 EA (Information in ReadMe.txt) -- 05/2011 - Version 4.0 -- 04/2011 - Version 4.0 RC2 [Errata] -- 02/2011 - Version 4.0 RC -- 11/2010 - Version 3.2 -- 10/2010 - Version 3.2 RC2 -- 09/2010 - Version 3.2 RC =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= More Information =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= * For more information and help with CUDA, please visit: http://www.nvidia.com/cuda * Please refer to the LLVM Release License text in EULA.txt for details on LLVM licensing.