-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA Windows XP, Vista, Server 2008, and Windows 7 Release Notes Version 3.1 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Porting to 3.1 -------------------------------------------------------------------------------- o Prior to the 3.1 release, nvcc treated __device__ functions as implicitly static. This behavior has changed with the 3.1 release. As a result, the host linker will give a link error regarding multiple defined symbols, if 1) two identical __device__ functions are defined in two different compilations units. This includes including function definitions through the #include <> mechanism. 2) a __device__ function and an identical host function are defined in two different compilations units. For both cases, declaring the __device__ function as static will make the compilation succeed. -------------------------------------------------------------------------------- Installation Notes -------------------------------------------------------------------------------- Silent Installation 1) Install the cudatoolkit.exe from the shell passing the -r flag: cudatoolkit.exe -r This will generate a file setup.iss in C:\Windows 2) You can rename the file and use it to do a silent install: cudatoolkit.exe -s -f1"fullpathto\file.iss" -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- Hardware Support o See http://www.nvidia.com/object/cuda_learn_products.html o On Fermi hardware, CUDA 3.1 supports up to 16 concurrent kernels. New Toolkit Features -------------------------- o Device emulation has been removed. o cublasSpsv, cublasDpsv, cublasCpsv, cublasZpsv, and cublasSbsv, cublasDbsv, cublasCbsv, cublasZbsv have been enhanced to remove all previous size limitations on the input vector o Improved interoperability between the CUDA Driver API and the CUDA Runtime API. Includes support for sharing pointers, events, streams, arrays and graphics interop resources between the CUDA Driver API and the CUDA Runtime API. Introduces CUDA Runtime API compatibility with the CUDA Driver context migration API (cuCtxPushCurrent and cuCtxPopCurrent). o Added the ability to call printf() from kernels. This feature is supported only on the Fermi architecture. o Added support for recursion in device functions. This feature is supported only on the Fermi architecture. Note that we default to a stack size limit of 1K per thread, so can run out of stack if recurse too deeply. Can use cuCtxSetLimit() to change the default stack size. o Added support for function pointers. This feature is supported only on the Fermi architecture. Function pointers can only be used inside a single kernel; they cannot be passed to another kernel. o Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES environment variable. Visible devices should be included as a comma-separated list in terms of the system-wide list of devices. For example, to use only devices 0 and 2 from the system-wide list of devices, set CUDA_VISIBLE_DEVICES equal to "0,2" before launching the application. The application will then enumerate these devices as device 0 and device 1. New API Features ------------------------- o In CUFFT-3.1, R2C and C2R transforms for power-of-2 sizes now experience a similar speedup to their C2C equivalent. However, CUFFT's internal data layout is different to that used by FFTW; by default CUFFT will match FFTW's data format, but at some performance penalty. To enable faster transforms, the user must use cufftSetCompatibilityMode() API to disable FFTW-compatible behavior and enable faster native mode. o CUBLAS now supports CUDA Stream via the cublasSetKernelStream API o Unformatted surface load/store (i.e. the ability to write to textures). This feature is supported only on the Fermi architecture. o New functions cuCtxSetLimit() and cuCtxGetLimit() have been added to control GPU thread stack size and the size of the printf() FIFO queue. o Device-to-device transfers in a non-NULL stream with asynchronous cudaMemcpy calls may overlap with kernels. Runtime documentation has been updated to reflect this. o New device attributes report the PCI bus and device identifiers of a particular GPU for better integration with system management tools. New Performance Improvements ---------------------------- o Double-precision and C2R/R2C performance of CUFFT has been improved significantly for many transform sizes since the CUFFT 3.0 release. o Double precision divide and reciprocal on the Fermi architecture have been optimized. o The performance of selected transcendental functions from the log, pow, erf, and gamma families. -------------------------------------------------------------------------------- Bug Fixes -------------------------------------------------------------------------------- o The CUBLAS SGEMM, CGEMM and small matrix DGEMM performance regressions that were in v3.0 have been restored in v3.1. -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- Vista, Server 2008 and Windows 7 Specific Issues: o In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU. o Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. o The CUDA Profiler does not support performance counter events on Windows Vista. All profiler configuration regarding performance counter events is ignored. o The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to: MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. o The OS may impose artificial limits on the amount of memory you can allocate using the Cuda APIs for both system and video memory. In many cases, these limits are significantly less than the size of physical system and video memory, but there are exceptions that make it difficult to quantify the expected behavior for a particular application. XP Specific Issues: o Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Issues Common to XP, Vista, Server 2008 and Windows 7: o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA Context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA. o Malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix. o When compiling a source file that includes vector_types.h with the Microsoft compiler on a 32-bit Windows system, the 16-byte aligned vector types are not properly aligned at 16 bytes. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release. o OpenGL interoperability - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview). o OpenCL program binary formats may change in this or future releases. Users should create programs from source and should not rely on compatibility of generated binaries between different versions of the driver. o CUBLAS issues o CUFFT issues - The stability of the large-prime FFT transform (signals with a length that is prime and >64k samples) is extremely variable, giving single- precision accuracy in the range 0.005->0.025. In general, smaller signals experience greater accuracy. - If the (batch size * transform size * datatype size) exceeds 512MB in a 1D transform, CUFFT kernels fail to launch. - Performance of CUFFT on the GT200 architecture has been reduced by 9.4% for transform size of 128 in single precision only - Performance of single precision 360x360 2D CUFFT has been reduced by 10%. - C2C transforms of length 4 and C2R and R2C transforms of length 8 will produce incorrect results when the batch size is not a multiple of 64 for Tesla and not 128 for Fermi o 32/64-bit Device code mixing - While it is currently possible to simultaneously load both 32-bit and 64-bit modules in a single context with the driver API, this feature may be removed in future CUDA releases. -------------------------------------------------------------------------------- Open64 Sources -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- 06/2010 - Version 3.1 04/2010 - Version 3.1 Beta 02/2010 - Version 3.0 10/2009 - Version 3.0 Beta 07/2009 - Version 2.3 06/2009 - Version 2.3 Beta 05/2009 - Version 2.2 03/2009 - Version 2.2 Beta 11/2008 - Version 2.1 Beta 06/2008 - Version 2.0 11/2007 - Version 1.1 06/2007 - Version 1.0 06/2007 - Version 0.9 02/2007 - Version 0.8 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit http://www.nvidia.com/cuda -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- NVIDIA extends thanks to Professor Mike Giles of Oxford University for suggesting possible optimizations to the erfinv() and erfinvf() functions, which inspired the eventual optimizations for these functions in this release.