-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA Windows XP and Vista Release Notes Version 2.2 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- Hardware Support o See http://www.nvidia.com/object/cuda_learn_products.html Platform Support o Additional OS support - Microsoft Windows Vista Server 2008 API Features o Pinned Memory Support - These new memory management functions (cuMemHostAlloc() and cudaHostAlloc()) enable pinned memory to be made "portable" (available to all CUDA contexts), "mapped" (mapped into the CUDA address space), and/or "write combined" (not cached and faster for the GPU to access). - cuMemHostAlloc - cuMemHostGetDevicePointer - cudaHostAlloc - cudaHostGetDevicePointer o Function attribute query - This function allows applications to query various function properties. - cuFuncGetAttribute o 2D Texture reads from pitch linear memory - You can bind linear memory that you get from cuMemAlloc() or cudaMalloc() directly to a 2D texture. In previous releases, you were only able to bind cuArrayCreate() or cudaMallocArray() arrays to 2D textures. - cuTexRefSetAddress2D - cudaBindTexture2D o Flags for event creation - Applications can now create events that use blocking synchronization. - cudaEventCreateWithFlags o New device management and context creation flags - The function cudaSetDeviceFlags() allows the application to specify attributes such as mapping host memory and support for blocking synchronization. - cudaSetDeviceFlags o Improved runtime device management - The runtime now defaults to attempting context creation on other devices in the system before returning any failure messages. The new call cudaSetValidDevices() allows the application to specify a list of acceptable devices for use. - cudaSetValidDevices o Driver/runtime version query functions - Applications can now directly query version information about the underlying driver/runtime. - cuDriverGetVersion - cudaDriverGetVersion - cudaRuntimeGetVersion o New device attribute queries - CU_DEVICE_ATTRIBUTE_INTEGRATED - CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY - CU_DEVICE_ATTRIBUTE_COMPUTE_MODE Documentation o Doxygen-generated and cross-referenced html, pdf, and Windows help files. - Runtime API - Driver API Performance Enhancements o Asynchronous memcpy support for Windows Vista - Asynchronous memory copy operations can now overlap GPU execution -------------------------------------------------------------------------------- Major Bug Fixes -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- Vista and Server 2008 Specific Issues: o In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU. o Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. GPUs without a display attached are not subject to the 2 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Thus, for devices like S1070 that do not have an attached display, users may disable the Windows TDR timeout. Disabling the TDR timeout will allow kernels to run for extended periods of time without triggering an error. The following is an example .reg script: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers] "TdrLevel"=dword:00000000 o The CUDA Profiler does not support performance counter events on Windows Vista. All profiler configuration regarding performance counter events is ignored. o The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to: MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. XP Specific Issues: o Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Issues Common to XP and Vista: o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA Context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA. o Malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix. o When two GPUs are run in SLI mode, only one of the GPUs will be available to the user for executing CUDA programs. o When using Microsoft Studio Visual 8.0, it is required that Service Pack 1 be installed. Certain Windows C++ header files will cause a crash in cudafe without it. o "#pragma unroll" sometimes does not unroll loops because of limits in the compiler on loop bodies, which may cause a decrease in performance versus CUDA 2.0. A user can override this limit on the command line with the following nvcc compiler flag: nvcc -Xopencc -OPT:unroll_size=200000 In most cases, this should override the built-in loop unrolling limits. Unless a kernel uses #pragma unroll and shows a significant performance drop from CUDA 2.0, this flag should not be used. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o Cross-compilation with the --machine option is not supported. o The default compilation mode for host code is now C++. To restore the old behavior, use the option --host-compilation=c o For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release. o OpenGL interoperability - OpenGL cannot access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview). o Both the cudaEventQuery() and cudaStreamQuery() functions may show first-chance exceptions under certain conditions when debugging with Visual Studio. These first-chance exception informational messages are part of the expected behavior for the CUDA runtime and can thus be safely ignored. -------------------------------------------------------------------------------- Open64 Sources -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- 03/2009 - Version 2.2 Beta 11/2008 - Version 2.1 Beta 06/2008 - Version 2.0 11/2007 - Version 1.1 06/2007 - Version 1.0 06/2007 - Version 0.9 02/2007 - Version 0.8 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit http://www.nvidia.com/cuda