-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA Windows XP, Vista, Server 2008, and Windows 7 Release Notes Version 3.0 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Installation Notes -------------------------------------------------------------------------------- Silent Installation 1) Install the cudatoolkit.exe from the shell passing the -r flag: cudatoolkit.exe -r This will generate a file setup.iss in C:\Windows 2) You can rename the file and use it to do a silent install: cudatoolkit.exe -s -f1"fullpathto\file.iss" -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- Hardware Support o See http://www.nvidia.com/object/cuda_learn_products.html o Function Attributes added - PTX_VERSION - BINARY_VERSION o Device Attributes added - MAXIMUM_TEXTURE* - SURFACE_ALIGNMENT - CONCURRENT_KERNELS New API Features -------------------------- o Float16 (half) textures are supported in the runtime - cudaCreateChannelDescHalf family of functions supports it in C++ style API or proper channel could be crated via cudaCreateChannelDesc in C style level API - users should be aware that halves are promoted to floats during computation and therefore, only floats could be fetched by texture fetch functions - users could use intrinsics in device code to convert between fp16 and fp32 data o Double3 and double4 vector types are supported in the runtime - This breaks code when users had already added these themselves. o One dimensional device-device copies now support streams. - cudaMemcpyAsync now applies the stream parameter for cudaMemcpyDeviceToDevice as well - cuMemcpyDtoDAsync o Support for ELF binaries - ELF is generated by default by nvcc. For ptxas or fatbin, the -elf option is required. - Cubins are now binary files. Do not assume that they are ASCII text. o Testing applications for Fermi-readiness - Setting the env variable CUDA_FORCE_PTX_JIT to 1 will disable all non-PTX user kernels from being able to load. If your application fails to run, you are not compiling with PTX. Please see the programming guide for more information about compiling for different compute capabilities. o OpenGL texture interoperation o Concurrent Kernels - Kernels launched within different non-NULL streams may now overlap with each other if they are able to simultaneously fit on the device. The ability of a device to run multiple kernels concurrently can be queried via the CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS device attribute. See the 3.0 programming guide for using this feature. o Batched 2D & 3D transforms are now supported in CUFFT, using the new cufftPlanMany() API. This is defined in cufft.h, as follows: cufftResult CUFFTAPI cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed, // Unused: pass NULL int istride, // Unused: pass 1 int idist, // Unused: pass 0 int *onembed, // Unused: pass NULL int ostride, // Unused: pass 1 int odist, // Unused: pass 0 cufftType type, int batch); The arguments are: *plan - The plan is returned here, as for other cufft calls rank - The dimensionality of the transform (1, 2 or 3) *n - An array of size [rank], describing the size of each dimension type - Transform type (e.g. CUFFT_C2C), as per other cufft calls batch - Batch size for this transform Return values are as for all other cufftPlanXxx functions. Thus to plan a batch of 1000, 2D, double-precision, complex-to-complex transforms of size (128, 256), you would do: cufftHandle *myplan; cufftPlanMany(myplan, 2, { 128, 256 }, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2Z, 1000); Note that for CUFFT 3.0, the layout of batched data must be side-by-side and not interleaved. The inembed, istride, idist, onembed, ostride and odist parameters are for enabling data windowing and interleaving in a future version. New Toolkit Features -------------------------- o nvcc - The command line option --host-compilation=C is no more. nvcc emits a warning and switches back to C++. This option will eventually disappear altogether o Windows DLL Naming Conventions - Each DLL now specifies the machine type, the toolkit version number, and the build number in its filename. - For example, cudart32_30_4.dll would be the 32-bit build of 3.0 Cudart with a build number of 4. - The build number of the final release will always be greater than the build number of the beta release. - The corresponding .lib files do not have any extra naming decoration, so you can continue linking your applications the same way. o Separate Library for Runtime Device Emulation - Cudart has now been split up into two libraries. Cudartemu should be linked with for device emulation, similar to the way in which Cublasemu/Cufftemu were previously used. o CUBLAS Library Support - On Fermi architecture (e.g sm_20), arithmetic is IEEE-754 compliant. - cublasStrmv and cublasDtrmv have been enhanced to remove the previous size limitation of the input vector. - On Tesla architecture, cublasZgemm performance has been improved to be similar to cublasDgemm. - Added the BLAS1 functions: * cublasDzasum() * cublasDznrm2() * cublasIzamax() * cublasIzamin() * cublasZaxpy() * cublasZcopy() * cublasZdrot() * cublasZdscal() * cublasZrot() * cublasZrotg() * cublasZswap() - Added the BLAS2 functions: * cublasCgbmv() * cublasCtbmv() * cublasCtpmv() * cublasCtbsv() * cublasCtpsv() * cublasChemv() * cublasChbmv() * cublasChpmv() * cublasCtrsv() * cublasCgerc() * cublasCgeru() * cublasCher() * cublasChpr() * cublasCher2() * cublasChpr2() * cublasCtrmv() * cublasCgemv() * cublasCgeru() * cublasCgerc() * cublasDtbmv() * cublasDtbsv() * cublasDtpmv() * cublasDtpsv() * cublasDgbmv() * cublasDsymv() * cublasDsbmv() * cublasDspr() * cublasDsyr2() * cublasDspr2() * cublasZgbmv() * cublasZtbmv() * cublasZtpmv() * cublasZtbsv() * cublasZtpsv() * cublasZhemv() * cublasZhbmv() * cublasZhpmv() * cublasZtrsv() * cublasZgerc() * cublasZgeru() * cublasZher() * cublasZhpr() * cublasZher2() * cublasZhpr2() * cublasZtrmv() * cublasZgemv() * cublasZgeru() * cublasZgerc() - Added the BLAS3 functions: * cublasCtrsm() * cublasCtrmm() * cublasCsyrk() * cublasCsymm() * cublasCherk() * cublasChemm() * cublasCsymm() * cublasCherk() * cublasCher2() * cublasCsyr2k() * cublasZtrsm() * cublasZtrmm() * cublasZsyrk() * cublasZsymm() * cublasZherk() * cublasZhemm() * cublasZsymm() * cublasZherk() * cublasZher2() * cublasZsyr2k() o OpenCL Extensions - Added cl_khr_icd - Added cl_khr_gl_sharing - Added cl_khr_fp64 - Added cl_nv_compiler_options - Added cl_nv_device_attribute_query - Added cl_nv_pragma_unroll - Added cl_nv_d3d9_sharing - Added cl_nv_d3d10_sharing - Added cl_nv_d3d11_sharing -------------------------------------------------------------------------------- Bug Fixes -------------------------------------------------------------------------------- o Both the cudaEventQuery() and cudaStreamQuery() functions have been altered such that they longer show first-chance exceptions when cudaErrorNotReady would be returned. This eliminates an issue where users could not turn on exception debugging in Visual Studio for applications that used these API calls. -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- Vista, Server 2008 and Windows 7 Specific Issues: o In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU. o Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. o The CUDA Profiler does not support performance counter events on Windows Vista. All profiler configuration regarding performance counter events is ignored. o The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to: MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE ) For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB. o The OS may impose artificial limits on the amount of memory you can allocate using the Cuda APIs for both system and video memory. In many cases, these limits are significantly less than the size of physical system and video memory, but there are exceptions that make it difficult to quantify the expected behavior for a particular application. XP Specific Issues: o Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Issues Common to XP, Vista, Server 2008 and Windows 7: o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o Applications that try to use too much memory may cause a CUDA memcopy or kernel to fail with the error CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA Context is placed into an error state and must be destroyed and recreated if the application wants to continue using CUDA. o Malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix. o When compiling a source file that includes vector_types.h with the Microsoft compiler on a 32-bit Windows system, the 16-byte aligned vector types are not properly aligned at 16 bytes. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o For maximum performance when using multiple byte sizes to access the same data, coalesce adjacent loads and stores when possible rather than using a union or individual byte accesses. Accessing the data via a union may result in the compiler reserving extra memory for the object, and accessing the data as individual bytes may result in non-coalesced accesses. This will be improved in a future compiler release. o OpenGL interoperability - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. - Interoperability will use a software path on SLI - Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview). o OpenCL program binary formats may change in this or future releases. Users should create programs from source and should not rely on compatibility of generated binaries between different versions of the driver. o CUBLAS issue - SGEMM performance on Fermi-based GPU is 30% lower than expected. It will be fixed in 3.1. o CUFFT issue - The stability of the large-prime FFT transform (signals with a length that is prime and >64k samples) is extremely variable, giving single- precision accuracy in the range 0.005->0.025. In general, smaller signals experience greater accuracy. o CUDA GDB issue - Please see the "Known Issues" section in the CUDA_GDB_v3.0.pdf User Manual. -------------------------------------------------------------------------------- Open64 Sources -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- 02/2010 - Version 3.0 10/2009 - Version 3.0 Beta 07/2009 - Version 2.3 06/2009 - Version 2.3 Beta 05/2009 - Version 2.2 03/2009 - Version 2.2 Beta 11/2008 - Version 2.1 Beta 06/2008 - Version 2.0 11/2007 - Version 1.1 06/2007 - Version 1.0 06/2007 - Version 0.9 02/2007 - Version 0.8 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit http://www.nvidia.com/cuda