-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA MacOS X Release Notes Version 3.1 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Migrating to 3.1 -------------------------------------------------------------------------------- o Prior to the 3.1 release, nvcc treated __device__ functions as implicitly static. This behavior has changed with the 3.1 release. As a result, the host linker will give a link error regarding multiple defined symbols, if 1) two identical __device__ functions are defined in two different compilations units. For example, when including function definitions through the #include <> mechanism. 2) a __device__ function and an identical host function are defined in two different compilations units. For both cases, declaring the __device__ function as static will make the compilation succeed. -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- Hardware Support o See http://www.nvidia.com/object/cuda_learn_products.html New Toolkit Features ------------------------- o 64b support for the CUDA Runtime API Libraries - libcudart.dylib, libcublas.dylib, libcufft.dylib and libtlshook.dylib have all had 64b components added to this release. o Device emulation has been removed. o cublasSpsv, cublasDpsv, cublasCpsv, cublasZpsv, and cublasSbsv, cublasDbsv, cublasCbsv, cublasZbsv have been enhanced to remove all previous size limitations on the input vector o Added the ability to call printf() from kernels. This feature is supported only on the Fermi architecture. o Added support for recursion in device functions. This feature is supported only on the Fermi architecture. Note that we default to a stack size limit of 1K per thread, so can run out of stack if recurse too deeply. Can use cuCtxSetLimit() to change the default stack size. o Added support for function pointers. This feature is supported only on the Fermi architecture. Function pointers can only be used inside a single kernel; they cannot be passed to another kernel. o Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES environment variable. Visible devices should be included as a comma-separated list in terms of the system-wide list of devices. For example, to use only devices 0 and 2 from the system-wide list of devices, set CUDA_VISIBLE_DEVICES equal to "0,2" before launching the application. The application will then enumerate these devices as device 0 and device 1. New API Features ------------------------- o In CUFFT-3.1, R2C and C2R transforms for power-of-2 sizes now experience a similar speedup to their C2C equivalent. However, CUFFT's internal data layout is different to that used by FFTW; by default CUFFT will match FFTW's data format, but at some performance penalty. To enable faster transforms, the user must use cufftSetCompatibilityMode() API to disable FFTW-compatible behavior and enable faster native mode. o CUBLAS now supports CUDA Stream via the cublasSetKernelStream API o Unformatted surface load/store (i.e. the ability to write to textures). This feature is supported only on the Fermi architecture. o New functions cuCtxSetLimit() and cuCtxGetLimit() have been added to control GPU thread stack size and the size of the printf() FIFO queue. o Device-to-device transfers in a non-NULL stream with asynchronous cudaMemcpy calls may overlap with kernels. Runtime documentation has been updated to reflect this. o New device attributes report the PCI bus and device identifiers of a particular GPU for better integration with system management tools. New Performance Improvements ---------------------------- o Double-precision and C2R/R2C performance of CUFFT has been improved significantly for many transform sizes since the CUFFT 3.0 release. o Double precision divide and reciprocal on the Fermi architecture have been optimized. o The performance of the single-precision erfinv() and ervinvf() functions have improved significantly. o The performance of selected transcendental functions from the log, pow, erf, and gamma families. -------------------------------------------------------------------------------- Bug Fixes -------------------------------------------------------------------------------- o Permissions for the following files/directories should now be correctly assigned: - /Library/LaunchAgents/com.nvidia.CUDASoftwareUpdate.plist - /Library/PreferencePanes/CUDA Preferences.prefPane o The CUBLAS SGEMM, CGEMM and small matrix DGEMM performance regressions that were in v3.0 have been restored in v3.1. -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms. o CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot. o If a GPU is used without a display attached it may not exit a reduced power state, causing CUDA programs to perfom poorly when run on that GPU. Cycling the system's power saving state or rebooting should reset the GPU. In general it is best to use a GPU with a display attached. o The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix. o When compiling GCC, special care must be taken for structs that contain 64-bit integers. This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default. Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC. When using NVCC, this option is automatically passed to GCC. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory. o Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware. o The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time. As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired. - cuCtxSynchronize - cuEventSynchronize - cuStreamSynchronize - cudaThreadSynchronize - cudaEventSynchronize - cudaStreamSynchronize o OpenGL interoperability - OpenGL can not access a buffer that is currently *mapped*. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer. - Deleting a buffer while it is mapped for CUDA results in undefined behavior. - Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided. o When the profiler gathers performance signals on G80-based products, the driver reduces the clock rate on the device. If the CUDA app crashes or otherwise exits uncleanly, the clocks will not be reset to their previous values. The system must be rebooted to restore the original clock rate. o The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application. o The shared libraries for the CUDA Driver API should not be redistributed from this release (CUDA.framework, /usr/local/cuda/lib/libcuda.dylib). Any CUDA Application shipped on Macintosh requires the end-user install the CUDA driver from the CUDA install package. o CUBLAS issues o CUFFT issues - The stability of the large-prime FFT transform (signals with a length that is prime and >64k samples) is extremely variable, giving single- precision accuracy in the range 0.005->0.025. In general, smaller signals experience greater accuracy. - If the (batch size * transform size * datatype size) exceeds 512MB in a 1D transform, CUFFT kernels fail to launch. - Performance of CUFFT on the GT200 architecture has been reduced by 9.4% for transform size of 128 in single precision only - Performance of single precision 360x360 2D CUFFT has been reduced by 10%. - C2C transforms of length 4 and C2R and R2C transforms of length 8 will produce incorrect results when the batch size is not a multiple of 64 for Tesla and not 128 for Fermi o CUDA GDB issue - Please see the "Known Issues" section in the CUDA_GDB_v3.0.pdf User Manual. o 32/64-bit Device code mixing - While it is currently possible to simultaneously load both 32-bit and 64-bit modules in a single context with the driver API, this feature may be removed in future CUDA releases. -------------------------------------------------------------------------------- Open64 Sources -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- 06/2010 - Version 3.1 04/2010 - Version 3.1 Beta 02/2010 - Version 3.0 10/2009 - Version 3.0 Beta 07/2009 - Version 2.3 06/2009 - Version 2.3 Beta 05/2009 - Version 2.2 03/2009 - Version 2.2 Beta 03/2009 - Version 2.1 Beta 07/2008 - Version 2.0 01/2008 - Version 1.1 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit http://www.nvidia.com/cuda -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- NVIDIA extends thanks to Professor Mike Giles of Oxford University for suggesting possible optimizations to the erfinv() and erfinvf() functions, which inspired the eventual optimizations for these functions in this release.