-------------------------------------------------------------------------------- -------------------------------------------------------------------------------- NVIDIA CUDA Toolkit v3.2 RC2 Release Notes for MacOS X -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Manifest -------------------------------------------------------------------------------- This release contains: o NVIDIA CUDA toolkit documentation o NVIDIA OpenCL documentation o NVIDIA CUDA compiler (nvcc) and supporting tools o NVIDIA CUDA runtime libraries o NVIDIA CUBLAS, CUFFT, CUSPARSE and CURAND libraries -------------------------------------------------------------------------------- Documentation (located in the doc directory) -------------------------------------------------------------------------------- o CUDA_Toolkit_Release_Notes_Mac.txt - This document. o CUBLAS_Library.pdf - User manual for the CUDA accelerated BLAS implementation. o CUFFT_Library.pdf - User manual for the CUDA accelerated FFT library. o CURAND_Library.pdf - User manual for the CUDA accelerated Random Number Generation library. o CUSPARSE_Library.pdf - User manual for the CUDA accelerated Sparse Matrix library. o Compute_Profiler.txt - Guide for using the profiler. o EULA.txt - The end user license agreement. o nvcc.pdf - Documentation for the CUDA command line compiler. o OpenCL_Extensions/*.txt - Documentation for the NVIDIA OpenCL extensions. o OpenCL_Implementations_Notes.txt - Notes describing the implementation defined behavior for the NVIDIA OpenCL implementation. o ptx_isa_2.2.pdf - User manual for the Parallel Thread Execution ISA. -------------------------------------------------------------------------------- Important Files -------------------------------------------------------------------------------- bin/nvcc Command line compiler include/ cuda.h CUDA driver API header cudaGL.h CUDA OpenGL interop header for driver API cufft.h CUFFT API header cublas.h CUBLAS API header cusparse.h CUSPARSE API header curand.h CURAND API header curand_kernel.h CURAND device API header lib/ libcuda.dylib CUDA driver library libcudart.dylib CUDA runtime library libcublas.dylib CUDA BLAS library libcufft.dylib CUDA FFT library libcusparse.dylib CUDA Sparse Matrix library libcurand.dylib CUDA Random Number Generation library -------------------------------------------------------------------------------- Supported NVIDIA Hardware -------------------------------------------------------------------------------- o See http://www.nvidia.com/object/cuda_gpus.html -------------------------------------------------------------------------------- Supported Software Platforms -------------------------------------------------------------------------------- o 32-bit Operating Systems - Mac OS X 10.5.6 - Mac OS X 10.5.7 - Mac OS X 10.5.8 - Mac OS X 10.6.0 - Mac OS X 10.6.1 - Mac OS X 10.6.2 - Mac OS X 10.6.3 - Mac OS X 10.6.4 o 64-bit Operating Systems - Mac OS X 10.6.2 - Mac OS X 10.6.3 - Mac OS X 10.6.4 -------------------------------------------------------------------------------- Upgrading from CUDA Toolkit 3.1 -------------------------------------------------------------------------------- o Prior to the 3.1 release, nvcc treated __device__ functions as implicitly static. This behavior has changed with the 3.1 release. As a result, the host linker will give a link error regarding multiple defined symbols, if 1) two identical __device__ functions are defined in two different compilations units. This includes including function definitions through the #include <> mechanism. 2) a __device__ function and an identical host function are defined in two different compilations units. For both cases, declaring the __device__ function as static will make the compilation succeed. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- New Features -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- New CUDA Libraries o CUSPARSE, supporting sparse matrix computations. o CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines. o H.264 encode/decode libraries that were previously available in the GPU Computing SDK are now part of the CUDA Toolkit. CUDA Driver and CUDA C Runtime o Support for new 6GB Quadro and Tesla products o Cross-stream synchronization Development Tools o Support for debugging GPUs with more than 4GB device memory Miscellaneous o Support for malloc() and free() in device code o o NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, several GPU performance counters -------------------------------------------------------------------------------- Notes on New Features in the CUDA Driver and CUDA C Runtime -------------------------------------------------------------------------------- o Sample FORTRAN wrappers for CUSPARSE routines (similar to CUBLAS) included in the src/ directory. o Added CU_TARGET_COMPUTE_21 to JIT options. o Added cuCtxGetVersion; this allows an application to determine which version of the CUDA API was used to create a CUDA context. Library developers can use cuCtxGetVersion to support different versions of CUDA within the same package. o Enhanced printf; kernels that called printf and were built using CUDA Toolkit 3.1 must be recompiled with the CUDA Toolkit 3.2 tool chain. o CUDA Toolkit 3.2 allows the user set up the printf FIFO size any time until the first launch of a kernel that uses printf. In the previous version, the print FIFO size had to be set before the first module that called printf was loaded. Since CUDA Runtime automatically loads all modules upon the first CUDA API call, this meant that even if cudaThreadSetLimit(cudaLimitPrintfFifoSize, n) was the first cuda API call made in an application, the user's modules would be loaded by the runtime first and cudaThreadSetLimit() would return failure. o cuParamSetTexRef() has been deprecated in CUDA Toolkit 3.2 since this API entry provided no functionality. Accordingly, CUDA C Programming Guide has been updated to remove cuParamSetTexRef from the example. cuParamSetTexRef() has been marked deprecated in cuda.h; this is documented as deprecaded in the CUDA Toolkit Reference Manual (html/chm/pdf). o Improved setting of L1/smem configuration in the CUDA driver. There are 4 new API calls; 2 for driver, 2 for runtime. These calls are documented in the CUDA Toolkit Reference Manual (chm/html/pdf) and the relevant header files: cuCtxGetCacheConfig()/cuCtxSetCacheConfig() and cudaThreadGetCacheConfig()/cudaThreadSetCacheConfig(). o Added cross stream synchronization; cuStreamWaitEvent adds the ability to perform GPU-side synchronization on a CUDA event within a CUDA stream. Cross-stream synchronization and dependency management is resolved on the GPU without any CPU involvement while the work is executing. -------------------------------------------------------------------------------- Notes on New Features in the CUDA Libraries -------------------------------------------------------------------------------- o CUFFT has implemented the Bluestein (or chirp) algorithm for general transform sizes that cannot be factored into supported radices. The "Bluestein" FFT algorithm accelerates transforms for sizes that cannot be factored into 2^a * 3^b * 5^c * 7^d, where a,b,c,d are non-negative integers. This algorithm provides signficant speedup and in many cases improves the accuracy of the final result. Note: The Bluestein algorithm is only supported for 1-D batched transforms; acceleration for 2-D and 3-D transforms will be addressed in future releases. o CUFFT supports batched transforms > 512 MB. The previous version of CUFFT failed when (batch size * transform size * datatype size) for a 1D single-precision transform exceeded 512MB. This has been fixed so that now the total size can be as large as the device memory capacity allows. The exact size varies depending on whether operating in-place or out-of-place and depending on how much internal intermediate memory is required by API, which can vary depending on the actual size of the transform. Note however, that while the total size can be much larger, the size of each individual transform is still limited to 128 M elements (1 GB for single-precision.) o Single precision batched transforms on datasets larger than 512MB report an error for sizes (2^a * 3^b * 5^c * 7^d) where at least one of b, c, and d are non-zero. To work around this issue, the user may split the work into multiple batched calls such that the data processed by each call is <= 512MB. o CUDA Toolkit 3.2 supports 1 pseudo- and 1 quasi-random number generator. A new library called CURAND has been added for generating pseudo- and quasi-random numbers. See CURAND_Library.pdf for full documentation. The CUDA SDK contains several examples that illustrate the use of the new library APIs. In particular, the MonteCarloCURAND, EstimatePiInlineP, EstimatePiInlineQ, EstimatePiP, EstimatePiQ, and SingleAsianOptionP examples can be found in the MonteCarloCURAND directory. o Added a new library, CUSPARSE, for operating on sparse matrices and vectors. See CUSPARSE_Library.pdf for full documentation. The conjugateGradient example in the CUDA SDK illustrates the usage of the new APIs in the CUSPARSE library. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Known Issues -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- o GPU enumeration order on multi-GPU systems is non-deterministic and may change with this or future releases. Users should make sure to enumerate all CUDA-capable GPUs in the system and select the most appropriate one(s) to use. o When compiling with GCC, special care must be taken for structs that contain 64-bit integers. This is because GCC aligns long longs to a 4 byte boundary by default, while NVCC aligns long longs to an 8 byte boundary by default. Thus, when using GCC to compile a file that has a struct/union, users must give the -malign-double option to GCC. When using NVCC, this option is automatically passed to GCC. o It is a known issue that cudaThreadExit() may not be called implicitly on host thread exit. Due to this, developers are recommended to explicitly call cudaThreadExit() while the issue is being resolved. o OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms. o CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot. o The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix. o On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory. o Current hardware limits the number of asynchronous memcopies that can be overlapped with kernel execution. Overlap is also limited to kernels executing for less than 1 second. These limitations are expected to improve on future hardware. o The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time. As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired. - cuCtxSynchronize - cuEventSynchronize - cuStreamSynchronize - cudaThreadSynchronize - cudaEventSynchronize - cudaStreamSynchronize o When the profiler gathers performance signals on G80-based products, the driver reduces the clock rate on the device. If the CUDA app crashes or otherwise exits uncleanly, the clocks will not be reset to their previous values. The system must be rebooted to restore the original clock rate. o The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application. o For an OpenCL C program, the maximum alignment of a function scope local variable and a function parameter variable is limited to 16-bytes. o Some applications may not be able to call cuMemAllocHost/cudaMallocHost() due to existing virtual address ranges that they would like to bus master to/from. Registering system memory for DMA will be addressed in future releases. -------------------------------------------------------------------------------- Documentation Errata -------------------------------------------------------------------------------- o Visual Profiler User Guide.pdf/Overview/Getting Started/Installation and Setup/Windows Under the section entitled, "Running the Compute Visual Profiler", note the additional "3.2" in the path. ------- To run the Compute Visual Profiler, go to: Start->All Programs->NVIDIA Corporation->CUDA Toolkit->v3.2->Compute Visual Profiler ------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Performance Improvements -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- CUFFT Related -------------------------------------------------------------------------------- o Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i. -------------------------------------------------------------------------------- CUBLAS GEMM Related -------------------------------------------------------------------------------- o Increased performance for GEMM kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document. -------------------------------------------------------------------------------- CUBLAS Related -------------------------------------------------------------------------------- o Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library. Note that functions that implement a reduction such as *dot, *min, and *max are not improved. -------------------------------------------------------------------------------- Other -------------------------------------------------------------------------------- o The performance of round-to-nearest double precision reciprocals in device code has been improved by more than 50% for both Tesla and Fermi-class architectures. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Bug Fixes -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- o Fixed: CUDA_*_PATH environment variables get resolved properly when used within a Windows command prompt. -------------------------------------------------------------------------------- CUDA Driver Related -------------------------------------------------------------------------------- o Always use -m32 or -m64 command line argument to NVCC when building 32-bit or 64-bit device code respectively. o In the previous versions, 2D Texture size 65536 failed for compute capability equal to 2.0. The limit has been fixed to be in line with hardware capability. -------------------------------------------------------------------------------- CUDA Library Related -------------------------------------------------------------------------------- o On Tesla-architecture GPUs, the cublasSgemm, cublasDgemm, and cublasZgemm routines would fail with an "Unspecified Launch Error" in some cases when the 'k' parameter was not a multiple of 16. This is now fixed. Note: The cublasCgemm routine was not affected by this bug. o If Beta=0 for SSYRK and DSYRK, the 3.2 Release Candidate could produce incorrect results on Fermi in some cases. This has been fixed in the current 3.2 RC2 release. o Fixed: Performance of the CHERK function was reported as being slower in the CUDA 3.2 Toolkit Release Candidate compared to the 3.1 release. o Previous versions of CUFFT would possibly fail if the input or output data pointers were not aligned to a multiple of 256 bytes. Now, CUFFT requires 8 byte alignment for single-precision input data and 16 bytes for double-precision input data. o CUDA CUFFT: In the previous version, the CUFFT library would sometimes generate incorrect results on systems with multiple GPUS where the GPUs were of different types e.g a system with a GTX275 and a GTX480. This issue has been fixed; CUFFT detects the "active" device and correctly configures execution accordingly. o In some cases, the CUFFT library would cause the entire process to terminate when an internal or input error was encountered. This has been fixed so that the CUFFT APIs correctly catch the errors and return gracefully with an appropriate error code. o The previous version of CUFFT incorrectly created and destroyed planners such that a newly created planner could have a handle that was not unique from another existing active handle in certain situations. This has been fixed and now planners can be created and destroyed safely in any order. -------------------------------------------------------------------------------- CUDA Runtime Related -------------------------------------------------------------------------------- o In the previous version 1D Texture Maximum Size allowed did not match documentation. Fixed maximum width for a 1D texture reference bound to linear memory to be 2^27. -------------------------------------------------------------------------------- Source code for Open64 -------------------------------------------------------------------------------- The Open64 source files are controlled under terms of the GPL license. Current and previously released versions are located via anonymous ftp at download.nvidia.com in the CUDAOpen64 directory. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- 10/2010 - Version 3.2 RC2 09/2010 - Version 3.2 RC 06/2010 - Version 3.1 04/2010 - Version 3.1 Beta 02/2010 - Version 3.0 10/2009 - Version 3.0 Beta 07/2009 - Version 2.3 06/2009 - Version 2.3 Beta 05/2009 - Version 2.2 03/2009 - Version 2.2 Beta 03/2009 - Version 2.1 Beta 07/2008 - Version 2.0 01/2008 - Version 1.1 - Initial public Beta -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information and help with CUDA, please visit: http://www.nvidia.com/cuda -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- NVIDIA extends thanks to the University of Tennessee, and especially Professor Jack Dongarra and Stanimire Tomov, for their contributions to the matrix multiplication functions in the CUBLAS library. These changes are incorporated under the following copyright and with the following conditions: "Copyright (c) 2010 The University of Tennessee. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution. - Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE" --------------------------------------------------------------------------------