=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
NVIDIA CUDA Toolkit v4.2 Release Notes for Windows, Linux, and Mac OS X
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
----------------------------------------
CONTENTS
----------------------------------------
-- Release Highlights
-- Documentation
-- List of Important Files
-- Supported NVIDIA hardware
-- Supported Operating Systems
---- Windows
---- Linux
---- Mac OS X
-- Installation Notes
-- New Features
-- Resolved Issues
-- Known Issues
-- Source Code for Open64 and CUDA-GDB
-- Revision History
-- More Information 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Release Highlights
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
* Added support for GK10x Kepler GPUs. 
 
* This release contains the following:
- NVIDIA CUDA Toolkit documentation
- NVIDIA OpenCL documentation
- NVIDIA CUDA compiler (nvcc) and supporting tools
- NVIDIA CUDA runtime libraries
- NVIDIA CUDA-GDB debugger
- NVIDIA CUDA-MEMCHECK
- NVIDIA Visual Profiler
- NVIDIA CUBLAS, CUFFT, CUSPARSE, CURAND, Thrust, and NPP libraries

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Documentation
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
For a list of documents supplied with this release, please refer to the 
/doc directory of your CUDA Toolkit installation.

NOTE: The NVML development package is not shipped with CUDA 4.2. For 
changes related to nvidia-smi and NVML, please refer to the nvidia-smi man 
page and the "Tesla Deployment Kit" package located on the developer site 
http://developer.nvidia.com/tesla-deployment-kit; NVML documentation and the 
SDK are included.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
List of Important Files
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  bin/
    nvcc                       CUDA C/C++ compiler
    cuda-gdb                   CUDA Debugger
    cuda-memcheck              CUDA Memory Checker
    nvvp                       NVIDIA Visual Profiler    
                               (On Windows, nvvp is located in libnvvp/)

  include/
    cuda.h                     CUDA driver API header
    cudaGL.h                   CUDA OpenGL interop header for driver API
    cudaVDPAU.h                CUDA VDPAU interop header for driver API 
                               (Linux only)
    cuda_gl_interop.h          CUDA OpenGL interop header for toolkit API 
                               (Linux only)
    cuda_vdpau_interop.h       CUDA VDPAU interop header for toolkit API 
                               (Linux only)
    cudaD3D9.h                 CUDA DirectX 9 interop header (Windows only)
    cudaD3D10.h                CUDA DirectX 10 interop header (Windows only)
    cudaD3D11.h                CUDA DirectX 11 interop header (Windows only)
    cufft.h                    CUFFT API header
    cublas_v2.h                CUBLAS API header 
    cublas.h                   CUBLAS Legacy API header 
    cusparse_v2.h              CUSPARSE API header 
    cusparse.h                 CUSPARSE Legacy API header
    curand.h                   CURAND API header
    curand_kernel.h            CURAND device API header
    thrust/*                   Thrust Headers
    npp.h                      NPP API Header
    nvcuvid.h                  CUDA Video Decoder header (Windows and Linux)
    cuviddec.h                 CUDA Video Decoder header (Windows and Linux)
    NVEncodeDataTypes.h        CUDA Video Encoder (C-library or DirectShow) 
                               (Windows only)
    NVEncodeAPI.h              CUDA Video Encoder (C-library) (Windows only)
    INvTranscodeFilterGUIDs.h  CUDA Video Encoder (DirectShow) (Windows only)
    INVVESetting.h             CUDA Video Encoder (DirectShow) (Windows only)

  extras/
    CUPTI                      CUDA Profiling APIs
    Debugger                   CUDA Debugger APIs

----------------------------------------
Windows lib files
----------------------------------------
  lib/
    cuda.lib                   CUDA driver library
    cudart.lib                 CUDA runtime library
    cublas.lib                 CUDA BLAS library
    cufft.lib                  CUDA FFT library
    cusparse.lib               CUDA Sparse Matrix library
    curand.lib                 CUDA Random Number Generation library
    npp.lib                    NVIDIA Performance Primitives library
    nvcuvenc.lib               CUDA Video Encoder library
    nvcuvid.lib                CUDA Video Decoder library

----------------------------------------
Linux lib files
----------------------------------------
 lib/
    libcuda.so                 CUDA driver library
    libcudart.so               CUDA runtime library
    libcublas.so               CUDA BLAS library
    libcufft.so                CUDA FFT library
    libcusparse.so             CUDA Sparse Matrix library
    libcurand.so               CUDA Random Number Generation library
    libnpp.so                  NVIDIA Performance Primitives library

----------------------------------------
Mac OS X lib files
----------------------------------------
  lib/
    libcuda.dylib              CUDA driver library
    libcudart.dylib            CUDA runtime library
    libcublas.dylib            CUDA BLAS library
    libcufft.dylib             CUDA FFT library
    libcusparse.dylib          CUDA Sparse Matrix library
    libcurand.dylib            CUDA Random Number Generation library
    libnpp.dylib               NVIDIA Performance Primitives library
    libtlshook.dylib           NVIDIA internal library

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Supported NVIDIA Hardware
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
See http://www.nvidia.com/object/cuda_gpus.html.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Supported Operating Systems for Windows, Linux, and Mac OS X
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
----------------------------------------
Windows
----------------------------------------
* Supported Operating Systems (32-bit and 64-bit)

  OS
  --
  Windows Server 2008  
  Windows XP 
  Windows Vista
  Windows 7

* Supported Compilers

  Platform   Compiler         IDE       
  --------   --------         ---       
  Windows    MSVC8(14.00)     VS 2005   
  Windows    MSVC9(15.00)     VS 2008   
  Windows    MSVC2010(16.00)  VS 2010   

----------------------------------------
Linux 
----------------------------------------
The CUDA development environment relies on tight integration with the host 
development environment, including the host compiler and C runtime libraries, 
and is therefore only supported on distro versions that have been qualified 
for this CUDA Toolkit release. 

* Supported Distros

  Distro            32 64  Kernel             GCC         GLIBC         
  ------            -- --  ------             ---         -----         
  Fedora14          X  X   2.6.35.6-45        4.5.1       2.12.90  
  ICC Compiler 11.1 X  X
  OpenSUSE-11.2     X  X   2.6.31.5-0.1       4.4.1       2.10.1      
  RHEL-5.>=5        X  X   2.6.18-238.el5     4.1.2       2.5 
  (5.5, 5.6, 5.7) 	 
  RHEL-6.X             X   2.6.32-            4.4.5       2.12 
  (6.0, 6.1)               131.0.15.el6
  SLES 11.1         X  X   2.6.32.12-0.7-pae  4.3-62.198  2.11.1-0.17.4
  Ubuntu-10.04      X  X   2.6.35-23-generic  4.4.5       2.12.1 
  Ubuntu-11.04      X  X   2.6.38-8-generic   4.5.2       2.13 

* Distros No Longer Supported

  Distro            32 64  Kernel             GCC         GLIBC         
  ------            -- --  ------             ---         -----         
  Fedora13          X  X   2.6.33.3-85        4.4.4       2.12            
  RHEL-4.8          X      2.6.9-89.ELsmpl    3.4.6       2.3.4
  Ubuntu-10.10      X  X   2.6.35-23-generic  4.4.5       2.12.1        

NOTE: 32-bit versions of RHEL 4.8 and RHEL 6.0 have not been tested with
this release and are therefore not supported in this CUDA Toolkit release.  

----------------------------------------
Mac OS X
----------------------------------------
Platform          32 64       GCC                      
--------          -- --       ---                      
Mac OS X 10.7      X  X       4.2.1 (build 5646)                                                                     
Mac OS X 10.6      X  X       4.2.1 (build 5646)     

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Installation Notes
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
----------------------------------------
Windows
----------------------------------------
Silent Installation:
- To install, use msiexec.exe from the shell and pass these arguments:
  msiexec.exe /i cudatoolkit.msi /qn
- To uninstall, use /x instead of /i. 

----------------------------------------
Linux
----------------------------------------
* In order to run CUDA applications, the CUDA module must be loaded and the 
entries in /dev created. This may be achieved by initializing X Windows, or 
by creating a script to load the kernel module and create the entries. An 
example script (to be run at boot time):
  #!/bin/bash

  /sbin/modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

* On some Linux releases, due to a GRUB bug in the handling of upper memory 
and a default vmalloc too small on 32-bit systems, it may be necessary to 
pass this information to the bootloader:
  vmalloc=256MB, uppermem=524288
  
Example of grub conf:
  title Red Hat Desktop (2.6.9-42.ELsmp)
  root (hd0,0)
  uppermem 524288
  kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
  pci=nommconf
  initrd /initrd-2.6.9-42.ELsmp.img

* Pinned memory in CUDA is only supported on Linux kernel versions >= 2.6.18. 
Host side memory allocations pinned for CUDA using cudaHostRegister() API can 
be passed to 3rd party drivers. Pinned memory allocations returned from 
cudaHostAlloc() and cudaMallocHost() can also be passed to 3rd party drivers 
and starting with 4.1, CUDA_NIC_INTEROP is no longer needed on these APIs; 
thus this flag is now deprecated.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
New Features
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Support for GK10x Kepler GPUs. 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Resolved Issues
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
* In the routines cusparse<T>csr2hyb and cusparse<T>dense2hyb, upon the 
occurrence of an error (typically a device memory allocation problem), the 
handle to the hybrid format descriptor (cusparseHybMat_t) was wrongly 
destroyed using cusparseDestroyHybMat. A subsequent call to 
cusparseDestroyHybMat by the user would then result in an error. This issue 
has been fixed in the 4.2 toolkit and now the user can and should call 
cusparseDestroyHybMat to clean up, either after an error or when the matrix 
is no longer needed.

* CUDA-MEMCHECK now explicitly reports calls to assert() inside a CUDA kernel.

* The version of Thrust included with the CUDA toolkit has been upgraded from 
1.5.1 to 1.5.2.

* Rotate primitives falsely used to enforce that the source image's pitch 
(nSrcStep) was large enough to accommodate the destination ROI's size. This 
bug was fixed and the restriction no longer exists.

* Starting with CUDA Toolkit 4.0, cublasDestroy did not properly free all of 
the GPU resources, leading to a GPU memory leak of about 256 KB per CUBLAS 
handle. This could also lead to GPU memory fragmentation when the unreleased 
resources were scattered over the GPU memory. This issue has been resolved 
in the 4.2 Toolkit.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Known Issues
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
----------------------------------------
Windows
----------------------------------------
* In the NPP library, the nppiGraphcut_32s8u() and nppiGraphcut8_32s8u() 
primitives may fail with an error while running on a GPU that supports the 
sm1.0 architecture, especially on systems with a 64-bit operating system.

* Individual kernels are limited to a 2-second runtime by Windows Vista. 
Kernels that run for longer than 2 seconds will trigger the Timeout 
Detection and Recovery (TDR) mechanism. For more information, see 
http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

* The maximum size of a single memory allocation created by cudaMalloc or 
cuMemAlloc on WDDM devices is limited to:
    MIN( (System Memory Size in MB - 512 MB) / 2, PAGING_BUFFER_SEGMENT_SIZE )
For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB.

* (Windows and Linux) Individual GPU program launches are limited to a run 
time of less than 5 seconds on a GPU with a display attached. Exceeding this 
time limit usually causes a launch failure reported through the CUDA driver 
or the CUDA runtime. GPUs without a display attached are not subject to the 
5 second runtime restriction. For this reason it is recommended that CUDA 
be run on a GPU that is NOT attached to a display and does not have the 
Windows desktop extended onto it. In this case, the system must contain at 
least one NVIDIA GPU that serves as the primary graphics adapter.

----------------------------------------
Linux & Mac
----------------------------------------
* In the NPP library, the nppiGraphcut_32s8u() and nppiGraphcut8_32s8u() 
primitives may fail with an error while running on a GPU that supports the 
sm1.0 architecture, especially on systems with a 64-bit operating system.

* The Linux kernel provides a mode where it allows user processes to 
overcommit system memory. (Refer to kernel documentation for /proc/sys/vm/ 
for details). If this mode is enabled (the default on many distros) the 
kernel may have to kill processes in order to free up pages for allocation 
requests. The CUDA driver process, especially for CUDA applications that 
allocate lots of zero-copy memory with cuMemHostAlloc or cudaMallocHost, 
is particularly vulnerable to being killed in this way. Since there is no 
way for the CUDA SW stack to report an OOM error to the user before the 
process disappears, users, especially on 32-bit Linux, are encouraged to 
disable memory overcommit in their kernel to avoid this problem. 
Please refer to documentation on vm.overcommit_memory and
vm.overcommit_ratio for more information.

* When compiling with GCC, special care must be taken for structs that contain 
64-bit integers. This is because GCC aligns long longs to a 4 byte boundary 
by default, while NVCC aligns long longs to an 8 byte boundary by default. 
Thus, when using GCC to compile a file that has a struct/union, users must 
give the -malign-double option to GCC. When using NVCC, this option is 
automatically passed to GCC.

----------------------------------------
Mac
----------------------------------------
To save power, some Apple products automatically power-down the CUDA-capable 
GPU in the system. If the operating system has powered down the CUDA-capable 
GPU, CUDA fails to run and the system returns an error that no device was 
found. In order to ensure that your CUDA-capable GPU is not powered down by 
the operating system do the following:
1. Go to "System Preferences" 
2. Open the "Energy Saver" section
3. Un-check the "Automatic graphics switching" check box in the upper left

----------------------------------------
Visual Profiler & Command Line Profiler
----------------------------------------
* Visual Profiler fails to generate events or counter information. There are 
several reasons why Visual Profiler may fail to gather counter information:
a. If more than one tool is trying to access the GPU. To fix this issue 
   please make sure only one tool is using the GPU at any given point. Tools 
   include the CUDA command line profiler, Parallel NSight Analysis Tools and 
   Graphics Tools, and applications that use either CUPTI or PerfKit API 
   (NVPM) to read counter values.
b. If more than one application is using the GPU at the same time when Visual 
   Profiler is profiling a CUDA application. To fix this issue please close 
   all applications and just run the one with Visual Profiler. Interacting 
   with the active desktop should be avoided while the application is 
   generating counter information. Please note that Visual Profiler gathers  
   counters for only one context if the application is using multiple 
   contexts within the same application.

* Enabling "{gld|gst} instructions {8|16|32|64|128}bit" counters can cause 
GPU kernels to run longer than the driver's watchdog timeout limit. In these 
cases the driver will terminate the GPU kernel resulting in an application 
error and profiling data will not be available. Please disable driver  
watchdog timeout before profiling such long running CUDA kernels.
- On Linux, setting the X Config option 'Interactive' to false is recommended.
- For Windows, detailed information on disabling the Windows TDR is available 
  at: http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx#E2.

* On Windows Vista/Win7 profiling an application which makes more than 32K 
CUDA kernel launch, memory copy, or memory set API calls without a  
synchronization call can result in an application hang. To work around  
this issue add synchronization calls like cudaDeviceSynchronize() 
or cudaStreamSynchronize().

* Enabling counters on GPUs with compute capability (SM type) 1.x can result 
in occasional hangs. Please disable counters on such runs.

* The "warp serialize" counter for GPUs with compute capability 1.x is known 
to give incorrect and high values for some cases.

* Prof triggers are not supported on GPUs with compute capability (SM type) 1.0.

* Profiler data gets flushed to a file only at synchronization calls like 
cudaDeviceSynchronize() and cudaStreamSynchronize() or when the profiler 
buffer gets full. If an app terminates without these sync calls then  
profiler data may be lost. 

* Counters gld_incoherent and gst_incoherent always return zero on GPUs 
with compute capability (SM type) 1.3. A value of zero doesn't mean that 
all load/stores are 100% coalesced.

* Use Visual Profiler version 4.1 onwards with driver version 285 
(or later).Due to compatibility issues with profile counters, Visual 
Profiler 4.0 (or earlier) must not be used with driver version 285 (or later). 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Source Code for Open64 and CUDA-GDB
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
* The Open64 and CUDA-GDB source files are controlled under terms of the 
GPL license. Current and previously released versions are located at: 
ftp://download.nvidia.com/CUDAOpen64

* Linux users:
- Please refer to the "Release Notes" and "Known Issues" sections in the 
  CUDA-GDB User Manual (cuda-gdb.pdf).
- Please refer to cuda-memcheck.pdf for notes on supported error detection 
  and known issues.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Revision History
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-- 04/2012 - Version 4.2
-- 01/2012 - Version 4.1 Production
-- 11/2011 - Version 4.1 RC2
-- 10/2011 - Version 4.1 RC1
-- 09/2011 - Version 4.1 EA (Information in ReadMe.txt)
-- 05/2011 - Version 4.0
-- 04/2011 - Version 4.0 RC2 [Errata]
-- 02/2011 - Version 4.0 RC
-- 11/2010 - Version 3.2
-- 10/2010 - Version 3.2 RC2
-- 09/2010 - Version 3.2 RC

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
More Information
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
* For more information and help with CUDA, please visit: 
http://www.nvidia.com/cuda

* Please refer to the LLVM Release License text in EULA.txt for details on 
LLVM licensing.