--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA 
Linux Release Notes
Version 1.1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

On some Linux releases, due to a GRUB bug in the handling of upper 
memory and a default vmalloc too small on 32-bit systems, it may be 
necessary to pass this information to the bootloader: 

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------

  Platform Support
  o  New distributions supported
     - Fedora 7
     - Red Hat Enterprise Linux 3.9
     - Red Hat Enterprise Linux 4.5
     - SUSE Linux Enterprise Desktop - Service Pack 1
     - Ubuntu-7.04

  Hardware Support
  o  Additional hardware support:
     - Tesla C870
     - Tesla D870
     - Tesla S870

     - Quadro FX 1700
     - Quadro FX 570
     - Quadro FX 370
     - Quadro NVS 130M
     - Quadro NVS 135M
     - Quadro NVS 140M
     - Quadro NVS 290
     - Quadro NVS 320M
     - Quadro FX 1600M
     - Quadro FX 570M
     - Quadro FX 360M
     - Quadro Plex 1000 Model IV
     - Quadro Plex 1000 Model S4

     - GeForce 8800 GT
     - GeForce 8400 GS
     - GeForce 8800M GTX
     - GeForce 8800M GTS
     - GeForce 8700M GT
     - GeForce 8600M GT
     - GeForce 8600M GS
     - GeForce 8400M GT
     - GeForce 8400M GS
     - GeForce 8400M G

  Compiler and Toolchain
  o  __global__ functions are now always mangled.  When using the driver API
     developers must declare these with 'extern "C"'.
  o  device emulation mode no longer requires libcuda.so.
  o  C++ language support
     - host code only, alpha quality
     - developers must "opt in" with the --host-compilation=C++ option
  o  Added device functions:
     - __clz
     - __clzll
     - __ffs
     - __ffsll
     - __sad
     - __float2ll_rz
     - __float2ll_rn
     - __float2ull_rz
     - __ll2float_rn
     - __ull2float_rn
     - __usad
  o  Removed the float-valued atomicCAS device function

  Mathematical Functions
  o  Added math functions:
     - llrintf()
     - llroundf()
  o  Improved accuracy of the functions:
     - cosf
     - powf
     - sinf
     - sincosf
     - tanf

  Asynchronous Memcpy Support
  o  Streams and Asynchronous memcpy functions
     - CPU/GPU concurrency and compute/host<->device memcpy concurrency
  o  Events (enable high-precision timing and CPU/GPU synchronization)
  
  Miscellaneous
  o  New cuDeviceGetAttribute function obsoletes cuDeviceGetProperties
     - Also enables applications to query whether the hardware can process 
       and copy data concurrently
  
  CUFFT Library
  o  Performance improvements

  CUBLAS Library
  o  Performance improvements
  o  Added the functions:
     - cublasCrot()
     - cublasCrotg()
     - cublasCsrot()

--------------------------------------------------------------------------------
Major Bug Fixes
--------------------------------------------------------------------------------

  o  Fixed OpenGL interoperability on multi-GPU hardware configurations

--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

o Individual GPU program launches are limited to a run time 
  of less than 5 seconds on a GPU with a display attached.
  Exceeding this time limit causes a launch failure reported
  through the CUDA driver or the CUDA runtime.  GPUs without
  a display attached are not subject to the 5 second run time
  restriction.  For this reason it is recommeded that CUDA is
  run on a GPU that is NOT attached to an X display.

o In order to run CUDA applications, the CUDA module must be 
  loaded and the entries in /dev created.  This may be achieved 
  by initializing X Windows, or by creating a script to load the 
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

o When compiling GCC, special care must be taken for structs that
  contain 64-bit integers.  This is because GCC aligns long longs
  to a 4 byte boundary by default, while NVCC aligns long longs 
  to an 8 byte boundary by default.  Thus, when using GCC to
  compile a file that has a struct/union, users must give the
  -malign-double
  option to GCC.  When using NVCC, this option is automatically
  passed to GCC.

o On systems with multiple GPUs installed or systems with multiple 
  monitors connected to a single GPU, OpenGL interoperability 
  always copies shared buffers through host memory.

o Current hardware limits the number of asynchronous memcopies that can 
  be overlapped with kernel execution. Overlap is also limited to kernels 
  executing for less than 1 second. These limitations are expected to 
  improve on future hardware. 

o The following APIs exhibit high CPU utilization if they wait for the
  hardware for a significant amount of time.  As a workaround, apps
  may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether
  the GPU is busy and yield the thread as desired.
  - cuCtxSynchronize
  - cuEventSynchronize
  - cuStreamSynchronize
  - cudaThreadSynchronize
  - cudaEventSynchronize
  - cudaStreamSynchronize

o When the profiler gathers performance signals on G80-based products,
  the driver reduces the clock rate on the device.  If the CUDA app
  crashes or otherwise exits uncleanly, the clocks will not be reset 
  to their previous values. The system must be rebooted to restore the
  original clock rate.

-------------------------------------------------------------------------------- 
Open64 Sources 
-------------------------------------------------------------------------------- 

The Open64 source files are controlled under terms of the GPL license. 
Current and previously released versions are located via anonymous ftp at 
download.nvidia.com in the CUDAOpen64 directory. 

--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------

  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9 
  02/2007 - Version 0.8 - Initial public Beta 

--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

  For more information and help with CUDA, please visit
  http://www.nvidia.com/cuda