--------------------------------------------------------------------------------
NVIDIA CUDA Programming Guide
Revision History
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Version 1.0
--------------------------------------------------------------------------------

- Added references to Tesla
- Section 4.2.2.4
    - __device__ only allowed at file scope
- Section 4.3.4.1
    - Clarified that 3-component texture format are unsupported
- Section 4.5
    - Fixed typos in code samples
- Section 4.5.1.5
    - Functions that free memory are synchronous
- Section 4.5.2.3
    - Mentioned cudaMallocHost() and cudaFreeHost()
- Section 4.5.2.7
    - Mentioned that all code must be compiled either in device emulation mode or
      in device execution mode
- Section 4.5.3.7
    - Clarified code sample
- Section 5.1.1.1
    - Warned about [u]mul24 being slower that 32-bit integer multiply on future
      devices
    - Mentioned that double type gets demoted to float type on devices that do
      not support double precision
- Section B.1
    - Modified to clarify that error bounds only apply to the single-precision
      version of each function
- Section B.2
    - 8 most significant bits are ignored for [u]mul24
    - Added some description of __saturate()
- Section C
    - Fixed typos in function prototypes
- Section D
    - Fixed typos in function prototypes
- Section D.1.2
    - clockRate in kHz
- Section D.2.1
    - cudaThreadSynchronize() returns an error if one of the preceding tasks failed
- Section D.3.4
    - Removed one-dimensional array if height is zero
- New Sections D.3.6 and D.3.7 about cudaMallocHost() and cudaFreeHost()
- Sections D.3.18 and D.3.19
    - Added missing optional copy direction
- Section D.3.21
    - cudaGetSymbolSize also allows symbol to be in constant memory
- New Section D.4.1.6 about cudaGetTextureAlignmentOffset()
- New Section D.4.2.1 about cudaCreateChannelDesc()
- Section E
    - Fixed typos in function prototypes
- Section E.2.6
    - clockRate in kHz
- Section E.3.5
    - cuCtxSynchronize() returns an error if one of the preceding tasks failed
- Section E.4.3
    - Documented cuModuleLoadFatBinary()
- Section E.5.7
    - Clarified that texunit must be CU_PARAM_TR_DEFAULT
- New Section E.6.1 about cuMemGetInfo()
- New Section E.6.12 about cuMemset2D()
- New Appendix F on texture fetching

--------------------------------------------------------------------------------
Version 0.9
--------------------------------------------------------------------------------

- Chapter 5 moved to Appendix A
- New Section 4.4.6 and new Appendix B on atomic functions
- Former Chapter 6 is now Chapter 5
- Former Appendices A, B and C are now appendices C, D and E, respectively
- Updated Section 4.5.3 and Appendix E to reflect the new driver API naming
  conventions:
    - cudaMemAlloc2D renamed to cudaMemAllocPitch
    - cuMemAlloc2D renamed to cuMemAllocPitch
    - cuMemAllocSystem renamed to cuMemAllocHost
    - cuMemFreeSystem renamed to cuMemFreeHost
    - cuMemcpyStoD renamed to cuMemcpyHtoD
    - cuMemcpyDtoS renamed to cuMemcpyDtoH
    - cuMemcpyStoA renamed to cuMemcpyHtoA
    - cuMemcpyAtoS renamed to cuMemcpyAtoH
    - CU_MEMORYTYPE_SYSTEM renamed to CU_MEMORYTYPE_HOST
    - NumPackedComponents renamed to NumChannels (in CUDA_ARRAY_DESCRIPTOR)
- New CUcontext handle in the driver API (cuCtxCreate(), cuCtxAttach(), and
  cuCtxDetach() modified accordingly)
- New argument to cuInit()
- Section 3.2
    - Atomic writes to same address are serialized
- New Section 3.3 about compute capability
- New Section 3.4 about multiple devices
- New Section 3.5 about mode switches
- Section 4.2.1.4
    - Kernel invocations are now asynchronous
- Section 4.2.2.3
    - Clarified shared memory semantics
- Section 4.3.4
    - Restructured for more clarity
- Section 4.4.3
    - New type conversion functions
- Section 4.4.5
    - Texture fetch functions have changed names
- Section 4.5.1.1
    - Clarified that several host threads can execute device code on the same
      device
    - Clarified that host threads cannot share CUDA resources
- Section 4.5.1.4
    - D3DCREATE_HARDWARE_VERTEXPROCESSING required
    - Mapping of more than one vertex buffers simultaneously is now supported
- New Section 4.5.1.5 about API asynchronicity
- Section 4.5.2
    - Fixed typos in sample codes
- Section 4.5.2.4
    - Clarified that texture format mush match texture reference declaration
- Section 4.5.3.7
    - Clarified that texture format mush match texture reference declaration
- Section 5.1.2.1
    - Added the common case of accessing an array of structures
- Section 5.1.2.5 and Section 5.2
    - A multiple of 64 for the number of threads per block is better for optimal
      register usage
- Appendix A
    - Removed the table since these characteristics can now be queried with the
      runtime
    - Added limits on block dimensions
    - Added maximum width for 1D CUDA arrays
    - Added maximum kernel size
- Appendix B:
    - Updated error bounds for some math functions
    - Additional math functions
- Section D.1.2
    - Additional properties in cudaDeviceProp
- New Section D.2 about thread management
    - New cudaThreadSynchronize() function
    - New cudaThreadExit() function
- Section D.3.4
    - 1D CUDA arrays are created by specifying height=0
- Section D.4.1.4
    - New definition for cudaBindTexture()
- Section D.4.2.1
    - New definition for cudaBindTexture()
    - cudaBindTexture() for CUDA arrays renamed cudaBindTextureToArray()
- Section E.2.6
    - New cuDeviceGetProperties() function
- Section E.3
    - cuCtxGetDevice() is now documented
    - New cuCtxSynchronize() function
- Section E.5.7
    - Fixed name and description of function (cuParamSetTexRef())
- Section E.6.7
    - 1D CUDA arrays are created by specifying height=0
- Section E.7.5
    - New definition for cuTexRefSetAddress()

--------------------------------------------------------------------------------
Version 0.8.2
--------------------------------------------------------------------------------

- Section 3.2
    - Clarified the sentence about writes to the same location in memory
- Section 4.3.4:
    - Specified that reading and writing from same texture produces undefined
      results
- Section 5.1:
    - Relaxed restriction on multi-device support
- Section 5.2:
    - More details on treatment of denormalized numbers
- Section C.7.7 and C.7.8:
    - cuTexRefSetAddressMode() and cuTexRefSetFilterMode() have no effect for
      texture references bound to linear memory

--------------------------------------------------------------------------------
Version 0.8.1
--------------------------------------------------------------------------------

- Fixed typos and formatting throughout the document
- Mention Quadro FX 5600/4600
- Refer to cudaArray as "CUDA array" to avoid confusion with regular C arrays
- Clocks now refer to processor's clock
- Section 3.2
    - Clarified how a block is split into warps
    - Clafified register allocation
    - Moved number of registers per thread to Section 6.2
- Section 4.2.1.4:
    - Specified that device functions are inlined
- Section 4.2.2.3:
    - Rewrote code sample
- Section 4.2.2.4:
    - Updated pointer support
- Section 4.2.3:
    - Clarified that dynamic shared memory is per block
- Section 4.3.2:
    - Clarified when C runtime implementation is used
- Section 4.3.4:
    - Moved part of Section 4.5.1.3 to this section
    - Clarified that cudaReadModeNormalizedFloat is only supported for 16-bit and
      8-bit integer format
    - Clarified that addressing modes are not supported for textures in linear
      memory
- Section 4.4.4:
    - Clarified usage of texfetch()
- Section 4.5.1.1:
    - Moved execution on multiple devices to this section
- Section 4.5.1.3:
    - Removed: API considerations went to Section 4.3.4 and performance
      considerations went to Section 6.1.2.3
- Section 4.5.2:
    - Moved function descriptions to reference appendix B
    - Added more code samples
- Section 4.5.3:
    - Moved function descriptions to reference appendix C
    - Added more code samples
- Section 4.5.3.5:
    - Added description of page-locked host memory allocation
- Section 5.1:
    - Added maximum texture dimensions
    - Added number of registers
    - Added various limits related to concurrent threads
    - Removed bus bandwidth since there is a test application in SDK 0.8.1
    - Specified that multi-device is for same-type GPUs only
    - Based explanation on processor clock
- Section 6.1.1.2:
    - Mentioned that threads reconverge after a control flow instruction
- Section 6.1.2.1:
    - Coalescing constraints explained in terms of half-warps with 
      recommendation to fulfill them for the whole warp
- Section 6.1.2.2:
    - Read cost expressed in terms of half-warps with ecommendation to
      fulfill condition for the whole warp
- Section 6.1.2.3:
    - Moved performance consideration of Section 4.5.1.3 to this section
- Section 6.1.2.4:
    - Moved some figures around
- Section 6.2:
    - Moved number of registers per thread from Section 3.2 to this section
    - Added definition of multiprocessor occupancy
- Section 7.1:
    - Mentioned that the example has not been written for performance

--------------------------------------------------------------------------------
Version 0.8
--------------------------------------------------------------------------------

- Section 3.2
    - Specified that write order is undefined for threads writing to same location
- Chapter 4:
    - Documented texture support
    - Documented OpenGL and Direct3D interoperability
    - Documented driver API
- Section 4.2.2
    - Removed __local__
- Section 4.2.2.3
    - Added an example of managing layout of dynamic shared memory
- Section 4.3.3
    - Fixed function name
    - More detailed description
- Section 4.4.5
    - Removed __trap()
- Section 4.5.2.7
    - Fixed code sequence related to _controlfp
- Section 5.1
    - Specified size of constant and 1D-texture working sets
- Section 5.2
    - Precision on NaN propagation
- Section 6.1.1.1
    - Fixed throughput of sin, cos, and exponentiation
    - Specified througput of 32-bit integer multiplication
    - Mentioned __fdividef
- Section 6.1.1.2
    - Reworded the bit on branch predication
- Section 6.1.2.1
    - More details on alignment requirement
    - Mentioned __align()
    - Reworded memory coalescing
    - Mentioned the relevant 2D allocation routines
- Section 6.1.2.5
    - Mentioned number of threads require to ignore register RAW dependencies
- Appendix A:
    - Updated error bounds for some math functions
    - Additional math functions

--------------------------------------------------------------------------------
Alpha2 Version of November 21, 2006
--------------------------------------------------------------------------------

- Chapter 1:
    - Mostly re-written with additional diagrams
- Chapter 2:
    - Moved memory model from Chapter 3 to Chapter 2
- Section 3.2:
    - Rephrased after moving memory model to Chapter 2
- Section 4.2.2:
    - Clarified the __device__ type qualifier
- Section 4.4.2:
    - Removed the memory mapping functions since they have been removed from Beta
- Section 4.5:
    - Mentioned the type casting functions
- Section 4.5.2:
    - Elaborated more on when using __syncthreads
- Section 4.6.2:
    - Mentioned the __DEVICE_EMULATION__ preprocessor macro defined in device emulation mode
    - Added denormalized numbers as an important potential difference between device emulation and device execution modes
- Chapter 5:
    - More technical specifications
- Section 6.1.1.1:
    - Mentioned performance of bitwise operations, integer multiplication, modulo, and divide
    - Mentioned __mul24
    - Advised on the handling of single vs double precision functions and constants
- Section 6.1.1.2:
    - More details on when the compiler chooses to predicate
- Section 6.1.2.1:
    - Clarified memory coalescing for local memory
    - More details on alignment constrainst for global memory coalescing
- Section 6.1.2.4:
    - More graphics to illustrate bank conflicts
    - Fixed wrong statement about bank conflicts when accessing structure members
- Chapter 7:
    - Now using 2D arrays for As and Bs
    - Fixed wrong statement about bank conflict
- Appendix A:
    - Updated error bounds for some math functions
    - Additional math functions
    - Recommendation of using rintf instead of roundf

--------------------------------------------------------------------------------
Alpha2 Version of September 28, 2006
--------------------------------------------------------------------------------

First version