-------------------------------------------------------------------------------- NVIDIA CUDA Programming Guide Revision History -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Version 1.0 -------------------------------------------------------------------------------- - Added references to Tesla - Section 4.2.2.4 - __device__ only allowed at file scope - Section 4.3.4.1 - Clarified that 3-component texture format are unsupported - Section 4.5 - Fixed typos in code samples - Section 4.5.1.5 - Functions that free memory are synchronous - Section 4.5.2.3 - Mentioned cudaMallocHost() and cudaFreeHost() - Section 4.5.2.7 - Mentioned that all code must be compiled either in device emulation mode or in device execution mode - Section 4.5.3.7 - Clarified code sample - Section 5.1.1.1 - Warned about [u]mul24 being slower that 32-bit integer multiply on future devices - Mentioned that double type gets demoted to float type on devices that do not support double precision - Section B.1 - Modified to clarify that error bounds only apply to the single-precision version of each function - Section B.2 - 8 most significant bits are ignored for [u]mul24 - Added some description of __saturate() - Section C - Fixed typos in function prototypes - Section D - Fixed typos in function prototypes - Section D.1.2 - clockRate in kHz - Section D.2.1 - cudaThreadSynchronize() returns an error if one of the preceding tasks failed - Section D.3.4 - Removed one-dimensional array if height is zero - New Sections D.3.6 and D.3.7 about cudaMallocHost() and cudaFreeHost() - Sections D.3.18 and D.3.19 - Added missing optional copy direction - Section D.3.21 - cudaGetSymbolSize also allows symbol to be in constant memory - New Section D.4.1.6 about cudaGetTextureAlignmentOffset() - New Section D.4.2.1 about cudaCreateChannelDesc() - Section E - Fixed typos in function prototypes - Section E.2.6 - clockRate in kHz - Section E.3.5 - cuCtxSynchronize() returns an error if one of the preceding tasks failed - Section E.4.3 - Documented cuModuleLoadFatBinary() - Section E.5.7 - Clarified that texunit must be CU_PARAM_TR_DEFAULT - New Section E.6.1 about cuMemGetInfo() - New Section E.6.12 about cuMemset2D() - New Appendix F on texture fetching -------------------------------------------------------------------------------- Version 0.9 -------------------------------------------------------------------------------- - Chapter 5 moved to Appendix A - New Section 4.4.6 and new Appendix B on atomic functions - Former Chapter 6 is now Chapter 5 - Former Appendices A, B and C are now appendices C, D and E, respectively - Updated Section 4.5.3 and Appendix E to reflect the new driver API naming conventions: - cudaMemAlloc2D renamed to cudaMemAllocPitch - cuMemAlloc2D renamed to cuMemAllocPitch - cuMemAllocSystem renamed to cuMemAllocHost - cuMemFreeSystem renamed to cuMemFreeHost - cuMemcpyStoD renamed to cuMemcpyHtoD - cuMemcpyDtoS renamed to cuMemcpyDtoH - cuMemcpyStoA renamed to cuMemcpyHtoA - cuMemcpyAtoS renamed to cuMemcpyAtoH - CU_MEMORYTYPE_SYSTEM renamed to CU_MEMORYTYPE_HOST - NumPackedComponents renamed to NumChannels (in CUDA_ARRAY_DESCRIPTOR) - New CUcontext handle in the driver API (cuCtxCreate(), cuCtxAttach(), and cuCtxDetach() modified accordingly) - New argument to cuInit() - Section 3.2 - Atomic writes to same address are serialized - New Section 3.3 about compute capability - New Section 3.4 about multiple devices - New Section 3.5 about mode switches - Section 4.2.1.4 - Kernel invocations are now asynchronous - Section 4.2.2.3 - Clarified shared memory semantics - Section 4.3.4 - Restructured for more clarity - Section 4.4.3 - New type conversion functions - Section 4.4.5 - Texture fetch functions have changed names - Section 4.5.1.1 - Clarified that several host threads can execute device code on the same device - Clarified that host threads cannot share CUDA resources - Section 4.5.1.4 - D3DCREATE_HARDWARE_VERTEXPROCESSING required - Mapping of more than one vertex buffers simultaneously is now supported - New Section 4.5.1.5 about API asynchronicity - Section 4.5.2 - Fixed typos in sample codes - Section 4.5.2.4 - Clarified that texture format mush match texture reference declaration - Section 4.5.3.7 - Clarified that texture format mush match texture reference declaration - Section 5.1.2.1 - Added the common case of accessing an array of structures - Section 5.1.2.5 and Section 5.2 - A multiple of 64 for the number of threads per block is better for optimal register usage - Appendix A - Removed the table since these characteristics can now be queried with the runtime - Added limits on block dimensions - Added maximum width for 1D CUDA arrays - Added maximum kernel size - Appendix B: - Updated error bounds for some math functions - Additional math functions - Section D.1.2 - Additional properties in cudaDeviceProp - New Section D.2 about thread management - New cudaThreadSynchronize() function - New cudaThreadExit() function - Section D.3.4 - 1D CUDA arrays are created by specifying height=0 - Section D.4.1.4 - New definition for cudaBindTexture() - Section D.4.2.1 - New definition for cudaBindTexture() - cudaBindTexture() for CUDA arrays renamed cudaBindTextureToArray() - Section E.2.6 - New cuDeviceGetProperties() function - Section E.3 - cuCtxGetDevice() is now documented - New cuCtxSynchronize() function - Section E.5.7 - Fixed name and description of function (cuParamSetTexRef()) - Section E.6.7 - 1D CUDA arrays are created by specifying height=0 - Section E.7.5 - New definition for cuTexRefSetAddress() -------------------------------------------------------------------------------- Version 0.8.2 -------------------------------------------------------------------------------- - Section 3.2 - Clarified the sentence about writes to the same location in memory - Section 4.3.4: - Specified that reading and writing from same texture produces undefined results - Section 5.1: - Relaxed restriction on multi-device support - Section 5.2: - More details on treatment of denormalized numbers - Section C.7.7 and C.7.8: - cuTexRefSetAddressMode() and cuTexRefSetFilterMode() have no effect for texture references bound to linear memory -------------------------------------------------------------------------------- Version 0.8.1 -------------------------------------------------------------------------------- - Fixed typos and formatting throughout the document - Mention Quadro FX 5600/4600 - Refer to cudaArray as "CUDA array" to avoid confusion with regular C arrays - Clocks now refer to processor's clock - Section 3.2 - Clarified how a block is split into warps - Clafified register allocation - Moved number of registers per thread to Section 6.2 - Section 4.2.1.4: - Specified that device functions are inlined - Section 4.2.2.3: - Rewrote code sample - Section 4.2.2.4: - Updated pointer support - Section 4.2.3: - Clarified that dynamic shared memory is per block - Section 4.3.2: - Clarified when C runtime implementation is used - Section 4.3.4: - Moved part of Section 4.5.1.3 to this section - Clarified that cudaReadModeNormalizedFloat is only supported for 16-bit and 8-bit integer format - Clarified that addressing modes are not supported for textures in linear memory - Section 4.4.4: - Clarified usage of texfetch() - Section 4.5.1.1: - Moved execution on multiple devices to this section - Section 4.5.1.3: - Removed: API considerations went to Section 4.3.4 and performance considerations went to Section 6.1.2.3 - Section 4.5.2: - Moved function descriptions to reference appendix B - Added more code samples - Section 4.5.3: - Moved function descriptions to reference appendix C - Added more code samples - Section 4.5.3.5: - Added description of page-locked host memory allocation - Section 5.1: - Added maximum texture dimensions - Added number of registers - Added various limits related to concurrent threads - Removed bus bandwidth since there is a test application in SDK 0.8.1 - Specified that multi-device is for same-type GPUs only - Based explanation on processor clock - Section 6.1.1.2: - Mentioned that threads reconverge after a control flow instruction - Section 6.1.2.1: - Coalescing constraints explained in terms of half-warps with recommendation to fulfill them for the whole warp - Section 6.1.2.2: - Read cost expressed in terms of half-warps with ecommendation to fulfill condition for the whole warp - Section 6.1.2.3: - Moved performance consideration of Section 4.5.1.3 to this section - Section 6.1.2.4: - Moved some figures around - Section 6.2: - Moved number of registers per thread from Section 3.2 to this section - Added definition of multiprocessor occupancy - Section 7.1: - Mentioned that the example has not been written for performance -------------------------------------------------------------------------------- Version 0.8 -------------------------------------------------------------------------------- - Section 3.2 - Specified that write order is undefined for threads writing to same location - Chapter 4: - Documented texture support - Documented OpenGL and Direct3D interoperability - Documented driver API - Section 4.2.2 - Removed __local__ - Section 4.2.2.3 - Added an example of managing layout of dynamic shared memory - Section 4.3.3 - Fixed function name - More detailed description - Section 4.4.5 - Removed __trap() - Section 4.5.2.7 - Fixed code sequence related to _controlfp - Section 5.1 - Specified size of constant and 1D-texture working sets - Section 5.2 - Precision on NaN propagation - Section 6.1.1.1 - Fixed throughput of sin, cos, and exponentiation - Specified througput of 32-bit integer multiplication - Mentioned __fdividef - Section 6.1.1.2 - Reworded the bit on branch predication - Section 6.1.2.1 - More details on alignment requirement - Mentioned __align() - Reworded memory coalescing - Mentioned the relevant 2D allocation routines - Section 6.1.2.5 - Mentioned number of threads require to ignore register RAW dependencies - Appendix A: - Updated error bounds for some math functions - Additional math functions -------------------------------------------------------------------------------- Alpha2 Version of November 21, 2006 -------------------------------------------------------------------------------- - Chapter 1: - Mostly re-written with additional diagrams - Chapter 2: - Moved memory model from Chapter 3 to Chapter 2 - Section 3.2: - Rephrased after moving memory model to Chapter 2 - Section 4.2.2: - Clarified the __device__ type qualifier - Section 4.4.2: - Removed the memory mapping functions since they have been removed from Beta - Section 4.5: - Mentioned the type casting functions - Section 4.5.2: - Elaborated more on when using __syncthreads - Section 4.6.2: - Mentioned the __DEVICE_EMULATION__ preprocessor macro defined in device emulation mode - Added denormalized numbers as an important potential difference between device emulation and device execution modes - Chapter 5: - More technical specifications - Section 6.1.1.1: - Mentioned performance of bitwise operations, integer multiplication, modulo, and divide - Mentioned __mul24 - Advised on the handling of single vs double precision functions and constants - Section 6.1.1.2: - More details on when the compiler chooses to predicate - Section 6.1.2.1: - Clarified memory coalescing for local memory - More details on alignment constrainst for global memory coalescing - Section 6.1.2.4: - More graphics to illustrate bank conflicts - Fixed wrong statement about bank conflicts when accessing structure members - Chapter 7: - Now using 2D arrays for As and Bs - Fixed wrong statement about bank conflict - Appendix A: - Updated error bounds for some math functions - Additional math functions - Recommendation of using rintf instead of roundf -------------------------------------------------------------------------------- Alpha2 Version of September 28, 2006 -------------------------------------------------------------------------------- First version