Name NVX_shader_thread_shuffle Name Strings GL_NVX_shader_thread_shuffle Contributors Jeannot Breton, NVIDIA Pat Brown, NVIDIA Eric Werness, NVIDIA Mark Kilgard, NVIDIA Contact Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) Status Shipping. Version Last Modified Date: 9/4/2013 NVIDIA Revision: 2 Number XXX Dependencies This extension is written against the OpenGL 4.3 (Compatibility Profile) Specification. This extension is written against version 4.30 (revision 07) of the OpenGL Shading Language Specification. OpenGL 4.3 and GLSL 4.3 are required. This extension interacts with NV_gpu_program5 Overview Implementations of the OpenGL Shading Language may, but are not required, to run multiple shader threads for a single stage as a SIMD thread group, where individual execution threads are assigned to thread groups in an undefined, implementation-dependent order. This extension provides a set of new features to the OpenGL Shading Language to share data between multiple threads within a thread group. Shaders using the new functionalities provided by this extension should enable this functionality via the construct #extension GL_NVX_shader_thread_shuffle : require (or enable) This extension also specifies some modifications to the program assembly language to support the thread data sharing functionalities. New Procedures and Functions None New Tokens None Modifications to The OpenGL Shading Language Specification, Version 4.30 (Revision 07) Including the following line in a shader can be used to control the language features described in this extension: #extension GL_NVX_shader_thread_shuffle : where is as specified in section 3.3. New preprocessor #defines are added to the OpenGL Shading Language: #define GL_NVX_shader_thread_shuffle 1 Modify Section 8.3, Common Functions, p. 133 (add a function to share data between threads in a thread group) Syntax: int shuffleDownNVX(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleDownNVX(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleDownNVX(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleDownNVX(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleDownNVX(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleDownNVX(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleDownNVX(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleDownNVX(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleDownNVX(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleDownNVX(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleDownNVX(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleDownNVX(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleDownNVX(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleDownNVX(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleDownNVX(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleDownNVX(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleUpNVX(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleUpNVX(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleUpNVX(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleUpNVX(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleUpNVX(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleUpNVX(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleUpNVX(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleUpNVX(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleUpNVX(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleUpNVX(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleUpNVX(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleUpNVX(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleUpNVX(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleUpNVX(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleUpNVX(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleUpNVX(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleXorNVX(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleXorNVX(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleXorNVX(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleXorNVX(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleXorNVX(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleXorNVX(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleXorNVX(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleXorNVX(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleXorNVX(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleXorNVX(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleXorNVX(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleXorNVX(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleXorNVX(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleXorNVX(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleXorNVX(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleXorNVX(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleNVX(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleNVX(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleNVX(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleNVX(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleNVX(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleNVX(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleNVX(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleNVX(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleNVX(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleNVX(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleNVX(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleNVX(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleNVX(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleNVX(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleNVX(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleNVX(bvec4 data, uint index, uint width, [out bool threadIdValid]) Shuffle functions allow active threads within a thread group to exchange data using 4 different modes (up, down, xor, indexed). They all load the operand which can be different per thread and return a value read from the source thread at an address computed with the and the operands. is a 5 bits value in the range 0 to 31, MSBs are ignored. is an optional operand. It hold the value of the predicate that specifies if the source thread from which the current thread reads data is in range or not. is used for segmenting the thread group in multiple segments. The segments need to be subdivided equally, so needs to be a power of 2 in the range 2 to 32. Using a of 32 would divide the thread group in a single segment. A of 8 would divide the thread group in 4 segments of size 8. Using a that is not a power of 2, that is lower than 2 or larger than 32 will return an undefined value. Threads can only share data within their own segment. Each thread executing the built-in shuffle function will determine the ID of another thread by combining its value of gl_ThreadInWarpNVX with its value of as described below. Such threads will attempt to read the value of in the computed other thread and return that value to the caller. When a shuffle function attempts to access the value of from another thread, it determines whether the other thread is in accessible range or not. If it is in range, true will be returned in the optional parameter, if provided by the caller. If it's out of range, false will be returned in , if provided by the caller, and the value returned by the function will come from the current thread. The 4 modes use the following logic to compute the source thread index and the value: shuffleNVX computes the source index using as an absolute address within the thread group segment. srcThreadId = = < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 2 ----------------- src thread Id |2|2|2|2|2|2|2|2| ----------------- |1|1|1|1|1|1|1|1| ----------------- result |b|b|b|b|b|b|b|b| ----------------- If is 9 ----------------- src thread Id |9|9|9|9|9|9|9|9| ----------------- |0|0|0|0|0|0|0|0| ----------------- result |a|b|c|d|e|f|g|h| ----------------- shuffleUpNVX subtracts from the current thread id to get the source thread id. This have the effect of shifting up the segment by threads. Source thread id do not wrap around, so lower thread id will be left unchanged. srcThreadId = currentThreadId - = srcThreadId >= 0 For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 1 ------------------ src thread Id |-1|0|1|2|3|4|5|6| ------------------ |0 |1|1|1|1|1|1|1| ------------------ result |a |a|b|c|d|e|f|g| ------------------ shuffleDownNVX adds to the current thread id to get the source thread id. This have the effect of shifting down the segment by threads. Source thread id do not wrap around, so higher thread id will be left unchanged. srcThreadId = currentThreadId + = srcThreadId < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 2 ----------------- src thread Id |2|3|4|5|6|7|8|9| ----------------- |1|1|1|1|1|1|0|0| ----------------- result |c|d|e|f|g|h|g|h| ----------------- shuffleXorNv does a bitwise xor between the and the current thread id to get the src thread id: srcThreadId = currentThreadId ^ = srcThreadId < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 0x1 ----------------- src thread Id |1|0|3|2|5|4|7|6| ----------------- |1|1|1|1|1|1|1|1| ----------------- result |b|a|d|c|f|e|h|g| ----------------- Dependencies on NV_gpu_program5 If NV_gpu_program5 is supported and "OPTION NVX_shader_thread_shuffle" is specified in an assembly program, the following edits are made to extend the assembly programming model documented in the NV_gpu_program4 extension and extended by NV_gpu_program5. If NV_gpu_program5 is not supported, or if "OPTION NVX_shader_thread_shuffle" is not specified in an assembly program, the contents of this dependencies section should be ignored. Section 2.X.2, Program Grammar (add the following rules to the grammar) ::= "SHFDOWN" | "SHFIDX" | "SHFUP" | "SHFXOR" Modify Section 2.X.4, Program Execution Environment (Add the table entries and relevant text describing the program instructions to exchange data between threads.) Instr- Modifiers uction V F I C S H D Out Inputs Description ------- -- - - - - - - --- -------- -------------------------------- ... SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index ... (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, as extended by NV_gpu_program5) + Shader thread shuffle (NVX_shader_thread_shuffle) If a program specifies the "NVX_shader_thread_shuffle" option, it may use the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option is not specified, a program will fail to compile if it uses those instructions. Section 2.X.8.Z, SHFDOWN: warp shuffle with added index The SHFDOWN instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFDOWN instruction this source thread is the id of the current thread added with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFDOWN returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFDOWN, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFDOWN supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index The SHFIDX instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFIDX instruction, this source thread id is computed using the following operation: source thread id =( index operand & ~segmentationMask) | minThreadId The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFIDX returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFIDX, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFIDX supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index The SHFUP instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFUP instruction this source thread is the id of the current thread subtracted with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFUP returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFUP, the source thread id is in range when it is greater than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFUP supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index The SHFXOR instruction allows a 32-bit scalar value to be exchanged between multiple threads within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between threads, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFXOR instruction this source thread is the id of the current thread XORed with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFXOR returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFXOR, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFXOR supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Errors None. New State None. New Implementation Dependent State None. Issues None Revision History Rev. Date Author Changes ---- -------- -------- ----------------------------------------- 2 9/4/13 jbreton Replace mask by width in the shuffle functions. 1 11/27/12 jbreton Internal revisions.