Name NVX_shader_thread_group Name Strings GL_NVX_shader_thread_group Contributors Jeannot Breton, NVIDIA Pat Brown, NVIDIA Eric Werness, NVIDIA Mark Kilgard, NVIDIA Contact Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) Status Shipping. Version Last Modified Date: 9/4/13 NVIDIA Revision: 2 Number XXX Dependencies This extension is written against the OpenGL 4.3 (Compatibility Profile) Specification. This extension is written against version 4.30 (revision 07) of the OpenGL Shading Language Specification. OpenGL 4.3 and GLSL 4.3 are required. This extension interacts with NV_gpu_program5 This extension interacts with NV_compute_program5 This extension interacts with NV_tessellation_program5 Overview Implementations of the OpenGL Shading Language may, but are not required to, run multiple shader threads for a single stage as a SIMD thread group, where individual execution threads are assigned to thread groups in an undefined, implementation-dependent order. This extension provides a set of new features to the OpenGL Shading Language to query thread states and to share data between fragments within a 2x2 pixel quad. More specifically the following functionalities were added: * New uniform variables and tokens to query the number of threads in a warp, the number of warps running on a SM and the number of SMs on the GPU. * New shader inputs to query the thread id, the warp id and the SM id. * New shader inputs to query if a fragment shader thread is a helper thread. * New shader built-in functions to query the state of a Boolean condition over all threads in a thread group. * New shader built-in functions to query which threads are active within a thread group. * New fragment shader built-in functions to share data between fragments within a 2x2 pixel quad. Shaders using the new functionalities provided by this extension should enable this functionality via the construct #extension GL_NVX_shader_thread_group : require (or enable) This extension also specifies some modifications to the program assembly language to support the thread state query and thread data sharing functionalities. Note that in this extension specification warp and thread group have the same meaning. A warp is a group of threads that get executed in lockstep. Each thread in a warp executes the same instruction of a program, but on different data. New Procedures and Functions None New Tokens Accepted by the parameter of GetBooleanv, GetIntegerv, GetFloatv, and GetDoublev: WARP_SIZE_NVX 0x9339 WARPS_PER_SM_NVX 0x933A SM_COUNT_NVX 0x933B Modifications to The OpenGL Shading Language Specification, Version 4.30 (Revision 07) Including the following line in a shader can be used to control the language features described in this extension: #extension GL_NVX_shader_thread_group : where is as specified in section 3.3. New preprocessor #defines are added to the OpenGL Shading Language: #define GL_NVX_shader_thread_group 1 Modify Section 7.1, Built-in Languages Variable, p. 110 (Add to the list of built-in variables for the compute, vertex, geometry, tessellation control, tessellation evaluation and fragment languages) in uint gl_ThreadInWarpNVX; in uint gl_ThreadEqMaskNVX; in uint gl_ThreadGeMaskNVX; in uint gl_ThreadGtMaskNVX; in uint gl_ThreadLeMaskNVX; in uint gl_ThreadLtMaskNVX; in uint gl_WarpIDNVX; in uint gl_SMIDNVX; (Add to the list of built-in variables for the fragment languages) in bool gl_HelperThreadNVX; (Add those paragraphs at the end of this section) The variable gl_ThreadInWarpNVX hold the id of the thread within the thread group(or warp). This variable is in the range 0 to gl_WarpSizeNVX-1, where gl_WarpSizeNVX is the total number of thread in a warp. The variable gl_ThreadEqMaskNVX is a bitfield in which the bit equal to the current thread id is set. The variable gl_ThreadGeMaskNVX is a bitfield in which bits greater or equal to the current thread id are set. The variable gl_ThreadGtMaskNVX is a bitfield in which bits greater than the current thread id are set. The variable gl_ThreadLeMaskNVX is a bitfield in which bits lower or equal to the current thread id are set. The variable gl_ThreadLtMaskNVX is a bitfield in which bits lower than the current thread id are set. The value of gl_ThreadEqMaskNVX, gl_ThreadGeMaskNVX, gl_ThreadGtMaskNVX, gl_ThreadLeMaskNVX and gl_ThreadLtMaskNVX are derived from the value of gl_ThreadInWarpNVX using simple bit-shift arithmetic, they don't take into account the value of the thread group active mask. For example, if the application wants a bitfield in which bits lower or equal to the current thread id are set only for active threads, the result of gl_ThreadLeMaskNVX will need to be ANDed with the thread group active mask. The variable gl_WarpIDNVX hold the warp id of the executing thread. This variable is in the range 0 to gl_WarpsPerSMNVX-1, where gl_WarpsPerSMNVX is the maximum number of warp executing on a SM. The variable gl_SMIDNVX hold the SM id of the executing thread. This variable is in the range 0 to gl_SMCountNVX-1, where gl_SMCountNVX is the number of SM on the GPU. The variable gl_HelperThreadNVX specifies if the current thread is a helper thread. In implementations supporting this extension, fragment shader invocations may be arranged in SIMD thread groups of 2x2 fragments called "quad". When a fragment shader instruction is executed on a quad, it's possible that some fragments within the quad will execute the instruction even if they are not covered by the primitive. Those threads are called helper threads. Their outputs will be discarded and they will not execute global store functions, but the intermediate values they compute can still be used by thread group sharing functions or by fragment derivative functions like dFdx and dFdy. Modify Section 7.4, Built-In Uniform State, p. 125 (Add to the list of built-in uniform variable declaration) uniform uint gl_WarpSizeNVX; uniform uint gl_WarpsPerSMNVX; uniform uint gl_SMCountNVX; (Add this paragraph at the end of this section) The variable gl_WarpSizeNVX is the total number of thread in a warp. The variable gl_WarpsPerSMNVX is the maximum number of warp executing on a SM. The variable gl_SMCountNVX is the number of SM on the GPU. Modify Section 8.3, Common Functions, p. 133 (add a function to query which threads are active within a thread group) Syntax: uint activeThreadsNVX(void) In the value returned by activeThreadsNVX(), bit is set to 1 if the corresponding thread in the SIMD thread group is executing the call to activeThreadsNVX() and 0 otherwise. A bit in the return value may be set to zero due to conditional flow control (e.g., returning from a function, executing the "else" part of an "if" statement) or SIMD thread group was dispatched without a full collection of threads. (add a function to query the state of a Boolean condition over all the threads in a thread group) Syntax: uint ballotThreadNVX(bool value) The function ballotThreadNVX() computes a 32-bit bitfield. It looks at the condition for each active thread of a thread group and set to 1 each bit for which the condition in the corresponding thread is true. Bits for threads with false condition are set to 0. Bits for inactive threads are also set to 0. It's possible to query the active thread mask by calling the function activeThreadsNVX. (add a function to share data between fragment in a quad) Syntax: float quadSwizzle0NVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzle0NVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzle0NVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzle0NVX(vec4 swizzledValue, [vec4 unswizzledValue]) float quadSwizzle1NVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzle1NVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzle1NVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzle1NVX(vec4 swizzledValue, [vec4 unswizzledValue]) float quadSwizzle2NVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzle2NVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzle2NVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzle2NVX(vec4 swizzledValue, [vec4 unswizzledValue]) float quadSwizzle3NVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzle3NVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzle3NVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzle3NVX(vec4 swizzledValue, [vec4 unswizzledValue]) float quadSwizzleXNVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzleXNVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzleXNVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzleXNVX(vec4 swizzledValue, [vec4 unswizzledValue]) float quadSwizzleYNVX(float swizzledValue, [float unswizzledValue]) vec2 quadSwizzleYNVX(vec2 swizzledValue, [vec2 unswizzledValue]) vec3 quadSwizzleYNVX(vec3 swizzledValue, [vec3 unswizzledValue]) vec4 quadSwizzleYNVX(vec4 swizzledValue, [vec4 unswizzledValue]) In implementations supporting this extension, if a primitive covers a fragment at (x,y), its fragment shader invocation will be arranged in a SIMD thread group with fragment shader invocations corresponding to three neighboring pixels. These four invocations are arranged in a 2x2 grid, called a "quad". If the neighbors of a fragment are not covered by the primitive, fragment shader invocations will still be generated. The implementation may compute differences between values in these threads to estimate derivatives for dFdx(), dFdy(), and for texture lookups with automatic LOD calcuations. Fragments may have different locations in the quads based on the type of render target. When rendering to a window, fragments within a quad follow this pattern: ----------------------------------------------------- | gl_ThreadInWarpNVX 4N+2 | gl_ThreadInWarpNVX 4N+3 | | pixel (X+0,Y+1) | pixel (X+1,Y+1) | ----------------------------------------------------- | gl_ThreadInWarpNVX 4N+0 | gl_ThreadInWarpNVX 4N+1 | | pixel (X+0,Y+0) | pixel (X+1,Y+0) | ----------------------------------------------------- When rendering to a framebuffer object, fragments within a quad follow this pattern: ----------------------------------------------------- | gl_ThreadInWarpNVX 4N+0 | gl_ThreadInWarpNVX 4N+1 | | pixel (X+0,Y+1) | pixel (X+1,Y+1) | ------------------------------------------------ | gl_ThreadInWarpNVX 4N+2 | gl_ThreadInWarpNVX 4N+3 | | pixel (X+0,Y+0) | pixel (X+1,Y+0) | ----------------------------------------------------- There are 6 quadSwizzle functions that allow fragments within a quad to exchange data. All those functions will read a floating point operand , which can come from any fragment in the quad. Another optional floating point operand , which comes from the current fragment, can be added to . The only difference between all those quadSwizzle functions is the location where they get the operand within the 2x2 pixel quad. quadSwizzle0NVX will read the operand from the fragment 0: result[thread N] = swizzledValue[thread 0] + unswizzledValue[thread N] quadSwizzle1NVX will read the operand from the fragment 1: result[thread N] = swizzledValue[thread 1] + unswizzledValue[thread N] quadSwizzle2NVX will read the operand from the fragment 2: result[thread N] = swizzledValue[thread 2] + unswizzledValue[thread N] quadSwizzle3NVX will read the operand from the fragment 3: result[thread N] = swizzledValue[thread 3] + unswizzledValue[thread N] quadSwizzleXNVX will read the operand for each fragment from its neighbor in X: result[thread 0] = swizzledValue[thread 1] + unswizzledValue[thread 0] result[thread 1] = swizzledValue[thread 0] + unswizzledValue[thread 1] result[thread 2] = swizzledValue[thread 3] + unswizzledValue[thread 2] result[thread 3] = swizzledValue[thread 2] + unswizzledValue[thread 3] quadSwizzleYNVX will read the operand for each fragment from its neighbor in Y: result[thread 0] = swizzledValue[thread 2] + unswizzledValue[thread 0] result[thread 1] = swizzledValue[thread 3] + unswizzledValue[thread 1] result[thread 2] = swizzledValue[thread 0] + unswizzledValue[thread 2] result[thread 3] = swizzledValue[thread 1] + unswizzledValue[thread 3] If any thread in a 2x2 pixel quad is inactive, the quad is divergent. In this case quadSwizzle will return 0 for all fragments in the quad. Dependencies on NV_gpu_program5 If NV_gpu_program5 is supported and "OPTION NVX_shader_thread_group" is specified in an assembly program, the following edits are made to extend the assembly programming model documented in the NV_gpu_program4 extension and extended by NV_gpu_program5. If NV_gpu_program5 is not supported, or if "OPTION NVX_shader_thread_group" is not specified in an assembly program, the contents of this dependencies section should be ignored. Modify Section 2.X.2, Program Grammar (add the following rules to the the NV_gpu_program4 and NV_gpu_program5 base grammars) ::= "TGBALLOT" ::= "state" "." ::= "thread" "." ::= "warpsize" | "warpspersm" | "smcount" (add/change the following rules to the NV_fragment_program4 and NV_gpu_program5 base grammars) ::= "QSWZ0" | "QSWZ1" | "QSWZ2" | "QSWZ3" | "QSWZX" | "QSWZY" ::= "threadid" | "threadeqmask" | "threadltmask" | "threadlemask" | "threadgtmask" | "threadgemask" | "warpid" | "smid" | "helperthread" (add/change the following rules to the NV_vertex_program4 and NV_gpu_program5 base grammars) ::= "threadid" | "threadeqmask" | "threadltmask" | "threadlemask" | "threadgtmask" | "threadgemask" | "warpid" | "smid" (add/change the following rules to the NV_geometry_program4 and NV_gpu_program5 base grammars) ::= "threadid" | "threadeqmask" | "threadltmask" | "threadlemask" | "threadgtmask" | "threadgemask" | "warpid" | "smid" Modify Section 2.X.3.2 of the NV_gpu_program4 specification, Program Attribute Variables. (Add the table entries and relevant text describing the fragment program input variable use to query thread states.) Fragment Attribute Binding Components Underlying State -------------------------- ---------- ---------------------------- ... fragment.threadid (id,-,-,-) id of the current thread fragment.threadeqmask (m,-,-,-) mask with the current thread fragment.threadltmask (m,-,-,-) mask with lower thread fragment.threadlemask (m,-,-,-) mask with lower or equal thread fragment.threadgtmask (m,-,-,-) mask with greater thread fragment.threadgemask (m,-,-,-) mask with greater or equal thread fragment.warpid (id,-,-,-) warp id of the current thread fragment.smid (id,-,-,-) SM id of the current thread fragment.helperthread (k,-,-,-) current thread is a helper thread ... If a fragment attribute binding matches "fragment.threadid", the "x" component is filled with the thread id of the current thread. The thread id is an unsigned integer in the range 0 to 31. If a fragment attribute binding matches "fragment.threadeqmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which the bit equal to the current thread id is set. If a fragment attribute binding matches "fragment.threadltmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower than the current thread id are set. If a fragment attribute binding matches "fragment.threadlemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower or equal to the current thread id are set. If a fragment attribute binding matches "fragment.threadgtmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater than the current thread id are set. If a fragment attribute binding matches "fragment.threadgemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater or equal to the current thread id are set. If a fragment attribute binding matches "fragment.warpid", the "x" component is filled with the warp id of the current thread. The warp id is an unsigned integer, the range of this value is hw dependent. If a fragment attribute binding matches "fragment.smid", the "x" component is filled with the SM id of the current thread. The SM id is an unsigned integer, the range of this value is hw dependent. If a fragment attribute binding matches "fragment.helperthread", the "x" component is an integer value equal to -1 when the current thread is a helper thread and 0 otherwise. In implementations supporting this extension, fragment program invocations may be arranged in SIMD thread groups of 2x2 fragments called "quad". When a fragment program instruction is executed on a quad, it's possible that some fragments within the quad will execute the instruction even if they are not covered by the primitive. Those threads are called helper threads. Their outputs will be discarded and they will not execute global store instructions, but the intermediate values they compute can still be used by thread group sharing instructions or by fragment derivative instructions like DDX and DDY. (Add the table entries and relevant text describing the vertex program attribute variable use to query thread states.) Vertex Attribute Binding Components Underlying State ------------------------ ---------- ---------------------------- ... vertex.threadid (id,-,-,-) id of the current thread vertex.threadeqmask (m,-,-,-) mask with the current thread vertex.threadltmask (m,-,-,-) mask with lower thread vertex.threadlemask (m,-,-,-) mask with lower or equal thread vertex.threadgtmask (m,-,-,-) mask with greater thread vertex.threadgemask (m,-,-,-) mask with greater or equal thread vertex.warpid (id,-,-,-) warp id of the current thread vertex.smid (id,-,-,-) SM id of the current thread ... If a vertex attribute binding matches "vertex.threadid", the "x" component is filled with the thread id of the current thread. The thread id is an unsigned integer in the range 0 to 31. If a vertex attribute binding matches "vertex.threadeqmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which the bit equal to the current thread id is set. If a vertex attribute binding matches "vertex.threadltmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower than the current thread id are set. If a vertex attribute binding matches "vertex.threadlemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower or equal to the current thread id are set. If a vertex attribute binding matches "vertex.threadgtmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater than the current thread id are set. If a vertex attribute binding matches "vertex.threadgemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater or equal to the current thread id are set. If a vertex attribute binding matches "vertex.warpid", the "x" component is filled with the warp id of the current thread. The warp id is an unsigned integer, the range of this value is hw dependent. If a vertex attribute binding matches "vertex.smid", the "x" component is filled with the SM id of the current thread. The SM id is an unsigned integer, the range of this value is hw dependent. (Add the table entries and relevant text describing the geometry program attribute variable use to query thread states.) Geometry Attribute Binding Components Underlying State -------------------------- ---------- ---------------------------- ... primitive.threadid (id,-,-,-) id of the current thread primitive.threadeqmask (m,-,-,-) mask with the current thread primitive.threadltmask (m,-,-,-) mask with lower thread primitive.threadlemask (m,-,-,-) mask with lower or equal thread primitive.threadgtmask (m,-,-,-) mask with greater thread primitive.threadgemask (m,-,-,-) mask with greater or equal thread primitive.warpid (id,-,-,-) warp id of the current thread primitive.smid (id,-,-,-) SM id of the current thread ... If a geometry attribute binding matches "primitive.threadid", the "x" component is filled with the thread id of the current thread. The thread id is an unsigned integer in the range 0 to 31. If a geometry attribute binding matches "primitive.threadeqmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which the bit equal to the current thread id is set. If a geometry attribute binding matches "primitive.threadltmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower than the current thread id are set. If a geometry attribute binding matches "primitive.threadlemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower or equal to the current thread id are set. If a geometry attribute binding matches "primitive.threadgtmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater than the current thread id are set. If a geometry attribute binding matches "primitive.threadgemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater or equal to the current thread id are set. If a geometry attribute binding matches "primitive.warpid", the "x" component is filled with the warp id of the current thread. The warp id is an unsigned integer, the range of this value is hw dependent. If a geometry attribute binding matches "primitive.smid", the "x" component is filled with the SM id of the current thread. The SM id is an unsigned integer, the range of this value is hw dependent. (add the following subsection to section 2.X.3.3, Parameters) Thread Group Property Bindings Binding Components Underlying State ----------------------------- ---------- ---------------------------- state.thread.warpsize (x,-,-,-) total number of thread in a warp state.thread.warpspersm (x,-,-,-) maximum number of warp executing on a SM state.thread.smcount (x,-,-,-) number of SM on the GPU If a program parameter binding matches "state.thread.warpsize", the "x" component of the program parameter variable is filled with an integer value indicating the total number of thread in a warp. The "y", "z", and "w" components are undefined. If a program parameter binding matches "state.thread.warpspersm", the "x" component of the program parameter variable is filled with an integer value indicating the maximum number of warp executing on a SM. The "y", "z", and "w" components are undefined. If a program parameter binding matches "state.thread.smcount", the "x" component of the program parameter variable is filled with an integer value indicating the number of SM on the GPU. The "y", "z", and "w" components are undefined. Modify Section 2.X.4, Program Execution Environment (Add the table entries and relevant text describing the program instruction to query thread conditions.) Instr- Modifiers uction V F I C S H D Out Inputs Description ------- -- - - - - - - --- -------- -------------------------------- ... TGBALLOT 50 X X X X - - F vu v query a boolean in thread group ... (Add the table entries and relevant text describing the fragment program instructions to exchange data between threads.) Instr- Modifiers uction V F I C S H D Out Inputs Description ------- -- - - - - - - --- -------- -------------------------------- ... QSWZ0 50 X - - - - - F v v,v add fragment 0 in a quad QSWZ1 50 X - - - - - F v v,v add fragment 1 in a quad QSWZ2 50 X - - - - - F v v,v add fragment 2 in a quad QSWZ3 50 X - - - - - F v v,v add fragment 3 in a quad QSWZX 50 X - - - - - F v v,v add fragments horizontally QSWZY 50 X - - - - - F v v,v add fragments vertically ... (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, as extended by NV_gpu_program5) + Shader thread group (NVX_shader_thread_group) If a fragment program specifies the "NVX_shader_thread_group" option, it may use the "fragment.threadid", "fragment.threadeqmask", "fragment.threadltmask", "fragment.threadlemask", "fragment.threadgtmask", "fragment.threadgemask", "fragment.warpid", "fragment.smid", "fragment.helperthread", "state.thread.warpsize", "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also use the "TGBALLOT", "QSWZ0", "QSWZ1", "QSWZ2", "QSWZ3", "QSWZX" and "QSWZY" instructions. If this option is not specified, a program will fail to compile if it uses those instructions or bindings. If a vertex program specifies the "NVX_shader_thread_group" option, it may use the "vertex.threadid", "vertex.threadeqmask", "vertex.threadltmask", "vertex.threadlemask", "vertex.threadgtmask", "vertex.threadgemask", "vertex.warpid", "vertex.smid", "state.thread.warpsize", "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also use the "TGBALLOT" instruction. If this option is not specified, a program will fail to compile if it uses those instructions or bindings. If a geometry program specifies the "NVX_shader_thread_group" option, it may use the "primitive.threadid", "primitive.threadeqmask", "primitive.threadltmask", "primitive.threadlemask", "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid", "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also use the "TGBALLOT" instruction. If this option is not specified, a program will fail to compile if it uses those instructions or bindings. Section 2.X.8.Z, QSWZ0: add fragment 0 data to all fragment in a quad The QSWZ0 instruction produces a floating point result by adding the first operand, a floating point value from fragment 0, to the second operand, another floating point value from the current fragment. quadSwizzle0NVX is the GLSL function that implements the same functionality as the QSWZ0 assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzle0NVX. This additional information also applies to QSWZ0. Section 2.X.8.Z, QSWZ1: add fragment 1 data to all fragment in a quad The QSWZ1 instruction produces a floating point result by adding the first operand, a floating point value from fragment 1, to the second operand, another floating point value from the current fragment. quadSwizzle1NVX is the GLSL function that implements the same functionality as the QSWZ1 assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzle1NVX. This additional information also applies to QSWZ1. Section 2.X.8.Z, QSWZ2: add fragment 2 data to all fragment in a quad The QSWZ2 instruction produces a floating point result by adding the first operand, a floating point value from fragment 2, to the second operand, another floating point value from the current fragment. quadSwizzle2NVX is the GLSL function that implements the same functionality as the QSWZ2 assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzle2NVX. This additional information also applies to QSWZ2. Section 2.X.8.Z, QSWZ3: add fragment 3 data to all fragment in a quad The QSWZ3 instruction produces a floating point result by adding the first operand, a floating point value from fragment 3, to the second operand, another floating point value from the current fragment. quadSwizzle3NVX is the GLSL function that implements the same functionality as the QSWZ3 assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzle3NVX. This additional information also applies to QSWZ3. Section 2.X.8.Z, QSWZX: add fragments in a quad horizontally The QSWZX instruction produces a floating point result by adding the first operand, a floating point value from the fragment neighbor in X to the current fragment, to the second operand, another floating point value from the current fragment. quadSwizzleXNVX is the GLSL function that implements the same functionality as the QSWZX assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzleXNVX. This additional information also applies to QSWZX. Section 2.X.8.Z, QSWZY: add fragments in a quad vertically The QSWZY instruction produces a floating point result by adding the first operand, a floating point value from the fragment neighbor in Y to the current fragment, to the second operand, another floating point value from the current fragment. quadSwizzleYNVX is the GLSL function that implements the same functionality as the QSWZY assembly instruction. The section 8.3 of the OpenGL Shading Language Specification has more detail about the implementation of quadSwizzleYNVX. This additional information also applies to QSWZY. Section 2.X.8.Z, TGBALLOT: query a boolean condition over a thread group The TGBALLOT instruction produces a result vector by reading a vector operand for each active thread in the current thread group and comparing each component to zero. A result vector component contains an integer bitmask value (described below) for which the bits in a component bitmask are set if the value in the operand vector is non-zero for the corresponding thread, and not set otherwise. Sometime when the instruction is in a conditional control flow block or when it's not possible to completely fill a thread group, only a subset of the threads in the thread group will be active and will execute the TGBALLOT instruction. Each bit in the bitfield corresponding to inactive threads will be set to 0. It's possible to query the active thread mask by calling TGBALLOT with 1 as the first operand. tmp = VectorLoad(op0); result = { 0, 0, 0, 0 }; for (all active threads) { if ([thread]tmp.x != 0) result.x |= 1 << thread; if ([thread]tmp.y != 0) result.y |= 1 << thread; if ([thread]tmp.z != 0) result.z |= 1 << thread; if ([thread]tmp.w != 0) result.w |= 1 << thread; } Dependencies on NV_tessellation_program5 If NV_tessellation_program5 is supported and "OPTION NVX_shader_thread_group" is specified in an assembly program, the following edits are made to extend the assembly programming model documented in the NV_gpu_program4 extension and extended by NV_gpu_program5 and NV_tessellation_program5. If NV_tessellation_program5 is not supported, or if "OPTION NVX_shader_thread_group" is not specified in an assembly program, the contents of this dependencies section should be ignored. Modify Section 2.X.2, Program Grammar (add/change the following rules to the NV_gpu_program5 base grammars for tessellation control programs) ::= "threadid" | "threadeqmask" | "threadltmask" | "threadlemask" | "threadgtmask" | "threadgemask" | "warpid" | "smid" (add/change the following rules to the NV_gpu_program5 base grammars for tessellation evaluation programs) ::= "threadid" | "threadeqmask" | "threadltmask" | "threadlemask" | "threadgtmask" | "threadgemask" | "warpid" | "smid" Modify Section 2.X.3.2 of the NV_tessellation_program5 specification, Program Attribute Variables. (Add the table entries and relevant text describing the Tessellation control and evaluation program attribute variables use to query thread states.) Primitive Binding Suffix Components Underlying State -------------------------- ---------- ---------------------------- ... primitive.threadid (id,-,-,-) id of the current thread primitive.threadeqmask (m,-,-,-) mask with the current thread primitive.threadltmask (m,-,-,-) mask with lower thread primitive.threadlemask (m,-,-,-) mask with lower or equal thread primitive.threadgtmask (m,-,-,-) mask with greater thread primitive.threadgemask (m,-,-,-) mask with greater or equal thread primitive.warpid (id,-,-,-) warp id of the current thread primitive.smid (id,-,-,-) SM id of the current thread ... If a attribute binding matches "primitive.threadid", the "x" component is filled with the thread id of the current thread. The thread id is an unsigned integer in the range 0 to 31. If a attribute binding matches "primitive.threadeqmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which the bit equal to the current thread id is set. If a attribute binding matches "primitive.threadltmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower than the current thread id are set. If a attribute binding matches "primitive.threadlemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower or equal to the current thread id are set. If a attribute binding matches "primitive.threadgtmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater than the current thread id are set. If a attribute binding matches "primitive.threadgemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater or equal to the current thread id are set. If a attribute binding matches "primitive.warpid", the "x" component is filled with the warp id of the current thread. The warp id is an unsigned integer, the range of this value is hw dependent. If a attribute binding matches "primitive.smid", the "x" component is filled with the SM id of the current thread. The SM id is an unsigned integer, the range of this value is hw dependent. (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, as extended by NV_gpu_program5 and NV_tessellation_program5) + Shader thread group (NVX_shader_thread_group) If a program specifies the "NVX_shader_thread_group" option, it may use the "primitive.threadid", "primitive.threadeqmask", "primitive.threadltmask", "primitive.threadlemask", "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid", "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also use the "TGBALLOT" instruction. If this option is not specified, a program will fail to compile if it uses those bindings. Dependencies on NV_compute_program5 If NV_compute_program5 is supported and "OPTION NVX_shader_thread_group" is specified in an assembly program, the following edits are made to extend the assembly programming model documented in the NV_gpu_program4 extension and extended by NV_gpu_program5 and NV_compute_program5. If NV_compute_program5 is not supported, or if "OPTION NVX_shader_thread_group" is not specified in an assembly program, the contents of this dependencies section should be ignored. Section 2.X.2, Program Grammar (add the following rules to the grammar) ::= "invocation" "." "threadid" | "invocation" "." "threadeqmask" | "invocation" "." "threadltmask" | "invocation" "." "threadlemask" | "invocation" "." "threadgtmask" | "invocation" "." "threadgemask" | "invocation" "." "warpid" | "invocation" "." "smid" Modify Section 2.X.3.2 of the NV_compute_program5 specification, Program Attribute Variables. (Add the table entries and relevant text describing the compute program input variable use to query thread states.) Attribute Binding Components Underlying State -------------------------- ---------- ---------------------------- ... invocation.threadid (id,-,-,-) id of the current thread invocation.threadeqmask (m,-,-,-) mask with the current thread invocation.threadltmask (m,-,-,-) mask with lower thread invocation.threadlemask (m,-,-,-) mask with lower or equal thread invocation.threadgtmask (m,-,-,-) mask with greater thread invocation.threadgemask (m,-,-,-) mask with greater or equal thread invocation.warpid (id,-,-,-) warp id of the current thread invocation.smid (id,-,-,-) SM id of the current thread ... If a compute attribute binding matches "invocation.threadid", the "x" component is filled with the thread id of the current thread. The thread id is an unsigned integer in the range 0 to 31. If a compute attribute binding matches "invocation.threadeqmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which the bit equal to the current thread id is set. If a compute attribute binding matches "invocation.threadltmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower than the current thread id are set. If a compute attribute binding matches "invocation.threadlemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits lower or equal to the current thread id are set. If a compute attribute binding matches "invocation.threadgtmask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater than the current thread id are set. If a compute attribute binding matches "invocation.threadgemask", the "x" component is filled with a 32-bit unsigned integer bitfield in which bits greater or equal to the current thread id are set. If a compute attribute binding matches "invocation.warpid", the "x" component is filled with the warp id of the current thread. The warp id is an unsigned integer, the range of this value is hw dependent. If a compute attribute binding matches "invocation.smid", the "x" component is filled with the SM id of the current thread. The SM id is an unsigned integer, the range of this value is hw dependent. (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, as extended by NV_gpu_program5 and NV_compute_program5) + Shader thread group (NVX_shader_thread_group) If a program specifies the "NVX_shader_thread_group" option, it may use the "invocation.threadid", "invocation.threadeqmask", "invocation.threadltmask", "invocation.threadlemask", "invocation.threadgtmask", "invocation.threadgemask", "invocation.warpid", "invocation.smid", "state.thread.warpsize", "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also use the "TGBALLOT" instruction. If this option is not specified, a program will fail to compile if it uses those bindings. Errors None. New State None. New Implementation Dependent State Minimum Get Value Type Get Command Value Description Sec. Attrib -------------------------------- ---- --------------- ------- --------------------- ------ ------ WARP_SIZE_NVX Z+ GetIntegerv 1 total number of 2.X.3.3 - thread in a warp. WARPS_PER_SM_NVX Z+ GetIntegerv 1 maximum number of 2.X.3.3 - warp executing on a SM. SM_COUNT_NVX Z+ GetIntegerv 1 number of SM on the 2.X.3.3 - GPU. Issues None Revision History Rev. Date Author Changes ---- -------- -------- ----------------------------------------- 2 9/4/13 jbreton Add helperThread attribute binding. 1 12/19/12 jbreton Internal revisions.