Performance Counters


There are three types of counters available through Tegra Graphics Debugger. Hardware counters provide data directly from various points inside the GPU. Software counters give insight into the state and performance of the driver. Simplified Experiments are multi-pass experiments that give detailed information about the state of the GPU.

The GPU counters give results accumulated from the previous time the GPU was sampled. For instance, the triangle_count gives the number of triangles rendered since the last sample was taken. Once you integrate the counters into your own application, you can sample on a per-frame basis and correlate the data to a given frame.

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

Counter data is provided as either raw values or as a percentage. Raw counters count events (triangles, pixels, milliseconds, etc.) since the last call. Percentage counters are event counts based on the clock rate where the event count is divided by the number of cycles. For example, gpu_idle counts the number of clock ticks that the GPU was idle since the last call. This value is automatically divided by the total number of clock ticks to give the percentage of time that the GPU was idle.

The table below outlines all performance counters that are supported in Tegra Graphics Debugger.

Name API Unit Description
cpu_00_frequency     The current frequency of the CPU core in Hz.
cpu_00_load     The utilization of the CPU core.
cpu_01_frequency     The current frequency of the CPU core in Hz.
cpu_01_load     The utilization of the CPU core.
cpu_02_frequency     The current frequency of the CPU core in Hz.
cpu_02_load     The utilization of the CPU core.
cpu_03_frequency     The current frequency of the CPU core in Hz.
cpu_03_load     The utilization of the CPU core.
cpu_load     The average utilization of all the CPU cores.
elapsed_cycles Compute   Max elapsed cycles of all the GPCs.
geom_busy Graphics GPU Cycles the geometry unit is busy.
GPU Bottleneck Graphics GPU Index for GPU bottleneck
GPU_busy Both GPU Cycles the Graphics engine or the Compute engine is busy.
GPU_idle Both GPU Cycles the Graphics engine and Compute engine is idle.
IA Bottleneck Graphics GPU Input Attribute Is Bottleneck
IA SOL Graphics GPU Input Attribute SOL
ia_requests Graphics GPU Number of Input Assembler requests.
inst_executed_cs Both SM Instructions executed by Compute shaders (CS), not including replays.
inst_executed_cs_ratio Both SM Percentage of total instructions executed that were executed by a Compute shader.
inst_executed_gs Graphics SM Instructions executed by geometry shaders (GS), not including replays.
inst_executed_gs_ratio Graphics SM Percentage of total instructions executed that were executed by a geometry shader.
inst_executed_ps Graphics SM Instructions executed by pixel shaders (PS), not including replays.
inst_executed_ps_ratio Graphics SM Percentage of total instructions executed that were executed by a pixel shader.
inst_executed_tcs Graphics SM Instructions executed by tesselation control shaders (TCS/hull), not including replays.
inst_executed_tcs_ratio Graphics SM Percentage of total instructions executed that were executed by a hull shader.
inst_executed_tes Graphics SM Instructions executed by tesselation evaluation shaders (TES/domain), not including replays.
inst_executed_tes_ratio Graphics SM Percentage of total instructions executed that were executed by a domain shader.
inst_executed_vs Graphics SM Instructions executed by vertex shaders (VS), not including replays.
inst_executed_vs_ratio Graphics SM Percentage of total instructions executed that were executed by a vertex shader.
l1_atoms_bytes Compute Cache Number of bytes written through L1 for ATOM instructions.
l1_atoms_transactions Compute Cache ATOM transactions. A transaction is 128 bytes.
l1_atoms_transactions_per_request Compute Cache Number of atom transactions in L1 per atom instructions executed.
l1_global_load_bytes Compute Cache Number of bytes read from L1 for global memory.
l1_global_load_hitrate Compute Cache Hit rate in percent in L1 for global load operations.
l1_global_load_transactions Compute Cache Global load transactions. A transaction is 128 bytes.
l1_global_load_transactions_hit Compute Cache Global load transactions that hit in the L1 cache. A transaction is 128 bytes.
l1_global_load_transactions_hit_vsm0 Compute Cache Global load transactions that hit in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_global_load_transactions_miss Compute Cache Global load transactions that miss in the L1 cache. A transaction is 128 bytes.
l1_global_load_transactions_miss_vsm0 Compute Cache Global load transactions that miss in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_global_load_transactions_per_request Compute Cache Number of global load transactions in L1 per global/surface load instructions executed.
l1_global_load_uncached_transactions Compute Cache Uncached global load executed. A transaction is 128 bytes.
l1_global_load_uncached_transactions_vsm0 Compute Cache Uncached global load executed by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_global_store_bytes Compute Cache Number of bytes written to L1 for global memory.
l1_global_store_transactions Compute Cache Global store transactions executed. A transaction is 128 bytes.
l1_global_store_transactions_per_request Compute Cache Number of global store transactions in L1 per global/surface store instructions executed.
l1_global_store_transactions_vsm0 Compute Cache Global store transactions executed by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_global_uncached_load_bytes Compute Cache Number of bytes read from L2 for global uncached memory.
l1_hitrate Compute Cache Hit rate in percent in L1 for global load and local load and store operations.
l1_l2_bytes Graphics Cache Number of bytes transferred to the L2 unit by the L1 unit.
l1_l2_requests Graphics Cache Number of L2 requests from the L1 unit.
l1_local_load_bytes Compute Cache Number of bytes read from L1 for local memory.
l1_local_load_hitrate Compute Cache Hit rate in percent in L1 for local load operations.
l1_local_load_transactions Compute Cache Local load transactions. A transaction is 128 bytes.
l1_local_load_transactions_hit Compute Cache Local load transactions that hit in the L1 cache. A transaction is 128 bytes.
l1_local_load_transactions_hit_vsm0 Compute Cache Local load transactions that hit in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_local_load_transactions_miss Both Cache Local load transactions that miss in the L1 cache. A transaction is 128 bytes.
l1_local_load_transactions_miss_vsm0 Both Cache Local load transactions that miss in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_local_load_transactions_per_request Compute Cache Number of local load transactions in L1 per local load instructions executed.
l1_local_store_bytes Compute Cache Number of bytes written to L1 for local memory.
l1_local_store_hitrate Compute Cache Hit rate in percent in L1 for local store operations.
l1_local_store_transactions Compute Cache Local store transactions. A transaction is 128 bytes.
l1_local_store_transactions_hit Compute Cache Local store transactions that hit in the L1 cache. A transaction is 128 bytes.
l1_local_store_transactions_hit_vsm0 Compute Cache Local store transactions that hit in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_local_store_transactions_miss Compute Cache Local store transactions that miss in the L1 cache. A transaction is 128 bytes.
l1_local_store_transactions_miss_vsm0 Compute Cache Local store transactions that miss in the L1 cache by this SM. A transaction is 128 bytes. Increments by 0-1 per cycle per SM.
l1_local_store_transactions_per_request Compute Cache Number of local store transactions in L1 per local store instructions executed.
l1_reds_bytes Compute Cache Number of bytes written through L1 for RED instructions.
l1_reds_transactions Compute Cache RED transactions. A transaction is 128 bytes.
l1_reds_transactions_per_request Compute Cache Number of red transactions in L1 per red instructions executed.
l1_shared_bank_conflicts Compute Cache Number of bank conflicts for shared memory operations.
l1_shared_load_bytes Compute Cache Number of bytes read from L1 for shared memory.
l1_shared_load_transactions Compute Cache Shared load transactions. A transaction is 256 bytes.
l1_shared_load_transactions_per_request Compute Cache Number of shared load transactions in L1 per shared load instructions executed.
l1_shared_load_transactions_vsm0 Compute Cache Shared load transactions by this SM. A transaction is 256 bytes. Increments by 0-1 per cycle per SM.
l1_shared_store_bytes Compute Cache Number of bytes written to L1 for shared memory.
l1_shared_store_transactions Compute Cache Shared store transactions. A transaction is 256 bytes.
l1_shared_store_transactions_per_request Compute Cache Number of shared store transactions in L1 per shared store instructions executed.
l1_shared_store_transactions_vsm0 Compute Cache Shared store transactions by this SM. A transaction is 256 bytes. Increments by 0-1 per cycle per SM.
L2 Bottleneck Graphics GPU L2 Is Bottleneck
l2_read_bytes Compute Cache Number of bytes read from L2.
l2_read_bytes_atomic Compute Cache Number of bytes read by atomic from L2.
l2_read_bytes_ia Graphics GPU Number of bytes returned from L2 to the Input Assembler.
l2_read_bytes_l1 Compute Cache Number of bytes read by L1 from L2.
l2_read_bytes_rop Graphics Cache Number of bytes read to the L2 unit by the ROP unit.
l2_read_bytes_tex Compute Cache Number of bytes read by texture from L2.
l2_read_sectors Compute Cache Number of sectors read from L2. A sector is 32 bytes.
l2_read_sectors_atomic Compute Cache Number of sectors read by atomic from L2. A sector is 32 bytes.
l2_read_sectors_l1 Compute Cache Number of sectors read by L1 from L2. A sector is 32 bytes.
l2_read_sectors_tex Compute Cache Number of sectors read by texture from L2. A sector is 32 bytes.
l2_slice0_read_sectors_atomic_fb0 Compute Cache Sector reads for ATOM/RED to L2 cache that hit in the L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_read_sectors_fb0 Compute Cache Sector reads that hit in the L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_read_sectors_l1_fb0 Compute Cache Sector reads from L1 to L2 cache that hit in the L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_read_sectors_tex_fb0 Compute Cache Sector reads from TEX to L2 cache that hit in the L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_write_sectors_atomic_fb0 Compute Cache Sector writes for ATOM/RED to L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_write_sectors_fb0 Compute Cache Sector writes to the L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_write_sectors_l1_fb0 Compute Cache Sector writes from L1 to L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_slice0_write_sectors_tex_fb0 Compute Cache Sector writes from TEX to L2 cache in the given slice and FB partition. A sector is 32 bytes.
l2_write_bytes Compute Cache Number of bytes written to L2.
l2_write_bytes_atomic Compute Cache Number of bytes written by atomic to L2.
l2_write_bytes_l1 Compute Cache Number of bytes written by L1 to L2.
l2_write_bytes_rop Graphics Cache Number of bytes written to the L2 unit by the ROP unit.
l2_write_bytes_tex Compute Cache Number of bytes written by texture to L2.
l2_write_sectors Compute Cache Number of sectors written to L2. A sector is 32 bytes.
l2_write_sectors_atomic Compute Cache Number of sectors written by atomic to L2. A sector is 32 bytes.
l2_write_sectors_l1 Compute Cache Number of sectors written by L1 to L2. A sector is 32 bytes.
l2_write_sectors_tex Compute Cache Number of sectors written by texture to L2. A sector is 32 bytes.
OGL % driver waiting     OGL Percent of time in frame that driver is waiting.
OGL AGP/PCI-E usage (bytes)     OGL Current amount of AGP or PCI-E memory (non-local video memory) used in bytes.
OGL AGP/PCI-E usage (MB)     OGL Current amount of AGP or PCI-E memory (non-local video memory) used in MB.
OGL driver sleeping     OGL Last frame mSec sleeping in OGL driver.
OGL FPS     OGL Frames/Sec rendered since last sample.
OGL Frame Batch Count     OGL Number of draw batches issued during the last frame.
OGL Frame Primitive Count     OGL Number of primitives issued during the last frame.
OGL Frame Time     OGL Last frame to frame time measured by OGL in mSec.
OGL Frame Vertex Count     OGL Number of vertices issued during the last frame.
OGL vidmem bytes     OGL Current amount of video memory (local video memory) allocated in bytes. Drawables and render targets are not counted.
OGL vidmem MB     OGL Current amount of video memory (local video memory) allocated in MB. Drawables and render targets are not counted.
OGL vidmem total bytes     OGL total amount of video memory (local video memory) in bytes.
OGL vidmem total MB     OGL total amount of video memory (local video memory) in MB.
Primitive Setup Bottleneck Graphics GPU Primitive Setup is the Bottleneck
Primitive Setup SOL Graphics GPU Primitive Setup SOL
Rasterization Bottleneck Graphics GPU Rasterization is the Bottleneck
Rasterization SOL Graphics GPU Rasterization SOL
ROP Bottleneck Graphics GPU ROP Is Bottleneck
ROP SOL Graphics GPU ROP SOL
setup_primitive_count Graphics GPU Count of primitives seen by the setup unit.
shaded_pixel_count Graphics GPU Number of rasterized pixels sent to the shading units.
shader_busy Graphics GPU Cycles the shader unit is busy.
SHD Bottleneck Graphics GPU SHD Is Bottleneck
SHD SOL Graphics GPU SHD SOL
shd_l1_read_bytes Graphics Cache Number of bytes transferred from the L1 unit by the shader unit.
shd_l1_requests Graphics Cache Number of L1 requests from the shader unit.
shd_tex_read_bytes Graphics Cache Number of bytes read from the texture unit by the shader unit.
shd_tex_requests Graphics Cache Number of texel read requests from the shader unit.
sm_active_cycles Compute SM Sum of cycles that SM was active. Increments by 0-NumSMs per cycle.
sm_active_cycles_vsm0 Both SM Number of cycles that this SM has at least one active warp. Increments by 0-1 per cycle per SM.
sm_active_warps Compute SM Sum of warps that SM was active. Increments by 0-64 per cycle per SM.
sm_branches_diverged Compute SM Increments by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch.
sm_branches_diverged_vsm0 Compute SM Divergent branches by this VSM. This counter increments by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. Increments by 0-1 per cycle.
sm_branches_executed Compute SM Counts the number of branch instructions executed.
sm_branches_executed_vsm0 Compute SM Branches taken by this VSM. This counter increments by one if at least one thread in a warp takes the branch. Increments by 0-1 per cycle.
sm_branches_taken Compute SM Increments by one if at least one thread in a warp takes the branch.
sm_branches_taken_vsm0 Compute SM Increments by one if at least one thread in a warp takes the branch. Increments by 0-4 per cycle.
sm_ctas_launched Compute SM Thread blocks launched. Increments by 1 per thread block launched.
sm_ctas_launched_vsm0 Compute SM Thread blocks launched. Increments by 1 per thread block launched.
sm_executed_ipc Compute SM The average instructions executed per active cycle per SM. Final value is between 0 and 7.
sm_inst_executed Compute SM Instructions executed, not including replays.
sm_inst_executed_atomics Compute Memory ATOM instructions executed, including ATOM.CAS.
sm_inst_executed_generic_loads Compute Memory Generic load instructions executed.
sm_inst_executed_generic_loads_vsm0 Compute Memory Generic load instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_generic_stores Compute Memory Generic store instructions executed.
sm_inst_executed_generic_stores_vsm0 Compute Memory Generic store instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_local_loads Compute Memory Local load instructions executed.
sm_inst_executed_local_loads_vsm0 Compute Memory Local load instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_local_stores Compute Memory Local store instructions executed.
sm_inst_executed_local_stores_vsm0 Compute Memory Local store instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_lsu_red_vsm0 Compute Memory reduction in SM Quad0 GPC0.TPC0.SM
sm_inst_executed_reductions Compute Memory RED instructions executed.
sm_inst_executed_shared_loads Compute Memory Shared load instructions executed.
sm_inst_executed_shared_loads_vsm0 Compute Memory Shared load instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_shared_stores Compute Memory Shared store instructions executed.
sm_inst_executed_shared_stores_vsm0 Compute Memory Shared store instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_surface_loads_byte Compute Memory Surface load instructions (byte mode) executed.
sm_inst_executed_surface_loads_byte_vsm0 Compute Memory Surface load (byte mode) instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_surface_loads_pixel Compute Memory Surface load instructions (pixel mode) executed.
sm_inst_executed_surface_loads_pixel_vsm0 Compute Memory Surface load (pixel mode) instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_surface_stores_byte Compute Memory Surface store instructions (byte mode) executed.
sm_inst_executed_surface_stores_byte_vsm0 Compute Memory Surface store (byte mode) instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_surface_stores_pixel Compute Memory Surface store instructions (pixel mode) executed.
sm_inst_executed_surface_stores_pixel_vsm0 Compute Memory Surface store (pixel mode) instructions executed by this SM. Increments by 0-1 per cycle per SM.
sm_inst_executed_texture Compute Cache Texture instructions executed.
sm_inst_executed_vsm0 Compute SM Instructions executed in this SM, not including replays. Increments by 0-8 per cycle per SM.
sm_inst_issued Compute SM Instructions issued by the scheduler, including replays.
sm_inst_issued_vsm0 Compute SM Number of active cycles that this warp scheduler issued an instruction.
sm_issued_ipc Compute SM The average instructions issued per active cycle per SM. Final value is between 0 and 7.
sm_pmevent_00 Compute SM __prof_trigger00/pmevent instructions executed.
sm_pmevent_00_vsm0 Compute SM __prof_trigger00/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_01 Compute SM __prof_trigger01/pmevent instructions executed.
sm_pmevent_01_vsm0 Compute SM __prof_trigger01/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_02 Compute SM __prof_trigger02/pmevent instructions executed.
sm_pmevent_02_vsm0 Compute SM __prof_trigger02/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_03 Compute SM __prof_trigger03/pmevent instructions executed.
sm_pmevent_03_vsm0 Compute SM __prof_trigger03/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_04 Compute SM __prof_trigger04/pmevent instructions executed.
sm_pmevent_04_vsm0 Compute SM __prof_trigger04/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_05 Compute SM __prof_trigger05/pmevent instructions executed.
sm_pmevent_05_vsm0 Compute SM __prof_trigger05/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_06 Compute SM __prof_trigger06/pmevent instructions executed.
sm_pmevent_06_vsm0 Compute SM __prof_trigger06/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_pmevent_07 Compute SM __prof_trigger07/pmevent instructions executed.
sm_pmevent_07_vsm0 Compute SM __prof_trigger07/pmevent instructions executed where at least 1 thread is not predicated off. Increments by 0-1 per warp instruction executed.
sm_thread_inst_executed Compute SM Thread instructions executed, not including replays.
sm_warps_launched Compute SM Warps launched. Increments by 1 per warp launched.
sm_warps_launched_vsm0 Compute SM Warps launched. Increments by 1 per warp launched.
Stream Out Bottleneck Graphics GPU Stream Out Is Bottleneck
Stream Out SOL Graphics GPU Stream Out SOL
stream_out_bytes Graphics GPU Number of bytes streamed out.
Tessellator SOL Graphics GPU Tessellator SOL
TEX Bottleneck Graphics GPU TEX Is Bottleneck
TEX SOL Graphics GPU TEX SOL
tex_bank_conflicts Both Cache Bank conflicts occurred while accessing data from the texture units.
tex_cache_hitrate Both Cache Hit rate of texture cache queries.
tex_cache_read_bytes Both Cache Number of bytes read from all texture units.
tex_cache_sector_queries Both Cache Sector texture cache requests in all texture units. A sector is 32 bytes.
tex0_bank_conflicts_gpc0_tpc0 Both Cache Texture bank conflicts accurred while accessing data from the given texture unit in the TPC.
tex0_cache_sector_misses_gpc0_tpc0 Both Cache Sector texture cache misses in the given texture unit in the TPC. A sector is 32 bytes.
tex0_cache_sector_queries_gpc0_tpc0 Both Cache Sector texture cache requests in the given texture unit in the TPC. A sector is 32 bytes.
tex0_cache_texel_queries Graphics Cache Number of texture cache queries (32b each request)
tex1_bank_conflicts_gpc0_tpc0 Both Cache Texture bank conflicts accurred while accessing data from the given texture unit in the TPC.
tex1_cache_sector_misses_gpc0_tpc0 Both Cache Sector texture cache misses in the given texture unit in the TPC. A sector is 32 bytes.
tex1_cache_sector_queries_gpc0_tpc0 Both Cache Sector texture cache requests in the given texture unit in the TPC. A sector is 32 bytes.
texture_busy Graphics GPU Cycles the texture unit is busy.
threads_launched Compute SM Count the total number of threads launched for this TPC.
threads_launched_gpc0_tpc0 Compute SM Threads launched by this SM. Increments by 1 per thread launched.
ZCull Bottleneck Graphics GPU ZCull is the Bottleneck
ZCull SOL Graphics GPU ZCull SOL

 

NVIDIA® Tegra Graphics Debugger Documentation Rev. 2.5.170811 ©2014-2017. NVIDIA Corporation. All Rights Reserved.