OpenCL Profiler Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA, and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright (C) 2005-2009 by NVIDIA Corporation. All rights reserved. The OpenCL Profiler ================= This release of CUDA comes with an OpenCL profiler that allows users to gather timing information about kernel execution and memory transfer operations for OpenCL applications. The profiler can be used to identify performance bottlenecks in multi-kernel applications or to quantify the benefit of optimizing a single kernel. Profiler Control ---------------- The OpenCL profiler is controlled using the following environment variables: OPENCL_PROFILE is set to either 1 or 0 (or unset) to enable or disable profiling. OPENCL_PROFILE_LOG is set to the desired file path for profiling output. If there is no log path specified, the profiler will log data to ./opencl_profile_%d.log. In case of multiple devices you can add '%d' in the OPENCL_PROFILE_LOG name. This will generate separate profiler output files for each device - with '%d' substituted by the device number. OPENCL_PROFILE_CSV is set to either 1 or 0 (or unset) to enable or disable a comma separated version of the log output. OPENCL_PROFILE_CONFIG is used to specify a config file for enabling performance counters in the GPU. See the next section for configuration details. Profiler Configuration ---------------------- This version of the OpenCL profiler supports configuration options that allow users to gather statistics about various events occurring in the GPU during execution. These events are tracked with hardware counters on signals in the chip. The profiler supports the following options: timestamp : Time stamps for kernel launches and memory transfers. This can be used for timeline analysis. gpustarttimestamp : Time stamp when kernel starts execution in GPU. gpuendtimestamp : Time stamp when kernel ends execution in GPU. ndgrangesize : Number of workgroups in a NDRange along the X and Y dimensions for a kernel launch workgroupsize : Number of workitems in a workgroup along the X, Y and Z dimensions for a kernel launch stapmemperblock : Size of statically allocated private memory per workgroup in bytes for a kernel launch regperworkitem : Number of registers used per workitem for a kernel launch. memtransferdir : Memory transfer direction, a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host memory copies. memtransfersize : Memory copy size in bytes memtransferhostmemtype : Host memory type (pageable or page-locked) streamid : Stream Id for a kernel launch localworkgroupsize : If workgroupsize has been specified by the user, this option would be 1, otherwise it would be 0. The profiler supports logging of following counters during kernel execution on all architectures: local_load : Number of executed local load instructions per warp in a SM local store : Number of executed local store instructions per warp in a SM gld_request : Number of executed global load instructions per warp in a SM gst_request : Number of executed global store instructions per warp in a SM divergent_branch : Number of unique branches that diverge branch : Number of unique branch instructions in program sm_cta_launched : Number of threads blocks executed on a SM The profiler supports logging of following counters during kernel execution only on GPUs with Compute Capability 1.x: gld_incoherent : Non-coalesced (incoherent) global memory loads gld_coherent : Coalesced (coherent) global memory loads gld_32b : 32-byte global memory load transactions gld_64b : 64-byte global memory load transactions gld_128b : 128-byte global memory load transactions gst_incoherent : Non-coalesced (incoherent) global memory stores gst_coherent : Coalesced (coherent) global memory stores gst_32b : 32-byte global memory store transactions gst_64b : 64-byte global memory store transactions gst_128b : 128-byte global memory store transactions instructions : Instructions executed warp_serialize : Number of thread warps that serialize on address conflicts to either shared or constant memory cta_launched : Number of threads blocks executed prof_trigger_00... prof_trigger_07 : Profiler triggers that can be inserted by user inside the kernel code for instrumenting the kernel. A total of 8 triggers from prof_trigger_00 to prof_trigger_07 can be inserted in the code. The way to use prof triggers is to insert '__prof_trigger(int N)' in kernel code (where N is the counter number). tex_cache_hit : Number of texture cache hits tex_cache_miss : Number of texture cache miss Only 4 profiler counters can be monitored/profiled in a single run on GPUs with Compute Capability 1.x. The profiler supports logging of following counters during kernel execution only on GPUs with Compute Capability 2.0 or higher: shared_load : Number of executed shared load instructions per warp in a SM shared_store : Number of executed shared store instructions per warp in a SM inst_issued : Number of instructions issued including replays inst_executed : Number of instructions executed, do not include replays warps_launched : Number of warps launched in a SM threads_launched : Number of threads launched in a SM l1_global_load_hit: Number of global load hits in L1 cache l1_global_load_miss: Number of global load misses in L1 cache The number of counters that can be profiled in a single run depends on the specific counters selected on GPUs with Compute Capability 2.0 or higher. Options can be commented out using the '#' character at the start of a line. Profiler Output --------------- While OPENCL_PROFILE is set, the profiler log records timing information for every kernel launch and memory operation performed by the driver. The default log syntax follows a simple form: id=[ value ] For example, here is part of the log from a test of the scan application (without any counters enabled): # OPENCL_PROFILE_LOG_VERSION 1.0 # OPENCL_DEVICE 0 GeForce GTX 280 # TIMESTAMPFACTOR 114aa119a0c9d7d2 timestamp,gpustarttimestamp,gpuendtimestamp,method,gputime,cputime,occupancy timestamp=[ 7791621.500 ] gpustarttimestamp=[ 114ab721d9c649e0 ] gpuendtimestamp=[ 114ab721da1a0be0 ] method=[ workgroupScanInclusive ] gputime=[ 5489.152 ] cputime=[ 5842.782 ] occupancy=[ 1.000 ] timestamp=[ 7802433.500 ] gpustarttimestamp=[ 114ab721da6aaaa0 ] gpuendtimestamp=[ 114ab721da6b5500 ] method=[ workgroupScanExclusive ] gputime=[ 43.616 ] cputime=[ 387.270 ] occupancy=[ 1.000 ] timestamp=[ 7804496.500 ] gpustarttimestamp=[ 114ab721da894480 ] gpuendtimestamp=[ 114ab721dacecc00 ] method=[ uniformUpdate ] gputime=[ 4556.672 ] cputime=[ 4915.150 ] occupancy=[ 1.000 ] This log shows data for memory copies and a few different kernel launches. The 'method' label specifies which GPU function was executed by the driver. The 'gputime' and 'cputime' labels specify the actual chip execution time and the driver execution time (including gputime), respectively. The gpustarttimestamp and gpuendtimestamp indicate the start and end timestamps of the kernel being executed on the GPU. Note that all times are in microseconds. The 'occupancy' label gives the warp occupancy - percentage of the maximum warp count in the GPU - for a particular method launch. An occupancy of 1.000 means the chip is completely full. Another example shows the profiler log of matrix multiplication app. There are few counters enabled in this example. This example includes the new options for memcopy method: # OPENCL_PROFILE_LOG_VERSION 1.0 # OpenCL_DEVICE_NAME 0 GeForce GTX 280 # TIMESTAMPFACTOR 12bae765a4r9c521 timestamp,method,gputime,cputime,ndrangesizeX,ndrangesizeY,workgroupsizeX,workgroupsizeY,workgroupsizeZ,occupancy,instructions,branch, cta_launched,memtransfersize,memtransferdir timestamp=[ 7205451.000 ] method=[ matrixMul ] gputime=[ 92695.133 ] cputime=[ 93108.766 ] NDRangesize=[ 50, 100 ] workgroupsize=[ 16, 16, 1 ] occupancy=[ 1.000 ] instructions=[ 18204777 ] branch=[ 1479119 ] cta_launched=[ 500 ] timestamp=[ 7423482.500 ] method=[ memcopy ] gputime=[ 8.896 ] cputime=[ 230.686 ] memtransfersize=[ 8192 ] memtransferdir=[ 1 ] This log shows some of the new fields added: gridsize shows the number of workgroups in x and y direction, and workgroupsize shows the number of work-items in a work-group in x, y and z directions. The profiler will now show the cputime, transfer size,and direction of pageable and pinned memcopys. The default log syntax is easy to parse with a script, but for spreadsheet analysis it might be easier to use the comma separated version. When OPENCL_PROFILE_CSV is set to 1, this same test produces the following output: # OPENCL_PROFILE_LOG_VERSION 1.0 # OPENCL_PROFILE_CSV 1 # OpenCL_DEVICE_NAME 0 GeForce GTX 280 # TIMESTAMPFACTOR 1e4231f54a45c645 timestamp,method,gputime,cputime,ndrangesizeX,ndrangesizeY,workgroupsizeX,workgroupsizeY,workgroupsizeZ,occupancy,instructions,branch, cta_launched,memtransfersize,memtransferdir 7535422.000,matrixMul,91935.766,93031.500,50,100,16, 16, 1,1.000,18204777,1479119,500 7754673.000,memcopy,8.536,241.342,,,,,,,,,,8192,1 Interpreting Profiler Counters ------------------------------ The performance counter values do not correspond to individual work-item activity. Instead, these values represent events within a work-item warp. For example, a divergent branch within a work-item warp will increment the divergent_branch counter by one. So the final counter value stores information for all divergent branches in all warps. In addition, the profiler can only target one of the multiprocessors in the GPU, so the counter values will not correspond to the total number of warps launched for a particular kernel. For this reason, when using the performance counter options in the profiler the user should always launch enough work-items work-groups to ensure that the target multiprocessor is given a consistent percentage of the total work. In practice, it is best to launch at least around 100 workgroups for consistent results. For the reasons listed above, users should not expect the counter values to match the numbers one would get by inspecting kernel code. The values are best used to identify relative performance differences between unoptimized and optimized code. For example, if for the initial version of the program the profiler reports N non-coalesced global loads, it is easy to see if the optimized code produces less than N non-coalesced loads. In most cases, the goal is to make N go to 0, so the counter value is useful for tracking progress toward this goal. Known Issues ------------ 1. Due to improved memory coalescing hardware, the gld_incoherent and gst_incoherent signals will always be zero on GTX 280 and GTX 260.