NVIDIA CUDA Visual Profiler Version 3.0

Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050


Notice

BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:

ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. These materials supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright (C) 2007-2010 by NVIDIA Corporation. All rights reserved.

PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.

List of supported features:

Execute a CUDA program with profiling enabled and view the profiler output as a table. The table has the following columns for each GPU method:

Please refer the "Interpreting Profiler Counters" section below for more information on profiler counters. Note that profiler counters are also referred to as profiler signals.

Display the summary profiler table. It has the following columns for each GPU method:
Display various kinds of plots:
Analysis of profiler output lists out method with high number of:
Compare profiler output for multiple program runs of the same program or for different programs.

Each program run is referred to as a session.

Save profiling data for multiple sessions. A group of sessions is referred to as a project.

Import/Export CUDA Profiler CSV format data.

Description of different plots:

Summary profiling data bar plot :
GPU time height plot:
It is a bar diagram in which the height of each bar is proportional to the GPU time for a method and a different bar color is assigned for each method. A legend is displayed which shows the color assignment for different methods. The width of each bar is fixed and the bars are displayed in the order in which the methods are executed. When the "fit in window" option is enabled the display is adjusted so as to fit all the bars in the displayed window width. In this case bars for multiple methods can overlap. The overlapped bars are displayed in decreasing order of height so that all the different bars are visible. When the "Show CPU Time" option is enabled the CPU time is shown as a bar in a different color on top of the GPU time bar. The height of this bar is proportional to the difference of CPU time and GPU time for the method.
GPU time width plot:
It is a bar diagram in which the width of each bar is proportional to the GPU time for a method and a different bar color is assigned for each method. A legend is displayed which shows the color assignment for different methods. The bars are displayed in the order in which the methods are executed. When time stamps are enabled the bars are positioned based on the time stamp. The height of each bar is based on the option chosen:
  1. Fixed height : height is fixed.
  2. Height proportional to instruction issue rate: the instruction issue rate for a method is equal to profiler "instructions" counter value divided by the gpu time for the method.
  3. Height proportional to uncoalesced load + store rate: the uncoalesced load + store rate for a method is equal to the sum of profiler "gld uncoalesced" and "gst uncoalesced" counter values divided by the gpu time for the method.
  4. Occupancy: Occupancy is proportional to height.
In case of multiple streams or multiple devices the "Split Options" can be used.
  1. No Split : A single horizontal group of bars is displayed. Even in case of multiple streams or multiple devices the data is displayed in a single group.
  2. Split on Device: In case of multiple devices one separate horizontal group of bars is displayed for each device.
  3. Split on Stream: In case of multiple devices one separate horizontal group of bars is displayed for each stream.
Profiler counter bar plot :
It is a bar plot for profiler counter values for a method from the profiler output table or the summary table. . One bar for each profiler counter. Bars sorted in decreasing profiler counter value .Bar length is proportional to profiler counter value.
Profiler output table column bar plot:
It is a bar plot for any column of values from the profiler output table or summary table . One bar for each row in the table. Bars sorted in decreasing column value . Bar length is proportional to column value.
Comparison summary plot:
This plot can be used to compare GPU Time summary data for two sessions. The Base Session is the session with respect to which comparison is done and the other session which is selected for comparison is called Compare Session. GPU Times for matching kernels from the two sessions are shown in a group. For each matched kernel from Compare Session, percentage increment or decrement with respect to Base Session is displayed at the right end of the bar. After showing all the matched pairs, the unmatched kernels GPU Times are shown. At the bottom two bars with total GPU Times for the two sessions are shown.
Device level summary plot :
One bar for each method is there. Bars are sorted in decreasing gpu time. Bar length is proportional to cumulative gputime for a method across all contexts for a device.
Session level summary plot :
One bar for each device is there. Bar length is proportional to Gpu Utilization. Gpu Utilization is the proportion of time when gpu was actually executing some method to total time interval from gpu start to end. The values are presented in percentage.

Steps for sample cudaprof usage:


Sample1:



Sample2:

Brief description of some cudaprof GUI components:

Top line shows the main menu options: File, Profile, Session, Options, Window and Help. See the description below for details on the menu options.

Second line has 4 groups of tool bar icons.

The left vertical window lists all the sessions in the current project as a tree with three levels. Sessions at the top level, devices under a session at the next level and contexts under a device at the lowest level. The child of a session is named as "Device_< device_number >" e.g Device_0. The child of a device is named as "Context_< context_number >" e.g. Context_0

Summary session information is displayed when a session is selected in the tree view.

Summary device information is displayed when a device is selected in the tree view.

Right clicking on a session item or a context item in the tree view brings up the context sensitive menus. See the description below for details on the menu options.

Session context menu.

Session->Device->Context context menu.
Right workspace area contains windows which include Tabbed window for each session, each device in a session and for each context for a device.
The different windows for each context are shown as different tabs:
Table Header context menu, for Profiler Output table and Summary table.
Output window - Appears, when asked to display, at the bottom. It displays standard output & standard error for the CUDA program which is run. Also some additional status messages are displayed in this window.

Main menu

Tool bars

Dialogs

Session list context menu :

Session->Device context menu :

Profiler table context menu :

Interpreting profiler counters

The performance counter values do not correspond to individual thread activity. Instead, these values represent events within a thread warp. For example, a divergent branch within a thread warp will increment the divergent_branch counter by one. So the final counter value stores information for all divergent branches in all warps. In addition, the profiler can only target one of the multiprocessors in the GPU,so the counter values will not correspond to the total number of warps launched for a particular kernel. For this reason, when using the performance counter options in the profiler the user should always launch enough threads blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. In practice for consistent results, it is best to launch at least 2 times as many blocks as there are multiprocessors in the device on which you are profiling. For the reasons listed above, users should not expect the counter values to match the numbers one would get by inspecting kernel code. The values are best used to identify relative performance differences between un-optimized and optimized code. For example, if for the initial version of the program the profiler reports N non-coalesced global loads, it is easy to see if the optimized code produces less than N non-coalesced loads. In most cases, the goal is to make N go to 0, so the counter value is useful for tracking progress toward this goal.

Note that the counter values for the same application can be different across different runs even on the same setup since it depends on the number of thread blocks which are executed on each multiprocessor. For consistent results it is best to have number of blocks for each kernel launched to be at least equal to or a multiple of the total number of multiprocessors on a compute device. In other words when profiling the grid configuration should be chosen such that all the multiprocessors are uniformly loaded i.e. the number of blocks launched on each multiprocessor is same and also the amount of work of interest per block is the same. This will result in better accuracy of extrapolated counts (such as memory and instruction throughput) and will also provide more consistent results from run to run.

Profiler counters for GPUs with compute capability 1.x

In every application run only up to a maximum of four counter values can be collected. So in case more than four counters are selected Visual Profiler executes the application multiple times to collect all the counter values. Note that in case the number blocks in a kernel is less than or not a multiple of the number of multiprocessors the counters values across multiple runs will not be consistent.

  • Profiler counters for a single multiprocessor


  • These counter values are a cumulative count for all thread blocks which were run on multiprocessor zero. Note that the multiprocessor SIMT (single-instruction multi-thread) unit creates, manages, schedules, and executes threads in groups of 32 threads called warps. These counters are incremented by one per each warp.
  • Profiler counters for all multiprocessors in a Texture Processing Cluster (TPC)


  • These counter values are a cumulative count for all thread blocks which were run on multiprocessors within Texture Processing Cluster (TPC) zero. Note that there are two multiprocessors per TPC on compute devices with compute capability less than 1.3 and there are three multiprocessors per TPC on compute devices with compute capability greater than or equal to 1.3.

    When simultaneous global memory accesses by threads in a half-warp (during the execution of a single read or write instruction) can be combined into a single memory transaction of 32, 64, or 128 bytes it is called a coalesced access. If the global memory access by all threads of a half-warp do not fulfill the coalescing requirements it is called a non-coalesced access and a separate memory transaction is issued for each thread and throughput is significantly reduced. The coalescing requirements on devices with compute capability 1.2 and higher are different from devices with compute capability 1.0 or 1.1. Refer the CUDA Programming Guide for details. The profiler counters related to global memory count the number of global memory accesses or memory transactions and they are not per warp. They provide counts for all global memory requests initiated by warps running on a TPC.
  • Normalized counter values


  • When the the "Normalize counters" option is selected all counter values are normalized and per block counts are shown. In the following cases the counter value is set to zero: If any counter value is set to zero a warning is displayed at the end of the application profiling.

    With "Normalize counters" option enabled more number of application runs are required to collect all counter values compared to when the "Normalized counters" option is disabled.

    Also when "Normalize counters" option is enabled the "cta launched" and "sm cta launched" columns are not shown in the profiler table.

    Profiler counters for GPUs with compute capability 2.0

    In every application run only a few counter values can be collected. The number of counters depends on the specific counters selected. Visual Profiler executes the application multiple times to collect all the counter values. Note that in case the number blocks in a kernel is less than or not a multiple of the number of multiprocessors the counters values across multiple runs will not be consistent.

    All counter values are a cumulative count for all thread blocks which were run on multiprocessor zero. Note that the multiprocessor SIMT (single-instruction multi-thread) unit creates, manages, schedules, and executes threads in groups of 32 threads called warps. These counters are incremented by one per each warp.

    cudaprof project files saved to disk

    cudaprof settings which are saved

    Following is the list of cudaprof settings which are saved and remembered across different cudaprof sessions. On Windows these settings are saved in the system registry at the location "HKEY_CURRENT_USER\Software\NVIDIA\cudaprof".
    On Linux these settings are saved to the file "$HOME/.config/NVIDIA Corporation/cudaprof.conf".

    Cuda Visual Profiler Help cache is saved in the folder: There is a separate sub-directory for each version.