NVIDIA PerfWorks Pro User Guide


API Concepts

The PerfWorks API allows you to instrument an application and collect metrics. Metrics consist of low-level GPU counters, and high-level calculations that take those counters as input. Counters are collected per user-defined range or per-individual workload.

PerfWorks currently supports range-based profiling on a variety of GPUs and APIs.

There are currently no realtime modes.

Serialized vs Pipelined Metrics

Serialized metrics capture the total cost of a range of work, in isolation. To capture serialized results, PerfWorks inserts commands that cause a GPU-wait at each range boundary. PerfWorks will never issue GPU-wait commands inside an isolated range.

Pipelined metrics capture the incremental cost of a range of work. The incremental cost of a range is measured from the moment all preceding work has completed until the end of that range.

Serialized metrics and pipelined metrics are collected on separate passes.

Multi-Pass Collection

NVIDIA GPU hardware has a limited number of counter registers, and cannot collect all possible counters concurrently. There are also limitations on which counters can be collected together in a single pass.

PerfWorks resolves these problems by requiring you to replay the exact same set of GPU workloads multiple times, where each replay is termed a pass. On each pass, PerfWorks collects a different subset of the requested counters. Once all passes have completed, all counter values are made available to read back. This overall process is termed multi-pass collection.

Certain metrics require a large number of counters as inputs; adding a single metric may require a large number of passes to collect. For example, the tex__sol_pct metric alone requires a large number of passes, since it takes inputs from every texture pipeline stage.

Serialized metrics require one pass per nesting-level of ranges in the program. That is, if your application creates 3-deep nested ranges, each counter must be re-collected 3 times. If the configuration requires 7 passes, then a total of 3*7 = 21 execution passes will be required.

Activities

An NVPA_Activity presents the set of available metrics for a Device, and allows you to select the set of metrics you wish to collect.

Each kind of activity provides a certain set of performance characteristics and behavioral guarantees.

Profiler

Provides access to all GPU metrics,

Range-Based Profiling

Each profiling session runs a series of replay passes, where each pass contains a sequence of ranges. Every metric enabled in the session's configuration is collected separately per unique range-stack in the pass.

NVPA_Config* pConfig = ...;
void* ctx = ...; // an ApiContext
NVPA_Context_BeginSession(ctx, pConfig);
do {
NVPA_Context_BeginPass(ctx);
// repeat ...
gpu_commands_not_measured();
NVPA_Object_PushRange(ctx, MyRangeId);
gpu_commands_measured();
NVPA_Object_PopRange(ctx);
NVPA_Context_EndPass(ctx);
} while (!PredictDataReady(ctx));
NVPA_Context_Finish(ctx); // wait for GPU commands to complete
ReadResults(ctx);
NVPA_Context_EndSession(ctx);

At the start of each pass, the ApiContext starts with an empty range-stack. All remaining ranges on an ApiContext are popped during NVPA_Context_EndPass, as if NVPA_Object_PopRange were called on each one.

Detailed Pseudo-code

In this pseudo-code, each range contains more than one draw call, to emphasize that with application-defined ranges, measurements occur at range boundaries (not per draw call).

NVPA_Config* pConfig = ...;
void* ctx = ...; // an ApiContext
NVPA_Context_BeginSession(ctx, pConfig);
do {
NVPA_Context_BeginPass(ctx);
Draw_AA_0(); // not measured
Draw_AA_1(); // not measured
NVPA_Object_PushRange(ctx, 100);
Draw_BB_0(); // measured
Draw_BB_1(); // measured
NVPA_Object_PopRange(ctx);
Draw_CC_0(); // not measured
Draw_CC_1(); // not measured
NVPA_Object_PushRange(ctx, 200);
Draw_DD_0(); // measured
Draw_DD_1(); // measured
NVPA_Object_PushRange(ctx, 10);
Draw_EE_0(); // measured
Draw_EE_1(); // measured
NVPA_Object_PopRange(ctx);
Draw_FF_0(); // measured
Draw_FF_1(); // measured
NVPA_Object_PopRange(ctx);
Draw_GG_0(); // not measured
Draw_GG_1(); // not measured
NVPA_Context_EndPass(ctx);
} while (!PredictDataReady(ctx));
NVPA_Context_Finish(ctx); // wait for GPU commands to complete
ReadResults(ctx);
NVPA_Context_EndSession(ctx);

Notice that the serialized counters in {200} include contributions from Draw_EE_*, whereas pipelined counters in {200} do not. This result arrives from the definition of serialized metrics, which measure total cost in isolation, as opposed to pipelined metrics which measure incremental cost over previous work.

Serialized vs Pipelined Passes

The following diagrams illustrate how PerfWorks collects metrics differently in pipelined and serialized passes. Each diagram element is also annotated in the source code, so you can search for P1, RA, D3, etc. By way of example, the diagram elements are:

In both pass types:

During pipelined passes (diagram below):

During serialized passes (diagram below):

EGL

void SimpleExample1_EGL()
{
EGLContext* pEGLContext = eglGetCurrentContext();
NVPA_Context_BeginPass(pEGLContext); // P1
NVPA_EGL_PushRange('A'); // RA
glDrawElements(...); // D1
glDrawElements(...); // D2
NVPA_EGL_PopRange(); // /RA
NVPA_EGL_PushRange('B'); // RB
glDrawElements(...); // D3
NVPA_EGL_PopRange(); // /RB
NVPA_Context_EndPass(pEGLContext); // /P1
}

Recipes

These code recipes show the prescribed set of calls to accomplish a goal, without intervening error checking. Since each NVPA_ function may fail in real-world usage, working code will be more complex than it appears in the recipes.

Linking against PerfWorks

The PerfWorks API has two layers: "user" library and "DLL" library.

Your application must link against the "user" library, whose source code resides in nvperfapi_user_impl.h. The "user" library contains all global API function definitions.

We recommend:

Do #include nvperfapi_user_impl.h in a single, dedicated .c or .cpp file. This minimizes the chance of name collision errors during compilation.

Required:

nvperfapi_user_impl.h must be included in exactly one compilation unit (aka translation unit) per linkage unit.

Why:

The "user" library layer provides the application with a consistent interface that:

Configuring Metrics

// Given parallel arrays of metricNames and whether they are serialized,
// create a Config with those metrics enabled.
NVPA_Config* CreateConfig(size_t numMetrics, const char** metricNames, bool* serialized)
{
void* ctx = ...;
size_t deviceIndex = ~0u;
NVPA_Context_GetDeviceIndex(ctx, 0, &deviceIndex);
NVPA_ActivityOptions* pActivityOptions = nullptr;
NVPA_ActivityOptions_Create(&pActivityOptions);
NVPA_ActivityOptions_SetActivityKind(pActivityOptions, NVPA_ACTIVITY_KIND_PROFILER);
NVPA_Activity* pActivity = nullptr;
NVPA_Activity_CreateForDevice(deviceIndex, pActivityOptions, &pActivity);
NVPA_ActivityOptions_Destroy(pActivityOptions);
NVPA_MetricOptions* pMetricOptions;
NVPA_MetricOptions_Create(&pMetricOptions);
for (size_t ii = 0; ii < numMetrics; ++ii)
{
NVPA_MetricOptions_SetSerialized(pMetricOptions, serialized[ii])
// Find the global ID for the metric.
NVPA_MetricId metricId = 0;
NVPA_Activity_FindMetricByName(pActivity, metricNames[ii], &metricId);
// Enable the metric for collection.
NVPA_MetricEnableError metricEnableError;
NVPA_Activity_EnableMetric(pActivity, metricId, pMetricOptions, &metricEnableError);
}
// create Config from the Activity
NVPA_Config* pConfig = nullptr;
NVPA_Config_Create(pActivity, pConfig);
// we no longer need an Activity, once we have a Config
NVPA_Activity_Destroy(pActivity);
return pConfig;
}

Working with Graphics APIs

PerfWorks acts as a set of extensions over every supported Graphics API. Functions that operate on ApiObjects obey the same rules and threading model as native API calls.

Metrics

Metrics are high-level values derived from counter values.

Metric Naming Conventions

PerfWorks metrics follow the naming convention

<unit>__<name>[_<rollup>][_pct]

where

Rollups

Rollup Description
avg The average counter value across all unit instances
sum The sum of counter values across all unit instances
min The minimum counter value across all unit instances
max The maximum counter value across all unit instances

Cycle Metrics

Metrics using the term cycles in the name report the number of cycles in the unit's clock domain. The different types of metrics are:

Time Metrics

The gpu__time_duration or gpu__time_active is not simply:

gpu__time_end - gpu__time_start

A range id may cover several ranges within a pass, therefore gpu__time_duration and gpu__time_active is the sum of each of the range's durations with the same id.

NOTE: Neither gpu__time_duration nor gpu__time_active currently takes into account duration that is not in context, due to context switches. This can result in a higher duration than expected.

This time has no correlation to gpu__time_duration due to the asynchronous nature of the GPU.

This is not the duration between when the first command is executed on the CPU to the time the last command is finished executing on the GPU.

This diagram demonstrates two ranges collected in pipelined mode. Times for range 2 are shown.

Since range 2 begins while the first part of range 1 is on the GPU, the gpu__time_duration is the incremental cost of range 2. gpu__time_active is the full time for range 2.

This diagram demonstrates the same two ranges collected in serialized mode. Times for range 1 are shown.

For range 1, notice how gpu__time_duration ≠ gpu__time_end - gpu__time_start due to the split range.

Units

Unit Tree

Definitions

Notices

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright

NVIDIA® PerfWorks SDK Documentation ©2015-2017. NVIDIA Corporation. All Rights Reserved.


 

NVIDIA® PerfWorks Documentation Rev. 0.46.170612 ©2017. NVIDIA Corporation. All Rights Reserved.