One of the following development boards, with 64-bit image:
PerfWorks Pro supports NVIDIA GeForce, Quadro, and Tesla GPUs based upon the NVIDIA Kepler, Maxwell, and Pascal architectures.
PerfWorks Pro supports Microsoft hybrid system on NVIDIA r378.49 or higher driver (Windows 8.1a and Windows 10).
PerfWorks Pro does not support SLI, Optimus and Microsoft hybrid systems.
0.46.8 Fixed the guide of how to build the samples for Vibrante Linux in the release notes.
0.45.0 Added support for driver 381.65. (54295)
0.44.0 The following new metrics were added:
New Metrics | Kepler | Maxwell | Pascal |
---|---|---|---|
cpu__time_duration
|
all | all | all |
gpu__clip_primitives_in
|
all | all | all |
gpu__clip_primitives_out
|
all | all | all |
gpu__earlyz_samples_failed_depth
|
gk208, gk20a | all | all |
gpu__earlyz_samples_failed_stencil
|
gk208, gk20a | all | all |
gpu__earlyz_samples_failed_stencil
|
all | all | all |
gpu__latez_samples_failed_depth
|
gk208, gk20a | all | all |
gpu__latez_samples_failed_stencil
|
gk208, gk20a | all | all |
gpu__latez_samples_passed
|
all | all | all |
pa__pa2raster_stalled_pct
|
all | ||
sm__pipe_alu_utilization_pct
|
all | all | all |
sm__pipe_interp_utilization_pct
|
all | all | all |
sm__pixout_stall_pct
|
all | all | all |
system__time_duration
|
all | all | all |
tex__stalled_pct
|
all | all | all |
zcull__fragments_accepted
|
all | all | all |
zcull__fragments_accepted_pct
|
all | all | all |
zcull__fragments_rejected
|
all | all | all |
zcull__fragments_rejected_pct
|
all | all | all |
zcull__fragments_tested
|
all | all | all |
zcull__fragments_trivially_accepted
|
all | all | all |
zcull__fragments_trivially_accepted_pct
|
all | all | all |
zcull__tiles_accepted
|
all | all | all |
zcull__tiles_accepted_pct
|
all | all | all |
zcull__tiles_rejected
|
all | all | all |
zcull__tiles_rejected_pct
|
all | all | all |
zcull__tiles_tested
|
all | all | all |
zcull__tiles_trivially_accepted
|
all | all | all |
zcull__tiles_trivially_accepted_pct
|
all | all | all |
0.41.0 Improved the metrics crop__sol_pct and zrop__sol_pct. (53731,53732)
NVPA_D3D12_Queue_HandleProfilerEvents
signature changed from: NVPA_Status NVPA_D3D12_Queue_HandleProfilerEvents( struct ID3D12CommandQueue pCommandQueue, uint64_t timeout, size_t numEvents, NVPA_Bool endOnPassBoundary, NVPA_GpuEventHandlingResult pResult );
to
NVPA_Status NVPA_D3D12_Queue_HandleProfilerEvents( struct ID3D12CommandQueue pCommandQueue, uint64_t timeout, size_t numPasses, NVPA_GpuEventHandlingResult pResult );
(53253)
NVPA_D3D12_Queue_HandleProfilerEvents
. The user is responsible to call this function to handle events generated from passes and ranges. (48718)NVPA_CUDA_PredictStackDataReady
NVPA_CUDA_GetStackData
NVPA_CUDA_PushRange
NVPA_CUDA_PopRange
NVPA_CUDA_GetNumRangeIds
NVPA_CUDA_GetRangeIds
NVPA_CUDA_BeginSession
NVPA_CUDA_EndSession
NVPA_CUDA_BeginPass
NVPA_CUDA_EndPass
NVPA_CUDA_Register
NVPA_CUDA_Unregister
NVPA_CUDA_GetConfig
NVPA_CUDA_GetSliDeviceCount
NVPA_CUDA_GetDeviceIndex
NVPA_CUDA_Finish
(53047)
gr__idle_pct
. (42698)smsp__warps_launched_{sum, avg, min, max}
to smsp__warps_launched_cs_{sum, avg, min, max}
and restricted them to compute only. (44854)New Metrics
sm__sol_max_pct
sm__sol_min_pct
crop__lts_utilization_pct
, zrop__lts_utilization_pct
, lts__request_total_utilization_pct
metrics. (52303,52304)lts__sol_pct
has been renamed to ltc__sol_pct
. Improved accuracy of LTC SOL metric. (50535)Below, "[API]" stands in for D3D11, D3D12_Queue, D3D12_CommandList, OpenGL, EGL, CUDA, etc. Note that some of the APIs take no context parameter, as they use the current context instead. The context-less APIs are OpenGL, EGL, and CUDA. (49061)
Specific changes are as follows:
Removed API New API
--------------------------------------------------------------------------------------------
NVPA_LoadDriver*() NVPA_[API]_LoadDriver() NVPA_Register*() NVPA_[API]_Register()
NVPA_UnregisterContext() NVPA_[API]_Unregister() NVPA_Context_GetConfig() NVPA_[API]_GetConfig()
NVPA_Context_GetSliDeviceCount() NVPA_[API]_GetSliDeviceCount()
NVPA_Context_GetDeviceIndex() NVPA_[API]_GetDeviceIndex()
NVPA_Context_Finish() NVPA_[API]_Finish()
NVPA_Context_PredictStackDataReady() NVPA_[API]_PredictStackDataReady()
NVPA_Context_GetStackData() NVPA_[API]_GetStackData()
NVPA_Object_PushRange() NVPA_[API]_PushRange()
NVPA_Object_PopRange() NVPA_[API]_PopRange()
NVPA_Object_GetNumRangeIds() NVPA_[API]_GetNumRangeIds()
NVPA_Object_GetRangeIds() NVPA_[API]_GetRangeIds()
NVPA_Context_BeginSession() NVPA_[API]_BeginSession()
NVPA_Context_EndSession() NVPA_[API]_EndSession()
NVPA_Context_BeginPass() NVPA_[API]_BeginPass()
NVPA_Context_EndPass() NVPA_[API]_EndPass()
fbpa__read_sectors
metric description.NVPA_D3D12_CommandList_EnableAutoRangesDraw
(51427)gputime_active
and gputime_duration
. Please see user docs for more info on GPU time metrics. (50136)gpc__fragments_sent_to_rop
to include "Graphics" only. (49755)NVPA_ACTIVITY_KIND_REALTIME_SAMPLED
no longer enumerates gpu__draw_count
and other uncollectable counters. (50256)NVPA_ACTIVITY_KIND_REALTIME_SAMPLED
and a new family of functions to support device-sampling. Includes NVPA_Device_BeginSession/EndSession
, BeginPass/EndPass
, TriggerStart/TriggerEnd
, and NVPA_Device_GetStackData
. (45157)gpu__time_*
metrics when running in auto pipelined mode. (45338)gpu__time_duration
when collected in auto ranged pipelined and serialized modes.gpu__time_active
NVPA_Activity_GetMetricSerializdedCap()
. (46885)NVPA_GetVersionNumber()
ia__sol_pct
NVPA_Activity
. Deleted NVPA_SetAutoRangesDraws()
, NVPA_SetAutoRangesCompute()
. Added NVPA_Activity_SetAutoRangesDraw()
, NVPA_Activity_SetAutoRangesDispatch()
.NVPA_Global_EndStackData()
.gpu__tcs_invocations
gpu__tes_invocations
gpu__cs_invocations
Old Names New Names
--------------------------------------------------------------------------------------------
gpu__ps_invocations gpu__fs_invocations
sm__active_cycles_ps_avg sm__active_cycles_fs_avg
sm__active_cycles_ps_min sm__active_cycles_fs_min
sm__active_cycles_ps_max sm__active_cycles_fs_max
sm__active_cycles_ps_pct sm__active_cycles_fs_pct
sm__active_cycles_ps_sum sm__active_cycles_fs_sum
smsp__inst_executed_ps_avg smsp__inst_executed_fs_avg
smsp__inst_executed_ps_max smsp__inst_executed_fs_max
smsp__inst_executed_ps_min smsp__inst_executed_fs_min
smsp__inst_executed_ps_pct smsp__inst_executed_fs_pct
smsp__inst_executed_ps_sum smsp__inst_executed_fs_sum
New Metrics
-----------------------
gpu__gs_invocations
gpu__ps_invocations
gpu__vs_invocations
tex__hitrate_pct
tex__read_bytes
tex__texel_queries
Removed Metrics
-----------------------
smp__busy_cycles_avg
smp__busy_cycles_max
smp__busy_pct_avg
smp__busy_pct_max
smp__elapsed_cycles_avg
smp__elapsed_cycles_max
smp__elapsed_cycles_min
smp__elapsed_cycles_sum
NVPA_Context_BeginSession()
and NVPA_Context_EndSession()
.NVPA_Context_SetConfig()
.Old Names New Names
----------------------------------------------------------------------------------------
sm__active_cycles_cs_pct_sum sm__active_cycles_cs_pct
sm__active_cycles_gs_pct_sum sm__active_cycles_gs_pct
sm__active_cycles_ps_pct_sum sm__active_cycles_ps_pct
sm__active_cycles_tes_pct_sum sm__active_cycles_tes_pct
sm__active_cycles_tcs_pct_sum sm__active_cycles_tcs_pct
sm__active_cycles_vs_pct_sum sm__active_cycles_vs_pct
ia__vertex_count_reused
pa__prim_input_count
Old Names New Names
------------------------------------------------------------------------------------------------
ia__total_batch_count ia__batch_count
ia__total_prim_count ia__prim_count
ia__total_prim_line_count ia__prim_line_count
ia__total_prim_lineadj_count ia__prim_lineadj_count
ia__total_prim_patch_count ia__prim_patch_count
ia__total_prim_point_count ia__prim_point_count
ia__total_prim_tri_count ia__prim_tri_count
ia__total_prim_triadj_count ia__prim_triadj_count
ia__total_prim_triflat_count ia__prim_triflat_count
ia__total_vertex_count ia__vertex_count
smsp__not_predicated_off_thread_inst_executed_avg smsp__thread_inst_executed_not_pred_off_avg
smsp__not_predicated_off_thread_inst_executed_max smsp__thread_inst_executed_not_pred_off_max
smsp__not_predicated_off_thread_inst_executed_min smsp__thread_inst_executed_not_pred_off_min
smsp__not_predicated_off_thread_inst_executed_sum smsp__thread_inst_executed_not_pred_off_sum
sm__active_cycles_{SHADER_TYPE}_pct
has changed definition from the percentage of active cycles that were {SHADER_TYPE}
cycles to the percentage of elapsed cycles that {SHADER_TYPE}
shaders were active on SMs.nvperfapi_user.c
is now a header, nvperfapi_user_impl.h
, to ease integration in user builds.NVPA_StackData_PredictReady()
was replaced by NVPA_Context_PredictStackDataReady()
. The prediction is now performed correctly, based on the number of Passes submitted to the context.NVPA_Device_SetConfig()
was replaced by NVPA_Context_SetConfig()
. Config objects are now set per context instead of per device.PerfWorks_Metrics.txt
dram__{read,write}_pct
and fbpa__sol_pct
will show lower than expected value on Maxwell based devices that have a disabled ROP/L2 unit (e.g. GTX970).\---PerfWorks
+---bin
| +---<platform1>
| \---<platform2>
+---lib
| +---<platform1>
| \---<platform2>
+---doc
+---include
\---samples
nvperfapi_user_impl.h
. The "user" library contains all global API function definitions.#include nvperfapi_user_impl.h
in a single, dedicated .c or .cpp file. This minimizes the chance of name collision errors during compilation.nvperfapi_user_impl.h
must be included in exactly one compilation unit (translation unit) per linkage unit.nvperf is a command line tool for offline querying of PerfWorks metrics.
Usage:
nvperf <command> ...
where commands are
chips : list supported chip families
devices : list available devices and their properties
help : display this message
metrics : list and schedule metrics for a virtual chip
For help on an individual command, use nvperf <command> --help
Querying Supported Metrics
The command 'nvperf metrics --chip gm200 --list' outputs the list of metrics supported on NVIDIA GM200 GPUs.
nvperf metrics --chip gm200 --list
# metric name # tags # description
crop__busy_cycles_avg compute graphics realtime Number of cycles the crop is busy.
crop__busy_cycles_max compute graphics realtime Number of cycles the busiest crop is busy.
crop__busy_pct_avg realtime Percentage of time the crop is busy.
crop__busy_pct_max realtime Percentage of time the busiest crop is busy.
... etc ...
Querying number of passes to collect Metrics
It can take several passes to collect some performance metrics. nvperf command can schedule a list of metrics and report the number of passes.
The command 'nvperf metrics --chip gm200 gr__busy_pct sm__busy_pct_avg' schedules the metrics gr__busy_pct and sm__busy_pct_avg and outputs the number of passes.
nvperf metrics --chip gm200 gr__busy_pct sm__busy_pct_avg
Required passes to schedule all metrics: 1
The metric 'all' will schedule all available metrics.
nvperf metrics --chip gm200 all
Required passes to schedule all metrics: 41
|-- extensions
| |-- build build files for extensions
| |-- include headers for extensions referenced by sample code
| | |-- nvperfapi_utils helper library to configure a PerfWorks profiler
| | +-- winsys helper library to create a single window with a graphics context
| |-- lib built extensions will be deployed here
| +-- src source for extensions
| |-- nvperfapi_utils
| +-- winsys
+-- samples per graphics API samples
|-- bin built samples will be deployed here
|-- build build files for the samples
+-- gles GLES samples
+-- simple Basic app to demonstrate use of PerfWorks API
+-- assets
|-- shaders
+-- src_shaders
Each sample will have a corresponding Makefile that will deploy the built sample into its corresponding bin directory.
The Makefiles provided are meant to be cross-compiled on a standard Linux host machine. Before running make, edit the variables:
TEGRA_SDK_PATH := "<SDK ROOT>" COMPILER_BIN_PATH := "<TOOLCHAIN_ROOT>/tegra-4.9-nv/usr/bin/aarch64-gnu-linux"
at the top of the two following makefiles:
<sdk_root>/samples/extensions/build/l4t/Makefile
<sdk_root>/samples/samples/build/l4t/Makefile
A suitable aarch64-unknown-linux-gnu cross-compiler must be installed at:
/usr/bin
Once built, copy the contents of the samples/gles/bin
directory to the target device. This should include the built sample binary and a prebuilt PerfWorks library.
Support issues can be mailed to PerfWorks@nvidia.com.
NVIDIA® PerfWorks SDK Documentation ©2015-2017. NVIDIA Corporation. All Rights Reserved.
NVIDIA® PerfWorks Documentation Rev. 0.46.170612 ©2017. NVIDIA Corporation. All Rights Reserved.