NVIDIA System Profiler is capable of capturing information about CUDA execution in the profiled process.
The following CUDA driver and runtime (toolkit) versions are currently supported: 6.5, 7.0, and 8.0.
The following information can be collected and presented on the timeline in the report:
CUDA API trace — trace of CUDA Runtime and CUDA Driver calls made by the application.
CUDA Runtime calls typically start with cuda
prefix (e.g.
cudaLaunch
).
CUDA Driver calls typically start with cu
prefix (e.g.
cuDeviceGetCount
).
CUDA workload trace — trace of activity happening on the GPU, which includes memory operations (e.g., Host-to-Device memory copies) and kernel executions. Within the threads that use the CUDA API, additional child rows will appear in the timeline tree.
Near the bottom of the timeline row tree, the GPU node will appear and contain a CUDA node. Within the CUDA node, each CUDA context used within the process will be shown along with its corresponding CUDA streams. Steams will contain memory operations and kernel launches on the GPU. Kernel launches are represented by blue, while memory transfers are displayed in red.
The easiest way to capture CUDA information is to launch the process from NVIDIA System Profiler, and it will setup the environment for you. To do so, simply set up a normal launch and select the Collect CUDA trace checkbox.
Additional configuration parameters are available:
Flush data periodically — specifies the period after which an attempt to
flush CUDA trace data will be made. Normally, in order to collect full CUDA
trace, the application needs to finalize the device used for CUDA work (call
cudaDeviceReset()
, and then let the application gracefully exit (as opposed
to crashing).
This option allows flushing CUDA trace data even before the device is finalized. However, it might introduce additional overhead to a random CUDA Driver or CUDA Runtime API call.
Skip some API calls — avoids tracing insignificant CUDA Runtime API calls
(namely, cudaConfigureCall()
, cudaSetupArgument()
,
cudaHostGetDevicePointers()
). Not tracing these functions allows to
significantly reduce the profiling overhead, without losing any interesting
data.
If desired, the target application can be manually set up to collect CUDA trace. To capture information about CUDA execution, the following requirements should be satisfied:
The profiled process should be started with the specified environment variable, depending on the architecture of the process:
For ARMv7 (32-bit) processes: CUDA_INJECTION32_PATH
, which should
point to the injection library:
/opt/nvidia/tegra_system_profiler/libToolsInjection32.so
For ARMv8 (64-bit) processes: CUDA_INJECTION64_PATH
, which should
point to the injection library:
/opt/nvidia/tegra_system_profiler/libToolsInjection64.so
If the application is started by NVIDIA System Profiler, all required environment variables will be set automatically.
Please note that if your application crashes before all collected CUDA trace data has been copied out, some or all data might be lost and not present in the report.
NVIDIA® System Profiler Documentation Rev. 3.9.170817 ©2017. NVIDIA Corporation. All Rights Reserved.