Nsight Systems is capable of capturing information about CUDA execution in the profiled process.
The following information can be collected and presented on the timeline in the report:
CUDA API trace — trace of CUDA Runtime and CUDA Driver calls made by the application.
CUDA Runtime calls typically start with cuda
prefix (e.g.
cudaLaunch
).
CUDA Driver calls typically start with cu
prefix (e.g.
cuDeviceGetCount
).
CUDA workload trace — trace of activity happening on the GPU, which includes memory operations (e.g., Host-to-Device memory copies) and kernel executions. Within the threads that use the CUDA API, additional child rows will appear in the timeline tree.
Linux only:
cuDNN and cuBLAS API tracing
OpenACC tracing
Near the bottom of the timeline row tree, the GPU node will appear and contain a CUDA node. Within the CUDA node, each CUDA context used within the process will be shown along with its corresponding CUDA streams. Steams will contain memory operations and kernel launches on the GPU. Kernel launches are represented by blue, while memory transfers are displayed in red.
The easiest way to capture CUDA information is to launch the process from Nsight Systems, and it will setup the environment for you. To do so, simply set up a normal launch and select the Collect CUDA trace checkbox.
Additional configuration parameters are available:
Collect backtraces for API calls longer than X seconds - turns on collection of CUDA API backtraces and sets the minimum time a CUDA API event must take before its backtraces are collected. Setting this value too low can cause high application overhead and seriously increase the size of your results file.
Flush data periodically — specifies the period after which an attempt to
flush CUDA trace data will be made. Normally, in order to collect full CUDA
trace, the application needs to finalize the device used for CUDA work (call
cudaDeviceReset()
, and then let the application gracefully exit (as opposed
to crashing).
This option allows flushing CUDA trace data even before the device is finalized. However, it might introduce additional overhead to a random CUDA Driver or CUDA Runtime API call.
Skip some API calls — avoids tracing insignificant CUDA Runtime API calls
(namely, cudaConfigureCall()
, cudaSetupArgument()
,
cudaHostGetDevicePointers()
). Not tracing these functions allows to
significantly reduce the profiling overhead, without losing any interesting
data. (See CUDA Trace Filters, below)
Collect cuDNN trace, Collect cuBLAS trace, Collect OpenACC trace - selects which (if any) extra libraries that depend on CUDA to trace.
OpenACC versions 2.0, 2.5, and 2.6 are supported when using PGI runtime version 15.7 or greater and not compiling statically. In order to differentiate constructs, a PGI runtime of 16.1 or later is required. Note that Nsight Systems does not support the GCC implementation of OpenACC at this time.
Please note that if your application crashes before all collected CUDA trace data has been copied out, some or all data might be lost and not present in the report.
Unified Memory (also called Managed Memory) transfer trace is enabled automatically in Nsight Systems when CUDA trace is selected. It incurs no overhead in programs that do not perform any Unified Memory transfers. Data is displayed in the Managed Memory area of the timeline:
HtoD transfer indicates the CUDA kernel accessed managed memory that was residing on the host, so the kernel execution paused and transferred the data to the device. Heavy traffic here will incur performance penalties in CUDA kernels, so consider using manual cudaMemcpy operations from pinned host memory instead.
PtoP transfer indicates the CUDA kernel accessed managed memory that was residing on a different device, so the kernel execution paused and transferred the data to this device. Heavy traffic here will incur performance penalties, so consider using manual cudaMemcpyPeer operations to transfer from other devices' memory instead. The row showing these events is for the destination device -- the source device is shown in the tooltip for each transfer event.
DtoH transfer indicates the CPU accessed managed memory that was residing on a CUDA device, so the CPU execution paused and transferred the data to system memory. Heavy traffic here will incur performance penalties in CPU code, so consider using manual cudaMemcpy operations from pinned host memory instead.
By default, the CUDA and cuDNN tracing is limited to a select list of functions.
CUDA Runtime API
cudaBindSurfaceToArray cudaBindTexture cudaBindTexture2D cudaBindTextureToArray cudaBindTextureToMipmappedArray cudaConfigureCall cudaCreateSurfaceObject cudaCreateTextureObject cudaD3D10MapResources cudaD3D10RegisterResource cudaD3D10UnmapResources cudaD3D10UnregisterResource cudaD3D9MapResources cudaD3D9MapVertexBuffer cudaD3D9RegisterResource cudaD3D9RegisterVertexBuffer cudaD3D9UnmapResources cudaD3D9UnmapVertexBuffer cudaD3D9UnregisterResource cudaD3D9UnregisterVertexBuffer cudaDestroySurfaceObject cudaDestroyTextureObject cudaDeviceReset cudaDeviceSynchronize cudaEGLStreamConsumerAcquireFrame cudaEGLStreamConsumerConnect cudaEGLStreamConsumerConnectWithFlags cudaEGLStreamConsumerDisconnect cudaEGLStreamConsumerReleaseFrame cudaEGLStreamConsumerReleaseFrame cudaEGLStreamProducerConnect cudaEGLStreamProducerDisconnect cudaEGLStreamProducerReturnFrame cudaEventCreate cudaEventCreateFromEGLSync cudaEventCreateWithFlags cudaEventDestroy cudaEventQuery cudaEventRecord cudaEventRecord_ptsz cudaEventSynchronize cudaFree cudaFreeArray cudaFreeHost cudaFreeMipmappedArray cudaGLMapBufferObject cudaGLMapBufferObjectAsync cudaGLRegisterBufferObject cudaGLUnmapBufferObject cudaGLUnmapBufferObjectAsync cudaGLUnregisterBufferObject cudaGraphicsD3D10RegisterResource cudaGraphicsD3D11RegisterResource cudaGraphicsD3D9RegisterResource cudaGraphicsEGLRegisterImage cudaGraphicsGLRegisterBuffer cudaGraphicsGLRegisterImage cudaGraphicsMapResources cudaGraphicsUnmapResources cudaGraphicsUnregisterResource cudaGraphicsVDPAURegisterOutputSurface cudaGraphicsVDPAURegisterVideoSurface cudaHostAlloc cudaHostRegister cudaHostUnregister cudaLaunch cudaLaunchCooperativeKernel cudaLaunchCooperativeKernelMultiDevice cudaLaunchCooperativeKernel_ptsz cudaLaunchKernel cudaLaunchKernel_ptsz cudaLaunch_ptsz cudaMalloc cudaMalloc3D cudaMalloc3DArray cudaMallocArray cudaMallocHost cudaMallocManaged cudaMallocMipmappedArray cudaMallocPitch cudaMemGetInfo cudaMemPrefetchAsync cudaMemPrefetchAsync_ptsz cudaMemcpy cudaMemcpy2D cudaMemcpy2DArrayToArray cudaMemcpy2DArrayToArray_ptds cudaMemcpy2DAsync cudaMemcpy2DAsync_ptsz cudaMemcpy2DFromArray cudaMemcpy2DFromArrayAsync cudaMemcpy2DFromArrayAsync_ptsz cudaMemcpy2DFromArray_ptds cudaMemcpy2DToArray cudaMemcpy2DToArrayAsync cudaMemcpy2DToArrayAsync_ptsz cudaMemcpy2DToArray_ptds cudaMemcpy2D_ptds cudaMemcpy3D cudaMemcpy3DAsync cudaMemcpy3DAsync_ptsz cudaMemcpy3DPeer cudaMemcpy3DPeerAsync cudaMemcpy3DPeerAsync_ptsz cudaMemcpy3DPeer_ptds cudaMemcpy3D_ptds cudaMemcpyArrayToArray cudaMemcpyArrayToArray_ptds cudaMemcpyAsync cudaMemcpyAsync_ptsz cudaMemcpyFromArray cudaMemcpyFromArrayAsync cudaMemcpyFromArrayAsync_ptsz cudaMemcpyFromArray_ptds cudaMemcpyFromSymbol cudaMemcpyFromSymbolAsync cudaMemcpyFromSymbolAsync_ptsz cudaMemcpyFromSymbol_ptds cudaMemcpyPeer cudaMemcpyPeerAsync cudaMemcpyToArray cudaMemcpyToArrayAsync cudaMemcpyToArrayAsync_ptsz cudaMemcpyToArray_ptds cudaMemcpyToSymbol cudaMemcpyToSymbolAsync cudaMemcpyToSymbolAsync_ptsz cudaMemcpyToSymbol_ptds cudaMemcpy_ptds cudaMemset cudaMemset2D cudaMemset2DAsync cudaMemset2DAsync_ptsz cudaMemset2D_ptds cudaMemset3D cudaMemset3DAsync cudaMemset3DAsync_ptsz cudaMemset3D_ptds cudaMemsetAsync cudaMemsetAsync_ptsz cudaMemset_ptds cudaPeerRegister cudaPeerUnregister cudaStreamAddCallback cudaStreamAddCallback_ptsz cudaStreamAttachMemAsync cudaStreamAttachMemAsync_ptsz cudaStreamCreate cudaStreamCreateWithFlags cudaStreamCreateWithPriority cudaStreamDestroy cudaStreamQuery cudaStreamQuery_ptsz cudaStreamSynchronize cudaStreamSynchronize_ptsz cudaStreamWaitEvent cudaStreamWaitEvent_ptsz cudaThreadSynchronize cudaUnbindTexture
CUDA Master API
cu64Array3DCreate cu64ArrayCreate cu64D3D9MapVertexBuffer cu64GLMapBufferObject cu64GLMapBufferObjectAsync cu64MemAlloc cu64MemAllocPitch cu64MemFree cu64MemGetInfo cu64MemHostAlloc cu64Memcpy2D cu64Memcpy2DAsync cu64Memcpy2DUnaligned cu64Memcpy3D cu64Memcpy3DAsync cu64MemcpyAtoD cu64MemcpyDtoA cu64MemcpyDtoD cu64MemcpyDtoDAsync cu64MemcpyDtoH cu64MemcpyDtoHAsync cu64MemcpyHtoD cu64MemcpyHtoDAsync cu64MemsetD16 cu64MemsetD16Async cu64MemsetD2D16 cu64MemsetD2D16Async cu64MemsetD2D32 cu64MemsetD2D32Async cu64MemsetD2D8 cu64MemsetD2D8Async cu64MemsetD32 cu64MemsetD32Async cu64MemsetD8 cu64MemsetD8Async cuArray3DCreate cuArray3DCreate_v2 cuArrayCreate cuArrayCreate_v2 cuArrayDestroy cuBinaryFree cuCompilePtx cuCtxCreate cuCtxCreate_v2 cuCtxDestroy cuCtxDestroy_v2 cuCtxSynchronize cuD3D10CtxCreate cuD3D10CtxCreateOnDevice cuD3D10CtxCreate_v2 cuD3D10MapResources cuD3D10RegisterResource cuD3D10UnmapResources cuD3D10UnregisterResource cuD3D11CtxCreate cuD3D11CtxCreateOnDevice cuD3D11CtxCreate_v2 cuD3D9CtxCreate cuD3D9CtxCreateOnDevice cuD3D9CtxCreate_v2 cuD3D9MapResources cuD3D9MapVertexBuffer cuD3D9MapVertexBuffer_v2 cuD3D9RegisterResource cuD3D9RegisterVertexBuffer cuD3D9UnmapResources cuD3D9UnmapVertexBuffer cuD3D9UnregisterResource cuD3D9UnregisterVertexBuffer cuEGLStreamConsumerAcquireFrame cuEGLStreamConsumerConnect cuEGLStreamConsumerConnectWithFlags cuEGLStreamConsumerDisconnect cuEGLStreamConsumerReleaseFrame cuEGLStreamProducerConnect cuEGLStreamProducerDisconnect cuEGLStreamProducerPresentFrame cuEGLStreamProducerReturnFrame cuEventCreate cuEventCreateFromEGLSync cuEventCreateFromNVNSync cuEventDestroy cuEventDestroy_v2 cuEventQuery cuEventRecord cuEventRecord_ptsz cuEventSynchronize cuGLCtxCreate cuGLCtxCreate_v2 cuGLInit cuGLMapBufferObject cuGLMapBufferObjectAsync cuGLMapBufferObjectAsync_v2 cuGLMapBufferObjectAsync_v2_ptsz cuGLMapBufferObject_v2 cuGLMapBufferObject_v2_ptds cuGLRegisterBufferObject cuGLUnmapBufferObject cuGLUnmapBufferObjectAsync cuGLUnregisterBufferObject cuGraphicsD3D10RegisterResource cuGraphicsD3D11RegisterResource cuGraphicsD3D9RegisterResource cuGraphicsEGLRegisterImage cuGraphicsGLRegisterBuffer cuGraphicsGLRegisterImage cuGraphicsMapResources cuGraphicsMapResources_ptsz cuGraphicsUnmapResources cuGraphicsUnmapResources_ptsz cuGraphicsUnregisterResource cuGraphicsVDPAURegisterOutputSurface cuGraphicsVDPAURegisterVideoSurface cuInit cuLaunch cuLaunchCooperativeKernel cuLaunchCooperativeKernelMultiDevice cuLaunchCooperativeKernel_ptsz cuLaunchGrid cuLaunchGridAsync cuLaunchKernel cuLaunchKernel_ptsz cuLinkComplete cuLinkCreate cuLinkCreate_v2 cuLinkDestroy cuMemAlloc cuMemAllocHost cuMemAllocHost_v2 cuMemAllocManaged cuMemAllocPitch cuMemAllocPitch_v2 cuMemAlloc_v2 cuMemFree cuMemFreeHost cuMemFree_v2 cuMemGetInfo cuMemGetInfo_v2 cuMemHostAlloc cuMemHostAlloc_v2 cuMemHostRegister cuMemHostRegister_v2 cuMemHostUnregister cuMemPeerRegister cuMemPeerUnregister cuMemPrefetchAsync cuMemPrefetchAsync_ptsz cuMemcpy cuMemcpy2D cuMemcpy2DAsync cuMemcpy2DAsync_v2 cuMemcpy2DAsync_v2_ptsz cuMemcpy2DUnaligned cuMemcpy2DUnaligned_v2 cuMemcpy2DUnaligned_v2_ptds cuMemcpy2D_v2 cuMemcpy2D_v2_ptds cuMemcpy3D cuMemcpy3DAsync cuMemcpy3DAsync_v2 cuMemcpy3DAsync_v2_ptsz cuMemcpy3DPeer cuMemcpy3DPeerAsync cuMemcpy3DPeerAsync_ptsz cuMemcpy3DPeer_ptds cuMemcpy3D_v2 cuMemcpy3D_v2_ptds cuMemcpyAsync cuMemcpyAsync_ptsz cuMemcpyAtoA cuMemcpyAtoA_v2 cuMemcpyAtoA_v2_ptds cuMemcpyAtoD cuMemcpyAtoD_v2 cuMemcpyAtoD_v2_ptds cuMemcpyAtoH cuMemcpyAtoHAsync cuMemcpyAtoHAsync_v2 cuMemcpyAtoHAsync_v2_ptsz cuMemcpyAtoH_v2 cuMemcpyAtoH_v2_ptds cuMemcpyDtoA cuMemcpyDtoA_v2 cuMemcpyDtoA_v2_ptds cuMemcpyDtoD cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2 cuMemcpyDtoDAsync_v2_ptsz cuMemcpyDtoD_v2 cuMemcpyDtoD_v2_ptds cuMemcpyDtoH cuMemcpyDtoHAsync cuMemcpyDtoHAsync_v2 cuMemcpyDtoHAsync_v2_ptsz cuMemcpyDtoH_v2 cuMemcpyDtoH_v2_ptds cuMemcpyHtoA cuMemcpyHtoAAsync cuMemcpyHtoAAsync_v2 cuMemcpyHtoAAsync_v2_ptsz cuMemcpyHtoA_v2 cuMemcpyHtoA_v2_ptds cuMemcpyHtoD cuMemcpyHtoDAsync cuMemcpyHtoDAsync_v2 cuMemcpyHtoDAsync_v2_ptsz cuMemcpyHtoD_v2 cuMemcpyHtoD_v2_ptds cuMemcpyPeer cuMemcpyPeerAsync cuMemcpyPeerAsync_ptsz cuMemcpyPeer_ptds cuMemcpy_ptds cuMemcpy_v2 cuMemsetD16 cuMemsetD16Async cuMemsetD16Async_ptsz cuMemsetD16_v2 cuMemsetD16_v2_ptds cuMemsetD2D16 cuMemsetD2D16Async cuMemsetD2D16Async_ptsz cuMemsetD2D16_v2 cuMemsetD2D16_v2_ptds cuMemsetD2D32 cuMemsetD2D32Async cuMemsetD2D32Async_ptsz cuMemsetD2D32_v2 cuMemsetD2D32_v2_ptds cuMemsetD2D8 cuMemsetD2D8Async cuMemsetD2D8Async_ptsz cuMemsetD2D8_v2 cuMemsetD2D8_v2_ptds cuMemsetD32 cuMemsetD32Async cuMemsetD32Async_ptsz cuMemsetD32_v2 cuMemsetD32_v2_ptds cuMemsetD8 cuMemsetD8Async cuMemsetD8Async_ptsz cuMemsetD8_v2 cuMemsetD8_v2_ptds cuMipmappedArrayCreate cuMipmappedArrayDestroy cuModuleLoad cuModuleLoadData cuModuleLoadDataEx cuModuleLoadFatBinary cuModuleUnload cuStreamAddCallback cuStreamAddCallback_ptsz cuStreamAttachMemAsync cuStreamAttachMemAsync_ptsz cuStreamBatchMemOp cuStreamBatchMemOp_ptsz cuStreamCreate cuStreamCreateWithPriority cuStreamDestroy cuStreamDestroy_v2 cuStreamSynchronize cuStreamSynchronize_ptsz cuStreamWaitEvent cuStreamWaitEvent_ptsz cuStreamWaitValue32 cuStreamWaitValue32_ptsz cuStreamWaitValue64 cuStreamWaitValue64_ptsz cuStreamWriteValue32 cuStreamWriteValue32_ptsz cuStreamWriteValue64 cuStreamWriteValue64_ptsz cuSurfObjectCreate cuSurfObjectDestroy cuSurfRefCreate cuSurfRefDestroy cuTexObjectCreate cuTexObjectDestroy cuTexRefCreate cuTexRefDestroy cuVDPAUCtxCreate cuVDPAUCtxCreate_v2
Linux only:
cuDNN API functions
cudnnActivationBackward cudnnActivationBackward_v3 cudnnActivationBackward_v4 cudnnActivationForward cudnnActivationForward_v3 cudnnActivationForward_v4 cudnnAddTensor cudnnBatchNormalizationBackward cudnnBatchNormalizationBackwardEx cudnnBatchNormalizationForwardInference cudnnBatchNormalizationForwardTraining cudnnBatchNormalizationForwardTrainingEx cudnnCTCLoss cudnnConvolutionBackwardBias cudnnConvolutionBackwardData cudnnConvolutionBackwardFilter cudnnConvolutionBiasActivationForward cudnnConvolutionForward cudnnCreate cudnnCreateAlgorithmPerformance cudnnDestroy cudnnDestroyAlgorithmPerformance cudnnDestroyPersistentRNNPlan cudnnDivisiveNormalizationBackward cudnnDivisiveNormalizationForward cudnnDropoutBackward cudnnDropoutForward cudnnDropoutGetReserveSpaceSize cudnnDropoutGetStatesSize cudnnFindConvolutionBackwardDataAlgorithm cudnnFindConvolutionBackwardDataAlgorithmEx cudnnFindConvolutionBackwardFilterAlgorithm cudnnFindConvolutionBackwardFilterAlgorithmEx cudnnFindConvolutionForwardAlgorithm cudnnFindConvolutionForwardAlgorithmEx cudnnFindRNNBackwardDataAlgorithmEx cudnnFindRNNBackwardWeightsAlgorithmEx cudnnFindRNNForwardInferenceAlgorithmEx cudnnFindRNNForwardTrainingAlgorithmEx cudnnFusedOpsExecute cudnnIm2Col cudnnLRNCrossChannelBackward cudnnLRNCrossChannelForward cudnnMakeFusedOpsPlan cudnnMultiHeadAttnBackwardData cudnnMultiHeadAttnBackwardWeights cudnnMultiHeadAttnForward cudnnOpTensor cudnnPoolingBackward cudnnPoolingForward cudnnRNNBackwardData cudnnRNNBackwardDataEx cudnnRNNBackwardWeights cudnnRNNBackwardWeightsEx cudnnRNNForwardInference cudnnRNNForwardInferenceEx cudnnRNNForwardTraining cudnnRNNForwardTrainingEx cudnnReduceTensor cudnnReorderFilterAndBias cudnnRestoreAlgorithm cudnnRestoreDropoutDescriptor cudnnSaveAlgorithm cudnnScaleTensor cudnnSoftmaxBackward cudnnSoftmaxForward cudnnSpatialTfGridGeneratorBackward cudnnSpatialTfGridGeneratorForward cudnnSpatialTfSamplerBackward cudnnSpatialTfSamplerForward cudnnTransformFilter cudnnTransformTensor cudnnTransformTensorEx
Copyright (c) 2012-2020, NVIDIA Corporation. All rights reserved.