The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the
CUDA Application Programming Interface (API). The MPS runtime architecture is designed to
transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to
utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs
Any interactions with NVIDIA GPUs require that an instance of the kernel mode driver be running.
This driver may be persistent in some environments and transient in others. This document describes
the default driver behavior and options for modifying that behavior.
Nvidia-healthmon is the system administrator and cluster manager's tool for detecting
and troubleshooting common problems affecting NVIDIATesla GPUs in a high performance
computing environments. Nvidia-healthmon focuses on software and system configuration
issues, with only limited hardware diagnostic capabilities.
The NVIDIA driver supports "retiring" of bad framebuffer memory cells, by
retiring the page the cell belongs to. This is called "dynamic page retirement" and is done automatically for cells that are
degrading in quality. This feature can improve the longevity of an otherwise good board and
and is thus an important resiliency feature on supported products, especially in HPC and enterprise environments.
This document explains what Xid messages are, and is intended to assist system administrators, developers, and FAEs in understanding
the meaning behind these messages as an aid in analyzing and resolving GPU-related problems.