What must be done before installing new versions of DOCA drivers on a BlueField DPU?
A
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Before installing new versions of DOCA drivers on NVIDIA BlueField DPUs, it is required to uninstall
any previous versions of DOCA drivers to prevent conflicts and ensure a clean upgrade. This ensures
that the new installation is not affected by leftover files or configurations from earlier versions. Re-
flashing firmware or disabling network interfaces is not always required before every driver
installation. Rebooting the host system might be recommended after installation but is not a
prerequisite before installing drivers.
A Slurm user needs to display real-time information about the running processes and resource usage
of a Slurm job.
Which command should be used?
C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The Slurm command sstat is designed to provide real-time statistics about running jobs, including
process-level details and resource usage such as CPU, memory, and GPU utilization. Using sstat -j
<jobid> or sstat -j <jobid.step> allows monitoring of active job resource consumption.
smap is not a standard Slurm command.
scontrol show job gives job configuration and status but not real-time resource usage.
sinfo displays node and partition information, not job-specific resource stats.
Therefore, sstat is the correct command for real-time job process and resource monitoring.
Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from
the GPU Operator in the public NGC catalog? (Choose two.)
A, D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The pre-configured GPU Operator in the NVIDIA Enterprise Catalog differs from the public NGC
catalog GPU Operator primarily by its configuration to use a prebuilt vGPU driver image and being
configured to use the NVIDIA License System (NLS). These adaptations allow better support for
enterprise environments where vGPU functionality and license management are critical.
Other options such as automatic installation of the Datacenter driver or additional installation of
Network Operator are not specific differences highlighted between the two operators.
You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure
that each AI application running on the same GPU is isolated from others to prevent interference.
Which feature of Fleet Command should you use to achieve this?
C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA Fleet Command is a cloud-native software platform designed to deploy, manage, and
orchestrate AI applications at the edge. When managing multiple AI applications on the same GPU,
Multi-Instance GPU (MIG) support is critical. MIG allows a single GPU to be partitioned into multiple
independent instances, each with dedicated resources (compute, memory, bandwidth), enabling
workload isolation and preventing interference between applications.
Remote Console allows remote access for management but does not provide GPU resource isolation.
Secure NFS support is for secure network file system sharing, unrelated to GPU resource partitioning.
Over-the-air updates are for updating software remotely, not for GPU resource management.
Therefore, to ensure application isolation on the same GPU in Fleet Command environments,
enabling MIG support (option C) is the recommended and standard practice.
This capability is emphasized in NVIDIA’s AI Operations and Fleet Command documentation for
managing edge AI deployments efficiently and securely.
You are deploying AI applications at the edge and want to ensure they continue running even if one
of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
To ensure continued operation of AI applications at the edge despite server failures, NVIDIA Fleet
Command allows administrators to enable high availability (HA) for edge clusters. This HA
configuration ensures redundancy and failover capabilities, so applications remain operational when
an edge server goes down.
Over-the-air updates handle software patching but do not inherently provide failover. MIG manages
GPU resource partitioning, not failover. Secure NFS supports storage redundancy but is not the
primary solution for application failover.
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes,
which of the following is essential when using the Run:AI Administrator CLI for environments where
automation or scripting is required?
C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
When automating tasks with the Run:AI Administrator CLI, it is essential to ensure that the
Kubernetes configuration file (kubeconfig) is correctly set up with cluster administrative rights. This
enables the CLI to interact programmatically with the Kubernetes API for managing nodes, resources,
and workloads efficiently. Without proper administrative permissions in the kubeconfig, automated
operations will fail due to insufficient rights.
Manual GPU allocation is typically handled by scheduling policies rather than CLI manual
assignments. The CLI does not replace kubectl commands entirely, and installation on Windows is
not a critical requirement.
Explanation:
The Run:AI Administrator CLI requires a Kubernetes configuration file with cluster-administrative
rights in order to perform automation or scripting tasks across the cluster. Without those rights, the
CLI cannot manage nodes or resources programmatically.
A Fleet Command system administrator wants to create an organization user that will have the
following rights:
For locations - read only
For Applications - read/write/admin
For Deployments - read/write/admin
For Dashboards - read only
What role should the system administrator assign to this user?
A
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The Fleet Command Operator role is designed to provide users with read-only access to locations
and dashboards while granting full read/write/admin rights for applications and deployments. This
matches the described access requirements where the user can manage applications and
deployments but only view locations and dashboards without modification rights. Other roles like
Fleet Command Admin have broader permissions, Supporter has more limited access, and Viewer is
primarily read-only for all resources.
An organization only needs basic network monitoring and validation tools.
Which UFM platform should they use?
B
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The UFM Telemetry platform provides basic network monitoring and validation capabilities, making
it suitable for organizations that require foundational insight into their network status without
advanced analytics or AI-driven cybersecurity features. Other platforms such as UFM Enterprise or
UFM Pro offer broader or more advanced functionalities, while UFM Cyber-AI focuses on AI-driven
cybersecurity.
Your organization is running multiple AI models on a single A100 GPU using MIG in a multi-tenant
environment. One of the tenants reports a performance issue, but you notice that other tenants are
unaffected.
What feature of MIG ensures that one tenant's workload does not impact others?
A
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
NVIDIA's Multi-Instance GPU (MIG) technology provides hardware-level isolation of critical GPU
resources such as memory, cache, and compute units for each GPU instance. This ensures that
workloads running in one instance are fully isolated and cannot interfere with the performance of
workloads in other instances, supporting multi-tenancy without contention.
You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training
deep learning models. However, the pods are not able to detect the GPUs on the nodes.
What would be the first step to troubleshoot this issue?
A
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The first step in troubleshooting Kubernetes pods that cannot detect GPUs is to verify whether the
NVIDIA GPU Operator is properly installed and running. The GPU Operator manages the installation
and configuration of all NVIDIA GPU components in the cluster, including drivers, device plugins, and
monitoring tools. Without it, pods will not have access to GPU resources. Ensuring correct
installation and operational status of the GPU Operator is essential before checking application-level
versions or resource allocations.