Proxmox GPU Configuration
Another Kernal update and another problem
I have also ditched Copilot and switched to Claude. Too many problems and dead ends
Please find attached by Proxmox GPU Configuration code. This code is to configure a LXC not a VM as I don't use them.
Code is a work in progress but has had multiple rewrites although it should now be strong enough to survive a kernel update of the Proxmox server.
Discretion is advised.
#!/bin/bash
# Proxmox VE 9.1 NVIDIA CUDA Installer (Debian 13 / Trixie)
#
# FEATURES:
# - Idempotent: Safe to run multiple times without breaking existing setup
# - DKMS-safe: Properly handles kernel module building and rebuilding
# - Secure Boot aware: Auto-detects and helps import MOK keys for signed modules
# - LXC-ready: Configures device nodes and permissions for container passthrough
# - Multi-kernel support: Installs headers for ALL installed kernels, not just current
# - Multi-GPU aware: Detects NVIDIA even when Intel/AMD iGPU present
# - Auto-repair: Optional systemd units to rebuild after kernel updates
# - Repository validation: Auto-detects Debian version and validates CUDA repo
# - Comprehensive error handling: Detailed logging, retry logic, error capture
# - Dry-run mode: Preview all changes before applying
#
# INSTALLATION SOURCE:
# - Uses ONLY CUDA repository (developer.download.nvidia.com)
# - Purges conflicting Debian/Proxmox NVIDIA stacks to prevent conflicts
# - Installs cuda-drivers and cuda-toolkit packages
#
# KERNEL SUPPORT:
# - Detects all installed proxmox-kernel-* and pve-kernel-* packages
# - Installs matching proxmox-headers-* for each kernel
# - Builds DKMS modules for all kernels simultaneously
# - Cleans up only orphaned headers (kernels no longer installed)
# - Preserves multi-kernel setups for safe fallback
#
# DEVICE NODE MANAGEMENT:
# - Creates udev rules for nvidia-uvm device nodes
# - Installs systemd service to enforce device nodes on boot
# - Handles /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm
# - Ensures proper permissions (0666) for LXC container access
#
# GPU DETECTION:
# - Scans all VGA/3D/Display controllers via lspci
# - Finds NVIDIA GPU even when not primary (e.g., Intel iGPU present)
# - Shows all detected GPUs for troubleshooting
#
# Usage:
# sudo ./setup-gpu-pxe.sh [OPTIONS]
#
# OPTIONS:
# --dry-run Show what would be done without making changes
# --purge Remove all NVIDIA/CUDA components (clean slate)
# --force-rebuild Force DKMS rebuild of NVIDIA modules
# --check-only Check current NVIDIA status without installing
# --install-autorun Install systemd units for automatic kernel update handling
# --help Show detailed help message
#
# EXAMPLES:
# # Standard installation (recommended for most users)
# sudo ./setup-gpu-pxe.sh
#
# # Installation with automatic kernel update support
# sudo ./setup-gpu-pxe.sh --install-autorun
#
# # Preview changes before applying
# sudo ./setup-gpu-pxe.sh --dry-run
#
# # Check current system status
# sudo ./setup-gpu-pxe.sh --check-only
#
# # Force rebuild after manual kernel update
# sudo ./setup-gpu-pxe.sh --force-rebuild
#
# # Complete removal for troubleshooting
# sudo ./setup-gpu-pxe.sh --purge
#
# WORKFLOW:
# 1. Validates prerequisites (systemctl, apt, dpkg, wget, lspci, modprobe, dkms)
# 2. Detects NVIDIA GPU (works with multi-GPU systems)
# 3. Detects all installed Proxmox kernels
# 4. Installs headers for ALL kernels (not just current)
# 5. Blacklists nouveau driver
# 6. Validates and adds CUDA repository
# 7. Purges conflicting Debian/Proxmox NVIDIA packages
# 8. Checks Secure Boot status and MOK key enrollment
# 9. Installs cuda-drivers and cuda-toolkit
# 10. Builds DKMS modules for all kernels
# 11. Creates udev rules and device nodes
# 12. Installs systemd service for device node enforcement
# 13. Enables nvidia-persistenced for stability
# 14. Verifies all components operational
# 15. Optionally installs kernel update auto-repair systemd units
# 16. Displays comprehensive health summary
#
# TROUBLESHOOTING:
# - Run with --check-only to see current system state
# - Check /var/log/syslog for DKMS build errors
# - Check /tmp/modprobe_err_*.log for module loading failures
# - Use --purge followed by fresh install for clean slate
# - Verify Secure Boot MOK key enrollment if modules won't load
#
# LXC CONTAINER CONFIGURATION:
# Add these lines to container config (/etc/pve/lxc/<VMID>.conf):
# lxc.cgroup2.devices.allow: c 195:* rwm
# lxc.cgroup2.devices.allow: c 507:* rwm
# lxc.cgroup2.devices.allow: c 510:* rwm
# lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
# lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
# lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
# lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
# lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
# lxc.mount.entry: /dev/nvidia-caps dev/nvidia-caps none bind,optional,create=dir
#
# CHANGELOG (v2.2.5):
# - Fixed LXC configuration documentation to include ALL required cgroup devices
# - Added missing c 507:* rwm (nvidia-uvm devices)
# - Added missing c 510:* rwm (nvidia-caps devices)
# - Added missing /dev/nvidia-caps directory mount
# - Improved LXC config output with explanations of device numbers
# - Changed syntax from = to : for proper LXC config format
#
# CHANGELOG (v2.2.4):
# - Fixed module detection regex: now uses \s instead of space for proper matching
# - Fixed false-negative module loading detection
# - Improved persistence daemon error handling and diagnostics
# - Added better feedback when persistence daemon fails (non-critical)
# - More informative error messages during module verification
#
# CHANGELOG (v2.2.3):
# - Added apt install retry logic (3 attempts with exponential backoff)
# - Added apt update retry logic to handle transient network failures
# - Improved error messages with actionable troubleshooting steps
# - Better handling of Debian mirror timeouts and connection issues
# - All package installations now use retry mechanism automatically
#
# CHANGELOG (v2.2.2):
# - Fixed multi-kernel support: installs headers for ALL installed kernels
# - Fixed GPU detection: finds NVIDIA even with Intel/AMD iGPU present
# - Added comprehensive prerequisite validation
# - Added CUDA repository validation with auto-detection
# - Added download retry logic (3 attempts)
# - Improved error handling with detailed logging
# - Added cleanup trap for interrupted installations
# - Extended module load timeout from 5s to 10s
# - Added error log capture for failed modprobe attempts
# - Improved DRY_RUN support across all functions
# - Added --help flag with detailed documentation
# - Enhanced health summary with checkmarks and color coding
# - Only removes orphaned headers (kernel no longer installed)
# - Shows all detected GPUs during installation
# - Better handling of Secure Boot and MOK key enrollment
#
# VERSION: 2.2.5
# AUTHOR: Leon Scott
# LICENSE: MIT
# REPOSITORY: [Add your repo URL here]
#
SCRIPT_VERSION="2.2.5"
set -euo pipefail
# =========================
# Color + logging utilities
# =========================
info() { echo -e "\033[0;32m[INFO]\033[0m $*"; }
warn() { echo -e "\033[0;33m[WARN]\033[0m $*" >&2; }
error() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }
run() {
if [[ "${DRY_RUN}" == "1" ]]; then
echo "+ $*"
else
eval "$@"
fi
}
need_root() { [[ "$(id -u)" -eq 0 ]] || { error "Run as root."; exit 1; }; }
# =========================
# Helper: apt install with retry logic
# =========================
apt_install_with_retry() {
local max_attempts=3
local wait_time=5
local packages="$*"
info "Installing packages: ${packages}"
for attempt in $(seq 1 $max_attempts); do
info "Attempt ${attempt}/${max_attempts}..."
if apt install -y ${packages}; then
info "Successfully installed: ${packages}"
return 0
else
local exit_code=$?
warn "Installation attempt ${attempt} failed (exit code: ${exit_code})"
if [[ $attempt -lt $max_attempts ]]; then
warn "Waiting ${wait_time} seconds before retry..."
warn "Running apt update to refresh repository metadata..."
apt update || warn "apt update failed, continuing anyway..."
sleep $wait_time
wait_time=$((wait_time * 2)) # Exponential backoff
fi
fi
done
error "Failed to install after ${max_attempts} attempts: ${packages}"
error "This may be due to:"
error " - Network connectivity issues"
error " - Debian mirror problems"
error " - Package dependency conflicts"
error ""
error "Suggested fixes:"
error " 1. Check network: ping -c3 deb.debian.org"
error " 2. Try different mirror: edit /etc/apt/sources.list"
error " 3. Run: apt update && apt install --fix-missing"
error " 4. Wait 15 minutes and retry (mirrors may be syncing)"
return 1
}
# =========================
# Helper: apt update with retry
# =========================
apt_update_with_retry() {
local max_attempts=3
local wait_time=3
for attempt in $(seq 1 $max_attempts); do
info "Running apt update (attempt ${attempt}/${max_attempts})..."
if apt update; then
info "Repository metadata updated successfully"
return 0
else
warn "apt update attempt ${attempt} failed"
if [[ $attempt -lt $max_attempts ]]; then
warn "Waiting ${wait_time} seconds before retry..."
sleep $wait_time
fi
fi
done
error "apt update failed after ${max_attempts} attempts"
error "Continuing anyway, but installation may fail..."
return 1
}
# =========================
# Defaults
# =========================
DRY_RUN=0
DO_PURGE=0
FORCE_REBUILD=0
CHECK_ONLY=0
INSTALL_AUTORUN=0
# Auto-detect Debian version or use override
detect_debian_version() {
if [[ -f /etc/os-release ]]; then
local version_id
version_id="$(grep '^VERSION_ID=' /etc/os-release | cut -d'=' -f2 | tr -d '"')"
case "$version_id" in
12) echo "debian12" ;;
13) echo "debian13" ;;
*)
warn "Unknown Debian version: $version_id, defaulting to debian12"
echo "debian12"
;;
esac
else
warn "/etc/os-release not found, defaulting to debian12"
echo "debian12"
fi
}
CUDA_REPO_CODENAME="${CUDA_REPO_CODENAME:-$(detect_debian_version)}"
CUDA_KEYRING_URL="https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/cuda-keyring_1.1-1_all.deb"
CUDA_KEYRING_DEB="/tmp/cuda-keyring.deb"
CUDA_KEYRING_SHA256="/tmp/cuda-keyring.sha256"
NVIDIA_LIST_FILE="/etc/apt/sources.list.d/cuda-${CUDA_REPO_CODENAME}-x86_64.list"
NVIDIA_KEYRING_FILE="/usr/share/keyrings/cuda-archive-keyring.gpg"
KERNEL_VER="$(uname -r)"
# Use consistent naming: proxmox-headers for PVE 8.x+
HEADERS_PKG="proxmox-headers-${KERNEL_VER}"
UVM_RULE_FILE="/lib/udev/rules.d/71-nvidia-uvm.rules"
UVM_FIX_SCRIPT="/usr/local/sbin/fix-nvidia-uvm.sh"
UVM_FIX_SERVICE="/etc/systemd/system/fix-nvidia-uvm.service"
AUTORUN_SVC="/etc/systemd/system/proxmox-nvidia-autofix.service"
AUTORUN_PATH="/etc/systemd/system/proxmox-nvidia-autofix.path"
# Cleanup trap for interrupted operations
cleanup_on_exit() {
local exit_code=$?
if [[ $exit_code -ne 0 ]]; then
warn "Script interrupted or failed (exit code: $exit_code)"
warn "You may need to run with --purge to clean up partial installation"
fi
}
trap cleanup_on_exit EXIT
# =========================
# Parse args
# =========================
show_help() {
cat << EOF
Proxmox NVIDIA CUDA Installer v${SCRIPT_VERSION}
Usage: sudo $0 [OPTIONS]
OPTIONS:
--dry-run Show what would be done without making changes
--purge Remove all NVIDIA/CUDA components (clean slate)
--force-rebuild Force DKMS rebuild of NVIDIA modules
--check-only Check current NVIDIA status without installing
--install-autorun Install systemd units for automatic kernel update handling
--help Show this help message
EXAMPLES:
# Full installation
sudo $0
# Check status
sudo $0 --check-only
# Clean removal
sudo $0 --purge
# Reinstall after kernel update
sudo $0 --force-rebuild
EOF
}
for arg in "$@"; do
case "$arg" in
--dry-run) DRY_RUN=1 ;;
--purge) DO_PURGE=1 ;;
--force-rebuild) FORCE_REBUILD=1 ;;
--check-only) CHECK_ONLY=1 ;;
--install-autorun) INSTALL_AUTORUN=1 ;;
--help) show_help; exit 0 ;;
*) warn "Unknown arg: $arg (use --help for usage)" ;;
esac
done
need_root
info "Proxmox NVIDIA Installer — version ${SCRIPT_VERSION}"
info "DRY_RUN=${DRY_RUN}, PURGE=${DO_PURGE}, FORCE_REBUILD=${FORCE_REBUILD}, CHECK_ONLY=${CHECK_ONLY}"
info "Kernel=${KERNEL_VER}, CUDA Repo=${CUDA_REPO_CODENAME}"
# =========================
# Validate prerequisites
# =========================
validate_prerequisites() {
local missing=()
for cmd in systemctl apt dpkg wget lspci modprobe dkms; do
if ! command -v "$cmd" >/dev/null 2>&1; then
missing+=("$cmd")
fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
error "Missing required commands: ${missing[*]}"
exit 1
fi
info "All prerequisites validated"
}
# =========================
# Helper: systemd health
# =========================
check_systemd_health() {
info "Checking systemd failed units..."
run "systemctl --failed || true"
}
# =========================
# Secure Boot check + auto-import
# =========================
check_secure_boot() {
if ! command -v mokutil >/dev/null 2>&1; then
warn "mokutil not available; skipping Secure Boot check."
return
fi
local sb_state
sb_state="$(mokutil --sb-state 2>/dev/null || true)"
info "Secure Boot state: ${sb_state}"
if echo "$sb_state" | grep -qi "enabled"; then
if [[ -f /var/lib/dkms/mok.pub ]]; then
warn "Secure Boot enabled. Attempting to auto-import DKMS MOK key..."
if [[ -f /var/lib/dkms/mok.key ]]; then
info "MOK key pair found, importing..."
run "mokutil --import /var/lib/dkms/mok.pub"
warn "Reboot required. Approve the key enrollment in firmware once."
else
warn "MOK public key exists but private key not found."
warn "Key will be generated during DKMS build."
fi
else
warn "DKMS MOK key not found at /var/lib/dkms/mok.pub."
warn "It will be generated during DKMS builds."
fi
fi
}
# =========================
# Detect GPU vendor
# ==========================
detect_gpu_vendor() {
local all_vendors
all_vendors="$(lspci -nn | grep -E 'VGA|3D|Display' | grep -oE 'NVIDIA|AMD|Intel' || true)"
if echo "$all_vendors" | grep -q 'NVIDIA'; then
info "Detected NVIDIA GPU"
# Show all GPUs found
lspci -nn | grep -E 'VGA|3D|Display' | while read -r line; do
info " Found: $line"
done
return 0
else
warn "No NVIDIA GPU detected. Found GPUs:"
lspci -nn | grep -E 'VGA|3D|Display' | while read -r line; do
warn " $line"
done
exit 0
fi
}
# ==========================
# Ensure headers
# ==========================
ensure_headers_and_update() {
info "Checking for Proxmox kernel headers..."
KERNEL_VER=$(uname -r)
HEADERS_PKG="proxmox-headers-${KERNEL_VER}"
# Get all installed Proxmox kernels
info "Detecting all installed Proxmox kernels..."
local installed_kernels
installed_kernels=$(dpkg -l | awk '/^ii.*proxmox-kernel-/{print $2}' | sed 's/proxmox-kernel-//' || true)
if [[ -z "$installed_kernels" ]]; then
warn "No proxmox-kernel packages found, checking for pve-kernel..."
installed_kernels=$(dpkg -l | awk '/^ii.*pve-kernel-/{print $2}' | sed 's/pve-kernel-//' || true)
fi
if [[ -n "$installed_kernels" ]]; then
info "Found installed kernels:"
echo "$installed_kernels" | while read -r kver; do
info " - $kver"
done
# Install headers for all installed kernels
echo "$installed_kernels" | while read -r kver; do
local header_pkg="proxmox-headers-${kver}"
if dpkg -l | grep -q "^ii[[:space:]]*${header_pkg}"; then
info "Headers already installed: ${header_pkg}"
else
info "Installing headers for kernel ${kver}: ${header_pkg}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would install ${header_pkg}"
else
if apt_install_with_retry "${header_pkg}"; then
info "Successfully installed ${header_pkg}"
else
warn "Failed to install ${header_pkg} - may not be available"
fi
fi
fi
done
else
# Fallback: just install for current kernel
warn "Could not detect installed kernels, installing for current kernel only"
if dpkg -l | grep -q "^ii[[:space:]]*${HEADERS_PKG}"; then
info "Headers already installed: ${HEADERS_PKG}"
else
info "Installing headers: ${HEADERS_PKG}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would install ${HEADERS_PKG}"
else
if ! apt_install_with_retry "${HEADERS_PKG}"; then
error "Failed to install ${HEADERS_PKG}. DKMS may not build correctly."
error "Available header packages:"
apt-cache search "^proxmox-headers-" | head -5
return 1
fi
fi
fi
fi
# Rebuild DKMS for all kernels
if [[ "${DRY_RUN}" -eq 0 ]]; then
info "Rebuilding DKMS modules for all kernels..."
dkms autoinstall || warn "DKMS rebuild had warnings"
fi
# Clean up headers for kernels that are no longer installed
info "Checking for orphaned kernel headers..."
local all_installed_headers
all_installed_headers=$(dpkg -l | awk '/^ii/{print $2}' | grep '^proxmox-headers-' || true)
if [[ -n "$all_installed_headers" ]]; then
echo "$all_installed_headers" | while read -r header_pkg; do
local kver="${header_pkg#proxmox-headers-}"
local kernel_pkg="proxmox-kernel-${kver}"
# Check if corresponding kernel is still installed
if ! dpkg -l | grep -q "^ii[[:space:]]*${kernel_pkg}"; then
# Try alternate naming
kernel_pkg="pve-kernel-${kver}"
if ! dpkg -l | grep -q "^ii[[:space:]]*${kernel_pkg}"; then
info "Removing orphaned headers (kernel no longer installed): ${header_pkg}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would purge ${header_pkg}"
else
apt purge -y "${header_pkg}" || warn "Failed to purge ${header_pkg}"
fi
fi
fi
done
fi
}
# ==========================
# Blacklist nouveau
# ==========================
ensure_blacklist_nouveau() {
local modprobe_conf="/etc/modprobe.d/blacklist-nouveau.conf"
if [[ ! -f "$modprobe_conf" ]]; then
info "Blacklisting nouveau and updating initramfs..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would create ${modprobe_conf} and update initramfs"
else
run "tee ${modprobe_conf} >/dev/null <<'EOF'
blacklist nouveau
options nouveau modeset=0
EOF"
run "update-initramfs -u"
fi
else
info "nouveau already blacklisted at ${modprobe_conf}"
fi
}
# ==========================
# Validate CUDA repo availability
# ==========================
validate_cuda_repo() {
info "Validating CUDA repository availability..."
local test_url="https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/"
if [[ "${DRY_RUN}" -eq 0 ]]; then
if wget --spider -q "$test_url" 2>/dev/null; then
info "CUDA repository for ${CUDA_REPO_CODENAME} is accessible"
return 0
else
error "CUDA repository for ${CUDA_REPO_CODENAME} is not accessible"
error "URL: $test_url"
error "You may need to adjust CUDA_REPO_CODENAME environment variable"
return 1
fi
else
info "(dry-run) Would validate CUDA repo: $test_url"
fi
}
# ==========================
# Ensure CUDA repo (no duplicates)
# ==========================
ensure_cuda_repo() {
validate_cuda_repo || return 1
if [[ ! -f "${NVIDIA_KEYRING_FILE}" ]]; then
info "Installing CUDA keyring..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would download and install CUDA keyring"
else
# Download with retry logic
local max_retries=3
local retry=0
while [[ $retry -lt $max_retries ]]; do
if wget -q "${CUDA_KEYRING_URL}" -O "${CUDA_KEYRING_DEB}"; then
break
fi
retry=$((retry + 1))
warn "Download failed, retry $retry/$max_retries..."
sleep 2
done
if [[ ! -f "${CUDA_KEYRING_DEB}" ]]; then
error "Failed to download CUDA keyring after $max_retries attempts"
return 1
fi
run "dpkg -i '${CUDA_KEYRING_DEB}'"
rm -f "${CUDA_KEYRING_DEB}"
fi
else
info "CUDA keyring present: ${NVIDIA_KEYRING_FILE}"
fi
if [[ ! -f "${NVIDIA_LIST_FILE}" ]]; then
info "Creating CUDA repo sources list: ${NVIDIA_LIST_FILE}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would create ${NVIDIA_LIST_FILE}"
else
run "tee '${NVIDIA_LIST_FILE}' >/dev/null <<EOF
deb [signed-by=${NVIDIA_KEYRING_FILE}] https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/ /
EOF"
fi
else
info "CUDA repo sources list already present: ${NVIDIA_LIST_FILE}"
fi
if [[ -f "/etc/apt/sources.list.d/nvidia.list" ]]; then
warn "Removing old /etc/apt/sources.list.d/nvidia.list to avoid duplicate repo entries..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would remove /etc/apt/sources.list.d/nvidia.list"
else
run "rm -f /etc/apt/sources.list.d/nvidia.list"
fi
fi
info "apt update (CUDA repo)..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would run apt update"
else
apt_update_with_retry
fi
}
# ==========================
# Purge conflicting Debian/Proxmox NVIDIA bits
# ===========================
purge_conflicting_stacks() {
info "Ensuring no Debian/Proxmox NVIDIA stacks are present..."
if dpkg -l | awk '{print $2}' | grep -qx 'nvidia-kernel-common'; then
warn "Found nvidia-kernel-common (Debian stack). Purging..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would purge nvidia-kernel-common"
else
run "apt purge -y nvidia-kernel-common || true"
run "apt autoremove -y || true"
fi
fi
if dpkg -l | awk '{print $2}' | grep -qx 'pve-nvidia-vgpu-helper'; then
warn "Found pve-nvidia-vgpu-helper. Purging to avoid stack conflicts..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would purge pve-nvidia-vgpu-helper"
else
run "apt purge -y pve-nvidia-vgpu-helper || true"
run "apt autoremove -y || true"
fi
fi
}
# ==========================
# Purge ALL NVIDIA/CUDA stacks (clean slate)
# ==========================
purge_nvidia_cuda_all() {
info "Purging ALL NVIDIA/CUDA/Proxmox NVIDIA stacks..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would stop and disable all NVIDIA services"
info "(dry-run) Would remove all NVIDIA packages and DKMS modules"
info "(dry-run) Would clean up systemd units and device nodes"
return
fi
# Stop and disable services first
run "systemctl stop nvidia-persistenced || true"
run "systemctl disable nvidia-persistenced || true"
run "systemctl stop pve-nvidia-vgpu-helper || true"
run "systemctl disable pve-nvidia-vgpu-helper || true"
run "systemctl daemon-reload || true"
# Remove persistence daemon units and symlinks
run "rm -f /etc/systemd/system/nvidia-persistenced.service || true"
run "rm -f /etc/systemd/system/multi-user.target.wants/nvidia-persistenced.service || true"
run "rm -f /usr/lib/systemd/system/nvidia-persistenced.service || true"
run "rm -f /lib/systemd/system/nvidia-persistenced.service || true"
# Remove UVM fixer unit and script
run "systemctl disable fix-nvidia-uvm.service || true"
run "rm -f '${UVM_FIX_SERVICE}' '${UVM_FIX_SCRIPT}' || true"
# Remove autorun units
run "systemctl disable proxmox-nvidia-autofix.path proxmox-nvidia-autofix.service || true"
run "rm -f '${AUTORUN_SVC}' '${AUTORUN_PATH}' || true"
run "systemctl daemon-reload || true"
# Remove custom UVM udev rule
run "rm -f '${UVM_RULE_FILE}' || true"
# Purge all relevant packages from both Debian/Proxmox and CUDA stacks
run "apt purge -y 'nvidia-*' 'cuda*' 'libnvidia-*' 'xserver-xorg-video-nvidia*' 'pve-nvidia-*' || true"
# DKMS cleanup
run "dkms status | grep -i nvidia || true"
run "dkms remove -m nvidia -v all --all || true"
run "rm -rf /var/lib/dkms/nvidia || true"
# CUDA leftovers
run "rm -rf /usr/local/cuda* || true"
# Clean package cache
run "apt autoremove -y || true"
run "apt clean || true"
info "Purge complete. You should reboot before reinstalling."
}
# ==========================
# Install drivers + toolkit (CUDA repo only)
# ==========================
install_nvidia_stack() {
info "Installing NVIDIA driver stack and CUDA toolkit from CUDA repo..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would install cuda-drivers cuda-toolkit"
else
if ! apt_install_with_retry cuda-drivers cuda-toolkit; then
error "Failed to install NVIDIA packages after multiple attempts"
error "Check the error messages above for specific issues"
return 1
fi
info "Verifying DKMS build status..."
run "dkms status || true"
fi
}
# ==========================
# Force DKMS rebuild
# ==========================
force_rebuild() {
info "Forcing DKMS rebuild for NVIDIA module..."
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would remove and rebuild NVIDIA DKMS modules"
else
run "dkms remove -m nvidia -v all --all || true"
run "dkms autoinstall || true"
info "Reloading NVIDIA modules..."
run "modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia || true"
run "modprobe nvidia || true"
fi
}
# ==========================
# UVM device node enforcement
# ==========================
install_uvm_fix_systemd() {
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would install UVM fixer systemd unit and script."
info "(dry-run) Would enable fix-nvidia-uvm.service."
return
fi
info "Installing systemd helper to enforce /dev/nvidia-uvm nodes..."
run "tee '${UVM_FIX_SCRIPT}' >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# Exit silently if UVM module is not loaded
if ! lsmod | grep -q '^nvidia_uvm'; then
exit 0
fi
major=\$(awk '\$2==\"nvidia-uvm\" {print \$1}' /proc/devices || true)
if [[ -z "\$major" ]]; then
exit 0
fi
if [[ ! -e /dev/nvidia-uvm ]]; then
mknod -m 666 /dev/nvidia-uvm c "\$major" 0
fi
if [[ ! -e /dev/nvidia-uvm-tools ]]; then
mknod -m 666 /dev/nvidia-uvm-tools c "\$major" 1
fi
EOF"
run "chmod +x '${UVM_FIX_SCRIPT}'"
run "tee '${UVM_FIX_SERVICE}' >/dev/null <<EOF
[Unit]
Description=Ensure /dev/nvidia-uvm device nodes exist
After=multi-user.target
ConditionPathExists=/proc/devices
[Service]
Type=oneshot
ExecStart=${UVM_FIX_SCRIPT}
[Install]
WantedBy=multi-user.target
EOF"
run "systemctl daemon-reload"
run "systemctl enable --now fix-nvidia-uvm.service"
info "UVM fixer systemd unit installed and active."
}
ensure_uvm_udev_rules_and_nodes() {
info "Ensuring udev rules and device nodes for nvidia-uvm..."
# Create udev rule if missing
if [[ ! -f "${UVM_RULE_FILE}" ]]; then
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would create UVM udev rule at ${UVM_RULE_FILE}"
else
info "Creating UVM udev rule at ${UVM_RULE_FILE}..."
run "tee '${UVM_RULE_FILE}' >/dev/null <<'EOF'
KERNEL=="nvidia-uvm", MODE="0666"
KERNEL=="nvidia-uvm-tools", MODE="0666"
EOF"
fi
else
info "UVM udev rule already present: ${UVM_RULE_FILE}"
fi
if [[ "${DRY_RUN}" -eq 0 ]]; then
run "udevadm control --reload-rules || true"
run "udevadm trigger || true"
run "udevadm settle || true"
# Load module with better error handling
if ! lsmod | grep -q '^nvidia_uvm'; then
info "Loading nvidia_uvm module..."
if ! modprobe nvidia_uvm 2>/dev/null; then
warn "Failed to load nvidia_uvm module. This may be normal if nvidia driver isn't fully installed yet."
fi
fi
local major
major="$(awk '$2=="nvidia-uvm" {print $1}' /proc/devices || true)"
if [[ -z "${major}" ]]; then
warn "nvidia-uvm major not found in /proc/devices; device nodes may not be creatable yet."
else
if [[ ! -e /dev/nvidia-uvm ]]; then
info "Creating /dev/nvidia-uvm (major ${major}, minor 0)..."
run "mknod -m 666 /dev/nvidia-uvm c ${major} 0"
fi
if [[ ! -e /dev/nvidia-uvm-tools ]]; then
info "Creating /dev/nvidia-uvm-tools (major ${major}, minor 1)..."
run "mknod -m 666 /dev/nvidia-uvm-tools c ${major} 1"
fi
fi
else
info "(dry-run) Would reload udev rules and trigger."
info "(dry-run) Would modprobe nvidia_uvm and create /dev/nvidia-uvm* if needed."
fi
install_uvm_fix_systemd
}
# ==========================
# Helper: DKMS + device + driver check
# ==========================
verify_nvidia_ready() {
info "Verifying NVIDIA kernel modules and device nodes..."
# MODULE LOAD TIMING DELAY - increased from 5 to 10 seconds
if [[ "${DRY_RUN}" -eq 0 ]]; then
info "Waiting for NVIDIA modules to load (up to 10 seconds)..."
for i in {1..10}; do
sleep 1
if lsmod | grep -q '^nvidia\s'; then
info "NVIDIA module loaded after ${i} seconds"
break
fi
if [[ $i -eq 10 ]]; then
warn "NVIDIA module not detected after 10 seconds"
fi
done
fi
if [[ "${DRY_RUN}" -eq 0 ]]; then
# Try to load modules with better error capture
local failed_modules=()
for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
if ! lsmod | grep -q "^${mod}\s"; then
info "Attempting to load module: ${mod}"
if ! modprobe "${mod}" 2>/tmp/modprobe_err_${mod}.log; then
failed_modules+=("${mod}")
warn "Failed to load ${mod}. Error log:"
cat "/tmp/modprobe_err_${mod}.log" 2>/dev/null || true
fi
fi
done
local missing_modules=()
for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
if ! lsmod | grep -q "^${mod}\s"; then
missing_modules+=("${mod}")
fi
done
if (( ${#missing_modules[@]} > 0 )); then
error "Missing loaded NVIDIA modules: ${missing_modules[*]}"
if (( ${#failed_modules[@]} > 0 )); then
error "Failed to load: ${failed_modules[*]}"
error "Check /tmp/modprobe_err_*.log for details"
fi
info "Showing current lsmod output:"
run "lsmod | grep -i nvidia || true"
info "Showing DKMS status:"
run "dkms status || true"
return 1
else
info "All core NVIDIA modules are loaded."
fi
local missing_devs=()
for dev in /dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools; do
if [[ ! -e "${dev}" ]]; then
missing_devs+=("${dev}")
fi
done
if (( ${#missing_devs[@]} > 0 )); then
warn "Missing NVIDIA device nodes: ${missing_devs[*]}"
run "ls -l /dev/nvidia* 2>/dev/null || true"
return 1
else
info "All expected NVIDIA device nodes are present."
fi
else
info "(dry-run) Would verify modules: nvidia, nvidia_modeset, nvidia_drm, nvidia_uvm."
info "(dry-run) Would verify device nodes: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm, /dev/nvidia-uvm-tools."
fi
if [[ "${DRY_RUN}" -eq 0 ]]; then
if command -v nvidia-smi >/dev/null 2>&1; then
info "Running nvidia-smi..."
run "nvidia-smi || true"
else
warn "nvidia-smi not found in PATH."
fi
local persist_status
persist_status="$(systemctl is-active nvidia-persistenced 2>/dev/null || true)"
if [[ "$persist_status" != "active" ]]; then
warn "nvidia-persistenced is not active (status: $persist_status)"
else
info "nvidia-persistenced is active."
fi
else
info "(dry-run) Would run nvidia-smi and check nvidia-persistenced."
fi
return 0
}
# ==========================
# Enable persistence daemon
# ==========================
enable_persistence() {
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would enable and start nvidia-persistenced and set persistence mode."
return
fi
info "Enabling NVIDIA Persistence Daemon..."
# Check if nvidia-persistenced binary exists
if ! command -v nvidia-persistenced >/dev/null 2>&1; then
warn "nvidia-persistenced binary not found. It may not be installed yet."
warn "Persistence daemon will be configured on next reboot."
return
fi
run "systemctl enable nvidia-persistenced || true"
run "systemctl restart nvidia-persistenced || systemctl start nvidia-persistenced || true"
# Check if it started successfully
if systemctl is-active --quiet nvidia-persistenced; then
info "nvidia-persistenced is running"
run "systemctl status --no-pager nvidia-persistenced || true"
else
warn "nvidia-persistenced failed to start. Checking logs..."
run "journalctl -u nvidia-persistenced -n 20 --no-pager || true"
warn "This is usually not critical. The daemon will start on next reboot."
fi
info "Setting persistence mode for the GPU..."
if ! nvidia-smi -pm 1; then
warn "Failed to set persistence mode. This may require a reboot."
warn "You can manually enable it later with: nvidia-smi -pm 1"
else
info "GPU persistence mode enabled successfully"
fi
}
# ==========================
# Check NVIDIA readiness for LXC passthrough
# ==========================
check_nvidia_lxc_ready() {
info "Checking NVIDIA components for LXC container passthrough..."
if [[ "${DRY_RUN}" -eq 0 ]]; then
if ls /dev/nvidia* >/dev/null 2>&1; then
run "ls -l /dev/nvidia*"
else
warn "No /dev/nvidia* device nodes found."
fi
info "Checking for required NVIDIA libraries (host-side):"
run "ldconfig -p | grep -E 'libcuda.so|libnvidia-ml.so' || true"
else
info "(dry-run) Would list /dev/nvidia* and check libcuda/libnvidia-ml."
fi
echo ""
echo "LXC config entries to enable GPU (add to /etc/pve/lxc/<VMID>.conf):"
echo ""
echo "# Allow access to NVIDIA device nodes"
echo "lxc.cgroup2.devices.allow: c 195:* rwm"
echo "lxc.cgroup2.devices.allow: c 507:* rwm"
echo "lxc.cgroup2.devices.allow: c 510:* rwm"
echo ""
echo "# Mount NVIDIA device nodes into container"
echo "lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file"
echo "lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file"
echo "lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file"
echo "lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file"
echo "lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file"
echo "lxc.mount.entry: /dev/nvidia-caps dev/nvidia-caps none bind,optional,create=dir"
echo ""
echo "Device explanations:"
echo " 195:* = Main GPU devices (nvidia0, nvidiactl, nvidia-modeset)"
echo " 507:* = CUDA/UVM devices (nvidia-uvm, nvidia-uvm-tools)"
echo " 510:* = Capability devices (nvidia-caps/nvidia-cap1, nvidia-cap2)"
echo ""
}
# ==========================
# Install systemd autorun for kernel updates
# ==========================
install_systemd_autorun() {
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "(dry-run) Would install proxmox-nvidia-autofix.{service,path} and enable .path unit."
return
fi
if [[ -f "${AUTORUN_SVC}" ]]; then
info "systemd autorun already installed."
return
fi
info "Installing systemd autorun for kernel updates..."
# Validate script location
local script_path
script_path="$(realpath "$0")"
if [[ ! -f "$script_path" ]]; then
error "Cannot resolve script path: $0"
return 1
fi
run "ln -sf \"${script_path}\" /usr/local/sbin/proxmox-nvidia.sh"
run "tee '${AUTORUN_SVC}' >/dev/null <<EOF
[Unit]
Description=Auto-run NVIDIA installer after kernel updates
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/proxmox-nvidia.sh --force-rebuild
EOF"
run "tee '${AUTORUN_PATH}' >/dev/null <<EOF
[Unit]
Description=Watch for kernel updates
[Path]
PathChanged=/boot
Unit=proxmox-nvidia-autofix.service
[Install]
WantedBy=multi-user.target
EOF"
run "systemctl daemon-reload"
run "systemctl enable --now proxmox-nvidia-autofix.path"
info "systemd autorun installed and active."
}
# ==========================
# Health Summary Block
# ===========================
health_summary() {
echo -e "\n\033[1;36m=== NVIDIA HEALTH SUMMARY ===\033[0m"
# Driver installed
if command -v nvidia-smi >/dev/null 2>&1; then
echo -e "\033[0;32m✓ Driver installed:\033[0m YES"
else
echo -e "\033[0;31m✗ Driver installed:\033[0m NO"
fi
# Modules loaded
local all_mods_loaded=true
for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
if ! lsmod | grep -q "^${mod}\s"; then
all_mods_loaded=false
fi
done
if $all_mods_loaded; then
echo -e "\033[0;32m✓ Modules loaded:\033[0m YES"
else
echo -e "\033[0;31m✗ Modules loaded:\033[0m NO"
fi
# Device nodes
local all_nodes=true
for dev in /dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools; do
[[ -e "$dev" ]] || all_nodes=false
done
if $all_nodes; then
echo -e "\033[0;32m✓ Device nodes present:\033[0m YES"
else
echo -e "\033[0;31m✗ Device nodes present:\033[0m NO"
fi
# Persistence daemon
if systemctl is-active --quiet nvidia-persistenced 2>/dev/null; then
echo -e "\033[0;32m✓ Persistence daemon:\033[0m ACTIVE"
else
echo -e "\033[0;33m⚠ Persistence daemon:\033[0m INACTIVE"
fi
# CUDA libs
if ldconfig -p 2>/dev/null | grep -q libcuda.so; then
echo -e "\033[0;32m✓ CUDA libraries:\033[0m PRESENT"
else
echo -e "\033[0;31m✗ CUDA libraries:\033[0m MISSING"
fi
# DKMS
if dkms status 2>/dev/null | grep -q nvidia; then
echo -e "\033[0;32m✓ DKMS status:\033[0m OK"
else
echo -e "\033[0;31m✗ DKMS status:\033[0m MISSING"
fi
# LXC readiness
if [[ -e /dev/nvidia0 ]]; then
echo -e "\033[0;32m✓ LXC passthrough readiness:\033[0m OK"
else
echo -e "\033[0;31m✗ LXC passthrough readiness:\033[0m NO"
fi
echo -e "\033[1;36m=============================\033[0m\n"
}
# ==========================
# Main execution flow
# ==========================
main() {
validate_prerequisites
if [[ "$CHECK_ONLY" -eq 1 ]]; then
info "CHECK-ONLY mode."
verify_nvidia_ready || true
check_nvidia_lxc_ready
check_systemd_health
health_summary
exit 0
fi
if [[ "$DO_PURGE" -eq 1 ]]; then
purge_nvidia_cuda_all
info "Purge complete. Reboot is strongly recommended before reinstall."
exit 0
fi
detect_gpu_vendor
ensure_headers_and_update
ensure_blacklist_nouveau
ensure_cuda_repo
purge_conflicting_stacks
check_secure_boot
install_nvidia_stack
if [[ "$FORCE_REBUILD" -eq 1 ]]; then
force_rebuild
fi
ensure_uvm_udev_rules_and_nodes
if ! verify_nvidia_ready; then
error "NVIDIA driver verification failed."
check_systemd_health
health_summary
exit 1
fi
enable_persistence
if ! verify_nvidia_ready; then
error "Post-persistence verification failed."
check_systemd_health
health_summary
exit 1
fi
check_systemd_health
check_nvidia_lxc_ready
if [[ "$INSTALL_AUTORUN" -eq 1 ]]; then
install_systemd_autorun
fi
health_summary
info "Installation complete."
info "Reboot may be required after first Secure Boot MOK import."
info "After approval, future kernel updates and reboots will preserve UVM nodes via systemd helper."
}
main "$@"Hope this helps someone
Now to fix the mistake I made in the SSH configuration.
#enoughsaid