Proxmox GPU Configuration

Proxmox GPU Configuration
Proxmox GPU Configuration

Another Kernal update and another problem

I have also ditched Copilot and switched to Claude. Too many problems and dead ends

Please find attached by Proxmox GPU Configuration code. This code is to configure a LXC not a VM as I don't use them.

Code is a work in progress but has had multiple rewrites although it should now be strong enough to survive a kernel update of the Proxmox server.

Discretion is advised.

#!/bin/bash

# Proxmox VE 9.1 NVIDIA CUDA Installer (Debian 13 / Trixie)
# 
# FEATURES:
# - Idempotent: Safe to run multiple times without breaking existing setup
# - DKMS-safe: Properly handles kernel module building and rebuilding
# - Secure Boot aware: Auto-detects and helps import MOK keys for signed modules
# - LXC-ready: Configures device nodes and permissions for container passthrough
# - Multi-kernel support: Installs headers for ALL installed kernels, not just current
# - Multi-GPU aware: Detects NVIDIA even when Intel/AMD iGPU present
# - Auto-repair: Optional systemd units to rebuild after kernel updates
# - Repository validation: Auto-detects Debian version and validates CUDA repo
# - Comprehensive error handling: Detailed logging, retry logic, error capture
# - Dry-run mode: Preview all changes before applying
#
# INSTALLATION SOURCE:
# - Uses ONLY CUDA repository (developer.download.nvidia.com)
# - Purges conflicting Debian/Proxmox NVIDIA stacks to prevent conflicts
# - Installs cuda-drivers and cuda-toolkit packages
#
# KERNEL SUPPORT:
# - Detects all installed proxmox-kernel-* and pve-kernel-* packages
# - Installs matching proxmox-headers-* for each kernel
# - Builds DKMS modules for all kernels simultaneously
# - Cleans up only orphaned headers (kernels no longer installed)
# - Preserves multi-kernel setups for safe fallback
#
# DEVICE NODE MANAGEMENT:
# - Creates udev rules for nvidia-uvm device nodes
# - Installs systemd service to enforce device nodes on boot
# - Handles /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm
# - Ensures proper permissions (0666) for LXC container access
#
# GPU DETECTION:
# - Scans all VGA/3D/Display controllers via lspci
# - Finds NVIDIA GPU even when not primary (e.g., Intel iGPU present)
# - Shows all detected GPUs for troubleshooting
#
# Usage:
#   sudo ./setup-gpu-pxe.sh [OPTIONS]
#
# OPTIONS:
#   --dry-run           Show what would be done without making changes
#   --purge             Remove all NVIDIA/CUDA components (clean slate)
#   --force-rebuild     Force DKMS rebuild of NVIDIA modules
#   --check-only        Check current NVIDIA status without installing
#   --install-autorun   Install systemd units for automatic kernel update handling
#   --help              Show detailed help message
#
# EXAMPLES:
#   # Standard installation (recommended for most users)
#   sudo ./setup-gpu-pxe.sh
#
#   # Installation with automatic kernel update support
#   sudo ./setup-gpu-pxe.sh --install-autorun
#
#   # Preview changes before applying
#   sudo ./setup-gpu-pxe.sh --dry-run
#
#   # Check current system status
#   sudo ./setup-gpu-pxe.sh --check-only
#
#   # Force rebuild after manual kernel update
#   sudo ./setup-gpu-pxe.sh --force-rebuild
#
#   # Complete removal for troubleshooting
#   sudo ./setup-gpu-pxe.sh --purge
#
# WORKFLOW:
#   1. Validates prerequisites (systemctl, apt, dpkg, wget, lspci, modprobe, dkms)
#   2. Detects NVIDIA GPU (works with multi-GPU systems)
#   3. Detects all installed Proxmox kernels
#   4. Installs headers for ALL kernels (not just current)
#   5. Blacklists nouveau driver
#   6. Validates and adds CUDA repository
#   7. Purges conflicting Debian/Proxmox NVIDIA packages
#   8. Checks Secure Boot status and MOK key enrollment
#   9. Installs cuda-drivers and cuda-toolkit
#  10. Builds DKMS modules for all kernels
#  11. Creates udev rules and device nodes
#  12. Installs systemd service for device node enforcement
#  13. Enables nvidia-persistenced for stability
#  14. Verifies all components operational
#  15. Optionally installs kernel update auto-repair systemd units
#  16. Displays comprehensive health summary
#
# TROUBLESHOOTING:
#   - Run with --check-only to see current system state
#   - Check /var/log/syslog for DKMS build errors
#   - Check /tmp/modprobe_err_*.log for module loading failures
#   - Use --purge followed by fresh install for clean slate
#   - Verify Secure Boot MOK key enrollment if modules won't load
#
# LXC CONTAINER CONFIGURATION:
#   Add these lines to container config (/etc/pve/lxc/<VMID>.conf):
#     lxc.cgroup2.devices.allow: c 195:* rwm
#     lxc.cgroup2.devices.allow: c 507:* rwm
#     lxc.cgroup2.devices.allow: c 510:* rwm
#     lxc.mount.entry: /dev/nvidia0          dev/nvidia0          none bind,optional,create=file
#     lxc.mount.entry: /dev/nvidiactl        dev/nvidiactl        none bind,optional,create=file
#     lxc.mount.entry: /dev/nvidia-modeset   dev/nvidia-modeset   none bind,optional,create=file
#     lxc.mount.entry: /dev/nvidia-uvm       dev/nvidia-uvm       none bind,optional,create=file
#     lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
#     lxc.mount.entry: /dev/nvidia-caps      dev/nvidia-caps      none bind,optional,create=dir
#
# CHANGELOG (v2.2.5):
#   - Fixed LXC configuration documentation to include ALL required cgroup devices
#   - Added missing c 507:* rwm (nvidia-uvm devices)
#   - Added missing c 510:* rwm (nvidia-caps devices)  
#   - Added missing /dev/nvidia-caps directory mount
#   - Improved LXC config output with explanations of device numbers
#   - Changed syntax from = to : for proper LXC config format
#
# CHANGELOG (v2.2.4):
#   - Fixed module detection regex: now uses \s instead of space for proper matching
#   - Fixed false-negative module loading detection
#   - Improved persistence daemon error handling and diagnostics
#   - Added better feedback when persistence daemon fails (non-critical)
#   - More informative error messages during module verification
#
# CHANGELOG (v2.2.3):
#   - Added apt install retry logic (3 attempts with exponential backoff)
#   - Added apt update retry logic to handle transient network failures
#   - Improved error messages with actionable troubleshooting steps
#   - Better handling of Debian mirror timeouts and connection issues
#   - All package installations now use retry mechanism automatically
#   
# CHANGELOG (v2.2.2):
#   - Fixed multi-kernel support: installs headers for ALL installed kernels
#   - Fixed GPU detection: finds NVIDIA even with Intel/AMD iGPU present
#   - Added comprehensive prerequisite validation
#   - Added CUDA repository validation with auto-detection
#   - Added download retry logic (3 attempts)
#   - Improved error handling with detailed logging
#   - Added cleanup trap for interrupted installations
#   - Extended module load timeout from 5s to 10s
#   - Added error log capture for failed modprobe attempts
#   - Improved DRY_RUN support across all functions
#   - Added --help flag with detailed documentation
#   - Enhanced health summary with checkmarks and color coding
#   - Only removes orphaned headers (kernel no longer installed)
#   - Shows all detected GPUs during installation
#   - Better handling of Secure Boot and MOK key enrollment
#
# VERSION: 2.2.5
# AUTHOR: Leon Scott
# LICENSE: MIT
# REPOSITORY: [Add your repo URL here]
#
SCRIPT_VERSION="2.2.5"

set -euo pipefail

# =========================
# Color + logging utilities
# =========================
info()  { echo -e "\033[0;32m[INFO]\033[0m $*"; }
warn()  { echo -e "\033[0;33m[WARN]\033[0m $*" >&2; }
error() { echo -e "\033[0;31m[ERROR]\033[0m $*" >&2; }

run() {
  if [[ "${DRY_RUN}" == "1" ]]; then
    echo "+ $*"
  else
    eval "$@"
  fi
}

need_root() { [[ "$(id -u)" -eq 0 ]] || { error "Run as root."; exit 1; }; }

# =========================
# Helper: apt install with retry logic
# =========================
apt_install_with_retry() {
  local max_attempts=3
  local wait_time=5
  local packages="$*"
  
  info "Installing packages: ${packages}"
  
  for attempt in $(seq 1 $max_attempts); do
    info "Attempt ${attempt}/${max_attempts}..."
    
    if apt install -y ${packages}; then
      info "Successfully installed: ${packages}"
      return 0
    else
      local exit_code=$?
      warn "Installation attempt ${attempt} failed (exit code: ${exit_code})"
      
      if [[ $attempt -lt $max_attempts ]]; then
        warn "Waiting ${wait_time} seconds before retry..."
        warn "Running apt update to refresh repository metadata..."
        apt update || warn "apt update failed, continuing anyway..."
        sleep $wait_time
        wait_time=$((wait_time * 2))  # Exponential backoff
      fi
    fi
  done
  
  error "Failed to install after ${max_attempts} attempts: ${packages}"
  error "This may be due to:"
  error "  - Network connectivity issues"
  error "  - Debian mirror problems"
  error "  - Package dependency conflicts"
  error ""
  error "Suggested fixes:"
  error "  1. Check network: ping -c3 deb.debian.org"
  error "  2. Try different mirror: edit /etc/apt/sources.list"
  error "  3. Run: apt update && apt install --fix-missing"
  error "  4. Wait 15 minutes and retry (mirrors may be syncing)"
  
  return 1
}

# =========================
# Helper: apt update with retry
# =========================
apt_update_with_retry() {
  local max_attempts=3
  local wait_time=3
  
  for attempt in $(seq 1 $max_attempts); do
    info "Running apt update (attempt ${attempt}/${max_attempts})..."
    
    if apt update; then
      info "Repository metadata updated successfully"
      return 0
    else
      warn "apt update attempt ${attempt} failed"
      
      if [[ $attempt -lt $max_attempts ]]; then
        warn "Waiting ${wait_time} seconds before retry..."
        sleep $wait_time
      fi
    fi
  done
  
  error "apt update failed after ${max_attempts} attempts"
  error "Continuing anyway, but installation may fail..."
  return 1
}

# =========================
# Defaults
# =========================
DRY_RUN=0
DO_PURGE=0
FORCE_REBUILD=0
CHECK_ONLY=0
INSTALL_AUTORUN=0

# Auto-detect Debian version or use override
detect_debian_version() {
  if [[ -f /etc/os-release ]]; then
    local version_id
    version_id="$(grep '^VERSION_ID=' /etc/os-release | cut -d'=' -f2 | tr -d '"')"
    case "$version_id" in
      12) echo "debian12" ;;
      13) echo "debian13" ;;
      *) 
        warn "Unknown Debian version: $version_id, defaulting to debian12"
        echo "debian12"
        ;;
    esac
  else
    warn "/etc/os-release not found, defaulting to debian12"
    echo "debian12"
  fi
}

CUDA_REPO_CODENAME="${CUDA_REPO_CODENAME:-$(detect_debian_version)}"
CUDA_KEYRING_URL="https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/cuda-keyring_1.1-1_all.deb"
CUDA_KEYRING_DEB="/tmp/cuda-keyring.deb"
CUDA_KEYRING_SHA256="/tmp/cuda-keyring.sha256"
NVIDIA_LIST_FILE="/etc/apt/sources.list.d/cuda-${CUDA_REPO_CODENAME}-x86_64.list"
NVIDIA_KEYRING_FILE="/usr/share/keyrings/cuda-archive-keyring.gpg"

KERNEL_VER="$(uname -r)"
# Use consistent naming: proxmox-headers for PVE 8.x+
HEADERS_PKG="proxmox-headers-${KERNEL_VER}"

UVM_RULE_FILE="/lib/udev/rules.d/71-nvidia-uvm.rules"
UVM_FIX_SCRIPT="/usr/local/sbin/fix-nvidia-uvm.sh"
UVM_FIX_SERVICE="/etc/systemd/system/fix-nvidia-uvm.service"

AUTORUN_SVC="/etc/systemd/system/proxmox-nvidia-autofix.service"
AUTORUN_PATH="/etc/systemd/system/proxmox-nvidia-autofix.path"

# Cleanup trap for interrupted operations
cleanup_on_exit() {
  local exit_code=$?
  if [[ $exit_code -ne 0 ]]; then
    warn "Script interrupted or failed (exit code: $exit_code)"
    warn "You may need to run with --purge to clean up partial installation"
  fi
}
trap cleanup_on_exit EXIT

# =========================
# Parse args
# =========================
show_help() {
  cat << EOF
Proxmox NVIDIA CUDA Installer v${SCRIPT_VERSION}

Usage: sudo $0 [OPTIONS]

OPTIONS:
  --dry-run           Show what would be done without making changes
  --purge             Remove all NVIDIA/CUDA components (clean slate)
  --force-rebuild     Force DKMS rebuild of NVIDIA modules
  --check-only        Check current NVIDIA status without installing
  --install-autorun   Install systemd units for automatic kernel update handling
  --help              Show this help message

EXAMPLES:
  # Full installation
  sudo $0
  
  # Check status
  sudo $0 --check-only
  
  # Clean removal
  sudo $0 --purge
  
  # Reinstall after kernel update
  sudo $0 --force-rebuild

EOF
}

for arg in "$@"; do
  case "$arg" in
    --dry-run)          DRY_RUN=1 ;;
    --purge)            DO_PURGE=1 ;;
    --force-rebuild)    FORCE_REBUILD=1 ;;
    --check-only)       CHECK_ONLY=1 ;;
    --install-autorun)  INSTALL_AUTORUN=1 ;;
    --help)             show_help; exit 0 ;;
    *) warn "Unknown arg: $arg (use --help for usage)" ;;
  esac
done

need_root
info "Proxmox NVIDIA Installer — version ${SCRIPT_VERSION}"
info "DRY_RUN=${DRY_RUN}, PURGE=${DO_PURGE}, FORCE_REBUILD=${FORCE_REBUILD}, CHECK_ONLY=${CHECK_ONLY}"
info "Kernel=${KERNEL_VER}, CUDA Repo=${CUDA_REPO_CODENAME}"

# =========================
# Validate prerequisites
# =========================
validate_prerequisites() {
  local missing=()
  for cmd in systemctl apt dpkg wget lspci modprobe dkms; do
    if ! command -v "$cmd" >/dev/null 2>&1; then
      missing+=("$cmd")
    fi
  done
  
  if [[ ${#missing[@]} -gt 0 ]]; then
    error "Missing required commands: ${missing[*]}"
    exit 1
  fi
  
  info "All prerequisites validated"
}

# =========================
# Helper: systemd health
# =========================
check_systemd_health() {
  info "Checking systemd failed units..."
  run "systemctl --failed || true"
}

# =========================
# Secure Boot check + auto-import
# =========================
check_secure_boot() {
  if ! command -v mokutil >/dev/null 2>&1; then
    warn "mokutil not available; skipping Secure Boot check."
    return
  fi

  local sb_state
  sb_state="$(mokutil --sb-state 2>/dev/null || true)"
  info "Secure Boot state: ${sb_state}"

  if echo "$sb_state" | grep -qi "enabled"; then
    if [[ -f /var/lib/dkms/mok.pub ]]; then
      warn "Secure Boot enabled. Attempting to auto-import DKMS MOK key..."
      if [[ -f /var/lib/dkms/mok.key ]]; then
        info "MOK key pair found, importing..."
        run "mokutil --import /var/lib/dkms/mok.pub"
        warn "Reboot required. Approve the key enrollment in firmware once."
      else
        warn "MOK public key exists but private key not found."
        warn "Key will be generated during DKMS build."
      fi
    else
      warn "DKMS MOK key not found at /var/lib/dkms/mok.pub."
      warn "It will be generated during DKMS builds."
    fi
  fi
}

# =========================
# Detect GPU vendor
# ==========================
detect_gpu_vendor() {
  local all_vendors
  all_vendors="$(lspci -nn | grep -E 'VGA|3D|Display' | grep -oE 'NVIDIA|AMD|Intel' || true)"
  
  if echo "$all_vendors" | grep -q 'NVIDIA'; then
    info "Detected NVIDIA GPU"
    # Show all GPUs found
    lspci -nn | grep -E 'VGA|3D|Display' | while read -r line; do
      info "  Found: $line"
    done
    return 0
  else
    warn "No NVIDIA GPU detected. Found GPUs:"
    lspci -nn | grep -E 'VGA|3D|Display' | while read -r line; do
      warn "  $line"
    done
    exit 0
  fi
}

# ==========================
# Ensure headers
# ==========================
ensure_headers_and_update() {
  info "Checking for Proxmox kernel headers..."
  
  KERNEL_VER=$(uname -r)
  HEADERS_PKG="proxmox-headers-${KERNEL_VER}"
  
  # Get all installed Proxmox kernels
  info "Detecting all installed Proxmox kernels..."
  local installed_kernels
  installed_kernels=$(dpkg -l | awk '/^ii.*proxmox-kernel-/{print $2}' | sed 's/proxmox-kernel-//' || true)
  
  if [[ -z "$installed_kernels" ]]; then
    warn "No proxmox-kernel packages found, checking for pve-kernel..."
    installed_kernels=$(dpkg -l | awk '/^ii.*pve-kernel-/{print $2}' | sed 's/pve-kernel-//' || true)
  fi
  
  if [[ -n "$installed_kernels" ]]; then
    info "Found installed kernels:"
    echo "$installed_kernels" | while read -r kver; do
      info "  - $kver"
    done
    
    # Install headers for all installed kernels
    echo "$installed_kernels" | while read -r kver; do
      local header_pkg="proxmox-headers-${kver}"
      if dpkg -l | grep -q "^ii[[:space:]]*${header_pkg}"; then
        info "Headers already installed: ${header_pkg}"
      else
        info "Installing headers for kernel ${kver}: ${header_pkg}"
        if [[ "${DRY_RUN}" -eq 1 ]]; then
          info "(dry-run) Would install ${header_pkg}"
        else
          if apt_install_with_retry "${header_pkg}"; then
            info "Successfully installed ${header_pkg}"
          else
            warn "Failed to install ${header_pkg} - may not be available"
          fi
        fi
      fi
    done
  else
    # Fallback: just install for current kernel
    warn "Could not detect installed kernels, installing for current kernel only"
    if dpkg -l | grep -q "^ii[[:space:]]*${HEADERS_PKG}"; then
      info "Headers already installed: ${HEADERS_PKG}"
    else
      info "Installing headers: ${HEADERS_PKG}"
      if [[ "${DRY_RUN}" -eq 1 ]]; then
        info "(dry-run) Would install ${HEADERS_PKG}"
      else
        if ! apt_install_with_retry "${HEADERS_PKG}"; then
          error "Failed to install ${HEADERS_PKG}. DKMS may not build correctly."
          error "Available header packages:"
          apt-cache search "^proxmox-headers-" | head -5
          return 1
        fi
      fi
    fi
  fi
  
  # Rebuild DKMS for all kernels
  if [[ "${DRY_RUN}" -eq 0 ]]; then
    info "Rebuilding DKMS modules for all kernels..."
    dkms autoinstall || warn "DKMS rebuild had warnings"
  fi
  
  # Clean up headers for kernels that are no longer installed
  info "Checking for orphaned kernel headers..."
  local all_installed_headers
  all_installed_headers=$(dpkg -l | awk '/^ii/{print $2}' | grep '^proxmox-headers-' || true)
  
  if [[ -n "$all_installed_headers" ]]; then
    echo "$all_installed_headers" | while read -r header_pkg; do
      local kver="${header_pkg#proxmox-headers-}"
      local kernel_pkg="proxmox-kernel-${kver}"
      
      # Check if corresponding kernel is still installed
      if ! dpkg -l | grep -q "^ii[[:space:]]*${kernel_pkg}"; then
        # Try alternate naming
        kernel_pkg="pve-kernel-${kver}"
        if ! dpkg -l | grep -q "^ii[[:space:]]*${kernel_pkg}"; then
          info "Removing orphaned headers (kernel no longer installed): ${header_pkg}"
          if [[ "${DRY_RUN}" -eq 1 ]]; then
            info "(dry-run) Would purge ${header_pkg}"
          else
            apt purge -y "${header_pkg}" || warn "Failed to purge ${header_pkg}"
          fi
        fi
      fi
    done
  fi
}

# ==========================
# Blacklist nouveau
# ==========================
ensure_blacklist_nouveau() {
  local modprobe_conf="/etc/modprobe.d/blacklist-nouveau.conf"
  if [[ ! -f "$modprobe_conf" ]]; then
    info "Blacklisting nouveau and updating initramfs..."
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would create ${modprobe_conf} and update initramfs"
    else
      run "tee ${modprobe_conf} >/dev/null <<'EOF'
blacklist nouveau
options nouveau modeset=0
EOF"
      run "update-initramfs -u"
    fi
  else
    info "nouveau already blacklisted at ${modprobe_conf}"
  fi
}

# ==========================
# Validate CUDA repo availability
# ==========================
validate_cuda_repo() {
  info "Validating CUDA repository availability..."
  local test_url="https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/"
  
  if [[ "${DRY_RUN}" -eq 0 ]]; then
    if wget --spider -q "$test_url" 2>/dev/null; then
      info "CUDA repository for ${CUDA_REPO_CODENAME} is accessible"
      return 0
    else
      error "CUDA repository for ${CUDA_REPO_CODENAME} is not accessible"
      error "URL: $test_url"
      error "You may need to adjust CUDA_REPO_CODENAME environment variable"
      return 1
    fi
  else
    info "(dry-run) Would validate CUDA repo: $test_url"
  fi
}

# ==========================
# Ensure CUDA repo (no duplicates)
# ==========================
ensure_cuda_repo() {
  validate_cuda_repo || return 1
  
  if [[ ! -f "${NVIDIA_KEYRING_FILE}" ]]; then
    info "Installing CUDA keyring..."
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would download and install CUDA keyring"
    else
      # Download with retry logic
      local max_retries=3
      local retry=0
      while [[ $retry -lt $max_retries ]]; do
        if wget -q "${CUDA_KEYRING_URL}" -O "${CUDA_KEYRING_DEB}"; then
          break
        fi
        retry=$((retry + 1))
        warn "Download failed, retry $retry/$max_retries..."
        sleep 2
      done
      
      if [[ ! -f "${CUDA_KEYRING_DEB}" ]]; then
        error "Failed to download CUDA keyring after $max_retries attempts"
        return 1
      fi
      
      run "dpkg -i '${CUDA_KEYRING_DEB}'"
      rm -f "${CUDA_KEYRING_DEB}"
    fi
  else
    info "CUDA keyring present: ${NVIDIA_KEYRING_FILE}"
  fi

  if [[ ! -f "${NVIDIA_LIST_FILE}" ]]; then
    info "Creating CUDA repo sources list: ${NVIDIA_LIST_FILE}"
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would create ${NVIDIA_LIST_FILE}"
    else
      run "tee '${NVIDIA_LIST_FILE}' >/dev/null <<EOF
deb [signed-by=${NVIDIA_KEYRING_FILE}] https://developer.download.nvidia.com/compute/cuda/repos/${CUDA_REPO_CODENAME}/x86_64/ /
EOF"
    fi
  else
    info "CUDA repo sources list already present: ${NVIDIA_LIST_FILE}"
  fi

  if [[ -f "/etc/apt/sources.list.d/nvidia.list" ]]; then
    warn "Removing old /etc/apt/sources.list.d/nvidia.list to avoid duplicate repo entries..."
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would remove /etc/apt/sources.list.d/nvidia.list"
    else
      run "rm -f /etc/apt/sources.list.d/nvidia.list"
    fi
  fi

  info "apt update (CUDA repo)..."
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would run apt update"
  else
    apt_update_with_retry
  fi
}

# ==========================
# Purge conflicting Debian/Proxmox NVIDIA bits
# ===========================
purge_conflicting_stacks() {
  info "Ensuring no Debian/Proxmox NVIDIA stacks are present..."

  if dpkg -l | awk '{print $2}' | grep -qx 'nvidia-kernel-common'; then
    warn "Found nvidia-kernel-common (Debian stack). Purging..."
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would purge nvidia-kernel-common"
    else
      run "apt purge -y nvidia-kernel-common || true"
      run "apt autoremove -y || true"
    fi
  fi

  if dpkg -l | awk '{print $2}' | grep -qx 'pve-nvidia-vgpu-helper'; then
    warn "Found pve-nvidia-vgpu-helper. Purging to avoid stack conflicts..."
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would purge pve-nvidia-vgpu-helper"
    else
      run "apt purge -y pve-nvidia-vgpu-helper || true"
      run "apt autoremove -y || true"
    fi
  fi
}

# ==========================
# Purge ALL NVIDIA/CUDA stacks (clean slate)
# ==========================
purge_nvidia_cuda_all() {
  info "Purging ALL NVIDIA/CUDA/Proxmox NVIDIA stacks..."

  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would stop and disable all NVIDIA services"
    info "(dry-run) Would remove all NVIDIA packages and DKMS modules"
    info "(dry-run) Would clean up systemd units and device nodes"
    return
  fi

  # Stop and disable services first
  run "systemctl stop nvidia-persistenced || true"
  run "systemctl disable nvidia-persistenced || true"
  run "systemctl stop pve-nvidia-vgpu-helper || true"
  run "systemctl disable pve-nvidia-vgpu-helper || true"
  run "systemctl daemon-reload || true"

  # Remove persistence daemon units and symlinks
  run "rm -f /etc/systemd/system/nvidia-persistenced.service || true"
  run "rm -f /etc/systemd/system/multi-user.target.wants/nvidia-persistenced.service || true"
  run "rm -f /usr/lib/systemd/system/nvidia-persistenced.service || true"
  run "rm -f /lib/systemd/system/nvidia-persistenced.service || true"

  # Remove UVM fixer unit and script
  run "systemctl disable fix-nvidia-uvm.service || true"
  run "rm -f '${UVM_FIX_SERVICE}' '${UVM_FIX_SCRIPT}' || true"

  # Remove autorun units
  run "systemctl disable proxmox-nvidia-autofix.path proxmox-nvidia-autofix.service || true"
  run "rm -f '${AUTORUN_SVC}' '${AUTORUN_PATH}' || true"

  run "systemctl daemon-reload || true"

  # Remove custom UVM udev rule
  run "rm -f '${UVM_RULE_FILE}' || true"

  # Purge all relevant packages from both Debian/Proxmox and CUDA stacks
  run "apt purge -y 'nvidia-*' 'cuda*' 'libnvidia-*' 'xserver-xorg-video-nvidia*' 'pve-nvidia-*' || true"

  # DKMS cleanup
  run "dkms status | grep -i nvidia || true"
  run "dkms remove -m nvidia -v all --all || true"
  run "rm -rf /var/lib/dkms/nvidia || true"

  # CUDA leftovers
  run "rm -rf /usr/local/cuda* || true"

  # Clean package cache
  run "apt autoremove -y || true"
  run "apt clean || true"

  info "Purge complete. You should reboot before reinstalling."
}

# ==========================
# Install drivers + toolkit (CUDA repo only)
# ==========================
install_nvidia_stack() {
  info "Installing NVIDIA driver stack and CUDA toolkit from CUDA repo..."
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would install cuda-drivers cuda-toolkit"
  else
    if ! apt_install_with_retry cuda-drivers cuda-toolkit; then
      error "Failed to install NVIDIA packages after multiple attempts"
      error "Check the error messages above for specific issues"
      return 1
    fi
    info "Verifying DKMS build status..."
    run "dkms status || true"
  fi
}

# ==========================
# Force DKMS rebuild
# ==========================
force_rebuild() {
  info "Forcing DKMS rebuild for NVIDIA module..."
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would remove and rebuild NVIDIA DKMS modules"
  else
    run "dkms remove -m nvidia -v all --all || true"
    run "dkms autoinstall || true"

    info "Reloading NVIDIA modules..."
    run "modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia || true"
    run "modprobe nvidia || true"
  fi
}

# ==========================
# UVM device node enforcement
# ==========================
install_uvm_fix_systemd() {
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would install UVM fixer systemd unit and script."
    info "(dry-run) Would enable fix-nvidia-uvm.service."
    return
  fi

  info "Installing systemd helper to enforce /dev/nvidia-uvm nodes..."

  run "tee '${UVM_FIX_SCRIPT}' >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

# Exit silently if UVM module is not loaded
if ! lsmod | grep -q '^nvidia_uvm'; then
  exit 0
fi

major=\$(awk '\$2==\"nvidia-uvm\" {print \$1}' /proc/devices || true)
if [[ -z "\$major" ]]; then
  exit 0
fi

if [[ ! -e /dev/nvidia-uvm ]]; then
  mknod -m 666 /dev/nvidia-uvm c "\$major" 0
fi

if [[ ! -e /dev/nvidia-uvm-tools ]]; then
  mknod -m 666 /dev/nvidia-uvm-tools c "\$major" 1
fi
EOF"

  run "chmod +x '${UVM_FIX_SCRIPT}'"

  run "tee '${UVM_FIX_SERVICE}' >/dev/null <<EOF
[Unit]
Description=Ensure /dev/nvidia-uvm device nodes exist
After=multi-user.target
ConditionPathExists=/proc/devices

[Service]
Type=oneshot
ExecStart=${UVM_FIX_SCRIPT}

[Install]
WantedBy=multi-user.target
EOF"

  run "systemctl daemon-reload"
  run "systemctl enable --now fix-nvidia-uvm.service"
  info "UVM fixer systemd unit installed and active."
}

ensure_uvm_udev_rules_and_nodes() {
  info "Ensuring udev rules and device nodes for nvidia-uvm..."

  # Create udev rule if missing
  if [[ ! -f "${UVM_RULE_FILE}" ]]; then
    if [[ "${DRY_RUN}" -eq 1 ]]; then
      info "(dry-run) Would create UVM udev rule at ${UVM_RULE_FILE}"
    else
      info "Creating UVM udev rule at ${UVM_RULE_FILE}..."
      run "tee '${UVM_RULE_FILE}' >/dev/null <<'EOF'
KERNEL=="nvidia-uvm", MODE="0666"
KERNEL=="nvidia-uvm-tools", MODE="0666"
EOF"
    fi
  else
    info "UVM udev rule already present: ${UVM_RULE_FILE}"
  fi

  if [[ "${DRY_RUN}" -eq 0 ]]; then
    run "udevadm control --reload-rules || true"
    run "udevadm trigger || true"
    run "udevadm settle || true"
    
    # Load module with better error handling
    if ! lsmod | grep -q '^nvidia_uvm'; then
      info "Loading nvidia_uvm module..."
      if ! modprobe nvidia_uvm 2>/dev/null; then
        warn "Failed to load nvidia_uvm module. This may be normal if nvidia driver isn't fully installed yet."
      fi
    fi

    local major
    major="$(awk '$2=="nvidia-uvm" {print $1}' /proc/devices || true)"
    if [[ -z "${major}" ]]; then
      warn "nvidia-uvm major not found in /proc/devices; device nodes may not be creatable yet."
    else
      if [[ ! -e /dev/nvidia-uvm ]]; then
        info "Creating /dev/nvidia-uvm (major ${major}, minor 0)..."
        run "mknod -m 666 /dev/nvidia-uvm c ${major} 0"
      fi
      if [[ ! -e /dev/nvidia-uvm-tools ]]; then
        info "Creating /dev/nvidia-uvm-tools (major ${major}, minor 1)..."
        run "mknod -m 666 /dev/nvidia-uvm-tools c ${major} 1"
      fi
    fi
  else
    info "(dry-run) Would reload udev rules and trigger."
    info "(dry-run) Would modprobe nvidia_uvm and create /dev/nvidia-uvm* if needed."
  fi

  install_uvm_fix_systemd
}

# ==========================
# Helper: DKMS + device + driver check
# ==========================
verify_nvidia_ready() {
  info "Verifying NVIDIA kernel modules and device nodes..."

  # MODULE LOAD TIMING DELAY - increased from 5 to 10 seconds
  if [[ "${DRY_RUN}" -eq 0 ]]; then
    info "Waiting for NVIDIA modules to load (up to 10 seconds)..."
    for i in {1..10}; do
      sleep 1
      if lsmod | grep -q '^nvidia\s'; then
        info "NVIDIA module loaded after ${i} seconds"
        break
      fi
      if [[ $i -eq 10 ]]; then
        warn "NVIDIA module not detected after 10 seconds"
      fi
    done
  fi

  if [[ "${DRY_RUN}" -eq 0 ]]; then
    # Try to load modules with better error capture
    local failed_modules=()
    
    for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
      if ! lsmod | grep -q "^${mod}\s"; then
        info "Attempting to load module: ${mod}"
        if ! modprobe "${mod}" 2>/tmp/modprobe_err_${mod}.log; then
          failed_modules+=("${mod}")
          warn "Failed to load ${mod}. Error log:"
          cat "/tmp/modprobe_err_${mod}.log" 2>/dev/null || true
        fi
      fi
    done

    local missing_modules=()
    for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
      if ! lsmod | grep -q "^${mod}\s"; then
        missing_modules+=("${mod}")
      fi
    done

    if (( ${#missing_modules[@]} > 0 )); then
      error "Missing loaded NVIDIA modules: ${missing_modules[*]}"
      if (( ${#failed_modules[@]} > 0 )); then
        error "Failed to load: ${failed_modules[*]}"
        error "Check /tmp/modprobe_err_*.log for details"
      fi
      info "Showing current lsmod output:"
      run "lsmod | grep -i nvidia || true"
      info "Showing DKMS status:"
      run "dkms status || true"
      return 1
    else
      info "All core NVIDIA modules are loaded."
    fi

    local missing_devs=()
    for dev in /dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools; do
      if [[ ! -e "${dev}" ]]; then
        missing_devs+=("${dev}")
      fi
    done

    if (( ${#missing_devs[@]} > 0 )); then
      warn "Missing NVIDIA device nodes: ${missing_devs[*]}"
      run "ls -l /dev/nvidia* 2>/dev/null || true"
      return 1
    else
      info "All expected NVIDIA device nodes are present."
    fi
  else
    info "(dry-run) Would verify modules: nvidia, nvidia_modeset, nvidia_drm, nvidia_uvm."
    info "(dry-run) Would verify device nodes: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm, /dev/nvidia-uvm-tools."
  fi

  if [[ "${DRY_RUN}" -eq 0 ]]; then
    if command -v nvidia-smi >/dev/null 2>&1; then
      info "Running nvidia-smi..."
      run "nvidia-smi || true"
    else
      warn "nvidia-smi not found in PATH."
    fi

    local persist_status
    persist_status="$(systemctl is-active nvidia-persistenced 2>/dev/null || true)"
    if [[ "$persist_status" != "active" ]]; then
      warn "nvidia-persistenced is not active (status: $persist_status)"
    else
      info "nvidia-persistenced is active."
    fi
  else
    info "(dry-run) Would run nvidia-smi and check nvidia-persistenced."
  fi
  
  return 0
}

# ==========================
# Enable persistence daemon
# ==========================
enable_persistence() {
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would enable and start nvidia-persistenced and set persistence mode."
    return
  fi

  info "Enabling NVIDIA Persistence Daemon..."
  
  # Check if nvidia-persistenced binary exists
  if ! command -v nvidia-persistenced >/dev/null 2>&1; then
    warn "nvidia-persistenced binary not found. It may not be installed yet."
    warn "Persistence daemon will be configured on next reboot."
    return
  fi
  
  run "systemctl enable nvidia-persistenced || true"
  run "systemctl restart nvidia-persistenced || systemctl start nvidia-persistenced || true"
  
  # Check if it started successfully
  if systemctl is-active --quiet nvidia-persistenced; then
    info "nvidia-persistenced is running"
    run "systemctl status --no-pager nvidia-persistenced || true"
  else
    warn "nvidia-persistenced failed to start. Checking logs..."
    run "journalctl -u nvidia-persistenced -n 20 --no-pager || true"
    warn "This is usually not critical. The daemon will start on next reboot."
  fi

  info "Setting persistence mode for the GPU..."
  if ! nvidia-smi -pm 1; then
    warn "Failed to set persistence mode. This may require a reboot."
    warn "You can manually enable it later with: nvidia-smi -pm 1"
  else
    info "GPU persistence mode enabled successfully"
  fi
}

# ==========================
# Check NVIDIA readiness for LXC passthrough
# ==========================
check_nvidia_lxc_ready() {
  info "Checking NVIDIA components for LXC container passthrough..."

  if [[ "${DRY_RUN}" -eq 0 ]]; then
    if ls /dev/nvidia* >/dev/null 2>&1; then
      run "ls -l /dev/nvidia*"
    else
      warn "No /dev/nvidia* device nodes found."
    fi

    info "Checking for required NVIDIA libraries (host-side):"
    run "ldconfig -p | grep -E 'libcuda.so|libnvidia-ml.so' || true"
  else
    info "(dry-run) Would list /dev/nvidia* and check libcuda/libnvidia-ml."
  fi

  echo ""
  echo "LXC config entries to enable GPU (add to /etc/pve/lxc/<VMID>.conf):"
  echo ""
  echo "# Allow access to NVIDIA device nodes"
  echo "lxc.cgroup2.devices.allow: c 195:* rwm"
  echo "lxc.cgroup2.devices.allow: c 507:* rwm"
  echo "lxc.cgroup2.devices.allow: c 510:* rwm"
  echo ""
  echo "# Mount NVIDIA device nodes into container"
  echo "lxc.mount.entry: /dev/nvidia0          dev/nvidia0          none bind,optional,create=file"
  echo "lxc.mount.entry: /dev/nvidiactl        dev/nvidiactl        none bind,optional,create=file"
  echo "lxc.mount.entry: /dev/nvidia-modeset   dev/nvidia-modeset   none bind,optional,create=file"
  echo "lxc.mount.entry: /dev/nvidia-uvm       dev/nvidia-uvm       none bind,optional,create=file"
  echo "lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file"
  echo "lxc.mount.entry: /dev/nvidia-caps      dev/nvidia-caps      none bind,optional,create=dir"
  echo ""
  echo "Device explanations:"
  echo "  195:* = Main GPU devices (nvidia0, nvidiactl, nvidia-modeset)"
  echo "  507:* = CUDA/UVM devices (nvidia-uvm, nvidia-uvm-tools)"
  echo "  510:* = Capability devices (nvidia-caps/nvidia-cap1, nvidia-cap2)"
  echo ""
}

# ==========================
# Install systemd autorun for kernel updates
# ==========================
install_systemd_autorun() {
  if [[ "${DRY_RUN}" -eq 1 ]]; then
    info "(dry-run) Would install proxmox-nvidia-autofix.{service,path} and enable .path unit."
    return
  fi

  if [[ -f "${AUTORUN_SVC}" ]]; then
    info "systemd autorun already installed."
    return
  fi

  info "Installing systemd autorun for kernel updates..."

  # Validate script location
  local script_path
  script_path="$(realpath "$0")"
  if [[ ! -f "$script_path" ]]; then
    error "Cannot resolve script path: $0"
    return 1
  fi

  run "ln -sf \"${script_path}\" /usr/local/sbin/proxmox-nvidia.sh"

  run "tee '${AUTORUN_SVC}' >/dev/null <<EOF
[Unit]
Description=Auto-run NVIDIA installer after kernel updates

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/proxmox-nvidia.sh --force-rebuild
EOF"

  run "tee '${AUTORUN_PATH}' >/dev/null <<EOF
[Unit]
Description=Watch for kernel updates

[Path]
PathChanged=/boot
Unit=proxmox-nvidia-autofix.service

[Install]
WantedBy=multi-user.target
EOF"

  run "systemctl daemon-reload"
  run "systemctl enable --now proxmox-nvidia-autofix.path"
  info "systemd autorun installed and active."
}

# ==========================
# Health Summary Block
# ===========================
health_summary() {
  echo -e "\n\033[1;36m=== NVIDIA HEALTH SUMMARY ===\033[0m"

  # Driver installed
  if command -v nvidia-smi >/dev/null 2>&1; then
    echo -e "\033[0;32m✓ Driver installed:\033[0m YES"
  else
    echo -e "\033[0;31m✗ Driver installed:\033[0m NO"
  fi

  # Modules loaded
  local all_mods_loaded=true
  for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
    if ! lsmod | grep -q "^${mod}\s"; then
      all_mods_loaded=false
    fi
  done
  if $all_mods_loaded; then
    echo -e "\033[0;32m✓ Modules loaded:\033[0m YES"
  else
    echo -e "\033[0;31m✗ Modules loaded:\033[0m NO"
  fi

  # Device nodes
  local all_nodes=true
  for dev in /dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools; do
    [[ -e "$dev" ]] || all_nodes=false
  done
  if $all_nodes; then
    echo -e "\033[0;32m✓ Device nodes present:\033[0m YES"
  else
    echo -e "\033[0;31m✗ Device nodes present:\033[0m NO"
  fi

  # Persistence daemon
  if systemctl is-active --quiet nvidia-persistenced 2>/dev/null; then
    echo -e "\033[0;32m✓ Persistence daemon:\033[0m ACTIVE"
  else
    echo -e "\033[0;33m⚠ Persistence daemon:\033[0m INACTIVE"
  fi

  # CUDA libs
  if ldconfig -p 2>/dev/null | grep -q libcuda.so; then
    echo -e "\033[0;32m✓ CUDA libraries:\033[0m PRESENT"
  else
    echo -e "\033[0;31m✗ CUDA libraries:\033[0m MISSING"
  fi

  # DKMS
  if dkms status 2>/dev/null | grep -q nvidia; then
    echo -e "\033[0;32m✓ DKMS status:\033[0m OK"
  else
    echo -e "\033[0;31m✗ DKMS status:\033[0m MISSING"
  fi

  # LXC readiness
  if [[ -e /dev/nvidia0 ]]; then
    echo -e "\033[0;32m✓ LXC passthrough readiness:\033[0m OK"
  else
    echo -e "\033[0;31m✗ LXC passthrough readiness:\033[0m NO"
  fi

  echo -e "\033[1;36m=============================\033[0m\n"
}

# ==========================
# Main execution flow
# ==========================
main() {
  validate_prerequisites

  if [[ "$CHECK_ONLY" -eq 1 ]]; then
    info "CHECK-ONLY mode."
    verify_nvidia_ready || true
    check_nvidia_lxc_ready
    check_systemd_health
    health_summary
    exit 0
  fi

  if [[ "$DO_PURGE" -eq 1 ]]; then
    purge_nvidia_cuda_all
    info "Purge complete. Reboot is strongly recommended before reinstall."
    exit 0
  fi

  detect_gpu_vendor
  ensure_headers_and_update
  ensure_blacklist_nouveau
  ensure_cuda_repo
  purge_conflicting_stacks
  check_secure_boot
  install_nvidia_stack

  if [[ "$FORCE_REBUILD" -eq 1 ]]; then
    force_rebuild
  fi

  ensure_uvm_udev_rules_and_nodes

  if ! verify_nvidia_ready; then
    error "NVIDIA driver verification failed."
    check_systemd_health
    health_summary
    exit 1
  fi

  enable_persistence

  if ! verify_nvidia_ready; then
    error "Post-persistence verification failed."
    check_systemd_health
    health_summary
    exit 1
  fi

  check_systemd_health
  check_nvidia_lxc_ready

  if [[ "$INSTALL_AUTORUN" -eq 1 ]]; then
    install_systemd_autorun
  fi

  health_summary

  info "Installation complete."
  info "Reboot may be required after first Secure Boot MOK import."
  info "After approval, future kernel updates and reboots will preserve UVM nodes via systemd helper."
}

main "$@"

Hope this helps someone

Now to fix the mistake I made in the SSH configuration.

#enoughsaid