Proxmox & Kernel 7 & Nvidia
Yikes is my bottom line at the moment.
There is a significant issue with Kernel 7 and various Nvidia graphics cards. Mine is rather dated which is becoming a problem.
GPU Script for Host setup
Please find below the script. This has been refined on updated with every problem discovered on the updating of the Proxmox host over time. Updates are a manual affair as I have learned this the hardway.
#!/bin/bash
# -------------------------------------------------------------------------------------
# setup-gpu-pxe.sh — Proxmox VE NVIDIA Driver Host Installer + LXC GPU Configurator
# -------------------------------------------------------------------------------------
#
# VERSION: 4.0.0
# CREATED: see git history — github.com/Braedach
# UPDATED: 2026-06-13
# TARGET: Proxmox VE 9.x (Debian 13 / Trixie)
# CARD: NVIDIA RTX 3070 (or any NVIDIA GPU)
# AUTHOR: braedach / Leon
#
# -----------------------------------------------------------------------------
# WHAT THIS SCRIPT DOES
# -----------------------------------------------------------------------------
# Installs and configures NVIDIA drivers on the Proxmox HOST, generates the
# correct LXC container configuration for GPU passthrough, AND guards the host
# against the kernel-series jump that breaks NVIDIA DKMS (see WHY v4.0.0).
#
# The LXC passthrough model:
# - NVIDIA kernel modules live ONLY on the host (never inside a container)
# - Containers access the GPU via bind-mounted /dev/nvidia* device nodes
# - The host creates and owns all device nodes; containers just use them
# - Any kernel/driver update must be done on the HOST only
#
# -----------------------------------------------------------------------------
# WHY v4.0.0 IS A MAJOR BUMP (read this — it is the whole point)
# -----------------------------------------------------------------------------
# On 2026-06-12 a routine SECURITY update on pxe (proxmox-kernel-6.17
# 6.17.13-2 -> 6.17.13-13) silently dragged in an ENTIRELY NEW kernel series:
# proxmox-kernel-7.0 (Ubuntu 26.04 "Resolute" base). The bump came through the
# proxmox-default-kernel meta-package (2.0.2 -> 2.1.0), which Proxmox is moving
# toward making 7.0 the default for the 9.2 release.
#
# The NVIDIA 550.163.01 driver (Debian-packaged, trixie-backports) does NOT
# build against Linux 7.0 — Debian's own packaging notes the module is only
# validated up to Linux 6.19. The DKMS autoinstall for 7.0 failed mid-apt,
# which:
# 1. wedged dpkg (three packages left half-configured)
# 2. left a 7.0 kernel that GRUB would sort highest and try to boot by
# default — a kernel with a broken GPU stack
#
# Recovery required pinning back to 6.17 and neutralising the failed build.
# This is NOT unique to this host: multiple Proxmox users with NVIDIA cards hit
# the same "GPU not detected on 7.0, pin back to 6.17" wall.
#
# v4.0.0 turns that silent break into a LOUD, DELIBERATE decision by adding a
# kernel-series guard. The host stays on an APPROVED series until you have
# CONFIRMED NVIDIA supports the next one — then you approve it on purpose.
#
# -----------------------------------------------------------------------------
# THE GUARD — THREE INDEPENDENT LAYERS
# -----------------------------------------------------------------------------
# LAYER 1 PREFLIGHT GATE (--preflight)
# Simulates 'apt-get full-upgrade', inspects every kernel/header package it
# would install, and BLOCKS (non-zero exit) if any belong to a series not
# in APPROVED_KERNEL_SERIES. Run this BEFORE every apt full-upgrade.
#
# LAYER 2 APT HOLDS (--apply-guard)
# apt-mark hold on proxmox-default-kernel / proxmox-default-headers (the
# series selector) plus any installed kernel/header packages from an
# unapproved series (e.g. the 7.0 packages). SURGICAL: 6.17 security point
# updates still flow, because they arrive via proxmox-kernel-6.17 which is
# NOT held. Held packages simply show as "kept back".
#
# LAYER 3 BOOT PIN (--apply-guard, via proxmox-boot-tool)
# Pins the running kernel as the GRUB default so that even if a bad kernel
# somehow installs, the host never boots it. Only pins a kernel that has a
# built NVIDIA DKMS module.
#
# Layers are independent on purpose: Layer 1 stops the bad kernel arriving,
# Layer 2 stops it installing, Layer 3 stops it booting.
#
# -----------------------------------------------------------------------------
# APPROVING A NEW KERNEL SERIES (when NVIDIA catches up)
# -----------------------------------------------------------------------------
# 1. CONFIRM a Debian/backports NVIDIA driver builds against the new series.
# 2. Add the series to APPROVED_KERNEL_SERIES below (e.g. "7.0").
# 3. sudo ./setup-gpu-pxe.sh --remove-guard (lifts the apt holds)
# 4. sudo ./setup-gpu-pxe.sh --preflight (should now pass)
# 5. apt full-upgrade, verify DKMS, reboot, re-pin with --apply-guard.
#
# -----------------------------------------------------------------------------
# ARCHITECTURE — THREE PILLARS (driver/passthrough side, unchanged from 3.x)
# -----------------------------------------------------------------------------
# PILLAR 1 — DRIVER INSTALLATION (HOST ONLY)
# - Ensures non-free / non-free-firmware + trixie-backports are enabled
# - Installs nvidia-kernel-dkms + nvidia-driver + nvidia-persistenced from
# trixie-backports (the 6.17-compatible Debian-packaged driver)
# - Installs pve-headers for the running Proxmox kernel
# - DKMS build is performed by apt; verified/self-healed here
# - Blacklists nouveau; provides Secure Boot / MOK guidance
#
# PILLAR 2 — DEVICE NODE STABILITY (solves the chronic UVM race)
# nvidia-setup-nodes.service runs at sysinit (Before=lxc / pve-container),
# creating /dev/nvidia-uvm* before any container starts, so LXC binds real
# char devices instead of empty stub files. udev rules + nvidia-persistenced
# keep nodes and GPU state stable across container restarts.
#
# PILLAR 3 — LXC CONTAINER CONFIGURATION (modern dev* syntax)
# Uses Proxmox 8.1+ dev* directive (automatic device-type detection and
# cgroup2 permissions). print_lxc_config is the authoritative dev* list:
# dev0: /dev/nvidia0,gid=44
# dev1: /dev/nvidiactl,gid=44
# dev2: /dev/nvidia-modeset,gid=44
# dev3: /dev/nvidia-uvm,gid=44
# dev4: /dev/nvidia-uvm-tools,gid=44
# dev5: /dev/nvidia-caps/nvidia-cap1,gid=44
# dev6: /dev/nvidia-caps/nvidia-cap2,gid=44
# gid=44 is the 'video' group on Debian-based systems.
#
# -----------------------------------------------------------------------------
# WHAT LIVES WHERE
# -----------------------------------------------------------------------------
# /etc/systemd/system/nvidia-setup-nodes.service boot-ordering service
# /usr/local/sbin/nvidia-setup-nodes.sh node creation script
# /etc/udev/rules.d/71-nvidia-uvm.rules udev permissions
# /etc/modprobe.d/blacklist-nouveau.conf nouveau blacklist
# /etc/modules-load.d/nvidia.conf module autoload
#
# -----------------------------------------------------------------------------
# USAGE
# -----------------------------------------------------------------------------
# sudo ./setup-gpu-pxe.sh Full install (+ applies guard)
# sudo ./setup-gpu-pxe.sh --preflight Gate an upgrade BEFORE apt
# sudo ./setup-gpu-pxe.sh --apply-guard Apply holds + pin running kernel
# sudo ./setup-gpu-pxe.sh --guard-status Show holds, pin, approved series
# sudo ./setup-gpu-pxe.sh --remove-guard Lift holds (new series approved)
# sudo ./setup-gpu-pxe.sh --check-only System + GPU health
# sudo ./setup-gpu-pxe.sh --force-rebuild Force DKMS rebuild
# sudo ./setup-gpu-pxe.sh --purge Remove all NVIDIA components
# sudo ./setup-gpu-pxe.sh --lxc-config ID Print LXC config snippet
# sudo ./setup-gpu-pxe.sh --dry-run Preview any of the above
# sudo ./setup-gpu-pxe.sh --help Show help
#
# -----------------------------------------------------------------------------
# RECOMMENDED KERNEL UPDATE SOP (manual, for production)
# -----------------------------------------------------------------------------
# 1. apt update
# 2. sudo ./setup-gpu-pxe.sh --preflight <-- HARD GATE; stop if it blocks
# 3. apt full-upgrade (use full-upgrade, not upgrade)
# 4. dkms status confirm new kernel = installed
# 5. reboot into the new kernel
# 6. sudo ./setup-gpu-pxe.sh --check-only verify GPU on new kernel
# 7. sudo ./setup-gpu-pxe.sh --apply-guard re-pin the now-running kernel
# 8. start passthrough LXCs; nvidia-smi inside each
#
# -----------------------------------------------------------------------------
# PREREQUISITES
# -----------------------------------------------------------------------------
# - Proxmox VE 9.x host, run as root
# - Internet access to deb.debian.org (non-free + trixie-backports)
# - An NVIDIA GPU present on the host (verified via lspci)
# - proxmox-headers-* for the running kernel available in apt
#
# -----------------------------------------------------------------------------
# WARNINGS
# -----------------------------------------------------------------------------
# - Driver / kernel changes are HOST-level. NEVER install the NVIDIA driver
# inside an LXC container.
# - Do NOT lift the guard (--remove-guard) until you have CONFIRMED NVIDIA
# builds against the new kernel series. Lifting it re-opens the exact
# failure mode that triggered v4.0.0.
# - Always confirm 'dkms status' shows the NEW kernel as installed BEFORE
# rebooting into it after a kernel update.
# - Use 'apt full-upgrade', never 'apt upgrade', on Proxmox.
#
# -----------------------------------------------------------------------------
# CHANGELOG (full history: github.com/Braedach)
# -----------------------------------------------------------------------------
# v4.0.0 — 2026-06-13 — Kernel-series guard (MAJOR)
# CONTEXT: a 6.17 security update silently pulled the 7.0 kernel series via
# proxmox-default-kernel; NVIDIA 550 cannot build against 7.0; the
# failed DKMS autoinstall wedged dpkg and left a bootable-but-broken
# 7.0 kernel. See "WHY v4.0.0" above.
# - NEW: --preflight. Simulates apt full-upgrade and BLOCKS if any kernel or
# header package from an unapproved series would be installed.
# - NEW: --apply-guard. apt-mark holds the series selector + unapproved-series
# kernel/header packages, then pins the running kernel via
# proxmox-boot-tool (only if it has a built NVIDIA DKMS module).
# - NEW: --remove-guard. Lifts the holds, with confirmation, for when a new
# series has been approved.
# - NEW: --guard-status. Reports approved series, current apt holds, and the
# pinned boot kernel.
# - NEW: APPROVED_KERNEL_SERIES constant (default: 6.17). Single source of
# truth for which series NVIDIA is allowed to follow.
# - CHANGE: full install now applies the guard at the end (holds + pin).
# - FIX: --check-only no longer false-alarms "modules NOT loaded". It now
# warms the GPU (nvidia-smi) first; if the driver is functional but a
# module is idle-unloaded, it reports informationally instead of red.
# - FIX: --check-only reports the boot pin and guard status.
# - DOC: added the manual kernel update SOP; reiterated full-upgrade.
#
# v3.6.0 — 2026-06-12 — Kernel-update hardening + doc reconciliation
# - DKMS self-heal for the running kernel; explicit nvidia-persistenced
# install; Pillar 3 docstring reconciled to 7 dev* entries; CUDA_* dead
# constants removed. (Full detail in git.)
#
# Earlier versions (CUDA-repo era, the SHA1 key rejection, the trixie-backports
# migration, the Pillar 2/3 rework, etc.) are recorded in git history.
#
# -------------------------------------------------------------------------------------
set -euo pipefail
IFS=$'\n\t'
SCRIPT_VERSION="4.0.0"
SCRIPT_NAME="$(basename "$0")"
# -------------------------------------------------------------------------------------
# KERNEL SERIES POLICY (edit APPROVED_KERNEL_SERIES to approve a new series)
# -------------------------------------------------------------------------------------
# Only kernel series listed here are allowed to be installed/followed. A series
# is the X.Y portion of a proxmox kernel package, e.g. "6.17" or "7.0".
# Add a new series ONLY after confirming NVIDIA builds against it (see header).
APPROVED_KERNEL_SERIES=("6.17")
# -------------------------------------------------------------------------------------
# CONFIGURATION CONSTANTS
# -------------------------------------------------------------------------------------
UVM_RULE_FILE="/etc/udev/rules.d/71-nvidia-uvm.rules"
NOUVEAU_BLACKLIST="/etc/modprobe.d/blacklist-nouveau.conf"
MODULES_LOAD_FILE="/etc/modules-load.d/nvidia.conf"
NODE_SCRIPT="/usr/local/sbin/nvidia-setup-nodes.sh"
NODE_SERVICE="/etc/systemd/system/nvidia-setup-nodes.service"
LXC_VIDEO_GID=44 # 'video' group on Debian systems
# Selector meta-packages that decide which kernel series is the default.
SELECTOR_PACKAGES=(proxmox-default-kernel proxmox-default-headers)
# -------------------------------------------------------------------------------------
# FLAGS (set by argument parsing)
# -------------------------------------------------------------------------------------
DRY_RUN=0
PURGE=0
FORCE_REBUILD=0
CHECK_ONLY=0
LXC_CONFIG_ONLY=0
PREFLIGHT_ONLY=0
APPLY_GUARD_ONLY=0
REMOVE_GUARD_ONLY=0
GUARD_STATUS_ONLY=0
LXC_VMID=""
# -------------------------------------------------------------------------------------
# COLOUR / LOGGING
# -------------------------------------------------------------------------------------
RED='\033[0;31m'; YELLOW='\033[0;33m'; GREEN='\033[0;32m'
CYAN='\033[0;36m'; BOLD='\033[1m'; RESET='\033[0m'
info() { echo -e "${GREEN}[INFO]${RESET} $*"; }
warn() { echo -e "${YELLOW}[WARN]${RESET} $*" >&2; }
error() { echo -e "${RED}[ERROR]${RESET} $*" >&2; }
section() { echo -e "\n${BOLD}${CYAN}=== $* ===${RESET}"; }
ok() { echo -e " ${GREEN}[ok]${RESET} $*"; }
fail() { echo -e " ${RED}[x]${RESET} $*"; }
skip() { echo -e " ${YELLOW}[-]${RESET} $*"; }
run() {
if [[ "${DRY_RUN}" -eq 1 ]]; then
echo -e " ${CYAN}[dry-run]${RESET} $*"
else
eval "$@"
fi
}
# -------------------------------------------------------------------------------------
# KERNEL SERIES HELPERS
# -------------------------------------------------------------------------------------
# Extract the X.Y series from a proxmox kernel/header package name.
# proxmox-kernel-6.17.13-13-pve -> 6.17
# proxmox-kernel-7.0.6-2-pve -> 7.0
# proxmox-kernel-6.17 -> 6.17 (series meta-package)
# proxmox-default-kernel -> "" (selector; handled separately)
kernel_series_from_pkg() {
local pkg="$1" rest=""
rest="${pkg#proxmox-kernel-}"
if [[ "$rest" == "$pkg" ]]; then
rest="${pkg#proxmox-headers-}"
fi
if [[ "$rest" == "$pkg" ]]; then
echo ""
return
fi
echo "$rest" | grep -oE '^[0-9]+\.[0-9]+' || true
}
is_approved_series() {
local s="$1" a
for a in "${APPROVED_KERNEL_SERIES[@]}"; do
[[ "$s" == "$a" ]] && return 0
done
return 1
}
# -------------------------------------------------------------------------------------
# PREREQUISITES
# -------------------------------------------------------------------------------------
check_root() {
[[ "$(id -u)" -eq 0 ]] || { error "Must run as root. Use: sudo $0 $*"; exit 1; }
}
check_prerequisites() {
section "Checking Prerequisites"
local missing=()
for cmd in systemctl apt apt-mark dpkg dpkg-query wget lspci modprobe dkms mokutil; do
if command -v "$cmd" &>/dev/null; then
ok "$cmd found"
else
fail "$cmd missing"
missing+=("$cmd")
fi
done
# mokutil and proxmox-boot-tool are optional
if [[ ${#missing[@]} -gt 0 ]]; then
local required_missing=()
for m in "${missing[@]}"; do
[[ "$m" != "mokutil" ]] && required_missing+=("$m")
done
if [[ ${#required_missing[@]} -gt 0 ]]; then
error "Required tools missing: ${required_missing[*]}"
error "Install with: apt install -y ${required_missing[*]}"
exit 1
fi
fi
}
# -------------------------------------------------------------------------------------
# GPU DETECTION
# -------------------------------------------------------------------------------------
detect_nvidia_gpu() {
section "Detecting NVIDIA GPU"
local gpu_list
gpu_list="$(lspci | grep -iE '(vga|3d|display)' || true)"
if echo "$gpu_list" | grep -qi nvidia; then
local nvidia_line
nvidia_line="$(echo "$gpu_list" | grep -i nvidia | head -1)"
ok "NVIDIA GPU detected: ${nvidia_line}"
if [[ "$(echo "$gpu_list" | wc -l)" -gt 1 ]]; then
info "All display controllers detected:"
while IFS= read -r line; do
info " $line"
done <<< "$gpu_list"
fi
return 0
else
error "No NVIDIA GPU found via lspci."
info "lspci output:"
echo "$gpu_list"
exit 1
fi
}
# -------------------------------------------------------------------------------------
# DEBIAN SOURCES — NON-FREE + TRIXIE-BACKPORTS
# -------------------------------------------------------------------------------------
ensure_debian_sources() {
section "Ensuring Debian Sources (non-free + trixie-backports)"
local sources_file="/etc/apt/sources.list"
local needs_update=0
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would verify non-free and trixie-backports in apt sources"
return
fi
if grep -rq 'non-free-firmware' /etc/apt/sources.list /etc/apt/sources.list.d/ 2>/dev/null; then
ok "non-free-firmware already in sources"
else
warn "non-free-firmware not found in apt sources"
warn "Adding to ${sources_file} — review if this causes issues"
if ! grep -q 'deb.debian.org/debian trixie' "${sources_file}" 2>/dev/null; then
echo "" >> "${sources_file}"
echo "# Added by setup-gpu-pxe.sh for NVIDIA drivers" >> "${sources_file}"
echo "deb http://deb.debian.org/debian trixie main contrib non-free non-free-firmware" >> "${sources_file}"
ok "Added Debian trixie non-free line to ${sources_file}"
else
if grep 'deb.debian.org/debian trixie' "${sources_file}" | grep -qv 'non-free'; then
warn "Found trixie line without non-free — please add non-free non-free-firmware manually:"
warn " Edit: ${sources_file}"
warn " Change: 'deb http://deb.debian.org/debian trixie main'"
warn " To: 'deb http://deb.debian.org/debian trixie main contrib non-free non-free-firmware'"
else
ok "non-free appears to be present in trixie sources"
fi
fi
needs_update=1
fi
if grep -rq 'trixie-backports' /etc/apt/sources.list /etc/apt/sources.list.d/ 2>/dev/null; then
ok "trixie-backports already in sources"
else
info "Adding trixie-backports to ${sources_file}..."
echo "" >> "${sources_file}"
echo "# Added by setup-gpu-pxe.sh — required for NVIDIA drivers on kernel 6.17+" >> "${sources_file}"
echo "deb http://deb.debian.org/debian trixie-backports main contrib non-free non-free-firmware" >> "${sources_file}"
ok "trixie-backports added"
needs_update=1
fi
if [[ "$needs_update" -eq 1 ]]; then
apt_update
fi
}
# -------------------------------------------------------------------------------------
# APT HELPERS WITH RETRY
# -------------------------------------------------------------------------------------
apt_update() {
local attempt max_attempts=3 delay=5
for ((attempt=1; attempt<=max_attempts; attempt++)); do
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] apt-get update"
return 0
fi
info "apt-get update (attempt ${attempt}/${max_attempts})..."
if DEBIAN_FRONTEND=noninteractive apt-get update -qq; then
return 0
fi
warn "apt update failed (attempt ${attempt}). Retrying in ${delay}s..."
sleep "$delay"
delay=$((delay * 2))
done
error "apt-get update failed after ${max_attempts} attempts."
return 1
}
download_with_retry() {
local url="$1" dest="$2"
local attempt max_attempts=3 delay=5
for ((attempt=1; attempt<=max_attempts; attempt++)); do
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] wget -q -O '${dest}' '${url}'"
return 0
fi
info "Downloading: $(basename "$url") (attempt ${attempt}/${max_attempts})..."
if wget -q --timeout=60 -O "$dest" "$url"; then
ok "Downloaded: $(basename "$url")"
return 0
fi
warn "Download failed (attempt ${attempt}). Retrying in ${delay}s..."
sleep "$delay"
delay=$((delay * 2))
done
error "Failed to download: $url after ${max_attempts} attempts."
return 1
}
# -------------------------------------------------------------------------------------
# NOUVEAU BLACKLIST
# -------------------------------------------------------------------------------------
write_nouveau_blacklist() {
cat > "${NOUVEAU_BLACKLIST}" << 'BLEOF'
blacklist nouveau
options nouveau modeset=0
BLEOF
}
blacklist_nouveau() {
section "Blacklisting Nouveau Driver"
if [[ -f "$NOUVEAU_BLACKLIST" ]]; then
ok "nouveau already blacklisted"
return
fi
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would write ${NOUVEAU_BLACKLIST}"
return
fi
write_nouveau_blacklist
ok "nouveau blacklisted"
update-initramfs -u -k all 2>/dev/null || true
}
# -------------------------------------------------------------------------------------
# SECURE BOOT CHECK
# -------------------------------------------------------------------------------------
check_secure_boot() {
section "Checking Secure Boot"
if ! command -v mokutil &>/dev/null; then
skip "mokutil not available — skipping Secure Boot check"
return
fi
local sb_state
sb_state="$(mokutil --sb-state 2>/dev/null || echo "unknown")"
if echo "$sb_state" | grep -qi "SecureBoot enabled"; then
warn "Secure Boot is ENABLED."
warn "NVIDIA DKMS modules must be signed to load."
warn "If modules fail to load after install, enrol a MOK key:"
warn " openssl req -new -x509 -newkey rsa:2048 -keyout /root/mok.key"
warn " -out /root/mok.crt -days 3650 -subj /CN=NVIDIA-DKMS-MOK/ -nodes"
warn " mokutil --import /root/mok.crt"
warn " (reboot, enrol key in MOK manager, then reboot again)"
else
ok "Secure Boot: ${sb_state}"
fi
}
# -------------------------------------------------------------------------------------
# INSTALL NVIDIA DRIVERS FROM TRIXIE-BACKPORTS
# -------------------------------------------------------------------------------------
install_nvidia_backports() {
section "Installing NVIDIA Drivers (trixie-backports)"
local pkgs_needed=()
if dpkg -l pve-headers 2>/dev/null | grep -q '^ii'; then
ok "pve-headers already installed"
else
pkgs_needed+=(pve-headers)
fi
if dpkg -l nvidia-kernel-dkms 2>/dev/null | grep -q '^ii'; then
ok "nvidia-kernel-dkms already installed"
else
pkgs_needed+=(nvidia-kernel-dkms)
fi
if dpkg -l nvidia-driver 2>/dev/null | grep -q '^ii'; then
ok "nvidia-driver already installed"
else
pkgs_needed+=(nvidia-driver)
fi
if dpkg -l nvidia-persistenced 2>/dev/null | grep -q '^ii'; then
ok "nvidia-persistenced already installed"
else
pkgs_needed+=(nvidia-persistenced)
fi
if [[ ${#pkgs_needed[@]} -eq 0 ]]; then
info "All NVIDIA driver packages already installed."
return
fi
info "Installing from trixie-backports: ${pkgs_needed[*]}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] apt-get install -t trixie-backports -y ${pkgs_needed[*]}"
return
fi
local attempt max_attempts=3 delay=5
for ((attempt=1; attempt<=max_attempts; attempt++)); do
info "Installing (attempt ${attempt}/${max_attempts})..."
if DEBIAN_FRONTEND=noninteractive apt-get install -t trixie-backports -y -qq "${pkgs_needed[@]}"; then
ok "NVIDIA driver packages installed from trixie-backports"
return 0
fi
warn "Install failed (attempt ${attempt}). Retrying in ${delay}s..."
apt_update
sleep "$delay"
delay=$((delay * 2))
done
error "Failed to install NVIDIA driver packages after ${max_attempts} attempts."
error "Check: apt-get install -t trixie-backports nvidia-kernel-dkms nvidia-driver"
return 1
}
# -------------------------------------------------------------------------------------
# DKMS STATUS CHECK + RUNNING-KERNEL SELF-HEAL
# -------------------------------------------------------------------------------------
build_dkms_all_kernels() {
section "Verifying DKMS Build"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would verify DKMS status and autoinstall for running kernel if needed"
return
fi
local running_kernel
running_kernel="$(uname -r)"
info "Running kernel: ${running_kernel}"
info "DKMS status:"
local dkms_out
dkms_out="$(dkms status 2>/dev/null || true)"
if [[ -z "$dkms_out" ]]; then
warn "No DKMS modules registered yet."
else
while IFS= read -r line; do
if echo "$line" | grep -q "installed"; then
ok "$line"
else
warn "$line"
fi
done <<< "$dkms_out"
fi
if echo "$dkms_out" | grep -F "$running_kernel" | grep -q "installed"; then
ok "nvidia DKMS module installed for running kernel ${running_kernel}"
return 0
fi
warn "No installed nvidia DKMS module for running kernel ${running_kernel}"
warn "Attempting 'dkms autoinstall' to build it now..."
if dkms autoinstall 2>&1 | tee /tmp/dkms_autoinstall.log; then
if dkms status 2>/dev/null | grep -F "$running_kernel" | grep -q "installed"; then
ok "DKMS module built for running kernel ${running_kernel}"
else
warn "autoinstall ran but module still not shown installed for ${running_kernel}"
warn "If you just updated the kernel, REBOOT into it and re-run --check-only."
warn "See /tmp/dkms_autoinstall.log for build details."
fi
else
error "dkms autoinstall failed — see /tmp/dkms_autoinstall.log"
warn "Ensure headers for ${running_kernel} are installed (proxmox-headers-*)."
fi
}
# -------------------------------------------------------------------------------------
# MODULE AUTOLOAD
# -------------------------------------------------------------------------------------
write_module_autoload() {
cat > "${MODULES_LOAD_FILE}" << 'MLEOF'
nvidia
nvidia_modeset
nvidia_drm
nvidia_uvm
MLEOF
}
configure_module_autoload() {
section "Configuring Module Autoload"
if [[ -f "$MODULES_LOAD_FILE" ]]; then
ok "Module autoload already configured: $MODULES_LOAD_FILE"
return
fi
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would write ${MODULES_LOAD_FILE}"
return
fi
write_module_autoload
ok "Module autoload configured"
}
# -------------------------------------------------------------------------------------
# PILLAR 2 — DEVICE NODE STABILITY
# -------------------------------------------------------------------------------------
write_node_creation_script() {
cat > "${NODE_SCRIPT}" << 'NODEEOF'
#!/bin/bash
# nvidia-setup-nodes.sh — called by nvidia-setup-nodes.service at sysinit
# Ensures all /dev/nvidia* device nodes exist with correct permissions.
# This runs BEFORE any LXC containers start (see Before= in service unit).
set -euo pipefail
log() { echo "[nvidia-setup-nodes] $*"; }
if ! lsmod | grep -q '^nvidia_uvm'; then
log "Loading nvidia_uvm module..."
if ! modprobe nvidia_uvm 2>/tmp/nvidia_uvm_modprobe.err; then
log "ERROR: failed to load nvidia_uvm — see /tmp/nvidia_uvm_modprobe.err"
cat /tmp/nvidia_uvm_modprobe.err || true
exit 1
fi
fi
for i in $(seq 1 10); do
grep -q 'nvidia-uvm' /proc/devices 2>/dev/null && break
sleep 1
done
UVM_MAJOR="$(awk '$2=="nvidia-uvm"{print $1}' /proc/devices || true)"
if [[ -z "${UVM_MAJOR}" ]]; then
log "ERROR: nvidia-uvm major not found in /proc/devices after 10 seconds"
exit 1
fi
log "nvidia-uvm major: ${UVM_MAJOR}"
if [[ ! -c /dev/nvidia-uvm ]]; then
[[ -e /dev/nvidia-uvm ]] && rm -f /dev/nvidia-uvm
mknod -m 0666 /dev/nvidia-uvm c "${UVM_MAJOR}" 0
log "Created /dev/nvidia-uvm"
fi
if [[ ! -c /dev/nvidia-uvm-tools ]]; then
[[ -e /dev/nvidia-uvm-tools ]] && rm -f /dev/nvidia-uvm-tools
mknod -m 0666 /dev/nvidia-uvm-tools c "${UVM_MAJOR}" 1
log "Created /dev/nvidia-uvm-tools"
fi
chmod 0666 /dev/nvidia* 2>/dev/null || true
chmod 0666 /dev/nvidia-uvm* 2>/dev/null || true
log "All nvidia device nodes verified."
NODEEOF
chmod 755 "${NODE_SCRIPT}"
}
install_node_creation_script() {
info "Installing node creation script: ${NODE_SCRIPT}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would write ${NODE_SCRIPT}"
return
fi
write_node_creation_script
ok "Node creation script installed"
}
write_node_service() {
cat > "${NODE_SERVICE}" << 'SVCEOF'
[Unit]
Description=NVIDIA Device Node Setup (must run before LXC containers)
Documentation=https://github.com/braedach/homelab
After=systemd-modules-load.service
Before=lxc.service
Before=pve-container@.service
ConditionPathExists=/proc/devices
DefaultDependencies=no
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/sbin/nvidia-setup-nodes.sh
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=sysinit.target
SVCEOF
}
install_node_service() {
info "Installing systemd service: ${NODE_SERVICE}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would write ${NODE_SERVICE}"
return
fi
write_node_service
ok "Service unit installed"
}
write_udev_rules() {
cat > "${UVM_RULE_FILE}" << 'UDEVEOF'
# nvidia-uvm device permissions — managed by setup-gpu-pxe.sh
KERNEL=="nvidia-uvm", MODE="0666"
KERNEL=="nvidia-uvm-tools", MODE="0666"
KERNEL=="nvidia*", MODE="0666"
UDEVEOF
}
install_udev_rules() {
info "Installing udev rules: ${UVM_RULE_FILE}"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would write ${UVM_RULE_FILE}"
return
fi
write_udev_rules
ok "udev rules installed"
udevadm control --reload-rules 2>/dev/null || true
udevadm trigger 2>/dev/null || true
}
enable_nvidia_persistenced() {
section "Enabling nvidia-persistenced"
if ! systemctl list-unit-files nvidia-persistenced.service &>/dev/null; then
warn "nvidia-persistenced.service not present — package may not be installed yet."
warn "It is installed by install_nvidia_backports; re-run after a reboot if missing."
return
fi
if systemctl is-enabled nvidia-persistenced &>/dev/null; then
ok "nvidia-persistenced already enabled"
else
run "systemctl enable nvidia-persistenced 2>/dev/null || true"
run "systemctl start nvidia-persistenced 2>/dev/null || true"
ok "nvidia-persistenced enabled"
fi
}
setup_device_stability() {
section "Setting Up Device Node Stability (Pillar 2)"
install_node_creation_script
install_node_service
install_udev_rules
if [[ "${DRY_RUN}" -eq 0 ]]; then
systemctl daemon-reload
systemctl enable nvidia-setup-nodes.service
systemctl restart nvidia-setup-nodes.service || true
ok "nvidia-setup-nodes.service enabled and started"
else
info "[dry-run] systemctl daemon-reload && systemctl enable nvidia-setup-nodes.service"
fi
enable_nvidia_persistenced
}
# -------------------------------------------------------------------------------------
# MODULE LOAD + VERIFICATION
# -------------------------------------------------------------------------------------
warm_up_gpu() {
# Touch the GPU so any idle-unloaded modules reload before we sample lsmod.
if command -v nvidia-smi &>/dev/null; then
nvidia-smi >/dev/null 2>&1 || true
fi
}
load_and_verify_modules() {
section "Loading and Verifying NVIDIA Modules"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would load and verify nvidia modules"
return
fi
local modules=(nvidia nvidia_modeset nvidia_drm nvidia_uvm)
local failed=()
for mod in "${modules[@]}"; do
if lsmod | grep -q "^${mod}[[:space:]]"; then
ok "Module loaded: $mod"
else
info "Loading module: $mod"
if modprobe "$mod" 2>/tmp/modprobe_err_${mod}.log; then
ok "Module loaded: $mod"
else
fail "Module failed: $mod (see /tmp/modprobe_err_${mod}.log)"
failed+=("$mod")
fi
fi
done
if [[ ${#failed[@]} -gt 0 ]]; then
warn "Some modules failed to load: ${failed[*]}"
warn "This is expected if a reboot is needed after DKMS build."
warn "Reboot and re-run with --check-only to verify."
fi
}
run_nvidia_smi_check() {
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] Would run nvidia-smi"
return
fi
section "nvidia-smi Verification"
if command -v nvidia-smi &>/dev/null; then
if nvidia-smi; then
ok "nvidia-smi successful"
else
warn "nvidia-smi returned an error — reboot may be required"
fi
else
warn "nvidia-smi not found — driver may need a reboot to activate"
fi
}
# -------------------------------------------------------------------------------------
# BOOT PIN HELPERS
# -------------------------------------------------------------------------------------
get_pinned_kernel() {
if command -v proxmox-boot-tool &>/dev/null; then
proxmox-boot-tool kernel list 2>/dev/null \
| awk '/^Pinned kernel:/{getline; gsub(/^[ \t]+/,""); print; exit}'
fi
}
pin_running_kernel() {
section "Pinning Boot Kernel (Guard Layer 3)"
local rk; rk="$(uname -r)"
if ! command -v proxmox-boot-tool &>/dev/null; then
warn "proxmox-boot-tool not found — cannot auto-pin."
warn "Ensure your bootloader defaults to ${rk} manually."
return
fi
# Never pin a kernel that has no built NVIDIA module.
if ! dkms status 2>/dev/null | grep -F "$rk" | grep -q installed; then
warn "Running kernel ${rk} has no installed nvidia DKMS module — NOT pinning."
warn "Resolve the driver build first, then pin manually:"
warn " proxmox-boot-tool kernel pin ${rk}"
return
fi
local current_pin; current_pin="$(get_pinned_kernel || true)"
if [[ "$current_pin" == "$rk" ]]; then
ok "Boot kernel already pinned to ${rk}"
return
fi
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] proxmox-boot-tool kernel pin ${rk}"
return
fi
info "Pinning running kernel: ${rk}"
proxmox-boot-tool kernel pin "${rk}" || warn "Pin command returned an error."
ok "Boot kernel pinned to ${rk}"
}
# -------------------------------------------------------------------------------------
# GUARD LAYER 1 — PREFLIGHT GATE
# -------------------------------------------------------------------------------------
# Returns 0 if safe, 2 if an unapproved kernel series would be installed.
do_preflight() {
section "Preflight — Kernel Series Guard (Layer 1)"
info "Approved kernel series: ${APPROVED_KERNEL_SERIES[*]}"
info "Simulating: apt-get full-upgrade ..."
local sim
sim="$(LANG=C apt-get -s full-upgrade 2>/dev/null || true)"
local incoming
incoming="$(echo "$sim" | awk '/^Inst /{print $2}' || true)"
local kernel_incoming=() bad=() selector_moving=0
while IFS= read -r pkg; do
[[ -z "$pkg" ]] && continue
case "$pkg" in
proxmox-default-kernel|proxmox-default-headers) selector_moving=1 ;;
esac
if [[ "$pkg" == proxmox-kernel-* || "$pkg" == proxmox-headers-* ]]; then
kernel_incoming+=("$pkg")
local s; s="$(kernel_series_from_pkg "$pkg")"
if [[ -n "$s" ]] && ! is_approved_series "$s"; then
bad+=("$pkg (series ${s})")
fi
fi
done <<< "$incoming"
if [[ ${#kernel_incoming[@]} -gt 0 ]]; then
info "Kernel/header packages this upgrade would install:"
for p in "${kernel_incoming[@]}"; do info " $p"; done
else
ok "No kernel or header packages in this upgrade."
fi
if [[ "$selector_moving" -eq 1 ]]; then
warn "proxmox-default-kernel/headers would change — the DEFAULT series is moving."
warn "This is exactly how the 7.0 series jump arrived. Inspect carefully."
fi
if [[ ${#bad[@]} -gt 0 ]]; then
echo ""
error "=================================================================="
error " BLOCKED — upgrade would install an UNAPPROVED kernel series"
error "=================================================================="
for b in "${bad[@]}"; do error " $b"; done
error ""
error "Approved series: ${APPROVED_KERNEL_SERIES[*]}"
error "An unapproved series will FAIL the NVIDIA DKMS build, can wedge dpkg,"
error "and can leave a bootable-but-broken kernel as the GRUB default."
error ""
error "DO NOT run 'apt full-upgrade' until you have either:"
error " (a) applied the guard: sudo $SCRIPT_NAME --apply-guard"
error " (holds the offending packages so apt keeps them back), OR"
error " (b) CONFIRMED NVIDIA supports the new series, then approved it"
error " by editing APPROVED_KERNEL_SERIES and running --remove-guard."
error "=================================================================="
return 2
fi
ok "Preflight PASSED — no unapproved kernel series incoming."
ok "Safe to proceed with: apt full-upgrade"
return 0
}
# -------------------------------------------------------------------------------------
# GUARD LAYER 2 — APT HOLDS (+ triggers Layer 3 pin)
# -------------------------------------------------------------------------------------
build_hold_list() {
# Echo (one per line) the packages that should be held:
# - the selector meta-packages (always)
# - any installed kernel/header packages from an unapproved series
local p
for p in "${SELECTOR_PACKAGES[@]}"; do echo "$p"; done
while IFS= read -r pkg; do
[[ -z "$pkg" ]] && continue
local s; s="$(kernel_series_from_pkg "$pkg")"
[[ -z "$s" ]] && continue
if ! is_approved_series "$s"; then echo "$pkg"; fi
done < <(dpkg-query -W -f='${Package}\n' 'proxmox-kernel-*' 'proxmox-headers-*' 2>/dev/null | sort -u || true)
}
apply_kernel_guard() {
section "Applying Kernel Series Guard (Layers 2 + 3)"
info "Approved kernel series: ${APPROVED_KERNEL_SERIES[*]}"
local raw uniq=() seen=" "
raw="$(build_hold_list)"
while IFS= read -r p; do
[[ -z "$p" ]] && continue
if [[ "$seen" != *" $p "* ]]; then uniq+=("$p"); seen+="$p "; fi
done <<< "$raw"
if [[ ${#uniq[@]} -eq 0 ]]; then
warn "No packages resolved for holding (unexpected)."
else
info "Packages to hold (apt will keep these back on upgrade):"
for p in "${uniq[@]}"; do info " $p"; done
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] apt-mark hold ${uniq[*]}"
else
apt-mark hold "${uniq[@]}" || warn "Some holds may have failed (package not installed?)."
ok "Holds applied. 'proxmox-default-kernel' will now show as kept-back."
ok "6.17 security point-updates STILL flow (proxmox-kernel-6.17 is not held)."
fi
fi
pin_running_kernel
}
# -------------------------------------------------------------------------------------
# GUARD — REMOVE
# -------------------------------------------------------------------------------------
remove_kernel_guard() {
section "Removing Kernel Series Guard"
warn "This lifts the apt holds that protect you from an unsupported kernel series."
warn "Only do this once you have CONFIRMED NVIDIA builds against the new series"
warn "AND added that series to APPROVED_KERNEL_SERIES."
echo ""
local held
held="$(apt-mark showhold 2>/dev/null | grep -E '^proxmox-(default|kernel|headers)-' || true)"
if [[ -z "$held" ]]; then
ok "No matching proxmox kernel/header holds present."
return
fi
info "Currently held:"
echo "$held" | while read -r p; do info " $p"; done
echo ""
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] apt-mark unhold ${held//$'\n'/ }"
return
fi
read -r -p "Type 'unhold' to confirm lifting these holds: " confirm
[[ "$confirm" == "unhold" ]] || { info "Cancelled — holds left in place."; return; }
# shellcheck disable=SC2086
apt-mark unhold $held || warn "Some unholds may have failed."
ok "Holds removed. Re-run --preflight BEFORE upgrading."
info "The boot-kernel pin (if set) is unchanged — manage via proxmox-boot-tool."
}
# -------------------------------------------------------------------------------------
# GUARD — STATUS
# -------------------------------------------------------------------------------------
guard_status() {
section "Kernel Guard Status"
echo ""
echo "-- Approved Kernel Series -------------------------------"
info "${APPROVED_KERNEL_SERIES[*]}"
echo ""
echo "-- Running Kernel ---------------------------------------"
info "$(uname -r)"
echo ""
echo "-- apt Holds (Layer 2) ----------------------------------"
local holds
holds="$(apt-mark showhold 2>/dev/null | grep -E '^proxmox-(default|kernel|headers)-' || true)"
if [[ -n "$holds" ]]; then
echo "$holds" | while read -r p; do ok "$p [held]"; done
else
fail "No proxmox kernel/header holds set — guard Layer 2 NOT active."
warn "Apply with: sudo $SCRIPT_NAME --apply-guard"
fi
echo ""
echo "-- Boot Kernel Pin (Layer 3) ----------------------------"
if command -v proxmox-boot-tool &>/dev/null; then
local pinned; pinned="$(get_pinned_kernel || true)"
if [[ -n "$pinned" && "$pinned" != "None." ]]; then
ok "Pinned kernel: ${pinned}"
[[ "$pinned" == "$(uname -r)" ]] || warn "Pinned kernel differs from running kernel."
else
fail "No boot kernel pinned — guard Layer 3 NOT active."
warn "Apply with: sudo $SCRIPT_NAME --apply-guard"
fi
else
skip "proxmox-boot-tool not available — cannot report pin state."
fi
echo ""
echo "-- Installed Proxmox Kernels ----------------------------"
local pk
while IFS= read -r pk; do
[[ -z "$pk" ]] && continue
local s; s="$(kernel_series_from_pkg "$pk")"
if [[ -n "$s" ]] && is_approved_series "$s"; then
ok "$pk (series ${s}, approved)"
elif [[ -n "$s" ]]; then
warn "$pk (series ${s}, NOT approved — should be held)"
fi
done < <(dpkg-query -W -f='${Package}\n' 'proxmox-kernel-[0-9]*' 2>/dev/null | sort -u || true)
}
# -------------------------------------------------------------------------------------
# PILLAR 3 — LXC CONTAINER CONFIGURATION
# -------------------------------------------------------------------------------------
print_lxc_config() {
local vmid="${1:-<VMID>}"
echo ""
echo "======================================================="
echo " LXC GPU Passthrough Configuration"
echo " Container: ${vmid}"
echo " File: /etc/pve/lxc/${vmid}.conf"
echo "======================================================="
echo ""
echo "# NVIDIA GPU passthrough — generated by setup-gpu-pxe.sh v${SCRIPT_VERSION}"
echo "# Uses Proxmox 8.1+ dev* syntax (handles device type detection automatically)"
echo "# gid=44 is the 'video' group on Debian-based systems"
echo "# Verify with: getent group video (inside the container)"
echo "#"
echo "dev0: /dev/nvidia0,gid=${LXC_VIDEO_GID}"
echo "dev1: /dev/nvidiactl,gid=${LXC_VIDEO_GID}"
echo "dev2: /dev/nvidia-modeset,gid=${LXC_VIDEO_GID}"
echo "dev3: /dev/nvidia-uvm,gid=${LXC_VIDEO_GID}"
echo "dev4: /dev/nvidia-uvm-tools,gid=${LXC_VIDEO_GID}"
echo "dev5: /dev/nvidia-caps/nvidia-cap1,gid=${LXC_VIDEO_GID}"
echo "dev6: /dev/nvidia-caps/nvidia-cap2,gid=${LXC_VIDEO_GID}"
echo ""
echo "# Also recommended: container startup delay to allow host nodes to settle"
echo "# startup: order=2,up=15"
echo ""
echo "======================================================="
echo ""
echo "NOTES:"
echo " 1. Do NOT install the full NVIDIA driver package inside the container."
echo " Only install userspace libraries (libnvidia-compute-*) if needed."
echo " 2. The dev* directive handles cgroup2 permissions automatically."
echo " You do NOT need separate lxc.cgroup2.devices.allow lines."
echo " 3. If you have multiple GPUs (e.g. dual RTX), increment the dev* index"
echo " and add: dev7: /dev/nvidia1,gid=${LXC_VIDEO_GID}"
echo " 4. After applying config, restart the container:"
echo " pct stop ${vmid} && pct start ${vmid}"
echo " 5. Verify inside the container:"
echo " ls -la /dev/nvidia*"
echo " nvidia-smi"
echo ""
}
# -------------------------------------------------------------------------------------
# PURGE
# -------------------------------------------------------------------------------------
do_purge() {
section "Purging All NVIDIA Components"
warn "This will remove ALL NVIDIA packages and configuration."
read -r -p "Are you sure? (yes/no): " confirm
[[ "$confirm" == "yes" ]] || { info "Purge cancelled."; exit 0; }
run "systemctl stop nvidia-setup-nodes.service 2>/dev/null || true"
run "systemctl disable nvidia-setup-nodes.service 2>/dev/null || true"
run "systemctl stop nvidia-persistenced 2>/dev/null || true"
run "systemctl disable nvidia-persistenced 2>/dev/null || true"
run "rm -f '${NODE_SERVICE}' '${NODE_SCRIPT}' '${UVM_RULE_FILE}' '${MODULES_LOAD_FILE}' '${NOUVEAU_BLACKLIST}'"
if [[ "${DRY_RUN}" -eq 0 ]]; then
systemctl daemon-reload
udevadm control --reload-rules 2>/dev/null || true
fi
local nvidia_pkgs
nvidia_pkgs="$(dpkg -l '*nvidia*' '*libnvidia*' 2>/dev/null | awk '/^ii/{print $2}' || true)"
if [[ -n "$nvidia_pkgs" ]]; then
info "Removing packages:"
echo "$nvidia_pkgs" | while read -r p; do info " $p"; done
if [[ "${DRY_RUN}" -eq 0 ]]; then
# shellcheck disable=SC2086
DEBIAN_FRONTEND=noninteractive apt-get purge -y $nvidia_pkgs || true
DEBIAN_FRONTEND=noninteractive apt-get autoremove -y || true
else
info "[dry-run] apt-get purge -y ${nvidia_pkgs//$'\n'/ }"
fi
fi
ok "Purge complete. You may want to reboot."
info "Note: --purge does NOT touch kernel guard holds. Use --remove-guard for those."
}
# -------------------------------------------------------------------------------------
# HEALTH CHECK / --check-only
# -------------------------------------------------------------------------------------
do_check() {
section "NVIDIA System Health Check"
# Warm the GPU first so idle-unloaded modules reload before we sample state.
warm_up_gpu
local smi_ok=0
if command -v nvidia-smi &>/dev/null && nvidia-smi &>/dev/null; then
smi_ok=1
fi
echo ""
echo "-- Running Kernel ---------------------------------------"
info "$(uname -r)"
echo ""
echo "-- Kernel Modules ---------------------------------------"
for mod in nvidia nvidia_modeset nvidia_drm nvidia_uvm; do
if lsmod | grep -q "^${mod}[[:space:]]"; then
ok "$mod loaded"
elif [[ "$smi_ok" -eq 1 ]]; then
skip "$mod not currently loaded (driver functional; loads on access)"
else
fail "$mod NOT loaded"
fi
done
echo ""
echo "-- Device Nodes -----------------------------------------"
for dev in /dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools; do
if [[ -c "$dev" ]]; then
local info_str; info_str="$(ls -la "$dev" 2>/dev/null)"
ok "$dev -> $info_str"
elif [[ -e "$dev" ]]; then
fail "$dev EXISTS but is NOT a character device (stub file — timing bug)"
else
fail "$dev MISSING"
fi
done
for cap in /dev/nvidia-caps/nvidia-cap1 /dev/nvidia-caps/nvidia-cap2; do
if [[ -c "$cap" ]]; then
ok "$cap -> $(ls -la "$cap" 2>/dev/null)"
else
skip "$cap not found (normal on GPUs without MIG/caps support)"
fi
done
echo ""
echo "-- systemd Services -------------------------------------"
for svc in nvidia-setup-nodes.service nvidia-persistenced.service; do
local state; state="$(systemctl is-active "$svc" 2>/dev/null || echo "inactive")"
local enabled; enabled="$(systemctl is-enabled "$svc" 2>/dev/null || echo "disabled")"
if [[ "$state" == "active" && "$enabled" == "enabled" ]]; then
ok "$svc [active/enabled]"
else
fail "$svc [${state}/${enabled}]"
fi
done
echo ""
echo "-- DKMS Status (running kernel = $(uname -r)) -----------"
if command -v dkms &>/dev/null; then
local rk; rk="$(uname -r)"
dkms status | while IFS= read -r line; do
if echo "$line" | grep -q "installed"; then
ok "$line"
else
fail "$line"
fi
done
if dkms status 2>/dev/null | grep -F "$rk" | grep -q "installed"; then
ok "nvidia module present for running kernel ${rk}"
else
fail "no installed nvidia module for running kernel ${rk} — run a full install or --force-rebuild"
fi
else
skip "dkms not installed"
fi
echo ""
echo "-- nvidia-smi -------------------------------------------"
if command -v nvidia-smi &>/dev/null; then
nvidia-smi || warn "nvidia-smi failed"
else
fail "nvidia-smi not found"
fi
echo ""
echo "-- Kernel Guard -----------------------------------------"
local holds pin
holds="$(apt-mark showhold 2>/dev/null | grep -E '^proxmox-(default|kernel|headers)-' || true)"
if [[ -n "$holds" ]]; then
ok "apt holds active (Layer 2):"
echo "$holds" | while read -r p; do ok " $p"; done
else
fail "No proxmox kernel/header apt holds — guard Layer 2 NOT active (run --apply-guard)"
fi
if command -v proxmox-boot-tool &>/dev/null; then
pin="$(get_pinned_kernel || true)"
if [[ -n "$pin" && "$pin" != "None." ]]; then
ok "Boot kernel pinned (Layer 3): ${pin}"
else
fail "No boot kernel pinned — guard Layer 3 NOT active (run --apply-guard)"
fi
fi
echo ""
echo "-- Installed Files --------------------------------------"
for f in "$NODE_SERVICE" "$NODE_SCRIPT" "$UVM_RULE_FILE" "$NOUVEAU_BLACKLIST" "$MODULES_LOAD_FILE"; do
if [[ -f "$f" ]]; then
ok "$f"
else
fail "$f MISSING"
fi
done
echo ""
print_lxc_config "YOUR_VMID"
}
# -------------------------------------------------------------------------------------
# FORCE REBUILD
# -------------------------------------------------------------------------------------
do_force_rebuild() {
section "Force Rebuilding DKMS Modules"
if [[ "${DRY_RUN}" -eq 1 ]]; then
info "[dry-run] dkms autoinstall --force"
return
fi
info "Forcing DKMS rebuild for all kernels..."
dkms autoinstall --force 2>&1 | tee /tmp/dkms_rebuild.log || true
ok "Force rebuild complete. Check /tmp/dkms_rebuild.log for details."
}
# -------------------------------------------------------------------------------------
# HELP
# -------------------------------------------------------------------------------------
show_help() {
cat <<EOF
${BOLD}setup-gpu-pxe.sh v${SCRIPT_VERSION}${RESET}
Proxmox VE NVIDIA Driver Host Installer + LXC GPU Configurator + Kernel Guard
${BOLD}USAGE${RESET}
sudo $SCRIPT_NAME [OPTION]
${BOLD}INSTALL / VERIFY${RESET}
(none) Full install — drivers, services, nodes, THEN applies guard
--check-only System + GPU health (also reports guard status)
--force-rebuild Force DKMS module rebuild (after a kernel update)
--lxc-config [ID] Print LXC container config snippet (optional VMID)
${BOLD}KERNEL GUARD${RESET}
--preflight Simulate full-upgrade; BLOCK if an unapproved series is
incoming. Run this BEFORE 'apt full-upgrade'.
--apply-guard apt-mark hold the series selector + unapproved-series
packages, then pin the running kernel.
--guard-status Show approved series, apt holds, and boot pin.
--remove-guard Lift the apt holds (after approving a new series).
${BOLD}MAINTENANCE${RESET}
--purge Remove all NVIDIA components (does not touch holds)
--dry-run Preview any action without making changes
--help Show this help
${BOLD}APPROVED KERNEL SERIES${RESET} (edit APPROVED_KERNEL_SERIES near the top)
${APPROVED_KERNEL_SERIES[*]}
${BOLD}KERNEL UPDATE SOP (manual, production)${RESET}
1. apt update
2. sudo $SCRIPT_NAME --preflight <-- hard gate; stop if blocked
3. apt full-upgrade (full-upgrade, NOT upgrade)
4. dkms status confirm new kernel = installed
5. reboot
6. sudo $SCRIPT_NAME --check-only
7. sudo $SCRIPT_NAME --apply-guard re-pin the now-running kernel
8. start passthrough LXCs; nvidia-smi inside each
EOF
}
# -------------------------------------------------------------------------------------
# ARGUMENT PARSING
# -------------------------------------------------------------------------------------
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--dry-run) DRY_RUN=1 ;;
--purge) PURGE=1 ;;
--force-rebuild) FORCE_REBUILD=1 ;;
--check-only) CHECK_ONLY=1 ;;
--preflight) PREFLIGHT_ONLY=1 ;;
--apply-guard) APPLY_GUARD_ONLY=1 ;;
--remove-guard) REMOVE_GUARD_ONLY=1 ;;
--guard-status) GUARD_STATUS_ONLY=1 ;;
--lxc-config)
LXC_CONFIG_ONLY=1
if [[ $# -gt 1 && "$2" =~ ^[0-9]+$ ]]; then
LXC_VMID="$2"; shift
fi
;;
--help|-h) show_help; exit 0 ;;
*)
error "Unknown argument: $1"
show_help
exit 1
;;
esac
shift
done
}
# -------------------------------------------------------------------------------------
# CLEANUP TRAP
# -------------------------------------------------------------------------------------
cleanup() {
local exit_code=$?
if [[ $exit_code -ne 0 && $exit_code -ne 2 ]]; then
warn "Script exited with code ${exit_code}."
warn "Partial installation may have occurred."
warn "Check the output above, then re-run or use --purge to reset."
fi
}
trap cleanup EXIT
# -------------------------------------------------------------------------------------
# MAIN
# -------------------------------------------------------------------------------------
main() {
parse_args "$@"
check_root "$@"
echo ""
echo -e "${BOLD}${CYAN}setup-gpu-pxe.sh v${SCRIPT_VERSION} — Proxmox VE NVIDIA Driver Installer + Kernel Guard${RESET}"
echo -e "Target: Proxmox VE 9.x / Debian 13 (Trixie)"
echo -e "Approved kernel series: ${APPROVED_KERNEL_SERIES[*]}"
[[ "${DRY_RUN}" -eq 1 ]] && echo -e "${YELLOW}*** DRY RUN MODE — no changes will be made ***${RESET}"
echo ""
# Short-circuit modes
if [[ "${LXC_CONFIG_ONLY}" -eq 1 ]]; then
print_lxc_config "${LXC_VMID}"; exit 0
fi
if [[ "${PREFLIGHT_ONLY}" -eq 1 ]]; then
do_preflight || exit $? # exit 2 on block
exit 0
fi
if [[ "${GUARD_STATUS_ONLY}" -eq 1 ]]; then
guard_status; exit 0
fi
if [[ "${APPLY_GUARD_ONLY}" -eq 1 ]]; then
apply_kernel_guard; exit 0
fi
if [[ "${REMOVE_GUARD_ONLY}" -eq 1 ]]; then
remove_kernel_guard; exit 0
fi
if [[ "${CHECK_ONLY}" -eq 1 ]]; then
do_check; exit 0
fi
if [[ "${PURGE}" -eq 1 ]]; then
do_purge; exit 0
fi
if [[ "${FORCE_REBUILD}" -eq 1 ]]; then
do_force_rebuild; exit 0
fi
# Full install sequence
check_prerequisites
detect_nvidia_gpu
check_secure_boot
blacklist_nouveau
ensure_debian_sources # non-free + trixie-backports
install_nvidia_backports # nvidia-kernel-dkms + nvidia-driver + persistenced
build_dkms_all_kernels # verify DKMS + self-heal running kernel
configure_module_autoload
setup_device_stability # Pillar 2 — boot ordering + persistence
load_and_verify_modules
run_nvidia_smi_check
apply_kernel_guard # Guard Layers 2 + 3
section "Installation Complete"
echo ""
ok "NVIDIA drivers installed (trixie-backports)"
ok "nvidia-setup-nodes.service enabled (runs before LXC at boot)"
ok "nvidia-persistenced installed and enabled"
ok "Kernel guard applied (apt holds + boot pin)"
echo ""
info "Note: a normal re-run VERIFIES the DKMS build and rebuilds only if the"
info "running kernel lacks a module. Before EVERY 'apt full-upgrade', run:"
info " sudo $SCRIPT_NAME --preflight"
echo ""
echo -e "${BOLD}NEXT STEPS:${RESET}"
echo " 1. Reboot the Proxmox host (if this was a fresh driver install)"
echo " 2. Verify: sudo $SCRIPT_NAME --check-only"
echo " 3. Get container config: sudo $SCRIPT_NAME --lxc-config <VMID>"
echo " 4. Apply config to /etc/pve/lxc/<VMID>.conf"
echo " 5. Restart containers: pct stop <VMID> && pct start <VMID>"
echo ""
print_lxc_config "YOUR_VMID"
}
main "$@"This should work fine but you are advised I have a dated card that with drivers that are not supported on version 7 of the new Kernel.
#enoughsaid