Ivan Micai
← Back to blog

24 hours rescuing my TrueNAS: the case of the pool that panicked on boot

Published on

If you run a homelab, you know the feeling: some Tuesdays you wake up planning to tweak one little thing and end the day rediscovering ZFS internals, kernel logs, and the patience you thought you’d lost. This post is the story of 24 hours hunting two simultaneous bugs that kept knocking my TrueNAS over — one in the NVIDIA driver and one in ZFS metadata corruption.

Spoiler: it all worked out. I lost 5 media files along the way.


The setup

The server was running:

  • TrueNAS SCALE 26.04-MASTER (nightly — first mistake, I’ll admit it now)
  • Kernel 6.18.1 (also nightly)
  • 2× NVIDIA Blackwell GPUs (50-series, recent hardware)
  • NVIDIA open module driver 590.44.01 (nightly)
  • PSU with headroom, 62 GiB RAM

The pools (typical homelab layout, several disks):

RoleTopologySize
BootNVMe single~500 GB
Mediaraidz2 (4 HDDs)~29 TB
Apps + projectsraidz2 (4 SSDs)~3.6 TB — the problematic one
Misc poolssingle-diskseveral

On top of that, 30+ containers: the *arr stack, Plex, Pi-hole, Netdata, Portainer, Tailscale, a Coolify-style deploy platform, local LLM tooling, database managers, bots, game servers, observability. The server did everything — and did it well, until it stopped doing it.


The symptoms

I started noticing spontaneous reboots. No visible panic, just… it would drop off the network and come back minutes later.

After a heavier round of reboots, the server hit a boot loop — it’d come up, stay alive 13–14 minutes, drop, reboot. Over and over.

The irony: I started investigating thinking it was a new app I had just installed. Biggest red herring of the week.


Discovery 1: the NVIDIA driver is dying

First useful thing I found in journalctl:

kernel: NVRM: iovaspaceDestruct_IMPL: 1 left-over mappings in IOVAS 0x200
kernel: NVRM: GPU1 nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:592
kernel: BUG: unable to handle page fault for address: 0000000200000100

NVIDIA driver assertion while destroying an IOVA space, immediately followed by page fault and reset. Classic bug of the 590.x open driver with kernel 6.18 nightly on Blackwell hardware (50-series GPU, released in 2025). Nightly + new hardware = unstable.

I disabled the GPU workloads and the system stabilized for a few hours. “Fixed”. I went to sleep thinking I just had to avoid GPU until a stable release shipped.

That wasn’t all of it.


Discovery 2: the panic on boot

The next day, another reboot. And another. And another. Boot after boot, the system died exactly when ix-zfs.service started importing pools. Not even the panic trace survived — the kernel froze so fast journalctl couldn’t flush.

Workaround number one: edit GRUB manually on every boot and add:

systemd.mask=ix-zfs.service

Without ix-zfs, the TrueNAS middleware doesn’t try to import pools. The system boots, stays stable, and I get SSH.

I tried persisting the parameter:

# First attempt: /etc/default/grub (doesn't exist on TrueNAS)
# Second attempt: /etc/default/grub.d/truenas.cfg
sudo sed -i 's|nvme_core.multipath=N|nvme_core.multipath=N systemd.mask=ix-zfs.service|' \
  /etc/default/grub.d/truenas.cfg
sudo update-grub

It worked — 10 menuentries in grub.cfg with the parameter. Until the TrueNAS middleware regenerated the file and wiped everything. 😤

Final fix: systemctl mask via symlink, which lives in a per-BE dataset (/etc is a separate dataset per Boot Environment) and therefore persists:

sudo systemctl mask ix-zfs.service
# Creates /etc/systemd/system/ix-zfs.service → /dev/null

The middleware doesn’t revert that one. Pool doesn’t import on boot, system comes up clean.


Discovery 3: which pool is broken?

With SSH stable, time to hunt down which pool was panicking the kernel. Strategy: import them one at a time, readonly, with no persistent cache:

sudo zpool import -o cachefile=none -o readonly=on -N POOL_A   # OK
sudo zpool import -o cachefile=none -o readonly=on -N POOL_B   # OK
sudo zpool import -o cachefile=none -o readonly=on -N POOL_C   # OK
# ... every pool imports readonly without issue

Critical test: import the apps pool RW:

sudo zpool import -o cachefile=none -N POOL_APPS

Immediate kernel panic. SSH dropped. Server reboot.

Found it. The pool had corrupted metadata that only got processed in read-write mode. Specifically, the log_spacemap feature was active:

POOL_APPS  feature@log_spacemap  active  local

This feature logs space-map changes in pending TXGs. If the pool crashed mid-commit (and it had — the server died several times), the log can become inconsistent. Readonly skips replay. RW tries to apply the log → broken metadata → panic.

I tried every recovery flag I knew:

echo 1 | sudo tee /sys/module/zfs/parameters/zfs_recover
echo 1 | sudo tee /sys/module/zfs/parameters/zil_replay_disable
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_metadata
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_data
sudo zpool import -o cachefile=none -N -FX POOL_APPS   # extreme rewind

Still panic. The corruption is fatal in any write path.

Decision: destroy the pool and recreate from scratch. But first, save the ~630 GB of data on it (app configs, docker layers, projects).


Discovery 4: the NVIDIA driver sabotages the backup

I mounted the apps pool readonly and kicked off rsync via systemd-run so it wouldn’t depend on the SSH session:

sudo systemd-run --unit=backup-job /root/backup.sh

First attempt: server fell over near the end of the backup. No trace. Silent kill.

Second attempt: died at a similar point. No trace.

Third attempt: same.

None of this had anything to do with RW import. Source pool was readonly, reads weren’t triggering the log_spacemap bug. So what was killing the kernel?

Usual suspect: the NVIDIA driver. Under IO pressure (rsyncing a million files), something in driver 590’s memory management was breaking.

Test: rmmod the module before the backup:

sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo systemctl reset-failed backup-job
sudo systemd-run --unit=backup-job /root/backup.sh

Backup completed with zero crashes. 🎉

Final confirmation: it was TWO independent bugs:

  1. Unstable NVIDIA driver (random crashes under load)
  2. Pool with corrupted metadata (specific panic on RW import)

The backup (in numbers)

Files copied:       ~2.9 million
Total size:         ~730 GB
Avg throughput:     ~400 MB/s on raidz2 destination
Files skipped:      5 (real I/O errors on bad sectors)

The 5 corrupted files were media in spots with localized bit rot — nothing critical, all re-downloadable. I accepted the loss.


Migrating to a stable release

Before destroying the pool, I decided to switch to a stable TrueNAS release so I wasn’t stuck on a nightly kernel:

  • Current nightly: kernel 6.18.1 + NVIDIA 590.44 open (buggy)
  • Release Candidate: kernel 6.12.33 LTS + NVIDIA 570.172.08 open (release, supports Blackwell)

I already had the RC installed as an alternate boot environment. First attempt: set bootfs and reboot.

Boot loop.

Obvious in hindsight: the BE had its own /etc, without ix-zfs masked. So it tried to import the pools and panicked exactly the same way (the pool corruption is on disk, not in the kernel).

Fix: mount the target BE’s /etc and apply the protections there before booting:

sudo zfs set mountpoint=legacy boot-pool/ROOT/<BE>/etc
sudo mount -t zfs boot-pool/ROOT/<BE>/etc /mnt/be-tmp/etc

sudo ln -sf /dev/null /mnt/be-tmp/etc/systemd/system/ix-zfs.service
sudo tee /mnt/be-tmp/etc/modprobe.d/blacklist-nvidia.conf <<EOF
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia_uvm
EOF

sudo umount /mnt/be-tmp/etc
sudo zfs set mountpoint=/etc boot-pool/ROOT/<BE>/etc

sudo zpool set bootfs=boot-pool/ROOT/<BE> boot-pool
sudo update-grub
sudo systemctl reboot

Clean boot. Kernel 6.12 LTS running.


Destroy and recreate

With the new kernel, I repeated the RW test. Panic again. Confirmed: the problem is the pool, not the kernel.

Surgery time:

# Wipe ZFS labels from the disks
for uuid in <uuid_1> <uuid_2> <uuid_3> <uuid_4>; do
  sudo zpool labelclear -f /dev/disk/by-partuuid/$uuid
done

# Recreate the pool from scratch, same layout (raidz2 with 4 disks)
sudo zpool create -f \
  -o ashift=12 -o cachefile=none \
  -O compression=lz4 -O atime=off \
  -R /mnt/new-pool POOL_APPS raidz2 \
  <uuid_1> <uuid_2> <uuid_3> <uuid_4>

# Recreate the TrueNAS Apps datasets with the props the middleware expects
for ds in ix-apps ix-apps/app_configs ix-apps/app_mounts ix-apps/docker ix-apps/truenas_catalog; do
  sudo zfs create -o canmount=noauto -o mountpoint=/.ix-apps/$(basename $ds) POOL_APPS/$ds
done

Fresh pool, zero corrupted labels, no suspicious log_spacemap.


Restore

sudo systemd-run --unit=restore-job /root/restore.sh

Where the script rsyncs the backup back into the new pool. ~20 hours. Seriously.

Why so slow? The backup was fast (~30 min) because it was writing sequentially to a raidz2 of HDDs. The restore is reading millions of small files from those HDDs (mechanical disks suffer on seek-heavy random reads). Even with the destination being fast SSDs, the bottleneck became the source:

Rule: the slow side wins. And HDDs reading a million tiny files dominate.

During the restore, the system was rock-solid (LTS kernel + NVIDIA unloaded).


The return

After the restore, I imported the pools through the TrueNAS UI (Storage → Import Pool). The middleware registers the pools in its DB and they show up normally again.

Surprise: “Failed to start docker — Missing ix-apps/ dataset(s)”*.

The datasets existed and had data. But they were canmount=noauto and not mounted. The middleware didn’t know about them.

for ds in ix-apps ix-apps/app_configs ix-apps/app_mounts ix-apps/docker ix-apps/truenas_catalog; do
  sudo zfs mount POOL_APPS/$ds
done

sudo midclt call docker.state.start_service

Docker RUNNING. Apps started deploying.

Last step: remove the NVIDIA blacklist to test whether the 570 driver on kernel 6.12 LTS is stable:

sudo rm /etc/modprobe.d/blacklist-nvidia.conf
sudo systemctl reboot

Clean boot. NVIDIA loads. nvidia-smi detects both GPUs, idle, normal temps. All pools auto-imported via ix-zfs.service (now unmasked). TrueNAS Apps + compose stacks coming up.

Load 20+ in the first minutes (30 containers starting simultaneously hammer the pool), but it stabilized at load 2 afterwards. The TrueNAS dashboard was sluggish during startup because of middleware timeouts on the saturated Docker API — normal, it passes.


Final state

ItemBeforeAfter
TrueNASMASTER nightly25.10-RC (release candidate)
Kernel6.18.1 (nightly)6.12.33 (LTS)
NVIDIA590.44 open (nightly)570.172.08 open (release)
Apps poolcorrupted, panic on RWfresh, clean raidz2
Appscrash loopall RUNNING
Stable uptime~14 mindays 🎉

Lessons learned

1. Nightly in production is Russian roulette

Kernel 6.18.1 + driver 590 open was too bleeding-edge to run as production. New hardware (a GPU released in 2025) + nightly driver + nightly kernel piled up enough bugs to crash under load. LTS + release driver fixed it without touching anything else.

2. log_spacemap can become unrecoverable

If the pool crashed mid-TXG with feature@log_spacemap=active and the corruption is in a persisted structure, RW mount panics the kernel. zfs_recover, zil_replay_disable, spa_load_verify_*=0, -FX — none of it helps. Readonly saves the data, but the pool has to be destroyed and rebuilt.

3. RAID does not protect against software bugs

My pool was raidz2 (survives losing 2 disks). That helped exactly zero against a metadata bug. A backup on another pool was what saved me. Having free space on another pool to dump the entire failing one was pure luck that became the critical move.

4. TrueNAS middleware rewrites /etc/default/grub.d/*

Any manual edit to those files gets overwritten eventually. For a reliably persistent kernel parameter:

  • systemctl mask via symlink in /etc/systemd/system/* → persists (per-BE dataset)
  • Through the TrueNAS UI → persists
  • /etc/default/grub.d/*does not persist

5. Docker + TrueNAS Apps at boot = IO storm

30+ containers coming up simultaneously generate load 20+, absurd I/O wait, middleware Docker API timing out. Sluggish dashboard for 5–10 min. This is not a crash, it’s just reality. Be patient.

6. Patience is a finite resource — conserve it

Every “let me try one more RW import” cost a manual reboot. Early on I was trying lots of combinations. By the end, I confirmed the diagnosis once and went to execute the plan.

7. midclt is your friend

The TrueNAS middleware CLI (midclt call ...) handles a lot of things the UI won’t let you do. midclt call docker.config, app.query, app.start, docker.state.start_service were lifesavers during recovery.


Wrap-up

In the end, 24 hours of work to learn: (a) how ZFS handles corrupted log_spacemap, (b) how TrueNAS manages Boot Environments and middleware state, (c) that nightly builds aren’t meant to run dozens of production containers, (d) that some losses are part of the cost of the lesson.

If you’re starting a homelab, here’s the take: release > RC > nightly, in that order. For TrueNAS SCALE, use the release version of the current line. Nightly is for people who want to test, file bugs, and don’t mind losing a weekend.

And keep free space on another pool. Always.

If you want to swap stories about TrueNAS, ZFS, or homelab in general, hit me up. 🦥