HomeTechOps

NAS

TrueNAS pool health before replacing a disk

ZFS pool degradation is recoverable when handled in the right order — backup first, evidence second, replacement third. Operators who skip the first two and click Replace can convert a single failing disk into a multi-disk failure during rebuild, especially on wide vdevs. This page is the pre-replacement checklist that the Replace UI doesn't enforce.

Best for: TrueNAS Scale operators seeing a DEGRADED or FAULTED pool, a SMART warning, or a disk that's intermittently dropping from the pool. Before clicking Replace.

TrueNAS Scale Storage Dashboard

Reference images and diagrams. Click any image to view full resolution.

TrueNAS Scale Storage Dashboard for pool 'tank' showing Topology (1x MIRROR 2-wide 20 GiB), Usage gauge (0% used), ZFS Health Online, Total ZFS Errors 0, Scheduled Scrub Task Set, Auto TRIM Off, Pool Status Online, Disk Health Online, Failed S.M.A.R.T. Tests 0.
Real screenshot from TrueNAS Scale 24.10.2 captured 2026-05-18. The Storage Dashboard is the first place to look before any disk replacement — Pool Status must be Online (not Degraded or Faulted), Total ZFS Errors 0, and Failed S.M.A.R.T. Tests 0. The Scrub button at the top right of the ZFS Health panel triggers an on-demand scrub.

Read the pool state before touching anything

  • Storage > pool overview shows pool state: ONLINE (healthy), DEGRADED (a vdev has a missing/failed disk but is still serving), FAULTED (vdev can't serve — data inaccessible).
  • If state is FAULTED, do NOT attempt a Replace without restoring connectivity to the missing disk first; the rebuild from parity alone may not be possible.
  • If state is DEGRADED, the pool is still serving but at reduced redundancy. A second disk failure during rebuild = data loss on RAIDZ1 / mirror with one parity disk. Rebuild stress is real.
  • Storage > pool > Disks shows per-disk status. Note the failing disk's serial number (right-click row > Show Details — the serial is the identifier you need; slot numbers can be misleading).

Capture evidence before replacing

  • Run a SMART long test on the failing disk: Storage > Disks > select disk > Manual Test > Long. Wait for completion (hours). Storage > Reports > SMART shows the result.
  • Note SMART attribute deltas: Reallocated_Sector_Ct, Current_Pending_Sector, UDMA_CRC_Error_Count, Reported_Uncorrect. Increases over time = real disk failure (vs cable / power / controller issue, which would show no SMART progression).
  • Check `dmesg` and system logs for ZFS / kernel errors: Reports > System Log shows recent kernel events. ZFS write errors with the failing disk's serial are direct evidence.
  • Take a screenshot or export of the pool state and SMART data BEFORE replacing — this is your evidence trail if the rebuild has issues.

Verify a real backup exists

  • Cloud Sync or Replication task for the pool: Data Protection > Cloud Sync Tasks / Replication Tasks > confirm last success was recent (within 24 hours typically) and covers the irreplaceable datasets.
  • If no recent backup exists and the pool is RAIDZ1 / mirror with one parity, BACKUP FIRST before attempting Replace. The rebuild stress can surface latent errors on other disks; a second failure during rebuild = data loss.
  • Trigger an emergency Cloud Sync run if upstream bandwidth allows: Data Protection > Cloud Sync Tasks > task > play icon. Even a partial 'just the irreplaceable subset' backup is better than none.
  • For RAIDZ2 / dual-parity / triple-mirror configurations, the second-failure risk is lower but not zero — backup is still the right call before the rebuild.

Performing the replace

  • Buy the replacement disk: same size or larger than the failing disk. Different brand is fine (and recommended — same-batch disks fail at similar times).
  • Power down or use the pool's hot-swap if the chassis supports it: Storage > pool > pool actions > Offline the failing disk first if doing a power-down replace. For hot-swap, the Offline step is still recommended before pulling the disk.
  • Physically replace the disk in the same slot. Power up.
  • Storage > pool > Disks > pull-down 'Replace' on the now-empty slot > select the new disk. ZFS starts resilvering immediately.
  • Watch resilver progress: Storage > pool > Resilver shows percentage and ETA. Don't interrupt; don't reboot mid-resilver. Wide RAIDZ pools can take 24+ hours.
  • After resilver completes, run a scrub: Storage > Scrub Tasks > run manually. The scrub catches any errors the resilver couldn't repair.
Operator snapshotEvidence first
First proof

Pool state.

Screen to open

Storage > Disks > the affected disk row > Show Details > Serial Number

Expected signal

Storage > pool overview shows ONLINE / DEGRADED / FAULTED.

Stop boundary

Stop and reassess if resilver hits new errors on a DIFFERENT disk — pool may be in worse shape than expected; restore-from-backup may be the safer path.

Layer path

1ZFS pool degradation is recoverable when handled in the right order — backup first, evidence second, replacement third. Skipping the first two converts a single failing disk into a multi-disk failure during rebuild.
2Pool state (ONLINE / DEGRADED / FAULTED) determines what's possible. DEGRADED = serving with reduced redundancy; FAULTED = vdev can't serve, data may be inaccessible.
3Rebuild stress is real — resilver of a multi-TB drive in a wide RAIDZ pool sustained-writes every other disk for 24+ hours. Latent errors on other disks can surface as second failures.
4SMART evidence over time is the difference between 'real disk failure' and 'transient cable/power issue'. Replacing on a single-event basis without evidence wastes drives.
Runbook

Step-by-step runbook

Start here. Do each check in order, compare it to the expected result, and stop when the evidence explains the failure or the safe stop point applies.

1

Capture pool state and disk evidence

Check: Storage > pool > details screenshot; record failing disk serial; run SMART long test; review SMART deltas; check `dmesg` for ZFS errors.

Expected result: Pool state, failing disk identity, and SMART evidence are all captured.

If not: Without this evidence, you can't distinguish disk failure from cable/controller issues; without that, you may replace the wrong thing.

2

Verify a recent backup exists for irreplaceable data

Check: Data Protection > Cloud Sync / Cloud Backup / Replication Tasks > last success within acceptable window for the datasets that matter.

Expected result: Backup verified current.

If not: If no recent backup on RAIDZ1 / single-parity mirror, BACKUP FIRST. Rebuild stress + a second failure = total loss.

3

Acquire replacement disk

Check: Same-capacity or larger; ideally different brand/batch than the failed disk.

Expected result: Replacement disk in hand and tested briefly outside the pool.

If not: Same-batch disks fail at similar times — different brand reduces correlated-failure risk.

4

Offline the failing disk via UI

Check: the "Offline a disk before physical pull" command below.

Expected result: Disk shows OFFLINE in pool state.

If not: Avoids ZFS thrashing during the physical pull.

5

Physically replace the disk

Check: Power down (or hot-swap if chassis supports it). Match the failing disk's serial against the slot. Insert new disk. Power back up.

Expected result: New disk visible in Storage > Disks; old disk's serial no longer present.

If not: Verify by serial, not slot, especially in larger chassis.

6

Replace in UI; let resilver complete

Check: Storage > pool > Disks > new disk's slot > Replace > select the new disk identifier > start. Monitor resilver progress via Storage > pool > Resilver.

Expected result: Resilver runs to completion; no new errors.

If not: Don't reboot during resilver; don't interrupt. Wide RAIDZ pools can take 24+ hours.

Safe stop: Stop and reassess if resilver hits new errors on a DIFFERENT disk — pool may be in worse shape than expected; restore-from-backup may be the safer path.

7

Post-resilver scrub

Check: Storage > Scrub Tasks > Run Now.

Expected result: Scrub completes clean.

If not: Resilver alone doesn't catch every error; scrub does.

Decision tree

Decision tree

If: Pool state FAULTED.

Then: Replace cannot proceed — vdev is dead.

Action: Restore connectivity to the missing disk first (check cabling, power, controller). If genuinely lost, restore from backup is the path, not Replace.

Safe stop: Stop before clicking Replace on a FAULTED pool — operation may not be possible and obscures the actual recovery path.

If: Pool DEGRADED, recent backup exists, replacement disk available.

Then: Proceed with replacement following the documented sequence.

Action: Offline failing disk via UI → physical replacement → Replace in UI → resilver → post-resilver scrub.

If: Pool DEGRADED, no recent backup.

Then: Backup first; the rebuild stress is the risk window.

Action: Trigger emergency Cloud Sync / Cloud Backup or one-off Replication BEFORE replace. Once at least the irreplaceable datasets have a current copy off-pool, proceed with Replace.

Safe stop: Stop before replacing on RAIDZ1 with no recent backup — second failure during rebuild = total loss.

If: Failing disk is intermittent (drops + recovers).

Then: Could be disk, cable, or power. Run evidence before assuming disk.

Action: Re-seat cable + power; run SMART long test; check `dmesg` for controller errors. If symptoms persist after re-seating, treat as disk failure.

If: Wide RAIDZ1 (8+ drives), single disk failed.

Then: Second-failure-during-rebuild risk is materially elevated.

Action: Backup is non-negotiable. Consider after rebuild whether to migrate to narrower RAIDZ vdevs or RAIDZ2 — OpenZFS guidance is RAIDZ1 ≤ 6 drives.

Safe stop: Stop before treating this as a routine swap; the configuration is fragile.

Evidence

Evidence table

SymptomEvidence to collectLikely layerNext action
Pool DEGRADED with one disk UNAVAIL.Storage > pool > details.Disk failure or connectivity lossIdentify disk serial; SMART test; check cable/power; if SMART confirms failure, proceed to replace with backup verified.
SMART Reallocated_Sector_Ct increased by hundreds since last check.Reports > SMART > attribute history.Real disk failure progressionPlan replacement; ensure backup current; run replace sequence.
ZFS write errors in `dmesg` for a specific disk serial.Reports > System Log filtered for the disk.Real disk-level errors during writesConfirm with SMART long test; if SMART also progresses, replace; if SMART is clean, suspect cable/controller.
Multiple disks with UDMA_CRC_Error_Count > 0.Reports > SMART > UDMA_CRC across all disks.Cable or controller issue, not necessarily disk failureRe-seat all SATA cables; consider replacing the SATA controller / HBA before replacing drives.
Reference

Commands and settings paths

Identify failing disk by serial

Storage > Disks > the affected disk row > Show Details > Serial Number

Where: In the TrueNAS web UI.

Expected: Serial captured.

Failure means: Slot-number-only identification is unreliable across reboots.

Safe next step: Match physical replacement against this serial when swapping.

Run SMART long test

Storage > Disks > disk > Manual Test > Long > Confirm

Where: In the TrueNAS web UI.

Expected: Test runs (hours); result appears under Reports > SMART after completion.

Failure means: Test interruption (reboot, drive pull) means re-run from start.

Safe next step: Wait for completion; review SMART attributes.

Offline a disk before physical pull

Storage > pool > Disks > failing disk row > Offline

Where: In the TrueNAS web UI.

Expected: Disk shows OFFLINE in pool state; ZFS stops sending writes to it.

Failure means: Pulling a still-online disk can cause ZFS thrashing and unnecessary write errors.

Safe next step: Then physically replace; then Storage > Replace in UI on the empty slot.

Post-resilver scrub

Storage > Scrub Tasks > Run Now (or Data Protection > Scrub Tasks)

Where: In the TrueNAS web UI.

Expected: Scrub completes with 0 errors.

Failure means: Resilver can leave hidden errors that scrub will find and repair.

Safe next step: Don't consider replacement complete until post-resilver scrub passes clean.

Hardware boundary

Hardware and platform boundary

Change only when

  • Migrating from wide RAIDZ1 to narrower vdevs or RAIDZ2 is the right step after surviving a rebuild on a wide RAIDZ1 — the second-failure risk is the warning signal, not the current state.

Evidence that matters

  • Pool health monitoring (regular scrubs, SMART history captured), backup recency, and rebuild discipline matter most.

Evidence that does not matter

  • Newer/faster drives don't change rebuild stress; pool architecture (vdev width, parity level) is what determines second-failure risk.

Avoid

  • Avoid clicking Replace on FAULTED pools, replacing on RAIDZ1 without a recent backup, pulling disks without Offlining them first, or skipping the post-resilver scrub.

Last reviewed

2026-05-18 · Reviewed by HomeTechOps. Reviewed against TrueNAS Scale's storage pool management documentation, OpenZFS guidance on RAIDZ width and resilver behavior, and the conservative backup-before-rebuild rule from NIST's data-protection framing.

Source-backed checks

HomeTechOps turns official docs and conservative safety rules into a shorter runbook. These links are the source trail for the page direction.