Btrfs RAID1 disk replacement requires manual rebalance

back

Note	TL;DR: You must run `btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft` after replacing a disk in a btrfs RAID1 or data loss can occur!

The man page of btrfs replace and many guides on the internet don’t mention that only running btrfs replace start to replace a missing disk is not enough. Doing that can lead to data loss!

Thanks to Julian Brost for providing feedback. Also thanks to https://wiki.tnonline.net/w/Btrfs/Replacing_a_disk#Restoring_redundancy_after_a_replaced_disk for mentioning balance and giving some extra information.

The issue is tracked in btrfs-progs' issue 1077, still unfixed.

All tests were run on a Debian Trixie running Linux Kernel 6.12.63 and btrfs-progs 6.14-1.

The following script replicates this issue (using loop mounts for quick tests but this was also replicated on real hardware). It creates a btrfs RAID1 on two disks. It then destroys the second disk and replaces it with a new one. And then it destroys the first one. The second disk should then retain all data (but doesn’t).

set -eu

cd /tmp
mkdir -p mnt

# Create two 10GiB disks
rm -f disk0; truncate -s 10G disk0
rm -f disk1; truncate -s 10G disk1
# Btrfs needs to see both devices when mounting
losetup /dev/loop0 disk0
losetup /dev/loop1 disk1

# Initialize btrfs RAID1 and create a file with random data
mkfs.btrfs --data raid1 --metadata raid1 /dev/loop0 /dev/loop1
mount /dev/loop0 mnt
dd if=/dev/urandom bs=1G count=1 > mnt/data
sha512sum mnt/data > mnt/data.sha512sum
umount mnt

# Destroy data on second disk
rm -f disk1; truncate -s 10G disk1
losetup -d /dev/loop1; losetup /dev/loop1 disk1
# Not necessary, but just to make clear it's not a cache issue
echo 3 > /proc/sys/vm/drop_caches

# Replace second disk (-B waits until replace is complete)
mount -o degraded /dev/loop0 mnt
btrfs replace start -B 2 /dev/loop1 mnt
# btrfs filesystem usage -T mnt
# btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft mnt
# echo write > mnt/test
umount mnt

# Destroy data on first disk
rm -f disk0; truncate -s 10G disk0
losetup -d /dev/loop0; losetup /dev/loop0 disk0
# Not necessary
echo 3 > /proc/sys/vm/drop_caches

# Attempt to mount
mount -o degraded /dev/loop1 mnt
sha512sum -c mnt/data.sha512sum

# Cleanup
umount mnt
losetup -d /dev/loop0; rm disk0
losetup -d /dev/loop1; rm disk1

Running it yields the following error when mounting it after both disk replacements:

mount: /tmp/mnt: can't read superblock on /dev/loop1.
       dmesg(1) may have more information after failed mount system call.

The dmesg contains:

BTRFS info (device loop1): first mount of filesystem 5f1f3583-8c44-4671-9831-02bd8ff743e1
BTRFS info (device loop1): using crc32c (crc32c-intel) checksum algorithm
BTRFS warning (device loop1): devid 1 uuid 2b43640d-819f-42d2-9559-e8e284b84be1 is missing
BTRFS error (device loop1): failed to read chunk root
BTRFS error (device loop1): open_ctree failed: -5

The issue becomes clear when running btrfs filesystem usage -T after the replace:

Overall:
    Device size:                  20.00GiB
    Device allocated:              3.80GiB
    Device unallocated:           16.20GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          2.00GiB
    Free (estimated):             11.47GiB      (min: 8.77GiB)
    Free (statfs, df):             7.96GiB
    Data ratio:                       1.50
    Metadata ratio:                   1.50
    Global reserve:                5.50MiB      (used: 0.00B)
    Multiple profiles:                 yes      (data, metadata, system)

              Data    Data    Metadata  Metadata  System   System
Id Path       single  RAID1   single    RAID1     single   RAID1   Unallocated Total    Slack
-- ---------- ------- ------- --------- --------- -------- ------- ----------- -------- -----
 1 /dev/loop0 1.00GiB 1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB     7.46GiB 10.00GiB     -
 2 /dev/loop1       - 1.00GiB         - 256.00MiB        - 8.00MiB     8.74GiB 10.00GiB     -
-- ---------- ------- ------- --------- --------- -------- ------- ----------- -------- -----
   Total      1.00GiB 1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB    16.20GiB 20.00GiB 0.00B
   Used         0.00B 1.00GiB     0.00B   1.14MiB 16.00KiB   0.00B

Not all data was replicated to the new disk loop1!

Running btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft (disabled in the script above) fixes this and replicates all necessary data. soft only copies necessary data (instead of rewriting all data on disk which is the default). Running usage afterwards yields:

Overall:
    Device size:                  20.00GiB
    Device allocated:              5.02GiB
    Device unallocated:           14.98GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          2.00GiB
    Free (estimated):              8.49GiB      (min: 8.49GiB)
    Free (statfs, df):             8.49GiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:                5.50MiB      (used: 0.00B)
    Multiple profiles:                  no

              Data    Metadata  System
Id Path       RAID1   RAID1     RAID1    Unallocated Total    Slack
-- ---------- ------- --------- -------- ----------- -------- -----
 1 /dev/loop0 2.00GiB 512.00MiB  8.00MiB     7.49GiB 10.00GiB     -
 2 /dev/loop1 2.00GiB 512.00MiB  8.00MiB     7.49GiB 10.00GiB     -
-- ---------- ------- --------- -------- ----------- -------- -----
   Total      2.00GiB 512.00MiB  8.00MiB    14.98GiB 20.00GiB 0.00B
   Used       1.00GiB   1.14MiB 16.00KiB

Enabling balance and running the script again works fine and the final sha512sum reports no errors.

The behavior is slightly different when performing write operations after the replace (the disabled echo command) before unmounting. Then mounting fails with:

mount: /tmp/mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.

BTRFS warning (device loop1): devid 1 uuid b015b488-4a96-4c5a-81f2-cbec50dfe6ca is missing
BTRFS warning (device loop1): chunk 1372585984 missing 1 devices, max tolerance is 0 for writable mount
BTRFS warning (device loop1): writable mount is not allowed due to too many missing devices
BTRFS error (device loop1): open_ctree failed: -22

Mounting with -o degraded,ro works (and the data seems to be intact; at least in this basic test), but attempting to replace the disk fails with:

ERROR: ioctl(DEV_REPLACE_START) failed on "mnt": Read-only file system

I don’t know if one can recover from this scenario.

back