|
Note
|
TL;DR: You must run btrfs balance start -dconvert=raid1,soft
-mconvert=raid1,soft after replacing a disk in a btrfs RAID1 or data loss can
occur! |
The man page of btrfs replace and many guides on the internet don’t mention
that only running btrfs replace start to replace a missing disk is not
enough. Doing that can lead to data loss!
All tests were run on a Debian Trixie running Linux Kernel 6.12.63 and
btrfs-progs 6.14-1.
The following script replicates this issue (using loop mounts for quick tests
but this was also replicated on real hardware). It creates a btrfs RAID1 on
two disks. It then destroys the second disk and replaces it with a new one.
And then it destroys the first one. The second disk should then retain all
data (but doesn’t).
set -eu
cd /tmp
mkdir -p mnt
# Create two 10GiB disks
rm -f disk0; truncate -s 10G disk0
rm -f disk1; truncate -s 10G disk1
# Btrfs needs to see both devices when mounting
losetup /dev/loop0 disk0
losetup /dev/loop1 disk1
# Initialize btrfs RAID1 and create a file with random data
mkfs.btrfs --data raid1 --metadata raid1 /dev/loop0 /dev/loop1
mount /dev/loop0 mnt
dd if=/dev/urandom bs=1G count=1 > mnt/data
sha512sum mnt/data > mnt/data.sha512sum
umount mnt
# Destroy data on second disk
rm -f disk1; truncate -s 10G disk1
losetup -d /dev/loop1; losetup /dev/loop1 disk1
# Not necessary, but just to make clear it's not a cache issue
echo 3 > /proc/sys/vm/drop_caches
# Replace second disk (-B waits until replace is complete)
mount -o degraded /dev/loop0 mnt
btrfs replace start -B 2 /dev/loop1 mnt
# btrfs filesystem usage -T mnt
# btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft mnt
# echo write > mnt/test
umount mnt
# Destroy data on first disk
rm -f disk0; truncate -s 10G disk0
losetup -d /dev/loop0; losetup /dev/loop0 disk0
# Not necessary
echo 3 > /proc/sys/vm/drop_caches
# Attempt to mount
mount -o degraded /dev/loop1 mnt
sha512sum -c mnt/data.sha512sum
# Cleanup
umount mnt
losetup -d /dev/loop0; rm disk0
losetup -d /dev/loop1; rm disk1
Running it yields the following error when mounting it after both disk
replacements:
mount: /tmp/mnt: can't read superblock on /dev/loop1.
dmesg(1) may have more information after failed mount system call.
BTRFS info (device loop1): first mount of filesystem 5f1f3583-8c44-4671-9831-02bd8ff743e1
BTRFS info (device loop1): using crc32c (crc32c-intel) checksum algorithm
BTRFS warning (device loop1): devid 1 uuid 2b43640d-819f-42d2-9559-e8e284b84be1 is missing
BTRFS error (device loop1): failed to read chunk root
BTRFS error (device loop1): open_ctree failed: -5
The issue becomes clear when running btrfs filesystem usage -T after the
replace:
Overall:
Device size: 20.00GiB
Device allocated: 3.80GiB
Device unallocated: 16.20GiB
Device missing: 0.00B
Device slack: 0.00B
Used: 2.00GiB
Free (estimated): 11.47GiB (min: 8.77GiB)
Free (statfs, df): 7.96GiB
Data ratio: 1.50
Metadata ratio: 1.50
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: yes (data, metadata, system)
Data Data Metadata Metadata System System
Id Path single RAID1 single RAID1 single RAID1 Unallocated Total Slack
-- ---------- ------- ------- --------- --------- -------- ------- ----------- -------- -----
1 /dev/loop0 1.00GiB 1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB 7.46GiB 10.00GiB -
2 /dev/loop1 - 1.00GiB - 256.00MiB - 8.00MiB 8.74GiB 10.00GiB -
-- ---------- ------- ------- --------- --------- -------- ------- ----------- -------- -----
Total 1.00GiB 1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB 16.20GiB 20.00GiB 0.00B
Used 0.00B 1.00GiB 0.00B 1.14MiB 16.00KiB 0.00B
Not all data was replicated to the new disk loop1!
Running btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft
(disabled in the script above) fixes this and replicates all necessary data.
soft only copies necessary data (instead of rewriting all data on disk which
is the default). Running usage afterwards yields:
Overall:
Device size: 20.00GiB
Device allocated: 5.02GiB
Device unallocated: 14.98GiB
Device missing: 0.00B
Device slack: 0.00B
Used: 2.00GiB
Free (estimated): 8.49GiB (min: 8.49GiB)
Free (statfs, df): 8.49GiB
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID1 RAID1 RAID1 Unallocated Total Slack
-- ---------- ------- --------- -------- ----------- -------- -----
1 /dev/loop0 2.00GiB 512.00MiB 8.00MiB 7.49GiB 10.00GiB -
2 /dev/loop1 2.00GiB 512.00MiB 8.00MiB 7.49GiB 10.00GiB -
-- ---------- ------- --------- -------- ----------- -------- -----
Total 2.00GiB 512.00MiB 8.00MiB 14.98GiB 20.00GiB 0.00B
Used 1.00GiB 1.14MiB 16.00KiB
Enabling balance and running the script again works fine and the final
sha512sum reports no errors.
The behavior is slightly different when performing write operations after the
replace (the disabled echo command) before unmounting. Then mounting fails
with:
mount: /tmp/mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
BTRFS warning (device loop1): devid 1 uuid b015b488-4a96-4c5a-81f2-cbec50dfe6ca is missing
BTRFS warning (device loop1): chunk 1372585984 missing 1 devices, max tolerance is 0 for writable mount
BTRFS warning (device loop1): writable mount is not allowed due to too many missing devices
BTRFS error (device loop1): open_ctree failed: -22
Mounting with -o degraded,ro works (and the data seems to be intact; at
least in this basic test), but attempting to replace the disk fails with:
ERROR: ioctl(DEV_REPLACE_START) failed on "mnt": Read-only file system
I don’t know if one can recover from this scenario.