I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s
md RAID1 for the root filesystem (and/or
/boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.
With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running
BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this
.EFI code is either GRUB itself, or Shim which loads GRUB.
So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read
md, LVM, etc), but how do I handle
/boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with
efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.
The current implementation of Linux’s
md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact,
mdadm warns about this pretty loudly:
# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90
Reading from the mdadm man page:
-e, --metadata= ... 1, 1.0, 1.1, 1.2 default Use the new version-1 format superblock. This has fewer restrictions. It can easily be moved between hosts with different endian-ness, and a recovery operation can be checkpointed and restarted. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred 1.x format). "default" is equivalent to "1.2".
First we toss a FAT32 on the RAID (
mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and
file doesn’t see a filesystem:
# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............| ... # file -s /dev/sda1 /dev/sda1: Linux Software RAID version 1.2 ...
So, instead, we’ll use
--metadata 1.0 to put the RAID metadata at the end:
# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1 ... # mkfs.fat -F32 /dev/md0 # dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd 00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac FAT32 ...w|. # file -s /dev/sda1 /dev/sda1: ... FAT (32 bit)
Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and
grub-install will write to the RAID mounted at
However, we’re left with a new problem: on (at least) Debian and Ubuntu,
grub-install attempts to run
efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run
efibootmgr with an empty
Installing for x86_64-efi platform. efibootmgr: option requires an argument -- 'd' ... grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted. Failed: grub-install --target=x86_64-efi WARNING: Bootloader is not properly installed, system may not be bootable
Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running:
dpkg-reconfigure -p low grub-efi-amd64
So, now my system will boot with both or either drive present, and updates from Linux to
/boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).
To deal with this “external write” situation, I see some solutions:
- Make the partition read-only when not under Linux. (I don’t think this is a thing.)
- Create higher-level knowledge of the root-filesystem RAID configuration is needed to keep a collection of filesystems manually synchronized instead of doing block-level RAID. (Seems like a lot of work and would need redesign of
/boot/efiinto something like
- Prefer one RAID member’s copy of
/boot/efiand rebuild the RAID at every boot. If there were no external writes, there’s no issue. (Though what’s really the right way to pick the copy to prefer?)
mdadm has the “
--update=resync” assembly option, I can actually do the latter option. This required updating
/etc/mdadm/mdadm.conf to add
<ignore> on the RAID’s
ARRAY line to keep it from auto-starting:
ARRAY <ignore> metadata=1.0 UUID=123...
(Since it’s ignored, I’ve chosen
/dev/md100 for the manual assembly below.) Then I added the
noauto option to the
/boot/efi entry in
/dev/md100 /boot/efi vfat noauto,defaults 0 0
And finally I added a
systemd oneshot service that assembles the RAID with resync and mounts it:
[Unit] Description=Resync /boot/efi RAID DefaultDependencies=no After=local-fs.target [Service] Type=oneshot ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync ExecStart=/bin/mount /boot/efi RemainAfterExit=yes [Install] WantedBy=sysinit.target
(And don’t forget to run “
update-initramfs -u” so the initramfs has an updated copy of
mdadm.conf supported an “
update=” option for
ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!
And if I wanted to keep a “pristine” version of
/boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g.
/boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:
# truncate --size 512M /boot/efi.img # losetup -f --show /boot/efi.img /dev/loop0 # mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1
And at boot just rebuild it from
/dev/loop0, though I’m not sure how to “prefer” that partition…
© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.