Router Upgrade From Hell
The day started with a plan. I would upgrade to the latest OpenWRT release with minimal disruption to the home internet, planning out the steps beforehand, and being careful not to totally mess things up. And as always reality had other ideas about how the day would go.
The Original Plan
I’m running OpenWRT on a PC Engines APU4 with way more disk space than it needs. The APU4 boots from the SSD like a traditional PC, not like a router booting from dedicated Flash storage, so I wasn’t sure exactly how the OpenWRT upgrade procedure would work and I wanted to make sure I had a working install to fall back on.
So I planned on re-partitioning the disk to add more partitions, installing the
new version, and setting up the
grub2 bootloader to select the new root
filesystem while leaving the working one intact and ready to fall back to in
case of problems. Sounds easy, right? I’ve re-sized and re-partitioned disks
lots of times without any trouble.
The APU4 can boot from USB, so I downloaded the x86 ext4
uncompressed it and wrote it to a USB stick. Since this is a router the console
is a DB9-M serial port. Using
minicom, a USB to Serial adapter, and a DB9-F
to DB9-F cable I had made for the original install I was able to connect to the
console from my laptop. I plugged in the USB stick, rebooted, and the first
thing I see is a prompt to hit F10 to select the boot menu. But I’m running
minicom from a
xfce4-terminal and F10 doesn’t get passed through to the
serial port. What the heck am I going to do now?
I let the router boot, and proceed to try to figure out how to send an F10 to the router. For some reason Firefox doesn’t work because the network didn’t come back up correctly, which was odd since rebooting the router has always ‘Just Worked(tm)’. I missed all the clues and scrambled around trying to figure out what I’d already broken before I’d actually changed anything.
The astute reader will be yelling at me right now :) Yes, I had left the USB
stick in place, and yes it had actually booted from it into the default OpenWRT
configuration, which is different from my production setup. I should have
realized immediately but didn’t. Ends up the
grub2 boot menu on the USB is
identical to the menu on the disk because, well, they are exactly the
same. Needless to say I felt pretty dumb after realizing this.
So, the F10 prompt from the APU4 is a red herring. Ignore it. If you are trying to boot from USB and none of your settings from the working system are applied it’s because it did what you told it to do and booted from the USB.
I’ve made sure I don’t repeat this mistake again by editing the
grub2 menu on
the disk, adding the OpenWRT release number to the menu entry so that it is
totally clear which one is being booted. And it will also be clear when booting
from a new OpenWRT release on USB because that menu won’t have any release number.
The next task is to re-partition the disk. I had used the most of the disk for
/ when doing the initial install and I needed to split it up into at least 2
more pieces. I first tried to shrink the ext4 filesystem from the running
system, but the version of
resize2fs used by OpenWRT 19.07 doesn’t support
shrinking a mounted filesystem, so I reboot into the USB image. Which doesn’t
have the right tools installed. I am using the ext4 image on the USB stick so
it is actually mounted writable, unlike the squashfs image, so there is some
hope for progress here.
resize2fs are on the installed system, which I can mount, and
copy them over to the USB filesystem. The usual way to do things is to shrink
the filesystem and then delete and recreate the partition. Which is what I set
out to do while booting from the USB stick with the SSD’s root partition
The Next Mistake
From the booted USB stick I ran the
resize2fs binary that I’d copied over
from the SSD:
root@OpenWrt:~# resize2fs /dev/sda2 5G resize2fs 1.44.5 (15-Dec-2018) Resizing the filesystem on /dev/sda2 to 1310720 (4k) blocks. The filesystem on /dev/sda2 is now 1310720 (4k) blocks long.
But when I tried to run
fdisk I get a pile of linker errors. It needs more than the
main binary in order to run. Instead of tracking them down and copying them over I
rebooted into the SSD and used
fdisk from there to shrink the partition.
But in my haste I used
df to get the size of the shrunk filesystem. Yes, you
can yell some more :) The output from that was:
/dev/root 5160576 44344 5099848 1% /
I used 5160576 * 2 = 10321152 sectors as my new partition size. I fired up
fdisk, deleted the existing root partition, created a new one with the same
starting sector location (very important) and the new end location. I created
a couple of new partitions to hold the new release and any common files I want
to keep around between releases. I almost formatted these new partitions.
Something told me to reboot first. One change at a time is safer. By now you
know where this is going – I rebooted and was greeted with a kernel panic and
block count 1310720 exceeds size of device 1290145
Oh Shit was my first thought. But I knew that the data was safe for now. Good thing I hadn’t formatted the second partition which was currently overlapping the end of the first. I might be able to fix this.
Fixing My Mistakes
I rebooted into the USB stick yet again. It still had the
resize2fs binary on
it, and it includes
e2fsck by default so I should have all the tools needed
to fix this. I told it to check the filesystem, where I fixed some problems,
and then shrink the filesystem to fit the partition:
root@OpenWrt:~# e2fsck /dev/sda2 e2fsck 1.45.6 (20-Mar-2020) The filesystem size (according to the superblock) is 1310720 blocks The physical size of the device is 1290145 blocks Either the superblock or the partition table is likely to be corrupt! root@OpenWrt:~# resize2fs /dev/sda2 (I didn't save the actual output from this run) root@OpenWrt:~# e2fsck /dev/sda2 e2fsck 1.45.6 (20-Mar-2020) /dev/sda2: clean, 2644/327680 files, 31666/1290145 blocks
So now the original filesystem fits into the partition and I should be able to reboot into a running system.
What Went Wrong
I got distracted by having to reboot to use
fdisk and forgot that the output
df is the filesystem size, not the number of blocks it uses on disk. I
should have used the 1310720 block count from the
e2fsck commands. You just have to remember that the output from those
are in 4K block sizes. It didn’t help that some tools report using 1K blocks,
others use 4K blocks, and disk sectors (on this SSD anyway) are 512bytes. The
cmdline tools are pretty clear about their block sizes, but the kernel output
isn’t, so be careful and double check the numbers you use.
Installing A Second OpenWRT Release
It now boots again. I’m now back to having a running system with a single /boot partition (quite small, but enough for 2 releases), 2 5G (or so) root partitions for releases and the remainder of the disk for extra file storage if needed.
Installing the new release was simple. Mount the USB stick from the running
system, mount the new rootfs, and copy over all the files. Mount the /boot from
the USB stick and copy over the kernel to
differentiate it from the previous kernel.
I edited the
/boot/grub/grub.cfg menu to add a new entry pointing to the new
partition’s UUID (suffix is
-03 instead of
-02) and vmlinuz file, copied
/etc/config files, and
/etc/dropbear/authorized_keys. I edited
/etc/shadow to copy over the root password hash.
Initially I rebooted with the serial console and manually selected the new menu
luci wouldn’t start due to some path changes it expected in the
config files. I compared the old configs with the new ones to figure that out
since the error output from
luci is pretty terrible. Once the UI was working
I went through the all the pages, some of which prompted to update the
configuration files. All that worked fine, and as a final step I edited
/boot/grub/grub.cfg to make the default menu entry point to the new
I now have a router setup that I feel safer updating, I can re-use the old root
partition for the next release, edit the
grub.cfg to reflect which release it
really is, and then reboot knowing that if something goes wrong I can always
fall back to the working version.
I should have been careful about jumping to conclusions while going through this process. The initial USB confusion was frustrating and led to a cascade of mistakes that I was luckily able to recover from in the end. You could also make an argument that I may have had an easier time just redoing the installation from scratch, and you would probably be right, but I was set on keeping the working system intact so that’s the route I ended up taking.