Router Upgrade From Hell

The day started with a plan. I would upgrade to the latest OpenWRT release with minimal disruption to the home internet, planning out the steps beforehand, and being careful not to totally mess things up. And as always reality had other ideas about how the day would go.

The Original Plan

I’m running OpenWRT on a PC Engines APU4 with way more disk space than it needs. The APU4 boots from the SSD like a traditional PC, not like a router booting from dedicated Flash storage, so I wasn’t sure exactly how the OpenWRT upgrade procedure would work and I wanted to make sure I had a working install to fall back on.

So I planned on re-partitioning the disk to add more partitions, installing the new version, and setting up the grub2 bootloader to select the new root filesystem while leaving the working one intact and ready to fall back to in case of problems. Sounds easy, right? I’ve re-sized and re-partitioned disks lots of times without any trouble.

The Reality

The APU4 can boot from USB, so I downloaded the x86 ext4 image, uncompressed it and wrote it to a USB stick. Since this is a router the console is a DB9-M serial port. Using minicom, a USB to Serial adapter, and a DB9-F to DB9-F cable I had made for the original install I was able to connect to the console from my laptop. I plugged in the USB stick, rebooted, and the first thing I see is a prompt to hit F10 to select the boot menu. But I’m running minicom from a xfce4-terminal and F10 doesn’t get passed through to the serial port. What the heck am I going to do now?

I let the router boot, and proceed to try to figure out how to send an F10 to the router. For some reason Firefox doesn’t work because the network didn’t come back up correctly, which was odd since rebooting the router has always ‘Just Worked(tm)'. I missed all the clues and scrambled around trying to figure out what I’d already broken before I’d actually changed anything.

The astute reader will be yelling at me right now :) Yes, I had left the USB stick in place, and yes it had actually booted from it into the default OpenWRT configuration, which is different from my production setup. I should have realized immediately but didn’t. Ends up the grub2 boot menu on the USB is identical to the menu on the disk because, well, they are exactly the same. Needless to say I felt pretty dumb after realizing this.

So, the F10 prompt from the APU4 is a red herring. Ignore it. If you are trying to boot from USB and none of your settings from the working system are applied it’s because it did what you told it to do and booted from the USB.

I’ve made sure I don’t repeat this mistake again by editing the grub2 menu on the disk, adding the OpenWRT release number to the menu entry so that it is totally clear which one is being booted. And it will also be clear when booting from a new OpenWRT release on USB because that menu won’t have any release number.

Making Progress

The next task is to re-partition the disk. I had used the most of the disk for / when doing the initial install and I needed to split it up into at least 2 more pieces. I first tried to shrink the ext4 filesystem from the running system, but the version of resize2fs used by OpenWRT 19.07 doesn’t support shrinking a mounted filesystem, so I reboot into the USB image. Which doesn’t have the right tools installed. I am using the ext4 image on the USB stick so it is actually mounted writable, unlike the squashfs image, so there is some hope for progress here.

fdisk and resize2fs are on the installed system, which I can mount, and copy them over to the USB filesystem. The usual way to do things is to shrink the filesystem and then delete and recreate the partition. Which is what I set out to do while booting from the USB stick with the SSD’s root partition unmounted.

The Next Mistake

From the booted USB stick I ran the resize2fs binary that I’d copied over from the SSD:

root@OpenWrt:~# resize2fs /dev/sda2 5G
resize2fs 1.44.5 (15-Dec-2018)
Resizing the filesystem on /dev/sda2 to 1310720 (4k) blocks.
The filesystem on /dev/sda2 is now 1310720 (4k) blocks long.

But when I tried to run fdisk I get a pile of linker errors. It needs more than the main binary in order to run. Instead of tracking them down and copying them over I rebooted into the SSD and used fdisk from there to shrink the partition.

But in my haste I used df to get the size of the shrunk filesystem. Yes, you can yell some more :) The output from that was:

/dev/root              5160576     44344   5099848   1% /

I used 5160576 * 2 = 10321152 sectors as my new partition size. I fired up fdisk, deleted the existing root partition, created a new one with the same starting sector location (very important) and the new end location. I created a couple of new partitions to hold the new release and any common files I want to keep around between releases. I almost formatted these new partitions. Something told me to reboot first. One change at a time is safer. By now you know where this is going – I rebooted and was greeted with a kernel panic and this message:

block count 1310720 exceeds size of device 1290145

Oh Shit was my first thought. But I knew that the data was safe for now. Good thing I hadn’t formatted the second partition which was currently overlapping the end of the first. I might be able to fix this.

Fixing My Mistakes

I rebooted into the USB stick yet again. It still had the resize2fs binary on it, and it includes e2fsck by default so I should have all the tools needed to fix this. I told it to check the filesystem, where I fixed some problems, and then shrink the filesystem to fit the partition:

root@OpenWrt:~# e2fsck /dev/sda2
e2fsck 1.45.6 (20-Mar-2020)
The filesystem size (according to the superblock) is 1310720 blocks
The physical size of the device is 1290145 blocks
Either the superblock or the partition table is likely to be corrupt!

root@OpenWrt:~# resize2fs /dev/sda2
(I didn't save the actual output from this run)

root@OpenWrt:~# e2fsck /dev/sda2
e2fsck 1.45.6 (20-Mar-2020)
/dev/sda2: clean, 2644/327680 files, 31666/1290145 blocks

So now the original filesystem fits into the partition and I should be able to reboot into a running system.

What Went Wrong

I got distracted by having to reboot to use fdisk and forgot that the output from df is the filesystem size, not the number of blocks it uses on disk. I should have used the 1310720 block count from the resize2fs, or e2fsck commands. You just have to remember that the output from those are in 4K block sizes. It didn’t help that some tools report using 1K blocks, others use 4K blocks, and disk sectors (on this SSD anyway) are 512bytes. The cmdline tools are pretty clear about their block sizes, but the kernel output isn’t, so be careful and double check the numbers you use.

Installing A Second OpenWRT Release

It now boots again. I’m now back to having a running system with a single /boot partition (quite small, but enough for 2 releases), 2 5G (or so) root partitions for releases and the remainder of the disk for extra file storage if needed.

Installing the new release was simple. Mount the USB stick from the running system, mount the new rootfs, and copy over all the files. Mount the /boot from the USB stick and copy over the kernel to /boot/vmlinuz-21.02.2 to differentiate it from the previous kernel.

I edited the /boot/grub/grub.cfg menu to add a new entry pointing to the new partition’s UUID (suffix is -03 instead of -02) and vmlinuz file, copied over my /etc/config files, and /etc/dropbear/authorized_keys. I edited /etc/shadow to copy over the root password hash.

Initially I rebooted with the serial console and manually selected the new menu entry. luci wouldn’t start due to some path changes it expected in the config files. I compared the old configs with the new ones to figure that out since the error output from luci is pretty terrible. Once the UI was working I went through the all the pages, some of which prompted to update the configuration files. All that worked fine, and as a final step I edited /boot/grub/grub.cfg to make the default menu entry point to the new installation.

Conclusion

I now have a router setup that I feel safer updating, I can re-use the old root partition for the next release, edit the grub.cfg to reflect which release it really is, and then reboot knowing that if something goes wrong I can always fall back to the working version.

I should have been careful about jumping to conclusions while going through this process. The initial USB confusion was frustrating and led to a cascade of mistakes that I was luckily able to recover from in the end. You could also make an argument that I may have had an easier time just redoing the installation from scratch, and you would probably be right, but I was set on keeping the working system intact so that’s the route I ended up taking.