Backups, Archives and Overheating Processors

A few (ahum) years ago I wrote an article for Linux Journal on building a RAID system. While that exact system no longer exists, I do still have a RAID5 setup that I use with BackupPC to backup all the systems on my LAN. As I wrote about in my KVM article, I have updated my main Linux box to Fedora11. It had been out of backup rotation for about a year, since I have mostly been using my Mac Mini and everything on the Linux box was checked out of a remote Subversion repository. I wanted to archive the old system's backup and add it to the backup rotation again.

In all my years of using BackupPC I had somehow missed the archive feature. I've used it to recover files by writing them to a /tmp/ directory on a remote system or download a tar of selected files but hadn't realized that you could also create a gigantic tar of all the files in the current backup. To get this setup I had to do several things:

  • Add a new dns alias for the system to write the archive to. I use the same system that BackupPC runs on for this.
  • Add a new host in the 'Edit Hosts' page, I named it the same thing as the new DNS alias
  • Edit the new host's config and in the 'Xfer' page set 'XferMethod' to Archive instead of rsync
  • Change the 'ArchiveSplit' option to 1000 to split the tar into 1G files to make it easy to handle

And presto! I could now dump archives of the backups to the local system and then burn them to DVD. I also wanted to include a directory of all the files along size the archive. Since the tar is actually split up into pieces you need to join them together in order to get a full listing out of them. Since tar was written to be used with streaming tapes this means all you need to do is cat them to a tar process reading from standard input and write the output into a file. Like this:

cat host.tar.bz2.a? | tar tvjf - > ./directory.txt

This streams all the archive files to tar which is reading from standard input and writes the output to the directory.txt file. This can take quite a while.

So at this point in the day I finally had the old system image written to a couple of DVD's. Now it was time to switch the backup back on and catch up with the current system image. I added a few new directories to the list to backup (I usually only backup /etc, /root and /home). This included my new libvirt virtual images. In all it amount to about 96G worth of files. The LAN bottleneck is the 100Mb NIC in the backup system. It was pushing around 45Mb for several hours, chugging its way through the backup. Then something strange happened.

The backup server turned off. No warning, just click. nothing. I rebooted it, it ran its filesystem checks with no problems. I dug through the logfiles and there was nothing in them to indicate a problem of any kind. So I restarted the backup and it ran for about another 30 minutes before doing the same thing. This system never dies on me, or at least ever since I put in an Antec 500W power supply it doesn't just die.

I started with the obvious, checking for bus errors in the logs. I ran memtest86 on it for a bit. Then I took a look at the BIOS health readings. Even after being relativly idle for 15 minutes the CPU temp was at 63C. Now, this system is a 2.9GHz Celeron D. The max temp is somewhere around 67C. So I was probably baking the heck out of the CPU and it was doing a thermal shutdown. Consumer CPUs like the Celeron just aren't designed for this kind of abuse. But that never stops me from trying to squeeze every last cent out of a system.

The heatsink had what I'd call a moderate amount of dust on it, but it was mostly on the top not crammed down in the fins like I have sometimes seen. I pulled it odd and the CPU was glued to it with heatsink grease. I blew out the dust (canned air is so much fun!), cleaned things off, gave it some new grease and re-installed.

I fired up the backup and again after a short period of time it died. I finally setup the sensors package on the system and it told the story -- it was still overheating. The fan was only running at about 2.7k rpms so I swapped in a spare Tornado fan, cranked it up to its maximum of  5300 rpm and restarted the backup. The CPU now maxes out around 57C and the backups all run to completion so things seem to be happy.

This also reminds me that I really need to blow the dust out of the heatsinks in the other systems around here -- I don't think I've done that in over a year. I really should have a regular maintenance schedule instead of waiting for failures to happen. I guess I need the extra excitement or something.