Automatic Backup of Files to S3 and Glacier

2013-06-23 15:18:00

Automatic backups are important, especially when it comes to irreplaceable data like family photos. I have used s3cmd to maintain my website’s static files for a while now, and it was simple to use it to push my 100GB+ archive of photos over to S3. But I needed an automated way to update it with any new photos that my wife or I may take. The sync protocol really isn’t what you want – there should be no need to re-examine all the files that have already been archived. You really only want to copy over new ones added since the last update.

So I put together a little wrapper for s3cmd and find that checks for new files and uses s3cmd’s put command to transfer them over to S3. I have the bucket setup to automatically archive everything older than 5 days into Glacier for long term storage. You can grab the copy2s3 script from github and drop it anywhere in your $PATH.

Make sure you have s3cmd setup for your S3 account eg. by running s3cmd --configure. I recommend creating individual user credentials for the buckets you will be writing to. This prevents any accidental or intentional access to the rest of your S3 account. Create a bucket for this account to write to and then set the user account’s permissions to only be able to access the new bucket, like this:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::bucket-name/*"
    }
  ]
}

If you want the contents of the bucket to be transferred to Glacier you can do that with the bucket’s Properties->Lifecycle settings. Click on ‘Add Rule’ and then select ‘Apply to Entire Bucket’. Click ‘Move to Glacier’ and enter the number of days you want to wait before moving it to Glacier. Note that once it is moved you can no longer access it immediately, you need to make a Glacier request to retrieve it. Also, they do not show up under the Glacier service in your AWS console, they are still under their S3 buckets, but the ‘Storage Class’ has been changed to Glacier.

Now edit the copy2s3 script and change the BUCKET variable to match the one you created above. I have mine setup to automatically fill in the host, user and year because I am using the same script on several different machines and accounts.

Now you can add the script to cron (run crontab -e). Just pass it a list of the directories you want it to watch for changes, like this:

MAILTO=<your email address>
42 * * * * /home/user/bin/copy2s3 /home/user/Pictures/ /home/user/Movies/

You probably shouldn’t use this script on directories where file contents is constantly changing. That could result in increased S3 storage charges.

How it Works

The program first checks to make sure another copy isn’t running. It does this by looking for a lock file. If it is older than 4 hours it will print a warning which will get emailed to you if you have MAILTO=<your email address> set in the crontab entry. If it has been less than 4 hours it exits silently, and if there is no lockfile it creates one.

It then uses find’s -newerct command to find any files that are newer than the timestamp in the ~/.copy2s3_last file. After all of the updated file are found it iterates through them and uses s3cmd put to transfer them to S3. If there is an error it will stop the transfer, print the error and exit without updating the last timestamp so that the same files will be picked up on the next pass.

If you want to copy over some older files you can set the initial date in ~/.copy2s3_last to the date you want. For example:

Sun Jun 23 15:27:01 2013

The first run of the script will then pick up anything created after that date and transfer it to S3.