Automatic Backup of Files to S3 and Glacier
Automatic backups are important, especially when it comes to irreplaceable data like family photos. I have used s3cmd to maintain my website’s static files for a while now, and it was simple to use it to push my 100GB+ archive of photos over to S3. But I needed an automated way to update it with any new photos that my wife or I may take. The sync protocol really isn’t what you want – there should be no need to re-examine all the files that have already been archived. You really only want to copy over new ones added since the last update.
So I put together a little wrapper for s3cmd and find that checks for new files
and uses s3cmd’s put command to transfer them over to S3. I have the bucket
setup to automatically archive everything older than 5 days into Glacier for
long term storage. You can grab the copy2s3
script
from github and drop it anywhere in
your $PATH
.
Make sure you have s3cmd setup for your S3 account eg. by running s3cmd --configure
. I recommend creating individual user credentials for the buckets
you will be writing to. This prevents any accidental or intentional access to
the rest of your S3 account. Create a bucket for this account to write to and
then set the user account’s permissions to only be able to access the new
bucket, like this:
{
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::bucket-name/*"
}
]
}
If you want the contents of the bucket to be transferred to Glacier you can do that with the bucket’s Properties->Lifecycle settings. Click on ‘Add Rule’ and then select ‘Apply to Entire Bucket’. Click ‘Move to Glacier’ and enter the number of days you want to wait before moving it to Glacier. Note that once it is moved you can no longer access it immediately, you need to make a Glacier request to retrieve it. Also, they do not show up under the Glacier service in your AWS console, they are still under their S3 buckets, but the ‘Storage Class’ has been changed to Glacier.
Now edit the copy2s3 script and change the BUCKET variable to match the one you created above. I have mine setup to automatically fill in the host, user and year because I am using the same script on several different machines and accounts.
Now you can add the script to cron (run crontab -e
). Just pass it a list of the directories you
want it to watch for changes, like this:
MAILTO=<your email address>
42 * * * * /home/user/bin/copy2s3 /home/user/Pictures/ /home/user/Movies/
You probably shouldn’t use this script on directories where file contents is constantly changing. That could result in increased S3 storage charges.
How it Works
The program first checks to make sure another copy isn’t running. It does this by
looking for a lock file. If it is older than 4 hours it will print a warning which
will get emailed to you if you have MAILTO=<your email address>
set in the crontab
entry. If it has been less than 4 hours it exits silently, and if there is no lockfile
it creates one.
It then uses find’s -newerct
command to find any files that are newer than the
timestamp in the ~/.copy2s3_last
file. After all of the updated file are found
it iterates through them and uses s3cmd put
to transfer them to S3. If there is
an error it will stop the transfer, print the error and exit without updating the
last timestamp so that the same files will be picked up on the next pass.
If you want to copy over some older files you can set the initial date in
~/.copy2s3_last
to the date you want. For example:
Sun Jun 23 15:27:01 2013
The first run of the script will then pick up anything created after that date and transfer it to S3.