Lately I been moving legacy thumbnails and user images from our applications servers to AWS S3. Cost for storing on S3 is a bargain and S3 can handle infinity amount of objects. I’m really into simplifying and removing as much maintains of our internal systems as possible. During this work I needed to move one folder containing a bit more then 6 million files off to a S3 bucket.
I did this in a two step rocket,
first backing up extracting folder into a tar archive and moving that backup into a ec2 instance (to not affect production capacity)
second upload all files to S3
First step was easy and nice. Second one is the hard part, started using s3cmd (which is a tool I’ve used in the past to upload/download data from S3), but did run into memory issues as s3cmd allocated a lot of memory. Moved on to s4cmd and same thing there, moving this amount of files turned out to be an interesting issue.
After some googling and some more googling I finally found a script called s3-parallel-put. The python script is quite easy to get going with, here my final conf.
export AWS_ACCESS_KEY_ID=<KEY_ID> export AWS_SECRET_ACCESS_KEY=<KEY> s3-parallel-put \ --bucket=images.example.com \ --host=s3.amazonaws.com \ --put=stupid \ --insecure \ --processes=30 \ --content-type=image/jpeg \ --quiet \ .
There’s a couple of win’s here
--put=stupidwhich always uploads the data and doesn’t check if object key is already set (one leas
HEADrequest for each key)
--insecurenone ssl is faster
--processes=30a lot of parallel uploads
Going back to final solution I ended up with putting all images user in a S3 CNAME bucket and added Cloudflare caching in front of it. An elegant and cheap solution.