Lately I been moving legacy thumbnails and user images from our applications servers to AWS S3. Cost for storing on S3 is a bargain and S3 can handle infinity amount of objects. I’m really into simplifying and removing as much maintains of our internal systems as possible. During this work I needed to move one folder containing a bit more then 6 million files off to a S3 bucket.
I did this in a two step rocket,
-
first backing up extracting folder into a tar archive and moving that backup into a ec2 instance (to not affect production capacity)
-
second upload all files to S3
First step was easy and nice. Second one is the hard part, started using s3cmd (which is a tool I’ve used in the past to upload/download data from S3), but did run into memory issues as s3cmd allocated a lot of memory. Moved on to s4cmd and same thing there, moving this amount of files turned out to be an interesting issue.
After some googling and some more googling I finally found a script called s3-parallel-put. The python script is quite easy to get going with, here my final conf.
export AWS_ACCESS_KEY_ID=<KEY_ID>
export AWS_SECRET_ACCESS_KEY=<KEY>
s3-parallel-put \
--bucket=images.example.com \
--host=s3.amazonaws.com \
--put=stupid \
--insecure \
--processes=30 \
--content-type=image/jpeg \
--quiet \
.
Success!
There’s a couple of win’s here
--put=stupid
which always uploads the data and doesn’t check if object key is already set (one leasHEAD
request for each key)--insecure
none ssl is faster--processes=30
a lot of parallel uploads
Going back to final solution I ended up with putting all images user in a S3 CNAME bucket and added Cloudflare caching in front of it. An elegant and cheap solution.