TransWikia.com

Which is the fastest way to copy 400G of files from an ec2 elastic block store volume to s3?

Server Fault Asked by aseba on February 4, 2021

I have to copy 400G of files from an elastic block store volume to an s3 bucket… Those are about 300k files of ~1Mb

I’ve tried s3cmd and s3fuse, both of them are really, really slow.. s3cmd ran for a complete day, said it finished copying, and when I checked the bucket, nothing had happened (I suppose something went wrong, but at least s3cmd never complained of anything)

S3Fuse is working for an other complete day, and copied less than 10% of files…

Is there a better solution for this?

I’m running Linux (ubuntu 12.04) of course

8 Answers

Another good option is peak/s5cmd:

For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively.

Answered by Shane Brinkman-Davis on February 4, 2021

Tune AWS CLI S3 Configuration values as per http://docs.aws.amazon.com/cli/latest/topic/s3-config.html.

The below increased an S3 sync speed by at least 8x!

Example:

$ more ~/.aws/config
[default]
aws_access_key_id=foo
aws_secret_access_key=bar
s3 =
   max_concurrent_requests = 100
   max_queue_size = 30000

Answered by Fletcher on February 4, 2021

Try using s3-cli instead of s3cmd. I used it instead of s3cmd to upload files to my s3 bucket and it made my deployment faster almost by 17 minutes (from 21 to 4 minutes)!

Here's the link : https://github.com/andrewrk/node-s3-cli

Answered by Yahya on February 4, 2021

Try s4cmd instead, it's really faster than s3cmd. Its address: https://github.com/bloomreach/s4cmd

Answered by mcsrainbow on February 4, 2021

There is also: s3funnel, which seems very old (2008) and some open bugs, but is still listed from Amazon itself: amzn-lnk

Answered by math on February 4, 2021

I wrote a optimized console application in C# (CopyFasterToS3) to do this. I used in EBS vol, i my case it had 5 folders with more than 2 millions files in a amount of 20Gb. The script executed in less than 30 minutes.

In this article i showed how to using a recursive function with parallel. You can transcripted it to another language.

Good luck!

Answered by André Agostinho on February 4, 2021

There are several key factors that determine throughput from EC2 to S3:

  • File size - smaller files require a larger number of requests and more overhead and transfer slower. The gain with filesize (when originating from EC2) is negligible for files larger than 256kB. (Whereas, transfering from a remote location, with higher latency, tends to continue showing appreciable improvements until between 1MiB and 2MiB).
  • Number of parallel threads - a single upload thread usually has a fairly low throughout - often below 5MiB/s. Throughput increases with the number of concurrent threads, and tends to peak between 64 and 128 threads. It should be noted that larger instances are able to handle a greater number of concurrent threads.
  • Instance size - As per the instance specifications, larger instances have more dedicated resources, including a larger (and less variable) allocation of network bandwidth (and I/O in general - including reading from ephemeral/EBS disks - which are network attached. Typical numbers values for each category are:
    • Very High: Theoretical: 10Gbps = 1250MB/s; Realistic: 8.8Gbps = 1100MB/s
    • High: Theoretical: 1Gbps = 125MB/s; Realistic: 750Mbps = 95MB/s
    • Moderate: Theoretical: 250Mbps; Realistic: 80Mbps = 10MB/s
    • Low: Theoretical: 100Mbps; Realistic: 10-15Mbps = 1-2MB/s

In cases of transferring large amounts of data, it may be economically practical to use a cluster compute instance, as the effective gain in throughput (>10x) is more than the difference in cost (2-3x).

While the above ideas are fairly logical (although, the per-thread cap may not be), it is quite easy to find benchmarks backing them up. One particularly detailed one can be found here.

Using between 64 and 128 parallel (simultaneous) uploads of 1MB objects should saturate the 1Gbps uplink that an m1.xlarge has and should even saturate the 10Gbps uplink of a cluster compute (cc1.4xlarge) instance.

While it is fairly easy to change instance size, the other two factors may be harder to manage.

  • File size is usually fixed - we cannot join files together on EC2 and have them split apart on S3 (so, there isn't much we can do about small files). Large files however, we can split apart on the EC2 side and reassemble on the S3 side (using S3's multi-part upload). Typically, this is advantageous for files that are larger than 100MB.
  • Parallel threads is a bit harder to cater to. The simplest approach comes down to writing a wrapper for some existing upload script that will run multiple copies of it at once. Better approaches use the API directly to accomplish something similar. Keeping in mind that the key is parallel requests, it is not difficult to locate several potential scripts, for example:
    • s3cmd-modification - a fork of an early version of s3cmd that added this functionality, but hasn't been updated in several years.
    • s3-parallel-put - reasonably recent python script that works well

Answered by cyberx86 on February 4, 2021

So, after a lot of testing s3-parallel-put did the trick awesomely. Clearly the solution if you need to upload a lot of files to S3. Thanks to cyberx86 for the comments.

Answered by aseba on February 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP