Moving Petabytes to Amazon S3, with AWS Snowball
The Challenge
Moving petabytes is not a process companies go through often. With AWS, files not only have to be copied, but folder structures have to be converted into keys. This can affect several aspects of Amazon S3 as keys are the only identifiers for objects.
This rises several questions for the customers, some of which are;
- How fast can I move my data to Amazon S3?
- How can I be sure my files will reach Amazon S3 intact?
- How can I structure my keys so I will have enough performance?
- Can I (or should I) make Amazon S3 keys mimic my current file system?
- How can I design to reduce storage costs without impacting the business?
- If I ingest file on-prem, how can I continously syncronize my files to Amazon S3?
- Should I encrypt data at rest?
- I have both small and large files, how can I serve both types with similar performance?
- Should I create multiple buckets or just one?
In this article, I will focus on copying files from on-prem to Amazon S3 using Amazon Snowball and best practices on different S3 storage classes. I will mention limitations presented with each choice and I hope it will help you design your object storage strategy based on the requirements at hand.
I will overlap with the official Amazon S3 documentation at times but I assume reader already has a good understanding on Amazon S3 or willing to go through offical documentation whenever the concept I am writing about is not clear.
Amazon Snowball
The Snowball appliance has 43TB capacity after being formatted. On our web site we state it can store “up to 50 TB” so this might create some confusion. On the other hand, Snowball compresses files with lossless LZ and in most cases it will store 50 TB or more depending on how well files are compressed. LZ being lossless is important as it guarantees that data won’t change even a bit after compression.
The Snowball client can connect to one snowball appliance only. However you can have multiple instances of the Snowball client connected to a single Snowball appliance.
If you need to access multiple Snowball appliances from a single server, different unix users can be used.
Running multiple client instances is possible as long as on-premise storage is not the bottleneck and customer wants to copy files to the Snowball appliance as quickly as possible. During a customer POC, we were able to fill out a Snowball appliance in 4 days using single client. With shipment and AWS process time, it took 8 days to copy 50TB to Amazon S3. This means, with a single snowball appliance always present on-prem, for 1PB we would need 80 days (weekends not considered). If this was done over internet, it would mean pushing data at 1.21362 Gbits/sec non stop for 80 days. If on-premise storage limitations permit the use of more Snowballs and/or Snowball client inttances, it would be possible to multiply the performance. Consider that on-premise storage will usually have production load during business hours and might become a bottleneck before network.
1.21362 Gbits/sec = 1 PB x 1024 TB x 1024 GB x 8 bits / 80 days / 24 hours / 60 minutes / 60 seconds
How can I be sure files I copied to Snowball are not corrupted when they show up on S3?
Snowball client calculates MD5 checksum of each file and stores it along with the file on the Snowball appliance. Files larger than 32 MB are copied in chunks to the appliance. Each chunk is 1 MB and is encrypted individually.
When Snowball reaches the AWS Datacenter, chunks for each file are decrypted to construct the file. If there is a problem with one of the chunks, file can’t be restored from Snowball and will be reported in “ingestion report”. Once the file is constructed, it is also checked against the MD5 checksum calculated by Snowball client. If this check fails, file is not copied to S3 and problem reported as well. While file is being uploaded to S3, MD5 checksum is passed in PUT request to let S3 acknowledge file integrity.
Using the mechanism described above, it is safe to assume if a file shows up on S3, it has to be the original file copied from on-prem storage.
Amazon S3
Amazon S3 is a simple key, value store designed to store unlimited number of objects. Key is the name, value is the data. The way you find an object in S3, is not any different than other key-value stores. If you need to run complex queries on metadata, it is best to maintain an external database which will store metadata.
Amazon S3 Explorer gives the sense of a folder structure while all it does is filtering results by /
and providing links to reach next /
in the key or if there is not any further a link to the object itself.
It is helpful for the customer to understand key-value store concept very well and whenever possible design applications to access S3 through S3 API instead of an appliance in front of S3 which presents objects as an NFS mount. For human interaction this appliance is convinient but for applications the preference should be the native rest api.
Once an object is written to Amazon S3 it can not be updated. It can be overwritten by sending different value or metadata for the same key value.
Amazon S3 keeps objects in partitions in order to scale horizontally. Objects are identified with unique keys. Keys are also used to decide on which partition an object is stored.
Partitions hold a group of objects which have sequential keys when ordered alphabetically. Partitions have a limit on number of requests they can serve. If number of requests will be constantly higher than 100 PUT/LIST/DELETE requests per second or 300 GET requests per second, keys should be designed to be in a random alphabetical order. This can be accomplished by using a hash as prefix. More on how to choose a hash can be found here.
Object Keys and Partition Keys
Bucket names are evaluated as part of the partition key when Amazon S3 decides on which partition to store the object.
A key cats/image1.jpg
under allpictures
bucket would form a partition key as seen below.
|
|
A key pictures/cats/image1.jpg
under all
bucket would form a partition key as seen below.
|
|
Both objects in the examples above might be stored in the same partition, if all
is a partition key.
Partitions are not tied to AWS accounts or buckets but live in a single region.
Partitions and Partition Keys Example
Partition keys are internal structures and there is no way to query them externally. One can not know which partitions exist by querying Amazon S3 API.
As an example, assume Amazon S3 has the partition keys below.
|
|
Below you can see in which partition a partition key (bucket/object key) will be stored.
|
|
The most specific match is used to match a key to a partition. A partition name can be up to 255 characters long. Since partitons are internally managed, it’s length isn’t really something customers are concerned with.
Hash Location in Key
Best practice for hash length and position in the key is to make it 3-4 characters long, 7 to 10 characters from left. On the other hand, this won’t guarantee that objecs will be distributed evenly across partitions as it is likely there isn’t a partition yet which overlaps with the hash defined.
On the other hand, S3 will eventually partition the keys when number of requests for a partition gradually increase in time. If number of requests increase rapidly, 40% over a few weeks, this won’t give enough time for S3 to automatically adjust partitions so a ticket has to be created with support. More information is here.
Bursts in requests, up to 300 PUT/LIST/DELETE requests per second or up to 800 GET requests per second, can be handled for a brief amount of time. If the requests per second remain high for long enough S3 throttles the requests, returns 503 for the requests over the threshold and creates partitions for the given workload. There is no SLA on how quickly a partition will be created. A well designed application that uses Amazon S3, has to handle 503’s and implement retry logic.
Hot Spots
In order to find hot spots S3 Server Logs can be enabled. Logs will be written to another S3 bucket and has to be evaluated. Using S3 events, this data can be written to CloudSearch for further processing. Log format can be found here
Server Side Optimization
Check TCP Window Scaling and TCP Selective Acknowledgement documentation on Amazon S3 Documentation.
Metadata
Objects in S3 have System-Defined metadata and User-Defined metadata.
User-Defined metadata can be 2KB max per object and is made of key-value pairs. As an example MD5 checksum of a file can be stored in user-defined metadata.
Amazon only charges for User-Defined metadata. The same is true for bandwidth. If object size is N, Amazon metadata is AM and user metadata is UM total size of the object is N+AM+UM. Customer pays for N+UM for bandwidth and storage.
MD5 Checksums
Objects in Amazon S3 have a metadata key named ETag. Value of ETag matches MD5 checksum of the file only if:
- objects are created without Multipart Upload or Part Copy operation
- objects are encrypted with SSE-S3 or not encrypted
Even though in many cases Amazon S3 doesn’t store MD5 checksum of the file, it is advised to pass MD5 checksum with PUT requests in Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, fails the PUT request and returns an error.
One Bucket vs Multiple Buckets
There is no limit to the number of objects that can be stored in a bucket.
There is no performance difference in using one or multiple buckets.
There can be max 100 buckets per account.
Features listed below belong to buckets and may affect the decision on multiple buckets vs one bucket decision.
- Cross-Region Replication
- Do you want to replicate everything in the bucket to another region?
- Requires versioning enabled
- Lifecycle policies (max 1000)
- Do you have complex lifecycle policies?
- Are you able to transition to different storage classes just using age of the files?
- Logging
- All logging for all objects in the same place. Do you want to process all the logs?
- Events (prefix and suffix can be used to limit events)
- You can have only one filter for events. Is it going to be enough?
- Versioning
- Do you need versioning enabled on all objects?
- Requestor Pays
- Do you need this feature?
- Bucket Policy
- Tags
- Static Web Site hosting
PS: Encryption is on object level so doesn’t play a role in one vs multiple buckets decision.
Encryption at Rest
There are 3 options for encrypting data on rest.
Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3)
Amazon S3 manages they key and all process is transparent for the user.
Server-Side Encryption with Customer-Provided Encryption Keys (SSE-C)
Customer manages the keys and has to pass the encryption key with each request to S3. If customer loses the key, the data can’t be decrypted.
Server-Side Encryption with AWS KMS–Managed Keys (SSE-KMS)
AWS Key Management service manages the keys. Customer can use different keys for different buckets. Integrates with IAM so customer doesn’t have to share an encryption key between applications or take the key out of AWS environment. This option limits number of requests for S3 to 100 as KMS service has lower limits but a limit increase can be requested.
Lifecycle and Storage Classes
Amazon S3 has four storage classes. Details for storage classes is here.
- STANDARD
- STANDARD_IA: Infrequent Access
- GLACIER
- REDUCED_REDUNDANCY
Lifecycle policies define which files will be moved within these storage classes and / or removed completely from S3. There can be 1000 lifecycle policies per bucket.
Lifecycle policies consider age and key of the object to find matching objects. (There is no metadata support so a lifecycle policy on a custom metadata of file creation date wouldn’t work.)
Lifecycle configuration on MFA-enabled buckets is not supported.
Transition done by lifecycle configuration is one way;
- STANDARD -> STANDARD_IA -> GLACIER
- REDUCED_REDUNDANCY -> STANDARD_IA -> GLACIER
Standard Infrequent Access
- objects > 128KB, smaller objects will be billed as 128KB, and automatic transition is not supported for objects smaller than 128KB
- objects have to be at least 30 days old to transition from STANDARD to STANDARD_IA
- STANDARD_IA, data retrieval $0.01 per GB ($10 per TB)
Large Objects
It is advised to use multipart uploads for objects larger than 100MB. A single PUT can upload 5 GB max. Objects bigger than 5 GB can be uploaded to S3 with multipart upload. Max object size S3 supports is 5 TB.
Multipart upload flow
- create multipart upload, record multipart id
- send parts, each with multipart id and unique part id, record eTag value S3 returns
- if a upload of a part fails, upload it again with the same part id
- when all parts are uploaded, send part_id:eTag pair for all parts with the multipart finish call.
Glacier
- Data is encrypted by default
- Vault has archives and inventory
- Inventory has archive id and description
- You can restore a temporary copy of the files from Glacier to S3, you specify the time
- Then use the copy operation to overwrite the object as a STANDARD, STANDARD_IA, or REDUCED_REDUNDANCY object.
- There is cost to expire (remove) files that have been in Glacier less than 90 days
- For each file transitioned to Glacier,
- 8 KB in S3 is used to store name and metadata
- 32 KB in Glacier is used for storage index and metadata
- 5%/30 can be restored to S3 every day with no extra charge
- 3.3 TB every day if you have 2PB on Glacier
- and $0.01 per gigabyte exceeding 5%/30
- 10000 x 4GB is max archive size, if file is uploaded directly to Glacier
Check out Elements to Describe Lifecycle Actions