Base64 Encoded 128-bit MD5 Digest
This is an old blog post that I copied to test a few things. Content is still relevant.
While reading about Amazon S3 API documentation to find out how Amazon S3 does the integrity check on objects to verify that the data is the same data that was originally sent you might come across the statement The base64-encoded 128-bit MD5 digest of the message
and it may not make much sense or you may wonder how this is different than an MD5 hash you can calculate with standard md5
command.
It is pretty easy and relatively fast to calculate md5sum for a file.
|
|
What is MD5?
The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. Regardless of the file size, an md5 hash is always 128 bits.
128-bit MD5 digest
statement in Amazon S3 documentation, just implies that MD5 digest is 128 bit long as defined in the RFC.
What is Hexadecimal?
In mathematics and computing, hexadecimal (also base 16, or hex) is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a, b, c, d, e, f) to represent values ten to fifteen.
Examples for Hexadecimal (base 16) to decimal (base 10) conversion
|
|
MD5 Hash and Hexadecimal
MD5 hash has 128 bits which is 16 bytes. Biggest decimal value that 1 byte (8 bits) can hold is 255. From the chart above, we know hexadecimal FF represents 255 as well (16 x 15 + 15).
We need 2 digit hexadecimal number (FF) to represent max value in a byte which is 1111 1111
. To represent 16 bytes (128 bits), we need a 32 digit hexadecimal number. If you go back and check the output of md5
command above, you will see it is exactly 32 digits long.
Base64 Encoding
Base64 encoding is used to represent binary data in an ASCII string. Base64 encoding is used commonly in HTTP requests and headers. Check the wiki page for base64 encoding here, to find out the interesting calculation on how grous of 6 bits are converted to individual numbers and how padding is done when the number of bytes is not divisible by three.
Calculating Base64-Encoded MD5
The output of md5
command produces a 32 digit long hexadecimal text which is ASCII encoded and base64
command calculates base64 encoded string.
Check the command below and try to figure out why it is not going to produce what we are really looking for.
|
|
Here is why. We need hexadecimal value of md5sum, instead with md5
command we are getting an ASCII text representing the hexadecimal value. Remembering base64 is used to represent binary data in ASCII, we need to find binary value of md5 result. Below you can find the command which will give the right base64 encoded md5 hash.
|
|
What if you already have md5sums calculated for bunch of files and don’t want to calculate these but instead just convert to base64?
xxd
command can convert a hexadecimal string to binary value and you can use this binary to calculate base64 as seen below.
|
|
As a way to verify the output, we see that base64 encoded text above matches the one we found using openssl.
Conclusion
I hope Base64 Encoded 128-bit MD5 Digest
is a clear statement now.
It is important to note that, AWS SDK for JavaScript has S3 ManagedUpload API (similar to Java TransferManager or Go s3Manager) which calculates base64-encoded md5 digest and passes it in content-md5
automatically for you if you set computeChecksums
option to true
. If you choose to use PUT API, you will have to calculate the value before passing it as an option for the API call.
If you want to keep MD5 hash value of the file in Amazon S3 along with the object, you can post it in user metadata of the object. If you choose to do so, you don’t really have to calculate base64 value but pass md5 hash as it is if you like. Since you will be interpreting user metadata, it is up to you to decide in which format / encoding you want to store it.