MD5哈希的具体原理
首先我们假设我们将要对信息 $a$ 进行MD5哈希处理,我们知道每个字节有 8 个比特,我们可以把 $a$ 分成若干个 512 比特的分组,最后的分组补一个 0x80
的字符后,若该分组长度不等于 448 比特,则一直补字符 0x00
直到最后的分组为 448 比特,即若最后的分组长度大于 448 则多加一个分组然后一直补到下一个分区达到 448 比特。
之后我们再把信息 $a$ 的长度(不算后来的 0x80
以及 0x00
)以小端的方法储存在最后的分区的最后 64 比特中,届时最后的分区也会达到 512 比特的长度。之后我们会通过64轮一系列的比特运算,代换,移位等方式,最后生成一个 128 比特的哈希信息,一般我们会把最后的哈希信息转为 16 进制信息进行输出。
Python 代码实现
我们不妨先把常量表打好:
# T Table
T = [[0xD76AA478, 0xE8C7B756, 0x242070DB, 0xC1BDCEEE, 0xF57C0FAF, 0x4787C62A, 0xA8304613, 0xFD469501, 0x698098D8, 0x8B44F7AF, 0xFFFF5BB1, 0x895CD7BE, 0x6B901122, 0xFD987193, 0xA679438E, 0x49B40821],
[0xF61E2562, 0xC040B340, 0x265E5A51, 0xE9B6C7AA, 0xD62F105D, 0x02441453, 0xD8A1E681, 0xE7D3FBC8, 0x21E1CDE6, 0xC33707D6, 0xF4D50D87, 0x455A14ED, 0xA9E3E905, 0xFCEFA3F8, 0x676F02D9, 0x8D2A4C8A],
[0xFFFA3942, 0x8771F681, 0x6D9D6122, 0xFDE5380C, 0xA4BEEA44, 0x4BDECFA9, 0xF6BB4B60, 0xBEBFBC70, 0x289B7EC6, 0xEAA127FA, 0xD4EF3085, 0x04881D05, 0xD9D4D039, 0xE6DB99E5, 0x1FA27CF8, 0xC4AC5665],
[0xF4292244, 0x432AFF97, 0xAB9423A7, 0xFC93A039, 0x655B59C3, 0x8F0CCC92, 0xFFEFF47D, 0x85845DD1, 0x6FA87E4F, 0xFE2CE6E0, 0xA3014314, 0x4E0811A1, 0xF7537E82, 0xBD3AF235, 0x2AD7D2BB, 0xEB86D391]]
ROLStep = [[7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22],
[5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20],
[4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23],
[6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21]]
然后我们再来写一些简单的逻辑函数,就像上面说的,我们每个 512 比特的分区都会经历 64 轮的比特运算,这 64 轮每 16 轮会换一个逻辑函数,所以总共有 f
,g
,h
,i
四个逻辑函数,并且在后面三个逻辑函数中,我们会有信息代换,所以有 rho_2
,rho_3
以及 rho_4
三个函数:
# Logic Functions
f = lambda b, c, d : (b & c) | ((~b) & d)
g = lambda b, c, d : (b & d) | (c & (~d))
h = lambda b, c, d : b ^ c ^ d
i = lambda b, c, d : c ^ (b | (~d))
rho_2 = lambda i : (1 + 5 * (i)) % 16
rho_3 = lambda i : (5 + 3 * (i)) % 16
rho_4 = lambda i : (7 * (i)) % 16
之后我们每轮比特运算之后会有比特左循环移位,上面那个 ROLSteps
就是记录每轮之后移动的位数,我们接下来写左循环移位的函数:
def ROLs(x, y):
x, y = int(x), int(y)
mask1 = (1 << y) - 1
return ((x >> (32 - y)) & mask1) | ((x << y) & ~mask1)
除了这些函数以外,我们需要注意的是 MD5 哈希是使用小端的方式对数据进行处理的,但是我们一般数据都是使用大端的方式储存,所以我们要专门写函数去转换小端和大端,下面我们来写加载和储存 32 比特大端以及储存 64 比特大段的函数,由于 64 比特大端只会在储存信息长度的时候用到,没有需要加载的场景,故不用写加载 64 比特大端的函数:
# Read 32bit Litte Endian From y to x
def load32L(y):
# print(y)
x = (((ord(y[3]) & 255) << 24) + ((ord(y[2]) & 255) << 16) + ((ord(y[1]) & 255) << 8) + (ord(y[0]) & 255))
return x
# Store 32bit Little Endian From x to y
def store32L(x):
y = [""] * 4
y[0] = chr(x & 255)
y[1] = chr((x>>8) & 255)
y[2] = chr((x>>16) & 255)
y[3] = chr((x>>24) & 255)
return "".join(y)
def gen64LS(x):
x, y = int(x), ""
y += chr(x & 255)
y += chr((x>>8) & 255)
y += chr((x>>16) & 255)
y += chr((x>>24) & 255)
y += chr((x>>32) & 255)
y += chr((x>>40) & 255)
y += chr((x>>48) & 255)
y += chr((x>>56) & 255)
return y
之后我们可以来定义每轮比特运算的总函数了:
# Compress Hash Functions
def FF(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((f(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + f(b, c, d) + M + T[0][s]) % (2 ** 32), ROLStep[0][s])) % (2 ** 32)
return d, a, b, c
def GG(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((g(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + g(b, c, d) + M + T[1][s]) % (2 ** 32), ROLStep[1][s])) % (2 ** 32)
return d, a, b, c
def HH(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((h(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + h(b, c, d) + M + T[2][s]) % (2 ** 32), ROLStep[2][s])) % (2 ** 32)
return d, a, b, c
def II(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((i(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + i(b, c, d) + M + T[3][s]) % (2 ** 32), ROLStep[3][s])) % (2 ** 32)
return d, a, b, c
当我们对信息处理的时候,就是先把 512 比特分区以小段的方式读入,变成 16 个 64 比特的数值,并且定义 4 个状态寄存器 a
, b
, c
, d
以及他们的初始值 0x67452301
, 0xEFCDAB89
, 0x98BADCFE
, 0x10325476
,之后我们顺序执行 FF
,GG
,HH
以及 II
,对每个 512 比特分区执行上述步骤后四个状态寄存器的值就是我们的 MD5 哈希值了:
def compressMD5(msg, a, b, c, d):
for i in range(int(len(msg) / 64)):
aa, bb, cc, dd = a, b, c, d
s = msg[64 * i : 64 * (i + 1)]
w = [0] * 16
for j in range(16):
w[j] = load32L(s[4 * j: 4 * (j + 1)])
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[j])
a, b, c, d = FF(a, b, c, d, w[j], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_2(j)])
a, b, c, d = GG(a, b, c, d, w[rho_2(j)], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_3(j)])
a, b, c, d = HH(a, b, c, d, w[rho_3(j)], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_4(j)])
a, b, c, d = II(a, b, c, d, w[rho_4(j)], j)
# a, b, c, d = (a + aa) & 0xFFFFFFFF, (b + bb) & 0xFFFFFFFF, (c + cc) & 0xFFFFFFFF, (d + dd) & 0xFFFFFFFF
a, b, c, d = a + aa, b + bb, c + cc, d + dd
return a, b, c, d
最后我们把函数整合一下,并且在得到四个寄存器的值之后使用 16 进制的方式展现出来:
def hexDigest(hexNum):
s = ""
s += "%0.2x"%(hexNum & 255)
s += "%0.2x"%((hexNum >> 8) & 255)
s += "%0.2x"%((hexNum >> 16) & 255)
s += "%0.2x"%((hexNum >> 24) & 255)
return s
def hashMD5(msg):
a, b, c, d = 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476
msg = str(msg)
strlen, endLen = len(msg), len(msg) % 64
segments = []
msg += chr(0x80)
endLen += 1
fill = 0
while endLen > 56:
while endLen < 64:
msg += chr(fill)
endLen = (endLen + 1) % 64
# print("endLen =", endLen, "filled =", chr(fill), "Msg =", msg)
endLen = 0
while endLen < 56:
msg += chr(fill)
endLen = (endLen + 1) % 64
msg = msg + gen64LS(strlen * 8)
a, b, c, d = compressMD5(msg, a, b, c, d)
# print(hex(a), hex(b), hex(c), hex(d))
output = hexDigest(a) + hexDigest(b) + hexDigest(c) + hexDigest(d)
return output
最后我们可以与系统函数库 hashlib
中的 md5
函数互相验证一下,即可知道算法正确与否。
完整代码
import hashlib
# T Table
T = [[0xD76AA478, 0xE8C7B756, 0x242070DB, 0xC1BDCEEE, 0xF57C0FAF, 0x4787C62A, 0xA8304613, 0xFD469501, 0x698098D8, 0x8B44F7AF, 0xFFFF5BB1, 0x895CD7BE, 0x6B901122, 0xFD987193, 0xA679438E, 0x49B40821],
[0xF61E2562, 0xC040B340, 0x265E5A51, 0xE9B6C7AA, 0xD62F105D, 0x02441453, 0xD8A1E681, 0xE7D3FBC8, 0x21E1CDE6, 0xC33707D6, 0xF4D50D87, 0x455A14ED, 0xA9E3E905, 0xFCEFA3F8, 0x676F02D9, 0x8D2A4C8A],
[0xFFFA3942, 0x8771F681, 0x6D9D6122, 0xFDE5380C, 0xA4BEEA44, 0x4BDECFA9, 0xF6BB4B60, 0xBEBFBC70, 0x289B7EC6, 0xEAA127FA, 0xD4EF3085, 0x04881D05, 0xD9D4D039, 0xE6DB99E5, 0x1FA27CF8, 0xC4AC5665],
[0xF4292244, 0x432AFF97, 0xAB9423A7, 0xFC93A039, 0x655B59C3, 0x8F0CCC92, 0xFFEFF47D, 0x85845DD1, 0x6FA87E4F, 0xFE2CE6E0, 0xA3014314, 0x4E0811A1, 0xF7537E82, 0xBD3AF235, 0x2AD7D2BB, 0xEB86D391]]
ROLStep = [[7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22],
[5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20],
[4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23],
[6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21]]
# Logic Functions
f = lambda b, c, d : (b & c) | ((~b) & d)
g = lambda b, c, d : (b & d) | (c & (~d))
h = lambda b, c, d : b ^ c ^ d
i = lambda b, c, d : c ^ (b | (~d))
rho_2 = lambda i : (1 + 5 * (i)) % 16
rho_3 = lambda i : (5 + 3 * (i)) % 16
rho_4 = lambda i : (7 * (i)) % 16
def ROLs(x, y):
x, y = int(x), int(y)
mask1 = (1 << y) - 1
return ((x >> (32 - y)) & mask1) | ((x << y) & ~mask1)
# Compress Hash Functions
def FF(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((f(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + f(b, c, d) + M + T[0][s]) % (2 ** 32), ROLStep[0][s])) % (2 ** 32)
return d, a, b, c
def GG(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((g(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + g(b, c, d) + M + T[1][s]) % (2 ** 32), ROLStep[1][s])) % (2 ** 32)
return d, a, b, c
def HH(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((h(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + h(b, c, d) + M + T[2][s]) % (2 ** 32), ROLStep[2][s])) % (2 ** 32)
return d, a, b, c
def II(a, b, c, d, M, s):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%M, s, "%x"%((i(b, c, d)) % (2 ** 32)))
a = (b + ROLs((a + i(b, c, d) + M + T[3][s]) % (2 ** 32), ROLStep[3][s])) % (2 ** 32)
return d, a, b, c
# Read 32bit Litte Endian From y to x
def load32L(y):
# print(y)
x = (((ord(y[3]) & 255) << 24) + ((ord(y[2]) & 255) << 16) + ((ord(y[1]) & 255) << 8) + (ord(y[0]) & 255))
return x
# Store 32bit Little Endian From x to y
def store32L(x):
y = [""] * 4
y[0] = chr(x & 255)
y[1] = chr((x>>8) & 255)
y[2] = chr((x>>16) & 255)
y[3] = chr((x>>24) & 255)
return "".join(y)
def gen64LS(x):
x, y = int(x), ""
y += chr(x & 255)
y += chr((x>>8) & 255)
y += chr((x>>16) & 255)
y += chr((x>>24) & 255)
y += chr((x>>32) & 255)
y += chr((x>>40) & 255)
y += chr((x>>48) & 255)
y += chr((x>>56) & 255)
return y
def compressMD5(msg, a, b, c, d):
for i in range(int(len(msg) / 64)):
aa, bb, cc, dd = a, b, c, d
s = msg[64 * i : 64 * (i + 1)]
w = [0] * 16
for j in range(16):
w[j] = load32L(s[4 * j: 4 * (j + 1)])
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[j])
a, b, c, d = FF(a, b, c, d, w[j], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_2(j)])
a, b, c, d = GG(a, b, c, d, w[rho_2(j)], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_3(j)])
a, b, c, d = HH(a, b, c, d, w[rho_3(j)], j)
for j in range(16):
# print("%x"%a, "%x"%b, "%x"%c, "%x"%d, "%x"%w[rho_4(j)])
a, b, c, d = II(a, b, c, d, w[rho_4(j)], j)
# a, b, c, d = (a + aa) & 0xFFFFFFFF, (b + bb) & 0xFFFFFFFF, (c + cc) & 0xFFFFFFFF, (d + dd) & 0xFFFFFFFF
a, b, c, d = a + aa, b + bb, c + cc, d + dd
return a, b, c, d
def hexDigest(hexNum):
s = ""
s += "%0.2x"%(hexNum & 255)
s += "%0.2x"%((hexNum >> 8) & 255)
s += "%0.2x"%((hexNum >> 16) & 255)
s += "%0.2x"%((hexNum >> 24) & 255)
return s
def hashMD5(msg):
a, b, c, d = 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476
msg = str(msg)
strlen, endLen = len(msg), len(msg) % 64
segments = []
msg += chr(0x80)
endLen += 1
fill = 0
while endLen > 56:
while endLen < 64:
msg += chr(fill)
endLen = (endLen + 1) % 64
# print("endLen =", endLen, "filled =", chr(fill), "Msg =", msg)
endLen = 0
while endLen < 56:
msg += chr(fill)
endLen = (endLen + 1) % 64
msg = msg + gen64LS(strlen * 8)
a, b, c, d = compressMD5(msg, a, b, c, d)
# print(hex(a), hex(b), hex(c), hex(d))
output = hexDigest(a) + hexDigest(b) + hexDigest(c) + hexDigest(d)
return output
# Main Program
if __name__ == "__main__":
message = '1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#@$%^&*()'
print("Original:", message)
print("My MD5 Hash :", hashMD5(message.encode("UTF-8").decode("UTF-8")))
print("HashLib MD5 Hash:", hashlib.md5(message.encode("UTF-8")).hexdigest())