我正在尝试计算文件的哈希值,以检查是否进行了任何更改。我让Gui和其他一些观察员在事件循环中运行。因此,我决定异步计算文件[md5 / Sha1,速度更快]的哈希值。
同步代码:
import hashlib
import time
chunk_size = 4 * 1024
def getHash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
for byte_block in iter(lambda: f.read(chunk_size), b""):
md5_hash.update(byte_block)
print("getHash : " + md5_hash.hexdigest())
start = time.time()
getHash("C:\\Users\\xxx\\video1.mkv")
getHash("C:\\Users\\xxx\\video2.mkv")
getHash("C:\\Users\\xxx\\video3.mkv")
end = time.time()
print(end - start)
同步代码的输出:2.4000535011291504
异步代码:
import hashlib
import aiofiles
import asyncio
import time
chunk_size = 4 * 1024
async def get_hash_async(file_path: str):
async with aiofiles.open(file_path, "rb") as fd:
md5_hash = hashlib.md5()
while True:
chunk = await fd.read(chunk_size)
if not chunk:
break
md5_hash.update(chunk)
print("get_hash_async : " + md5_hash.hexdigest())
async def check():
start = time.time()
t1 = get_hash_async("C:\\Users\\xxx\\video1.mkv")
t2 = get_hash_async("C:\\Users\\xxx\\video2.mkv")
t3 = get_hash_async("C:\\Users\\xxx\\video3.mkv")
await asyncio.gather(t1,t2,t3)
end = time.time()
print(end - start)
loop = asyncio.get_event_loop()
loop.run_until_complete(check())
异步代码输出:27.957366943359375
我做对了吗?还是需要进行任何更改以提高代码的性能?
提前感谢。
在同步情况下,您将顺序读取文件。依次按块读取文件更快。
在异步情况下,您的事件循环在计算哈希值时会阻塞。这就是为什么只能同时计算一个哈希的原因。 What do the terms “CPU bound” and “I/O bound” mean?
如果要提高计算速度,则需要使用线程。线程可以在CPU上并行执行。增加CHUNK_SIZE也会有所帮助。
import hashlib
import os
import time
from pathlib import Path
from multiprocessing.pool import ThreadPool
CHUNK_SIZE = 1024 * 1024
def get_hash(filename):
md5_hash = hashlib.md5()
with open(filename, "rb") as f:
while True:
chunk = f.read(CHUNK_SIZE)
if not chunk:
break
md5_hash.update(chunk)
return md5_hash
if __name__ == '__main__':
directory = Path("your_dir")
files = [path for path in directory.iterdir() if path.is_file()]
number_of_workers = os.cpu_count()
start = time.time()
with ThreadPool(number_of_workers) as pool:
files_hash = pool.map(get_hash, files)
end = time.time()
print(end - start)
[仅计算1个文件的哈希值:aiofiles对每个文件使用一个线程。操作系统需要时间来创建线程。