临时文件和哈希的意外结果

Question

为什么这段代码用相同的哈希对所有文件进行编码？

import asyncio
import wget
import hashlib
import os
import tempfile
import zipfile
from multiprocessing import Pool

async def main() -> None:
    url = "https://gitea.radium.group/radium/project-configuration/archive/master.zip"
    if not os.path.isdir('tempdir'):
        os.makedirs('tempdir')
    os.chdir('tempdir')
    wget.download(url)
    zipfile.ZipFile("project-configuration-master.zip").extractall('.')
    os.chdir("project-configuration/nitpick")

async def result():
    for i in range(2):
        task = asyncio.create_task(main())
        await task
        os.chdir("../..")
        for f in os.listdir():
            print(hashlib.sha256(b'{f}'))
        os.remove("tempfile")

asyncio.run(result())

循环一次编码的文件数8但最后输出22个文件哈希后出错，应该是24：

Answer 1

在处理清理目录处理的示例时，我发现了您的主要感知问题：您正在将文件 names 传递给 hashlib.sha256 - 您是否最有可能传递文件 contents - 我的例子波纹管强调，因为我切换到使用更现代的“pathlib”，也使用它的功能来读取每个文件内容。

现在，对于我第一次看你的脚本时的初步发现：

您正在使用

os.chdir

来更改全局应用程序状态 - 看起来您希望那里的每个任务都有独立的工作目录，因此

os.chdir

只会在调用之间更改为相同的目录。

不是 - 那里的第一个调用将整个应用程序切换到您的 zip 解压子目录中。当应用程序已经inside那个目录时，第二次调用 main 将发生。

如果它完全有效（我必须运行它，或逐行跟踪结果），您将把所有结果深深地嵌套在递归

tempdir/project-configuration/nitpick

目录中。

第二件事，可能与你的问题无关，但这段代码根本不是并发的：你把

async def

用于你的

main

函数，但在任何时候都没有另一个异步调用（以

await

为特征) 关键字：函数将在 asyncio 循环切换到另一个任务之前运行完成。

在这种情况下，

wget

调用将是在那里进行异步的自然调用 - 它是第 3 方库，如果有等效的异步调用，请检查其文档 - 就可以了。否则，您可以使用 asyncio 的

loop.run_in_executor

在另一个线程中运行它 - 这也会使代码并发。

鉴于此处的结构，我想您尝试调整一次使用多重处理的代码：如果

main

在每个任务的不同进程中运行，每个任务都会有一个单独的工作目录，并调用

wget

将仅取决于操作系统并行化进程执行：一切都确实有效。异步代码不是这种情况。

所以，只涉及这两个部分，这就是您的代码的样子。第一件事：永远不要使用

os.chdir

：它会 always 打破任何比 10 行脚本更复杂的东西（在这种情况下，它甚至可能比这更早打破） - 因为它取决于改变一个单一的，非 -过程的可恢复的全局资产。始终使用相对路径，并连接您的路径。传统的 API -

os.path.join

太冗长了，但是从 Python 3.5 开始，我们有了

pathlib.Path

允许使用

运算符正确连接路径。

import asyncio
import wget
import hashlib
import os
import tempfile
import zipfile
from pathlib import Path
from functools import partial
import shutil


TEMPDIR = Path("tempdir")
CONFIGDIR = "project-configuration/nitpick"

async def main() -> None:
    url = "https://gitea.radium.group/radium/project-configuration/archive/master.zip"
    if not TEMPDIR.is_dir():    # is_dir is a method from Path objects
        TEMPDIR.mkdir()
    #wget.download(url, out=str(TEMPDIR))  
                # I had to check wget's source: it is a small utility which was not updated to  understand pathlib
                # objects. But it can take an output directory argument, like above, avoiding the use of `chdir`
                # Nonetheless this has to be made an async call with:
                
    loop = asyncio.get_running_loop()
    await loop.run_in_executor(None, partial(wget.download, url, out=str(TEMPDIR)))
            # The above call is asyncio friendly: it sends the wget workload
            # to another thread, in a transparent way, so that other tasks
            # can run concurrently to that
                                        
    zipfile.ZipFile(TEMPDIR / "project-configuration-master.zip").extractall(TEMPDIR)
    # os.chdir("project-configuration/nitpick")  # as discussed: do not do this.
    # instead, change all your file operations from this point down to prepending
    # `TEMPDIR / CONFIGDIR / "filename" ` from this point on. Calls with this to 
    # legacy functions may require it to be converted to a str -
    # `str(TEMPDIR / CONFIGDIR / "filename")`, but "open" and other 
    # Python file operations will work just fine.
    # ...

async def result():
    for i in range(2):
        task = asyncio.create_task(main())
        await task  # This will just perform all ops in sequence, not paralleizing anything
                    # but if you are parallelizing things, you might want to 
                    # parametrize tempdir - as is, the code will use the same
                    # hardcoded "tempdir" for all downloads
        # os.chdir("../..")  # No directory changed in the call - no need to change back
        for f in TEMPDIR.iterdir():  # yields all entries in directory 
            # print(hashlib.sha256(b'{f}'))  # Here is your main problem: you are really calling hashlib on the FILENAME
            hash_ = hashlib.sha256(f.read_bytes()) # here we calculate the hash on the file _contents_ 
            print(f.name, hash_.hexdigest()) # and print the actual hash, not the python repr of the hash object
        shutil.rmtree(TEMPDIR)

asyncio.run(result())

临时文件和哈希的意外结果

问题描述投票：0回答：1

1个回答

最新问题

临时文件和哈希的意外结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1