对文件夹进行更改后通过 os.scandir() 访问文件夹

Question

我试图遍历一个包含 html 文件的文件夹，根据它们是否包含字符串形式的关键字来过滤它们。我通过 wget 和 BeautifulSoup 将它们下载到一个文件夹，并在使用 os.scandir() 下载后迭代它，并将传递的文件复制到另一个文件夹。但是，第一次运行脚本时，我只能下载它们，并且不会将任何文件复制到目标目录。当我第二次运行它时，它确实过滤正确。

我猜测当我第一次运行脚本时， os.scandir() 会自动复制第一个文件夹的初始状态（为空）。但我无法让 os.scandir() 假定包含 HTML 数据的文件夹的最终状态。如何才能做到这样才能在下载后迭代数据？这是片段：

#pull HTML data from links with wget
subprocess.Popen(bash_commandList,stdout=subprocess.PIPE)

job_as_string = ""

#search for keyword in html as string to detect if the jobstelle has something I can do
with os.scandir('/Users/user/wgetLinks') as parent:
    for job_stelle in parent:
        with open(job_stelle, 'r') as f:
            if job_stelle.name.endswith(".html") and job_stelle.is_file():
                print(job_stelle.name)
                job_as_string = f.read()
        f.close()
        for keyword in keywords:
            if(keyword in job_as_string):
                popen_Command = '/Users/user/wgetLinks/' + job_stelle.name
                shutil.copy(popen_Command, '/Users/user/wgetInformatics')
                continue```

Answer 1

您遇到的问题可能是由于

subprocess.Popen

调用的异步性质造成的，该调用会启动下载过程，但不会等待下载过程完成，然后再继续执行脚本的过滤部分。当您开始使用

os.scandir

遍历目录时，文件可能尚未完全下载。

要解决您的问题，您可以使用

subprocess.run

而不是

subprocess.Popen

以确保您的脚本在遍历目录之前等待下载过程完成。

subprocess.run

函数将等待命令完成，然后再继续执行脚本的下一行。

您可以在下面找到我在您的脚本中所做的一些修改：

import subprocess
import os
import shutil

# List of keywords to search for
keywords = ["keyword1", "keyword2", "keyword3"]

# Pull HTML data from links with wget
bash_commandList = ["wget", "-P", "/Users/user/wgetLinks", "http://example.com/file1.html", "http://example.com/file2.html"]
subprocess.run(bash_commandList, stdout=subprocess.PIPE)

# Search for keyword in HTML files and copy matching files to another directory
with os.scandir('/Users/user/wgetLinks') as parent:
    for job_stelle in parent:
        if job_stelle.name.endswith(".html") and job_stelle.is_file():
            with open(job_stelle.path, 'r') as f:
                job_as_string = f.read()
            for keyword in keywords:
                if keyword in job_as_string:
                    shutil.copy(job_stelle.path, '/Users/user/wgetInformatics')
                    break

以下是有关更改的简短说明：

使用
```
subprocess.run
```
而不是
```
subprocess.Popen
```
确保脚本等待下载完成后再继续。
检查文件是否为 HTML 文件并在同一个
```
with
```
块中读取其内容，以避免从文件句柄读取时出现问题。
使用
```
job_stelle.path
```
获取文件的完整路径以供读取和复制。

此方法可确保在脚本尝试过滤和复制 HTML 文件之前完全下载 HTML 文件。

尝试一下并在评论中告诉我！

对文件夹进行更改后通过 os.scandir() 访问文件夹

问题描述投票：0回答：1

1个回答

最新问题

对文件夹进行更改后通过 os.scandir() 访问文件夹

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1