[多个.html文件上的BeautifulSoup

Question

我正尝试通过使用此处建议的模型enter link description here在BeautifulSoup中提取固定标签之间的信息

我的文件夹中有很多.html文件，我想将使用BeautifulSoup脚本获得的结果以单个.txt文件的形式保存到另一个文件夹中。这些.txt文件应与原始文件具有相同的名称，但仅包含提取的内容。我编写的脚本（请参见下文）可以成功处理文件，但不会将提取的位写出到各个文件中。

import os
import glob
from bs4 import BeautifulSoup

dir_path = "C:My_folder\\tmp\\"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    my_data = (file_name)
    soup = BeautifulSoup(open(my_data, "r").read())
    for i in soup.select('font[color="#FF0000"]'):
        print(i.text)
        file_path = os.path.join(dir_path, file_name)
        text = open(file_path, mode='r').read()
        results = i.text
        results_dir = "C:\\My_folder\\tmp\\working"
        results_file = file_name[:-4] + 'txt'
        file_path = os.path.join(results_dir, results_file)
        open(file_path, mode='w', encoding='UTF-8').write(results)

Answer 1

Glob返回完整路径。您将为找到的每个font元素重新打开文件，替换文件的内容。将文件的开头移到循环[[outside中；您应该真正使用文件作为上下文管理器（带有with语句），以确保它们也再次正确关闭：

import glob import os.path from bs4 import BeautifulSoup dir_path = r"C:\My_folder\tmp" results_dir = r"C:\My_folder\tmp\working" for file_name in glob.glob(os.path.join(dir_path, "*.html")): with open(file_name) as html_file: soup = BeautifulSoup(html_file) results_file = os.path.splitext(file_name)[0] + '.txt' with open(os.path.join(results_dir, results_file), 'w') as outfile: for i in soup.select('font[color="#FF0000"]'): print(i.text) outfile.write(i.text + '\n')

[多个.html文件上的BeautifulSoup

问题描述投票：1回答：2

2个回答

最新问题

[多个.html文件上的BeautifulSoup

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2