[多个.html文件上的BeautifulSoup

问题描述 投票:1回答:2

我正尝试通过使用此处建议的模型enter link description here在BeautifulSoup中提取固定标签之间的信息

我的文件夹中有很多.html文件,我想将使用BeautifulSoup脚本获得的结果以单个.txt文件的形式保存到另一个文件夹中。这些.txt文件应与原始文件具有相同的名称,但仅包含提取的内容。我编写的脚本(请参见下文)可以成功处理文件,但不会将提取的位写出到各个文件中。

import os
import glob
from bs4 import BeautifulSoup

dir_path = "C:My_folder\\tmp\\"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    my_data = (file_name)
    soup = BeautifulSoup(open(my_data, "r").read())
    for i in soup.select('font[color="#FF0000"]'):
        print(i.text)
        file_path = os.path.join(dir_path, file_name)
        text = open(file_path, mode='r').read()
        results = i.text
        results_dir = "C:\\My_folder\\tmp\\working"
        results_file = file_name[:-4] + 'txt'
        file_path = os.path.join(results_dir, results_file)
        open(file_path, mode='w', encoding='UTF-8').write(results)
python python-2.7 beautifulsoup text-files extract
2个回答
2
投票
Glob返回完整路径。您将为找到的每个font元素重新打开文件,替换文件的内容。将文件的开头移到循环[[outside中;您应该真正使用文件作为上下文管理器(带有with语句),以确保它们也再次正确关闭:

import glob import os.path from bs4 import BeautifulSoup dir_path = r"C:\My_folder\tmp" results_dir = r"C:\My_folder\tmp\working" for file_name in glob.glob(os.path.join(dir_path, "*.html")): with open(file_name) as html_file: soup = BeautifulSoup(html_file) results_file = os.path.splitext(file_name)[0] + '.txt' with open(os.path.join(results_dir, results_file), 'w') as outfile: for i in soup.select('font[color="#FF0000"]'): print(i.text) outfile.write(i.text + '\n')


1
投票
© www.soinside.com 2019 - 2024. All rights reserved.