我正尝试通过使用此处建议的模型enter link description here在BeautifulSoup中提取固定标签之间的信息
我的文件夹中有很多.html
文件,我想将使用BeautifulSoup脚本获得的结果以单个.txt
文件的形式保存到另一个文件夹中。这些.txt
文件应与原始文件具有相同的名称,但仅包含提取的内容。我编写的脚本(请参见下文)可以成功处理文件,但不会将提取的位写出到各个文件中。
import os
import glob
from bs4 import BeautifulSoup
dir_path = "C:My_folder\\tmp\\"
for file_name in glob.glob(os.path.join(dir_path, "*.html")):
my_data = (file_name)
soup = BeautifulSoup(open(my_data, "r").read())
for i in soup.select('font[color="#FF0000"]'):
print(i.text)
file_path = os.path.join(dir_path, file_name)
text = open(file_path, mode='r').read()
results = i.text
results_dir = "C:\\My_folder\\tmp\\working"
results_file = file_name[:-4] + 'txt'
file_path = os.path.join(results_dir, results_file)
open(file_path, mode='w', encoding='UTF-8').write(results)
font
元素重新打开文件,替换文件的内容。将文件的开头移到循环[[outside中;您应该真正使用文件作为上下文管理器(带有with
语句),以确保它们也再次正确关闭:import glob
import os.path
from bs4 import BeautifulSoup
dir_path = r"C:\My_folder\tmp"
results_dir = r"C:\My_folder\tmp\working"
for file_name in glob.glob(os.path.join(dir_path, "*.html")):
with open(file_name) as html_file:
soup = BeautifulSoup(html_file)
results_file = os.path.splitext(file_name)[0] + '.txt'
with open(os.path.join(results_dir, results_file), 'w') as outfile:
for i in soup.select('font[color="#FF0000"]'):
print(i.text)
outfile.write(i.text + '\n')