我遇到以下问题。我写了一个简单的“ TextBasedBrowser”(如果您现在甚至可以将其称为浏览器:D)。到目前为止,使用BS4进行网站抓取和解析的效果很好,但其格式像狗屎一样,几乎无法读取。一旦我尝试使用BS4中的prettify()方法,就会抛出AttributeError。我在Google上搜索了很长时间,但找不到任何东西。这是我的代码(在其中注释了prettify()方法):
from bs4 import BeautifulSoup
import requests
import sys
import os
legal_html_tags = ['p', 'a', 'ul', 'ol', 'li', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'title']
saved_pages = []
def search_url(url):
saved_pages.append(url.rstrip(".com"))
url = requests.get(f'https://{url}')
return url.text
def parse_html(html_page):
final_text = ""
soup = BeautifulSoup(html_page, 'html.parser')
# soup = soup.prettify()
plain_text = soup.find_all(text=True)
for t in plain_text:
if t.parent.name in legal_html_tags:
final_text += '{} '.format(t)
return final_text
def save_webpage(url, tb_dir):
with open(f'{tb_dir}/{url.rstrip(".com")}.txt', 'w', encoding="utf-8") as tab:
tab.write(parse_html(search_url(url)))
def check_url(url):
if url.endswith(".com") or url.endswith(".org") or url.endswith(".net"):
return True
else:
return False
args = sys.argv
directory = args[1]
try:
os.mkdir(directory)
except FileExistsError:
print("Error: File already exists")
while True:
url_ = input()
if url_ == "exit":
break
elif url_ in saved_pages:
with open(f'{directory}/{url_}.txt', 'r', encoding="utf-8") as curr_page:
print(curr_page.read())
elif not check_url(url_):
print("Error: Invalid URL")
else:
save_webpage(url_, directory)
print(parse_html(search_url(url_)))
这是错误:
Traceback (most recent call last):
File "browser.py", line 56, in <module>
save_webpage(url_, directory)
File "browser.py", line 29, in save_webpage
tab.write(parse_html(search_url(url)))
File "browser.py", line 20, in parse_html
plain_text = soup.find_all(text=True)
AttributeError: 'str' object has no attribute 'find_all'
如果我在prettify()方法中包括encoding参数,它将抛出“字节”而不是“ str”对象。
[prettify
]将您解析的HTML对象转换为字符串,因此您无法在其上调用find_all
。也许您只想return soup.prettify()
?
您已使用.prettify()方法将汤类变量重新分配为字符串
soup = soup.prettify()
find_all()是仅用于汤对象的方法
您应该首先调用find_all(text = True),并提取所有带有文本的html标签,然后执行字符串操作。