我正在开发一个项目,需要我查看网页,但要进一步使用 HTML,我必须完整地查看它,而不是一堆与图片混合的线条。有没有办法使用 BeautifulSoup 来解析 CSS 和 HTML?
这是我的代码:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
警告:页面:https://www.labirint.ru/books/255282/已重定向至https://www.labirint.ru/books/733371/。
如果你的目标是真正解析CSS:
Beautiful soup 会拉动整个页面 - 它确实包括标题、样式、脚本、css 和 js 中的链接等。我之前已经使用过 pythonCodeArticle 中的方法,并针对您提供的链接重新测试了它。
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
通过查看汤输出(它很长,我不会在这里粘贴)..你可以看到它是一个完整的页面。 只需确保粘贴您的特定链接
现在如果您想解析结果以获取所有 css url...您可以添加以下内容:(我仍在使用上面描述得非常详细的 python 代码文章链接中的部分代码)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
输出的 css_files 将是所有 css 文件的列表。 您现在可以分别访问这些并查看正在导入的样式。
注意:这个特定的网站混合了与 html 内联的样式(即,他们并不总是使用 css 来设置样式属性......有时样式位于 html 内容内。)
这应该可以帮助您开始。
这是我使用的 python 函数(从外部样式表、脚本标签和内联 CSS 中提取 CSS):
import urllib.parse
from typing import Optional
import requests
from bs4 import BeautifulSoup
def extract_css_from_webpage(
url: str, request_kwargs: Optional[dict] = None, verbose: bool = False
) -> tuple[list[str], list[str], list[dict]]:
"""Extracts CSS from webpage
Args:
url (str): Webpage URL
request_kwargs (dict): These arguments are passed to requests.get() (when
fetching webpage HTML and external stylesheets)
verbose (bool): Print diagnostic information
Returns:
tuple[ list[str], list[str], list[dict] ]: css_from_external_stylesheets, css_from_style_tags, inline_css
"""
if not request_kwargs:
request_kwargs = {
"timeout": 10,
"headers": {"User-Agent": "Definitely not an Automated Script"},
}
url_response = requests.get(url, **request_kwargs)
if url_response.status_code != 200:
raise requests.exceptions.HTTPError(
f"received response [{url_response.status_code}] from [{url}]"
)
soup = BeautifulSoup(url_response.content, "html.parser")
css_from_external_stylesheets: list[str] = []
for link in soup.find_all("link", rel="stylesheet"):
css_url = urllib.parse.urljoin(url, link["href"])
if verbose:
print(f"downloading external CSS stylesheet {css_url}")
css_content: str = requests.get(css_url, **request_kwargs).text
css_from_external_stylesheets.append(css_content)
css_from_style_tags: list[str] = []
for style_tag in soup.find_all("style"):
css_from_style_tags.append(style_tag.string)
inline_css: list[dict] = []
for tag in soup.find_all(style=True):
inline_css.append({"tag": str(tag), "css": tag["style"]})
if verbose:
print(
f"""Extracted the following CSS from [{url}]:
1. {len(css_from_external_stylesheets):,} external stylesheets (total {len("".join(css_from_external_stylesheets)):,} characters of text)
2. {len(css_from_style_tags):,} style tags (total {len("".join(css_from_style_tags)):,} characters of text)
3. {len(inline_css):,} tags with inline CSS (total {len("".join( (x["css"] for x in inline_css) )):,} characters of text)
"""
)
return css_from_external_stylesheets, css_from_style_tags, inline_css