有没有办法使用BeautifulSoup从网页中提取CSS?

问题描述 投票:0回答:2

我正在开发一个项目,需要我查看网页,但要进一步使用 HTML,我必须完整地查看它,而不是一堆与图片混合的线条。有没有办法使用 BeautifulSoup 来解析 CSS 和 HTML?

这是我的代码:

from bs4 import BeautifulSoup


def get_html(url, name):
    r = requests.get(url)
    r.encoding = 'utf8'
    return r.text


link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
    file.write(get_html(link, '255282'))

警告:页面:https://www.labirint.ru/books/255282/已重定向至https://www.labirint.ru/books/733371/

python html python-3.x web-scraping beautifulsoup
2个回答
2
投票

如果你的目标是真正解析CSS:

Beautiful soup 会拉动整个页面 - 它确实包括标题、样式、脚本、css 和 js 中的链接等。我之前已经使用过 pythonCodeArticle 中的方法,并针对您提供的链接重新测试了它。

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"

# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

# get the HTML content
html = session.get(url).content

# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)

通过查看汤输出(它很长,我不会在这里粘贴)..你可以看到它是一个完整的页面。 只需确保粘贴您的特定链接

现在如果您想解析结果以获取所有 css url...您可以添加以下内容:(我仍在使用上面描述得非常详细的 python 代码文章链接中的部分代码)

# get the CSS files
css_files = []
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        # if the link tag has the 'href' attribute
        css_url = urljoin(url, css.attrs.get("href"))
        css_files.append(css_url)
print(css_files)

输出的 css_files 将是所有 css 文件的列表。 您现在可以分别访问这些并查看正在导入的样式。

注意:这个特定的网站混合了与 html 内联的样式(即,他们并不总是使用 css 来设置样式属性......有时样式位于 html 内容内。)

这应该可以帮助您开始。


0
投票

这是我使用的 python 函数(从外部样式表、脚本标签和内联 CSS 中提取 CSS):

import urllib.parse
from typing import Optional

import requests
from bs4 import BeautifulSoup

def extract_css_from_webpage(
    url: str, request_kwargs: Optional[dict] = None, verbose: bool = False
) -> tuple[list[str], list[str], list[dict]]:
    """Extracts CSS from webpage

    Args:
        url (str): Webpage URL
        request_kwargs (dict): These arguments are passed to requests.get() (when
                                fetching webpage HTML and external stylesheets)
        verbose (bool): Print diagnostic information

    Returns:
        tuple[ list[str], list[str], list[dict] ]: css_from_external_stylesheets, css_from_style_tags, inline_css
    """

    if not request_kwargs:
        request_kwargs = {
            "timeout": 10,
            "headers": {"User-Agent": "Definitely not an Automated Script"},
        }
    url_response = requests.get(url, **request_kwargs)
    if url_response.status_code != 200:
        raise requests.exceptions.HTTPError(
            f"received response [{url_response.status_code}] from [{url}]"
        )

    soup = BeautifulSoup(url_response.content, "html.parser")

    css_from_external_stylesheets: list[str] = []
    for link in soup.find_all("link", rel="stylesheet"):
        css_url = urllib.parse.urljoin(url, link["href"])
        if verbose:
            print(f"downloading external CSS stylesheet {css_url}")
        css_content: str = requests.get(css_url, **request_kwargs).text
        css_from_external_stylesheets.append(css_content)

    css_from_style_tags: list[str] = []
    for style_tag in soup.find_all("style"):
        css_from_style_tags.append(style_tag.string)

    inline_css: list[dict] = []
    for tag in soup.find_all(style=True):
        inline_css.append({"tag": str(tag), "css": tag["style"]})

    if verbose:
        print(
            f"""Extracted the following CSS from [{url}]:
    1. {len(css_from_external_stylesheets):,} external stylesheets (total {len("".join(css_from_external_stylesheets)):,} characters of text)
    2. {len(css_from_style_tags):,} style tags (total {len("".join(css_from_style_tags)):,} characters of text)  
    3. {len(inline_css):,} tags with inline CSS (total {len("".join( (x["css"] for x in inline_css) )):,} characters of text)  

"""
        )

    return css_from_external_stylesheets, css_from_style_tags, inline_css
© www.soinside.com 2019 - 2024. All rights reserved.