beautifulsoup 相关问题

Beautiful Soup是一个用于解析HTML / XML的Python包。此软件包的最新版本是版本4，导入为bs4。

“pip install bs4”和“pip install BeautifulSoup4”有什么区别？

当我搜索BeautifulSoup lib的安装时，有时会看到pip install bs4，有时会看到pip install BeautifulSoup4。这2种安装方式有什么区别？

python beautifulsoup

回答 2 投票 0

使用 Python 和 BeautifulSoup 抓取同一个表的下一页

所以我正在学习网络抓取，并且正在使用雅虎财经网站进行练习，但是迭代我正在提取的表格的下一页很麻烦。我尝试了下面的代码，但它只能工作...

python web-scraping beautifulsoup yahoo-finance

回答 1 投票 0

使用Beautiful Soup来统计标题/链接

我正在尝试编写一个代码来跟踪此网页上左手灰色框中的链接文本。在这种情况下，代码应该返回瓦雷克里酸宝宝这是代码...

web-scraping beautifulsoup

回答 1 投票 0

HLTV/结果抓取工具无法工作。多个相同命名的div

我正在构建一个脚本来抓取 cs2 比赛的 hltv.org/results 页面。但是，我遇到了很多问题，具体来说，网站 hltv.org/results?offset={} 有多个 d...

python selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

定位 div 内的跨度以从 google 搜索结果中抓取

我正在尝试用Python构建一个scraper，但我无法定位多个div内的span元素。该 URL 是 Google 搜索结果，因此我们以停车为例： https://www.go...

python beautifulsoup

回答 3 投票 0

如何自动抓取网站中嵌入的 power bi 工具中存储的所有 PDF 文件？

因此，正如标题所说，我想自动抓取存储在嵌入网站的 power bi 工具中的所有 PDF 文件。网站如下：网站链接要下载您需要的每个文件...

python selenium-webdriver web-scraping beautifulsoup automation

回答 1 投票 0

如何从 BeautifulSoup 循环中的嵌套标签中提取文本？

我正在尝试使用 Selenium 和 BeautifulSoup 从 https://yellowpages.com.eg/en/category/abrasives 中抓取元数据。我可以成功提取一些数据，但获取文本时遇到问题...

python selenium-webdriver beautifulsoup

回答 1 投票 0

使用 Beautiful Soup / Python 将一个网站的 <body> 替换为另一个网站

我正在尝试用另一个标签及其内容替换标签和下面的所有内容。 **** 代码 **** 从 bs4 导入 BeautifulSoup as bs 导入操作系统进口重新 # 删除...的最后一段

python beautifulsoup replace

回答 2 投票 0

BeautifulSoup 未阅读页面

我确实有这个简单的页面，我使用selenium和BeautifulSoup。据我所知，该页面加载了 Javascript。有一个加载更多按钮，所以它会点击直到按钮不再出现

python selenium-webdriver web-scraping beautifulsoup

回答 1 投票 0

我想从 Instagram 帖子网址获取图像网址

例如，这是一个帖子ID：https://www.instagram.com/p/C8_ohdOR/ 我想要图像源。首先我使用selenium进行登录，然后抓取图像src。所以通过这个我得到了src。但这是...

python web-scraping beautifulsoup instaloader

回答 1 投票 0

使用 BeautifulSoup 当有工作可用时提醒我

我正在尝试为 Northvolt 公司 (https://northvolt.com/career) 提供特定职位列表时创建提醒。该职位的名称是“能源协调员”。在...

python beautifulsoup alert

回答 2 投票 0

Beautifulsoup NoneType 对象没有属性“find_all”

按照本教程https://www.scrapingdog.com/blog/scrape-indeed-using-python/，并遇到此错误：回溯（最近一次调用最后一次）：文件“C:/Users/det-lab/Documents/

python html web-scraping beautifulsoup

回答 1 投票 0

在 Spyder IDE 中使用 Python BeautifulSoup 进行网页抓取

我正在尝试从 Spyder IDE 中的以下网址中抓取表格。到目前为止，以下是我的代码。我检查了 hteml 代码以找出表类和任何 th、tr、td、标签。但提取

python beautifulsoup anaconda spyder

回答 1 投票 0

如何正确使用 beautifulsoup 来抓取元素？

我不是来自网页设计或网站/html背景，并且是这个领域的新手。尝试从此链接中抓取包含容器/卡片的元素。我尝试过下面的代码并发现有点成功...

python html css beautifulsoup

回答 1 投票 0

使用Python提取站点地图中的URL

我需要站点地图中的提取链接 https://wunder.com.tr/sitemap.xml 我写了一些代码导入请求从 bs4 导入 BeautifulSoup wunder = requests.get("https://wunder.com.tr/sitemap.xml&...

python python-3.x beautifulsoup request

回答 2 投票 0

使用 Selenium 或 Beautiful soup 刮擦 hulkapps 表

我有一个正在尝试抓取的网址：https://papemelroti.com/products/live-free-badge 但好像找不到这个表类 <... 我正在尝试抓取以下网址：https://papemelroti.com/products/live-free-badge 但是好像找不到这个表类 <table class="hulkapps-table table"><thead><tr><th style="border-top-left-radius: 0px;">Quantity</th><th style="border-top-right-radius: 0px;">Bulk Discount</th><th style="display: none">Add to Cart</th></tr></thead><tbody><tr><td style="border-bottom-left-radius: 0px;">Buy 50 + <span class="hulk-offer-text"></span></td><td style="border-bottom-right-radius: 0px;"><span class="hulkapps-price"><span class="money"><span class="money"> ₱1.00 </span></span> Off</span></td><td style="display: none;"><button type="button" class="AddToCart_0" style="cursor: pointer; font-weight: 600; letter-spacing: .08em; font-size: 11px; padding: 5px 15px; border-color: #171515; border-width: 2px; color: #ffffff; background: #161212;" onclick="add_to_cart(50)">Add to Cart</button></td></tr></tbody></table> 我已经有了我的 Selenium 代码，但它仍然没有抓取它。这是我的代码： from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup import time # Set up Chrome options chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument("--no-sandbox") chrome_options.add_argument("--disable-dev-shm-usage") service = Service('/usr/local/bin/chromedriver') # Adjust path if necessary driver = webdriver.Chrome(service=service, options=chrome_options) def get_page_html(url): driver.get(url) time.sleep(3) # Wait for JS to load return driver.page_source def scrape_discount_quantity(url): page_html = get_page_html(url) soup = BeautifulSoup(page_html, "html.parser") # Locate the table containing the quantity and discount table = soup.find('table', class_='hulkapps-table') print(page_html) if table: table_rows = table.find_all('tr') for row in table_rows: quantity_cells = row.find_all('td') if len(quantity_cells) >= 2: # Check if there are at least two cells quantity_cell = quantity_cells[0].get_text(strip=True) # Get quantity text discount_cell = quantity_cells[1].get_text(strip=True) # Get discount text return quantity_cell, discount_cell return None, None # Example usage url = 'https://papemelroti.com/products/live-free-badge' quantity, discount = scrape_discount_quantity(url) print(f"Quantity: {quantity}, Discount: {discount}") driver.quit() # Close the browser when done 它不断返回“无” 供参考：折扣数据从此 https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table API 端点加载，当您使用 selenium driver.page_source 返回页面源时，bs4 没有要抓取的表名称，我尝试了您的代码并确认 hulkapps-table 不存在于回应！所以很明显的反应是 None, 我的回答：我使用了这个 https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table API 端点以及此请求中的 product_id https://papemelroti.com/products/live-free-badge.json，这是我的代码，它是基本的： import requests import json def getDiscount(root_url): prod_resp = requests.get(f'{root_url}.json').content #Get product_id prod_id = json.loads(prod_resp)['product']['id'] disc_url = 'https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table' #Discount URL data = f'pid={prod_id}&store_id=papemelroti.myshopify.com' headers = { "User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0", "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8" } resp = requests.post(disc_url, data=data, headers=headers).content data_json = json.loads(resp) disc_json = json.loads(data_json['eligible_offer']['offer_levels'])[0] #Offer has two variants: 'Price' and 'Off' so you can use condition if you like to scrape products other than 'live-free-badge' if 'price_discount' in disc_json[2]: print(f"Product ID:{prod_id} (Quantity: {disc_json[0]}, Discount: {disc_json[1]} Price discount)") elif 'Off' in disc_json[2]: print(f"Product ID:{prod_id} (Quantity: {disc_json[0]}, Discount: {disc_json[1]}% Off)") #sample for both 'Off' and 'Price' getDiscount('https://papemelroti.com/products/dear-me-magnet') getDiscount('https://papemelroti.com/products/live-free-badge') 输出： Product ID:7217967726790 (Quantity: 50, Discount: 10% Off) Product ID:104213217289 (Quantity: 50, Discount: 1.00 Price discount) 让我知道这是否可以或者您是否想严格使用硒

python selenium-webdriver beautifulsoup

回答 1 投票 0

使用BeautifulSoup，如何选择没有子项的标签？

html如下：我不想要这个我正在尝试获取所有 div 并投射...

python web-scraping beautifulsoup

回答 3 投票 0

使用 Python 3.9，如何从 URL -> https://www.tamoil.ch/en/store-locator 获取 MS Excel 中的所有物理地址

我想从这个url [https://www.tamoil.ch/en/store-locator]获取MS-excel中的所有物理地址。电子表格只有标题，但没有代码的输出。导入请求来自...

python html pandas web-scraping beautifulsoup

回答 1 投票 0

如何在 python 中漂亮地格式化 HTML，同时将 <tr> 子级 </tr> 保留在 1 行中？

我想漂亮地打印 HTML，同时将子元素保留在 1 行中。 HTML STRING 看起来像 html = '''hello world... 我想漂亮地打印 HTML，同时将 <tr> children </tr> 保留在 1 行中。 HTML STRING 看起来像 html = '''<html><body><h1>hello world</h1><table><tr><td>1 STRING</td><td>2 STRING</td><td>3 STRING</td></tr></table></body></html>''' 我尝试用 bs4 的 prettify 解决问题，但没有给出正确的结果。 from bs4 import BeautifulSoup # Original HTML string html = '''<html><body><h1>hello world</h1><table><tr><td>1 STRING</td><td>2 STRING</td><td>3 STRING</td></tr></table></body></html>''' </tr></table></body></html>''' soup = BeautifulSoup(html, 'html.parser') prettified_html = soup.prettify() for tr in soup.find_all('tr'): inline_tr = f"<tr>{''.join(str(td) for td in tr.find_all('td'))}</tr>" prettified_html = prettified_html.replace(str(tr), inline_tr) print(prettified_html) 输出： <html> <body> <h1> hello world </h1> <table> <tr> <td> 1 STRING </td> <td> 2 STRING </td> <td> 3 STRING </td> </tr> </table> </body> </html> 想要的输出：  <tr><td>1 STRING</td><td>2 STRING</td><td>3 STRING</td></tr>  我愿意使用任何 python 包来解决问题。只需使用正则表达式删除 <tr> 和 </tr> 标签之间的 '/n' from bs4 import BeautifulSoup import re # Original HTML string html = '''<html><body><h1>hello world</h1><table><tr><td>1 STRING</td><td>2 STRING</td><td>3 STRING</td></tr></table></body></html> </tr></table></body></html>''' soup = BeautifulSoup(html, 'html.parser') prettified_html = soup.prettify() def remove_newlines_in_tr(match): tr_content = match.group(0) lines = tr_content.split('\n') lines = [line.strip() for line in lines] tr_content = ''.join(lines) return tr_content pattern = re.compile(r'<tr>.*?</tr>', re.DOTALL) html_inline_tr = pattern.sub(remove_newlines_in_tr, prettified_html) print(html_inline_tr)

python-3.x beautifulsoup pretty-print

回答 1 投票 0

如何使用 Python 从 Espacenet 抓取专利链接？

我需要从 Espacenet 上的搜索结果中抓取专利链接。由于 Espacenet 是一个动态网站，因此使用 Beautiful Soup 和 Requests 的简单方法不起作用。我尝试使用 Selenium tog...

python web-scraping beautifulsoup

回答 1 投票 0

beautifulsoup 相关问题

最新问题