我正在尝试从新闻源页面中抓取和重新调整新闻图像和标题的用途,以便我可以在标牌显示(Xibo)中重复使用它们。基本上我只想要这个 URL 的主要内容的前三行,没有任何页眉/页脚信息,也没有额外的代码/脚本等。只有中等大小的图片和其下面的标题。想要抓取图像/标题,然后每天使用 Flask 渲染一个简单的 html 页面一次,供 CMS 读取。 https://news.clemson.edu/tag/extension/
我发现在这种情况下我需要硒来获取渲染的页面? 在下面的代码中,我很难正确找到图像 URL。 这将在页面中读取并滚动,但找不到图像。我尝试了一些嵌套的 div,但也没有运气。有人能给我指出正确的方向来获取图像 URL(以及最终的标题)吗?
#News feed test for Xibo Signage
#from flask import Flask, render_template
from markupsafe import Markup
#app=Flask(__name__)
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#installed chrome driver in scripts so don't need next lines?
#chromedriver_path = '...'
driver = webdriver.Chrome()
url = "https://news.clemson.edu/tag/extension/"
driver.get(url)
# wait (up to 20 seconds) until the images are visible on page
images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "site-main")))
# scroll to the last image, so that all images get rendered correctly
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(2)
# PRINT URLS USING SELENIUM -for test (will pass to Flask)
print('Selenium')
for img in images:
print(img.get_attribute('src'))
#@app.route('/')
#def home():
# return render_template('home.html',thumbnailmk=thumbnailmk)
#if __name__ == '__main__':
# app.run(host='0.0.0.0')
# app.run(debug=True)
这里的问题是您没有选择任何图像,尝试改变您的策略并专注于您真正想要定位的内容:
for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
print(e.get_attribute('data-srcset').split()[0])
此示例指向
data-srcset
属性并选择第一个图像 url:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
url = "https://news.clemson.edu/tag/extension/"
driver.get(url)
for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
print(e.get_attribute('data-srcset').split()[0])
https://news.clemson.edu/wp-content/uploads/2023/04/ag-and-art-scaled.jpg
https://news.clemson.edu/wp-content/uploads/2024/03/AgTech_Forum_FeatureImage.jpg
...
https://news.clemson.edu/wp-content/uploads/2023/09/Cooperative-Extension-RGB-color_featured.jpg
https://news.clemson.edu/wp-content/uploads/2023/09/20141107-simpson-5911-X5.jpg
https://news.clemson.edu/wp-content/uploads/2023/09/TailgateFoodSafety.jpg