我使用python包selenium自动点击“加载更多”按钮,这是成功的。但是为什么在“加载更多”之后我无法获取数据?
我想使用python从imdb抓取评论。它只显示25条评论,直到我点击“加载更多”按钮。我使用python包selenium自动点击“加载更多”按钮,这是成功的。但为什么在“加载更多”之后我无法获取数据并且只是反复获取前25个评论数据?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[@id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
我想要所有的评论,但现在我只能得到前25个。
您的脚本中有几个问题。硬编码等待是非常不一致的,当然是最糟糕的选择。你在while True:
循环中编写抓取逻辑的方式会通过一遍又一遍地收集相同的项来减慢解析过程。此外,每个标题在输出中产生巨大的线间隙,需要适当地剥离。我稍微改变了你的脚本,以反映我上面给出的建议。
试试这个来获得所需的输出:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()
你的代码很好。太棒了。但是,在点击“加载更多”按钮后,您永远不会为网页获取“更新的”HTML。这就是为什么你一直在列出相同的25条评论。
使用Selenium控制Web浏览器时,单击“加载更多”按钮。这会创建一个XHR请求(或更常见的称为AJAX请求),您可以在Web浏览器的开发人员工具的“网络”选项卡中看到该请求。
底线是JavaScript(在Web浏览器中运行)更新页面。但是在您的Python程序中,您只能使用Requests库静态获取一次HTML页面。
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
要解决此问题,您需要使用Selenium来获取包含评论的div框的innerHTML。然后,让BeautifulSoup再次解析HTML。我们希望避免一次又一次地获取整个页面的HTML,因为它需要计算资源来反复解析更新的HTML。
因此,在包含评论的页面上找到div,并使用BeautifulSoup再次解析它。这样的事情应该有效:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[@class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[@class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break