我有一个谷歌地图抓取器。抓取工具应该向下滚动结果,直到没有任何内容可以滚动,抓取数据(名称、地址等)并将其保存到 Excel 中。
程序会执行除滚动部分之外的所有操作。滚动器可以工作,但它不会一直向下滚动(在某些时候程序会停止)。无论滚动多少,它总是保存 26 个结果(总共有 48 个)。
这是负责滚动的代码部分:
# Scroll to show more results
divSideBar = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"div[aria-label='Results for {service + ' ' + location}']")))
keepScrolling = True
while keepScrolling:
divSideBar.send_keys(Keys.PAGE_DOWN)
time.sleep(3)
html = driver.find_element(By.TAG_NAME, "html").get_attribute('outerHTML')
if "You've reached the end of the list." in html:
keepScrolling = False
无论我增加或减少多少
时间.睡眠(3)
我总是得到相同的结果。
代码可能有什么问题?
这是完整的代码,因此您可以根据需要运行:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd
URL = "https://www.google.com/maps"
service = "kaufland"
location = "hrvatska"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(URL)
# Accept cookies
try:
accept_cookies = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button')))
accept_cookies.click()
except NoSuchElementException:
print("No accept cookies button found.")
# Search for results and show them
input_field = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="searchboxinput"]')))
input_field.send_keys(service + ' ' + location)
input_field.send_keys(Keys.ENTER)
# Scroll to show more results
divSideBar = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"div[aria-label='Results for {service + ' ' + location}']")))
keepScrolling = True
while keepScrolling:
divSideBar.send_keys(Keys.PAGE_DOWN)
time.sleep(3)
html = driver.find_element(By.TAG_NAME, "html").get_attribute('outerHTML')
if "You've reached the end of the list." in html:
keepScrolling = False
page_source = driver.page_source
driver.quit()
soup = BeautifulSoup(page_source, "html.parser")
boxes = soup.find_all('div', class_='Nv2PK')
# Collect data
data = []
for box in boxes:
# Business name
try:
business_name = box.find('div', class_='qBF1Pd').getText()
except AttributeError:
business_name = "N/A"
if service.strip().lower() not in business_name.lower():
continue
# Address
try:
inner_div = box.find_all('div', class_='W4Efsd')[1].find('div', class_='W4Efsd')
address = [span.text for span in inner_div.find_all('span') if span.text and not span.find('span')][-1]
except (IndexError, AttributeError):
address = "N/A"
# Stars
try:
stars = box.find('span', class_='MW4etd').getText()
except AttributeError:
stars = "N/A"
# Number of reviews
try:
number_of_reviews = box.find('span', class_='UY7F9').getText().strip('()')
except AttributeError:
number_of_reviews = "N/A"
# Phone number
try:
phone_number = box.find('span', class_='UsdlK').getText()
except AttributeError:
phone_number = "N/A"
# Website
try:
website = box.find('a', class_='lcr4fd').get('href')
except AttributeError:
website = "N/A"
# Append to data list
data.append({
'Business Name': business_name,
'Address': address,
'Stars': stars,
'Number of Reviews': number_of_reviews,
'Phone Number': phone_number,
'Website': website
})
# Create a DataFrame and save to Excel
df = pd.DataFrame(data)
df.to_excel(f'{location}_{service}.xlsx', index=False)
print(f"Data has been saved to {location}_{service}.xlsx")
我没发现有什么问题。这段代码对我来说效果很好。