为什么我的抓取工具无法获取 Google 地图上的所有数据?

问题描述 投票:0回答:1

我有一个谷歌地图抓取器。抓取工具应该向下滚动结果,直到没有任何内容可以滚动,抓取数据(名称、地址等)并将其保存到 Excel 中。

程序会执行除滚动部分之外的所有操作。滚动器可以工作,但它不会一直向下滚动(在某些时候程序会停止)。无论滚动多少,它总是保存 26 个结果(总共有 48 个)。

这是负责滚动的代码部分:

# Scroll to show more results
divSideBar = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"div[aria-label='Results for {service + ' ' + location}']")))

keepScrolling = True
while keepScrolling:
    divSideBar.send_keys(Keys.PAGE_DOWN)
    time.sleep(3)
    html = driver.find_element(By.TAG_NAME, "html").get_attribute('outerHTML')
    if "You've reached the end of the list." in html:
        keepScrolling = False

无论我增加或减少多少

时间.睡眠(3)

我总是得到相同的结果。

代码可能有什么问题?

这是完整的代码,因此您可以根据需要运行:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd

URL = "https://www.google.com/maps"
service = "kaufland"
location = "hrvatska"

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(URL)

# Accept cookies
try:
    accept_cookies = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button')))
    accept_cookies.click()
except NoSuchElementException:
    print("No accept cookies button found.")

# Search for results and show them
input_field = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="searchboxinput"]')))
input_field.send_keys(service + ' ' + location)
input_field.send_keys(Keys.ENTER)

# Scroll to show more results
divSideBar = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"div[aria-label='Results for {service + ' ' + location}']")))

keepScrolling = True
while keepScrolling:
    divSideBar.send_keys(Keys.PAGE_DOWN)
    time.sleep(3)
    html = driver.find_element(By.TAG_NAME, "html").get_attribute('outerHTML')
    if "You've reached the end of the list." in html:
        keepScrolling = False

page_source = driver.page_source

driver.quit()

soup = BeautifulSoup(page_source, "html.parser")
boxes = soup.find_all('div', class_='Nv2PK')

# Collect data
data = []

for box in boxes:
    # Business name
    try:
        business_name = box.find('div', class_='qBF1Pd').getText()
    except AttributeError:
        business_name = "N/A"

    if service.strip().lower() not in business_name.lower():
        continue

    # Address
    try:
        inner_div = box.find_all('div', class_='W4Efsd')[1].find('div', class_='W4Efsd')
        address = [span.text for span in inner_div.find_all('span') if span.text and not span.find('span')][-1]
    except (IndexError, AttributeError):
        address = "N/A"

    # Stars
    try:
        stars = box.find('span', class_='MW4etd').getText()
    except AttributeError:
        stars = "N/A"

    # Number of reviews
    try:
        number_of_reviews = box.find('span', class_='UY7F9').getText().strip('()')
    except AttributeError:
        number_of_reviews = "N/A"

    # Phone number
    try:
        phone_number = box.find('span', class_='UsdlK').getText()
    except AttributeError:
        phone_number = "N/A"

    # Website
    try:
        website = box.find('a', class_='lcr4fd').get('href')
    except AttributeError:
        website = "N/A"

    # Append to data list
    data.append({
        'Business Name': business_name,
        'Address': address,
        'Stars': stars,
        'Number of Reviews': number_of_reviews,
        'Phone Number': phone_number,
        'Website': website
    })

# Create a DataFrame and save to Excel
df = pd.DataFrame(data)
df.to_excel(f'{location}_{service}.xlsx', index=False)

print(f"Data has been saved to {location}_{service}.xlsx")
python google-maps selenium-webdriver beautifulsoup
1个回答
0
投票

我没发现有什么问题。这段代码对我来说效果很好。

© www.soinside.com 2019 - 2024. All rights reserved.