Python&BeautifulSoup 4 / Selenium - 无法从kicksusa.com获取数据?

问题描述 投票:1回答:3

我正试图从kicksusa.com抓取数据,我遇到了一些问题。

当我尝试基本的BS4方法时,这样(导入是从使用所有这些的主程序中复制/粘贴的):

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

data1 = requests.get('https://www.kicksusa.com/')
soup1 = BeautifulSoup(data1.text, 'html.parser')

button = soup1.find('span', attrs={'class': 'shop-btn'}).text.strip()
print(button)

结果是“无”,它告诉我数据是通过JS隐藏的。所以,我尝试使用Selenium,就像这样:

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get('https://www.kicksusa.com/') 
url = driver.find_element_by_xpath("//span[@class='shop-btn']").text
print(url)
driver.close()

我得到“无法找到元素”。

有谁知道如何使用BS4或Selenium来抓取这个网站?先感谢您!

python selenium selenium-webdriver web-scraping beautifulsoup
3个回答
1
投票

问题是你被检测为机器人并获得如下响应:

<html style="height:100%">
    <head>
        <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
        <meta name="format-detection" content="telephone=no">
        <meta name="viewport" content="initial-scale=1.0">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
    </head>
    <body style="margin:0px;height:100%">
    <iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=5-36224256-0%200NNN%20RT%281552245394179%20277%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B15%2811%2c110765%2c0%29%20U2&incident_id=314001710050302156-195663432827669173&edet=15&cinfo=0b000000"
            frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula
        incident ID: 314001710050302156-195663432827669173
    </iframe>
    </body>
</html>

请求和BeautifulSoup

如果你想使用requestsbs,请从浏览器开发者工具visid_incap_incap_ses_ cookie中复制请求标题到www.kicksusa.com并在你的request中使用它们:

import requests
from bs4 import BeautifulSoup

headers = {
    'Host': 'www.kicksusa.com',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/72.0.3626.121 Safari/537.36',
    'DNT': '1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
    'Cookie': 'visid_incap_...=put here your visid_incap_ value; incap_ses_...=put here your incap_ses_ value',
}

response = requests.get('https://www.kicksusa.com/', headers=headers)

page = BeautifulSoup(response.content, "html.parser")

shop_buttons = page.select("span.shop-btn")
for button in shop_buttons:
    print(button.text)

print("the end")

当你运行Selenium时,你会得到相同的响应:enter image description here

重新加载页面适合我。请尝试以下代码:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.kicksusa.com/')

if len(driver.find_elements_by_css_selector("[name=ROBOTS]")) > 0:
    driver.get('https://www.kicksusa.com/')

shop_buttons = driver.find_elements_by_css_selector("span.shop-btn")
for button in shop_buttons:
    print(button.text)

1
投票

请尝试以下代码。它应该返回按钮的文本。希望这个帮助。

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument("--start-maximized")
options.add_argument('--disable-browser-side-navigation')
options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.kicksusa.com/')
url = driver.find_element_by_css_selector("span.shop-btn")
print(driver.execute_script("return arguments[0].innerHTML", url))

1
投票

对于您想要重复的链接,您可以使用以下css选择器限制每对中的第一个链接

#products-grid .item [href]:first-child

.find_elements_by_css_selector("#products-grid .item [href]:first-child")
© www.soinside.com 2019 - 2024. All rights reserved.