我有一个网络抓取项目,我试图从网页上抓取一些数据。我选择了一个名为 wykop.pl 的网站,比如说波兰的 reddit。
我的想法是,selenium 打开页面,接受 cookies,关闭广告(如果弹出,则 100% 的时间都不会出现)转到页面底部(可选,我不这样做)认为这是需要的),然后使用 css 选择器单击下一页按钮。
这是我的代码
website = "https://wykop.pl/hity/roku/strona/1"
cookies_button_xpath = '''
//button[contains(@class,'qxOn2zvg e1sXLPUy')]''' #relative xpath for accepting cookies
service_chrome = Service(executable_path = chromepath)
options_chrome = webdriver.ChromeOptions()
driver_chrome = webdriver.Chrome(service = service_chrome, options = options_chrome) # otwieramy chrome
driver_chrome.maximize_window() # mazimizes browser's window
driver_chrome.get(website) # opens a website
time.sleep(3) # sometimes there can be some delays when accessing website, one can specify waiting for couple of secs
content = driver_chrome.find_element('xpath',cookies_button_xpath) # finds the button
content.click() # clicks the button
#DZIALA
#next_page_class_next = driver_chrome.find_element_by_css_selector("li.next")
#usuniete, teraz to trzeba zrobic tak
# a css selector to target the next page button with the class "next"
next_page_button_css_selector = 'next > a'
try:
# Wait for the close button of the ad to be visible
close_ad_button = WebDriverWait(driver_chrome, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button[data-v-6fdb93ea]")))
#if the ad apperas
close_ad_button.click()
except:
# If the ad doesn't appear
pass
# get us to the bottom of the page
driver_chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# wait for the next page button to be clickable
next_page = WebDriverWait(driver_chrome, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_css_selector))).click()
这是错误:
---------------------------------------------------------------------------
TimeoutException Traceback (most recent call last)
Cell In[27], line 47
45 driver_chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
46 # wait for the next page button to be clickable
---> 47 next_page = WebDriverWait(driver_chrome, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_css_selector))).click()
File ~\miniconda3\envs\Piotrus\Lib\site-packages\selenium\webdriver\support\wait.py:105, in WebDriverWait.until(self, method, message)
103 if time.monotonic() > end_time:
104 break
--> 105 raise TimeoutException(message, screen, stacktrace)
TimeoutException: Message:
我尝试过使用xpath解决方案,问题是一样的
我尝试将时间从 10 秒增加到 30 秒、50 秒到 70 秒。没有任何效果。
我尝试过使用 css 选择器的其他变体,例如
next_page_css_selector = "li.next > a
不起作用
我知道问题出在我这边,而且我知道我已经很接近了,因为它接受我从 Xpath 获取的 cookie。
如果您尝试复制代码并看看有什么问题,我将非常感激
要从不同页面获取链接更容易使用他们的 Ajax 分页 API,例如:
import requests
url = "https://wykop.pl/api/v3/hits/links"
params = {"limit": "20", "page": "1", "sort": "year"}
headers = {
"Authorization": "Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6Inc1Mzk0NzI0MDc0OCIsInVzZXItaXAiOiIxMDUzMTU3MTQ5Iiwicm9sZXMiOlsiUk9MRV9BUFAiXSwiYXBwLWtleSI6Inc1Mzk0NzI0MDc0OCIsImV4cCI6MTcxNDUwMjA5MX0.X2mUIzvmz5FSskFRzuVYX37yAJU9aTlZqI56VqZCvWY"
}
for params["page"] in range(1, 3): # <-- increase number of pages here
data = requests.get(url, params=params, headers=headers).json()
for d in data["data"]:
print(
d["votes"]["count"], d["title"], f'{d["votes"]["up"]}/{d["votes"]["down"]}'
)
print(d["source"]["url"])
print()
打印:
...
5037 Kiedy ekstradycja Sebastiana M. do Polski? 5057/20
https://wykop.pl/artykul/7003275/kiedy-ekstradycja-sebastiana-m-do-polski
5040 Deweloperzy lobbują, aby usunąć wymóg ilości miejsc parkingowych na mieszkanie 5048/8
https://www.money.pl/gospodarka/zmiany-w-lex-deweloper-branza-parkingowy-wymog-musi-zniknac-7000188460038656a.html
5027 TEDE vs PiSowscy, ale to jest piękne xD 5187/160
https://www.threads.net/@lechuczechu/post/C1K9rbwv2dQ
4966 Policjant wyrywa telefon kierowcy niszcząc jego własność, wypiera się, ale wszys 4988/22
https://www.youtube.com/watch?v=Ly5J_46HY_Q
4900 Apel - administracjo zablokuj dodawanie FAME MMA 5272/372
https://wykop.pl/link/7299981/darmowe-fame-mma-reborn-na-tym-dc-https-discord-gg-a5ranypbdv-darmowe-clout-mm