我正在试图抓一个网站而我正在使用Selenium来帮我做,但我遇到了问题。我需要检查150页,它们的形式是"base_url&page=X"
。但是,当我打电话给driver.get("base_url&page=x")
时,由于某种原因,它会脱掉&page=x
。
当我打印链接时,它正确地显示为"base_url&page=X"
,但是当我点击它时会打开base_url
,但如果我复制并粘贴链接,那么它会将我带到正确的页面 - "base_url&page=X"
。
知道问题是什么或如何修复它?
for i in range(1, 5):
page_url = BASE_URL + "&page=" + str(i)
parsed_site = get_page(page_url)
def get_page(url):
DRIVER = webdriver.Chrome(chrome_options=chrome_options)
DRIVER.get(url)
time.sleep(2)
data = DRIVER.page_source
DRIVER.close()
return BeautifulSoup(data, "html.parser")
关于后续回答的堆栈超时:
Traceback (most recent call last):
File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 335, in <module>
sys.exit(main())
File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 309, in main
parsed_site = get_next_page(DRIVER, page_url)
File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 267, in get_next_page
DRIVER.get(url)
File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 324, in get
self.execute(Command.GET, {'url': url})
File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout
(Session info: chrome=64.0.3282.167)
(Driver info: chromedriver=2.35.528157 (4429ca2590d6988c0745c24c8858745aaaec01ef),platform=Mac OS X 10.13.3 x86_64)
我相信你遇到的问题是Domain要求你在开始遍历页面之前前往它并存储一些缓存数据,但是每次你下页时你都会打开一个新的driver
。试试这个:
def get_page(DRIVER, url):
DRIVER.get(url)
time.sleep(2)
data = DRIVER.page_source
return BeautifulSoup(data, "html.parser")
DRIVER = webdriver.Chrome(chrome_options=chrome_options)
DRIVER.get(BASE_URL)
parsedList = []
for i in range(1, 5):
page_url = BASE_URL + "&page=" + str(i)
parsed_site = get_page(DRIVER, page_url)
parsedList.append(parsed_site)
for source in parsedList: print(source)
DRIVER.quit()
编辑:
在最初的问题之后,您开始遇到当前chromedriver 2.35
和Chrome Build 64.
的问题。这个错误的答案是here,很高兴我能提供帮助。