如何使用selenium遍历页面并从每个页面获取数据?

问题描述 投票:2回答:1

我想进行谷歌搜索并收集所有点击的链接,以便我可以在收集所有链接后点击这些链接并从中提取数据。如何获得每次点击的链接?

我尝试了几种解决方案,比如使用for循环和while语句。我将展示下面代码的一些示例。我要么根本没有数据,要么只从1个网页获得数据(链接)。有人可以请帮我弄清楚如何迭代谷歌搜索的每一页并获取所有链接,以便我可以继续抓取这些页面?我是新手使用Selenium所以如果代码没有多大意义我很抱歉,我真的很困惑这个。

driver.get('https://www.google.com')
search = driver.find_element_by_name('q')
search.send_keys('condition')
sleep(0.5)
search.send_keys(Keys.RETURN)
sleep(0.5)

while True:
    try:
        urls = driver.find_elements_by_class_name('iUh30')
        for url in urls
        urls = [url.text for url in urls]

    sleep(0.5)

    element = driver.find_element_by_id('pnnext')
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    sleep(0.5)
    element.click()
urls = driver.find_elements_by_class_name('iUh30')
urls = [url.text for url in urls]
sleep(0.5)

element = driver.find_element_by_id('pnnext')
driver.execute_script("return arguments[0].scrollIntoView();", element)
sleep(0.5)
element.click()
while True:
    next_page_btn = driver.find_element_by_id('pnnext')
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else: 
        urls = driver.find_elements_by_class_name('iUh30')
        urls = [url.text for url in urls]
    sleep(0.5)

    element = driver.find_element_by_id('pnnext')
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    sleep(0.5)
    element.click()

我希望Selenium可以打开谷歌搜索中所有网址的列表,这样Selenium就可以从这些页面获取数据。

我只从一个页面获得一个网址列表。下一步(抓取这些页面)工作正常。但由于这个限制,我只得到10个结果,而我想看到所有结果。

python loops selenium web-scraping
1个回答
1
投票

请尝试以下代码。我改变了一点。希望得到这个帮助。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

driver=webdriver.Chrome()
driver.get('https://www.google.com')
search = driver.find_element_by_name('q')
search.send_keys('condition')
search.submit()

while True:
    next_page_btn =driver.find_elements_by_xpath("//a[@id='pnnext']")
    if len(next_page_btn) <1:
        print("no more pages left")
        break
    else:
        urls = driver.find_elements_by_xpath("//*[@class='iUh30']")
        urls = [url.text for url in urls]
        print(urls)

    element =WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'pnnext')))
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    element.click()

OutPut:

['https://dictionary.cambridge.org/dictionary/english/condition', 'https://www.thesaurus.com/browse/condition', 'https://en.oxforddictionaries.com/definition/condition', 'https://www.dictionary.com/browse/condition', 'https://www.merriam-webster.com/dictionary/condition', 'https://www.collinsdictionary.com/dictionary/english/condition', 'https://en.wiktionary.org/wiki/condition', 'www.businessdictionary.com/definition/condition.html', 'https://en.wikipedia.org/wiki/Condition', 'https://www.definitions.net/definition/condition', '', '', '', '']
['https://www.thefreedictionary.com/condition', 'https://www.thefreedictionary.com/conditions', 'https://www.yourdictionary.com/condition', 'https://www.foxnews.com/.../woman-battling-rare-suicide-disease-says-chronic-pain-con...', 'https://youngminds.org.uk/find-help/conditions/', 'www.road.is/travel-info/road-conditions-and-weather/', 'https://roll20.net/compendium/dnd5e/Conditions', 'https://www.home-assistant.io/docs/scripts/conditions/', 'https://www.bhf.org.uk/informationsupport/conditions', 'https://www.gov.uk/driving-medical-conditions']
['https://immi.homeaffairs.gov.au/visas/already-have.../check-visa-details-and-condition...', 'https://www.d20pfsrd.com/gamemastering/conditions/', 'https://www.ofgem.gov.uk/licences-industry-codes-and.../licence-conditions', 'https://www.healthychildren.org/English/health-issues/conditions/Pages/default.aspx', 'https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements.html', 'https://www.ofcom.org.uk/phones-telecoms.../general-conditions-of-entitlement', 'https://www.rnib.org.uk/eye-health/eye-conditions', 'https://www.mdt.mt.gov/travinfo/map/mtmap_frame.html', 'https://www.mayoclinic.org/diseases-conditions', 'https://www.w3schools.com/python/python_conditions.asp']
['https://www.tremblant.ca/mountain-village/mountain-report', 'https://www.equibase.com/static/horsemen/horsemenareaCB.html', 'https://www.abebooks.com/books/rarebooks/...guide/.../guide-book-conditions.shtml', 'https://nces.ed.gov/programs/coe/', 'https://www.cdc.gov/wtc/conditions.html', 'https://snowcrows.com/raids/builds/engineer/engineer/condition/']
['https://www.millenniumassessment.org/en/Condition.html', 'https://ghr.nlm.nih.gov/condition', 'horsemen.ustrotting.com/conditions.cfm', 'https://lb.511ia.org/ialb/', 'https://www.nps.gov/deva/planyourvisit/conditions.htm', 'https://www.allaboutvision.com/conditions/', 'https://www.spine-health.com/conditions', 'https://www.tripcheck.com/', 'https://hb.511.nebraska.gov/', 'https://www.gamblingcommission.gov.uk/.../licence-conditions-and-codes-of-practice....']
['https://sports.yahoo.com/andrew-bogut-credits-beer-improved-022043569.html', 'https://ant.apache.org/manual/Tasks/conditions.html', 'https://www.disability-benefits-help.org/disabling-conditions', 'https://www.planningportal.co.uk/info/200126/applications/60/consent_types/12', 'https://www.leafly.com/news/.../qualifying-conditions-for-medical-marijuana-by-state', 'https://www.hhs.gov/healthcare/about-the-aca/pre-existing-conditions/index.html', 'https://books.google.co.uk/books?id=tRcHAAAAQAAJ', 'www.onr.org.uk/documents/licence-condition-handbook.pdf', 'https://books.google.co.uk/books?id=S0sGAAAAQAAJ']
['https://books.google.co.uk/books?id=KSjLDvXH6iUC', 'https://www.arcgis.com/apps/Viewer/index.html?appid...', 'https://www.trappfamily.com/trail-conditions.htm', 'https://books.google.co.uk/books?id=n_g0AQAAMAAJ', 'https://books.google.co.uk/books?isbn=1492586277', 'https://books.google.co.uk/books?id=JDjQ2-HV3l8C', 'https://www.newsshopper.co.uk/.../17529825.teenager-no-longer-in-critical-condition...', 'https://nbcpalmsprings.com/.../bicyclist-who-collided-with-minivan-hospitalized-in-cri...']
['https://www.stuff.co.nz/.../4yearold-christchurch-terrorist-attack-victim-in-serious-but-...', 'https://www.shropshirestar.com/.../woman-in-serious-condition-after-fall-from-motor...', 'https://www.expressandstar.com/.../woman-in-serious-condition-after-fall-from-motor...', 'https://www.independent.ie/.../toddler-rushed-to-hospital-in-serious-condition-after-hit...', 'https://www.nhsinform.scot/illnesses-and-conditions/ears-nose-and-throat/vertigo', 'https://www.rochdaleonline.co.uk/.../teenage-cyclist-in-serious-condition-after-collisio...', 'https://www.irishexaminer.com/.../baby-of-woman-found-dead-in-cumh-in-critical-cond...', 'https://touch.nihe.gov.uk/index/corporate/housing.../house_condition_survey.htm', 'https://www.nami.org/Learn-More/Mental-Health-Conditions', 'https://www.weny.com/.../update-woman-in-critical-but-stable-condition-after-being-s...']
© www.soinside.com 2019 - 2024. All rights reserved.