我正在尝试使用python和selenium来抓取这个网站。但是我需要的所有信息都不在主页上,那么如何逐一点击“应用程序编号”列中的链接转到该页面,然后返回原始页面?
我试过了:
def getData():
data = []
select = Select(driver.find_elements_by_xpath('//*[@id="node-41"]/div/div/div/div/div/div[1]/table/tbody/tr/td/a/@href'))
list_options = select.options
for item in range(len(list_options)):
item.click()
driver.get(url)
网址:http://www.scilly.gov.uk/planning-development/planning-applications
要在网络表中打开多个href以清除硒,您可以使用以下解决方案:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
hrefs = []
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe')
driver.get('http://www.scilly.gov.uk/planning-development/planning-applications')
windows_before = driver.current_window_handle # Store the parent_window_handle for future use
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.views-field.views-field-title>a"))) # Induce WebDriverWait for the visibility of the desired elements
for element in elements:
hrefs.append(element.get_attribute("href")) # Collect the required href attributes and store in a list
for href in hrefs:
driver.execute_script("window.open('" + href +"');") # Open the hrefs one by one through execute_script method in a new tab
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2)) # Induce WebDriverWait for the number_of_windows_to_be 2
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0] # Identify the newly opened window
driver.switch_to_window(new_window) # switch_to the new window
# perform your webscrapping here
print(driver.title) # print the page title or your perform your webscrapping
driver.close() # close the window
driver.switch_to_window(windows_before) # switch_to the parent_window_handle
driver.quit() #Quit your program
Planning application: P/18/064 | Council of the ISLES OF SCILLY
Planning application: P/18/063 | Council of the ISLES OF SCILLY
Planning application: P/18/062 | Council of the ISLES OF SCILLY
Planning application: P/18/061 | Council of the ISLES OF SCILLY
Planning application: p/18/059 | Council of the ISLES OF SCILLY
Planning application: P/18/058 | Council of the ISLES OF SCILLY
Planning application: P/18/057 | Council of the ISLES OF SCILLY
Planning application: P/18/056 | Council of the ISLES OF SCILLY
Planning application: P/18/055 | Council of the ISLES OF SCILLY
Planning application: P/18/054 | Council of the ISLES OF SCILLY
你能做的是以下几点:
import selenium
from selenium.webdriver.common.keys import Keys
from selenium import Webdriver
import time
url = "url"
browser = Webdriver.Chrome() #or whatever driver you use
browser.find_element_by_class_name("views-field views-field-title").click()
# or use this browser.find_element_by_xpath("xpath")
#Note you will need to change the class name to click a different item in the table
time.sleep(5) # not the best way to do this but its simple. Just to make sure things load
#it is here that you will be able to scrape the new url I will not post that as you can scrape what you want.
# When you are done scraping you can return to the previous page with this
driver.execute_script("window.history.go(-1)")
希望这就是你要找的东西。
当您导航到新页面时,DOM会刷新,您不能在此处使用list方法。这是我对此操作的方法(我在python中编写的代码不多,因此语法和压缩可能会被破坏)
count = driver.find_elements_by_xpath("//table[@class='views-table cols-6']/tbody/tr") # to count total number of links
len(count)
j = 1
if j<=len:
driver.find_element_by_xpath("//table[@class='views-table cols-6']/tbody/tr["+str(j)+"]/td/a").click()
#add wait here
#do your scrape action here
driver.find_element_by_xpath("//a[text()='Back to planning applications']").click()#to go back to main page
#add wait here for main page to load.
j+=1