我有以下代码可从堆栈溢出中抓取问题。当我将class names
用作ex的css selectors
时:question-hyperlink
是类名,我将其转换为.question-hyperlink
到CSS选择器。这样做很好。
但是当我像tagNames
或CSS selectors
一样使用.A
作为.DIV
时,它会向我返回此超时错误:
Traceback (most recent call last):
File "C:\Users\intel\Desktop\D_scraper2.pyw", line 37, in to_do
element = WebDriverWait(driver, 5).until(
File "C:\Users\intel\AppData\Local\Programs\Python\Python38\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
我的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
def to_do():
# vars...
csv_file_location = r"C:\Users\intel\Desktop\data_file.csv"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/80.0.3987.132 Safari/537.36'
driver_exe = 'chromedriver'
options = Options()
options.add_argument("--headless")
options.add_argument(f'user-agent={user_agent}')
options.add_argument("--disable-web-security")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--allow-cross-origin-auth-prompt")
url = "https://stackoverflow.com/questions"
driver = webdriver.Chrome(executable_path=r"C:\Users\intel\Downloads\setups\chromedriver.exe", options=options)
driver.get(url)
one_ = ".A"
two_ = ".DIV"
three_ = ".A"
try:
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, one_))
)
elements_1 = driver.find_elements_by_css_selector(one_)
web_content_list = []
for ele in elements_1:
web_content_dict = {}
web_content_dict["Title"] = ele.text
web_content_list.append(web_content_dict)
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, two_))
)
elements_2 = driver.find_elements_by_css_selector(two_)
for ele2 in elements_2:
web_content_dict = {}
web_content_dict["Title2"] = ele2.text
web_content_list.append(web_content_dict)
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, three_))
)
elements_3 = driver.find_elements_by_css_selector(three_)
for ele3 in elements_3:
web_content_dict = {}
web_content_dict["Title3"] = ele3.text
web_content_list.append(web_content_dict)
df = pd.DataFrame(web_content_list)
new_df = pd.DataFrame({'Column 1': df['Title'].dropna(),
'Column 2': df['Title2'].dropna(),
'Column 3': df['Title3'].dropna()})
new_df.to_csv(csv_file_location,
index=False, mode='a', encoding='utf-8')
try:
f = open(csv_file_location)
print("Done !!!\n"*3)
except IOError:
print("File not accessible")
finally:
f.close()
driver.quit()
except:
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, one_))
)
elements_1 = driver.find_elements_by_css_selector(one_)
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, two_))
)
elements_2 = driver.find_elements_by_css_selector(two_)
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, three_))
)
elements_3 = driver.find_elements_by_css_selector(three_)
df = pd.DataFrame({
"Title1" : [ele for ele.text in elements_1],
"Title2" : [ele2 for ele2.text in elements_2],
"Title3" : [ele3 for ele3.text in elements_3],
})
df.to_csv(csv_file_location,
index=False, mode='w', encoding='utf-8')
try:
f = open(csv_file_location)
print("Done !!!\n"*3)
# Do something with the file
except IOError:
print("File not accessible")
finally:
f.close()
driver.quit()
finally:
print("start")
if __name__ == "__main__":
to_do()
[在
class names
为css selectors
的情况下,我在csv文件中使用了三列,在tagNames
的情况下,我也在csv中使用了三列。那些标签中可能还有更多的列...
任何帮助将不胜感激...
[当您要使用CSS选择器搜索标签时,请勿在标签前面使用句号。如果您想使用'a'锚标记,可以按照以下方式进行操作:
one_ = "a"
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, one_))
)
在类名前使用圆点。