我正在尝试抓取以下页面:“https://esco.ec.europa.eu/en/classification/skill_main”。特别是,我想单击 S-skills 下的所有加号按钮,除非不再有“加号按钮”可供单击,然后保存该页面源。现在,在检查页面时发现加号按钮位于 CSS 选择器“.api_hierarchy.has-child-link”下方,我尝试如下:
from selenium.common.exceptions import StaleElementReferenceException
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://esco.ec.europa.eu/en/classification/skill_main")
driver.implicitly_wait(10)
wait = WebDriverWait(driver, 20)
# Define a function to click all expandable "+" buttons
def click_expand_buttons():
while True:
try:
# Find all expandable "+" buttons
expand_buttons = wait.until(EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, ".api_hierarchy.has-child-link"))
)
# If no expandable buttons are found, we are done
if not expand_buttons:
break
# Click each expandable "+" button
for button in expand_buttons:
try:
driver.implicitly_wait(10)
driver.execute_script("arguments[0].click();", button)
# Wait for the dynamic content to load
time.sleep(1)
except StaleElementReferenceException:
# If the element is stale, we find the elements again
break
except StaleElementReferenceException:
continue
# Call the function to start clicking "+" buttons
click_expand_buttons()
html_source = driver.page_source
# Save the HTML to a file
with open("/Users/federiconutarelli/Desktop/escodata/expanded_esco_skills_page.html", "w", encoding="utf-8") as file:
file.write(html_source)
# Close the browser
driver.quit()
但是,上面的代码不断关闭并打开“第一级”的 +,这可能是因为,以我有限的抓取知识,我只是要求 selenium 单击加号按钮,直到出现加号按钮,并且当页面刷新到原始页面,脚本不断地执行下去。现在我的问题是:如何仅针对S技能打开所有加号(直到有加号):
<a href="#overlayspin" class="change_right_content" data-version="ESCO dataset - v1.1.2" data-link="http://data.europa.eu/esco/skill/335228d2-297d-4e0e-a6ee-bc6a8dc110d9" data-id="84527">S - skills</a>
?
提前感谢您,如果我没有进一步了解,我很抱歉,但我认为我的抓取知识达到了瓶颈。
我认为这会对你有帮助,没有测试过。但你在自己的代码上付出了努力
我现在更多 XPATH,所以我将 CSS 选择器更改为 XPATH
其余代码应该相同且有效
# Find all expandable "+" buttons
expand_buttons = wait.until(EC.presence_of_all_elements_located(
(By.XPATH, "//div[@class='main_item classification_item' and ./a[text()='S - skills']]//span[@class='api_hierarchy has-child-link']"))
)