抓取特定类别的分层网站

Question

我正在尝试抓取以下页面：“https://esco.ec.europa.eu/en/classification/skill_main”。特别是，我想单击 S-skills 下的所有加号按钮，除非不再有“加号按钮”可供单击，然后保存该页面源。现在，在检查页面时发现加号按钮位于 CSS 选择器“.api_hierarchy.has-child-link”下方，我尝试如下：


from selenium.common.exceptions import StaleElementReferenceException

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://esco.ec.europa.eu/en/classification/skill_main")
driver.implicitly_wait(10)

wait = WebDriverWait(driver, 20)

# Define a function to click all expandable "+" buttons
def click_expand_buttons():
    while True:
        try:
            # Find all expandable "+" buttons
            expand_buttons = wait.until(EC.presence_of_all_elements_located(
                (By.CSS_SELECTOR, ".api_hierarchy.has-child-link"))
            )

            # If no expandable buttons are found, we are done
            if not expand_buttons:
                break

            # Click each expandable "+" button
            for button in expand_buttons:
                try:
                    driver.implicitly_wait(10)
                    driver.execute_script("arguments[0].click();", button)
                    # Wait for the dynamic content to load
                    time.sleep(1)
                except StaleElementReferenceException:
                    # If the element is stale, we find the elements again
                    break
        except StaleElementReferenceException:
            continue

# Call the function to start clicking "+" buttons
click_expand_buttons()

html_source = driver.page_source

# Save the HTML to a file
with open("/Users/federiconutarelli/Desktop/escodata/expanded_esco_skills_page.html", "w", encoding="utf-8") as file:
    file.write(html_source)

# Close the browser
driver.quit()

但是，上面的代码不断关闭并打开“第一级”的 +，这可能是因为，以我有限的抓取知识，我只是要求 selenium 单击加号按钮，直到出现加号按钮，并且当页面刷新到原始页面，脚本不断地执行下去。现在我的问题是：如何仅针对S技能打开所有加号（直到有加号）：

<a href="#overlayspin" class="change_right_content" data-version="ESCO dataset - v1.1.2" data-link="http://data.europa.eu/esco/skill/335228d2-297d-4e0e-a6ee-bc6a8dc110d9" data-id="84527">S - skills</a>

？

提前感谢您，如果我没有进一步了解，我很抱歉，但我认为我的抓取知识达到了瓶颈。

Answer 1

我认为这会对你有帮助，没有测试过。但你在自己的代码上付出了努力

我现在更多 XPATH，所以我将 CSS 选择器更改为 XPATH

其余代码应该相同且有效

# Find all expandable "+" buttons
expand_buttons = wait.until(EC.presence_of_all_elements_located(
    (By.XPATH, "//div[@class='main_item classification_item' and ./a[text()='S - skills']]//span[@class='api_hierarchy has-child-link']"))
    )

抓取特定类别的分层网站

问题描述投票：0回答：1

1个回答

最新问题

抓取特定类别的分层网站

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1