如何使用 selenium 抓取概念站点的表格单元格中包含的概念页面?

问题描述 投票:0回答:1

我想根据Notion Site的概念表中存在的概念页面的内容创建一个数据集。

我已经成功编写了一个脚本,它完全可以完成我想做的事情,即:

  1. 打开网站
  2. 转到表格的每一行
  3. 将鼠标悬停在右侧以使
    Open
    按钮可见
  4. 点击
    Open
    按钮打开概念页面
  5. 获取概念页面内容
  6. 移至下面的下一个单元格并从步骤 3 开始重复

我的问题是,当页面加载时,可见的行数只有 26 行,或者如果我将窗口放大 28 行,则表的 47 行中。我已尝试滚动到底部,但我的脚本仍然无法检测到超过 28 行。我这里的函数用于定位 16 号之后不可见的元素:

def extract_content_from_cell(driver: webdriver.Chrome, row_number: int) -> str:
    """
    Extracts the content from a single table cell and returns the text content.
    """

    print(f"Processing cell {row_number}...")

    cell_xpath = f"//*[@id='notion-app']/div/div[1]/div/div[1]/main/div/div/div[3]/div[2]/div/div/div/div[3]/div[2]/div[{row_number}]/div/div[1]/div/div[2]/div/div"
    print(f"Locating cell {row_number}...")

    try:
        cell_element = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.XPATH, cell_xpath))
        )
        print(f"Cell {row_number} located successfully.")
    except Exception as e:
        print(f"Error locating cell {row_number}: {e}")
        return ""

    # if the row number is greater than 16, start scrolling the notion container
    if row_number > 16:
        for _ in range(10):  # scroll the container up to 10 times, each time by 40 pixels
            try:
                scroll_notion_container(driver, cell_element, 40)
                print(f"Hovered over cell {row_number}.")
                break  # stop scrolling once we successfully hover over the cell
            except Exception as e:
                print(
                    f"Scrolling down to bring cell {row_number} into view: {e}")

    # hover over the cell again after scrolling
    hover_over_element(driver, cell_element)

    # locate and click the 'Open in side peek' button
    print(f"Locating the 'Open in side peek' button for cell {row_number}...")

    try:
        open_button = WebDriverWait(driver, 15).until(
            EC.element_to_be_clickable(
                (By.XPATH, "//div[@aria-label='Open in side peek']"))
        )
        print(
            f"Clicking the 'Open in side peek' button for cell {row_number}...")
        open_button.click()
    except Exception as e:
        print(f"Button not visible for cell {row_number}, error: {e}")
        return ""

    time.sleep(4)

    # extract the text content from the opened page
    print(f"Extracting text from the side page for cell {row_number}...")
    try:
        content_element = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located(
                (By.CLASS_NAME, "notion-page-content"))
        )
        page_text = content_element.text
        print(f"Extracted text content for cell {row_number}.")
        return page_text
    except Exception as e:
        print(f"Error extracting content from cell {row_number}: {e}")
        return ""

我的问题可能是这些行不是首先定位的。

def get_total_rows(driver: webdriver.Chrome,
                   table_xpath: str) -> int:
    """
    Returns the total number of rows in the Notion table.
    """

    print("Determining the total number of rows in the table...")
    rows = driver.find_elements(By.XPATH, table_xpath)
    total_rows = len(rows)
    print(f"Total rows in the table: {total_rows}")
    return total_rows

有人知道我如何解决这个问题吗?我可以手动执行 47 个条目,但我想对 400 行的表执行相同的操作。

python selenium-webdriver web-scraping notion
1个回答
0
投票

试试这个让驱动程序浏览器更大

driver.set_window_size(1920 1080)
driver.maximize_window()

改变你的方法

presence_of_element_located

对于

element_to_be_clickable 

并检查您是否获得更多细胞

© www.soinside.com 2019 - 2024. All rights reserved.