我想根据Notion Site的概念表中存在的概念页面的内容创建一个数据集。
我已经成功编写了一个脚本,它完全可以完成我想做的事情,即:
Open
按钮可见Open
按钮打开概念页面我的问题是,当页面加载时,可见的行数只有 26 行,或者如果我将窗口放大 28 行,则表的 47 行中。我已尝试滚动到底部,但我的脚本仍然无法检测到超过 28 行。我这里的函数用于定位 16 号之后不可见的元素:
def extract_content_from_cell(driver: webdriver.Chrome, row_number: int) -> str:
"""
Extracts the content from a single table cell and returns the text content.
"""
print(f"Processing cell {row_number}...")
cell_xpath = f"//*[@id='notion-app']/div/div[1]/div/div[1]/main/div/div/div[3]/div[2]/div/div/div/div[3]/div[2]/div[{row_number}]/div/div[1]/div/div[2]/div/div"
print(f"Locating cell {row_number}...")
try:
cell_element = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.XPATH, cell_xpath))
)
print(f"Cell {row_number} located successfully.")
except Exception as e:
print(f"Error locating cell {row_number}: {e}")
return ""
# if the row number is greater than 16, start scrolling the notion container
if row_number > 16:
for _ in range(10): # scroll the container up to 10 times, each time by 40 pixels
try:
scroll_notion_container(driver, cell_element, 40)
print(f"Hovered over cell {row_number}.")
break # stop scrolling once we successfully hover over the cell
except Exception as e:
print(
f"Scrolling down to bring cell {row_number} into view: {e}")
# hover over the cell again after scrolling
hover_over_element(driver, cell_element)
# locate and click the 'Open in side peek' button
print(f"Locating the 'Open in side peek' button for cell {row_number}...")
try:
open_button = WebDriverWait(driver, 15).until(
EC.element_to_be_clickable(
(By.XPATH, "//div[@aria-label='Open in side peek']"))
)
print(
f"Clicking the 'Open in side peek' button for cell {row_number}...")
open_button.click()
except Exception as e:
print(f"Button not visible for cell {row_number}, error: {e}")
return ""
time.sleep(4)
# extract the text content from the opened page
print(f"Extracting text from the side page for cell {row_number}...")
try:
content_element = WebDriverWait(driver, 15).until(
EC.presence_of_element_located(
(By.CLASS_NAME, "notion-page-content"))
)
page_text = content_element.text
print(f"Extracted text content for cell {row_number}.")
return page_text
except Exception as e:
print(f"Error extracting content from cell {row_number}: {e}")
return ""
我的问题可能是这些行不是首先定位的。
def get_total_rows(driver: webdriver.Chrome,
table_xpath: str) -> int:
"""
Returns the total number of rows in the Notion table.
"""
print("Determining the total number of rows in the table...")
rows = driver.find_elements(By.XPATH, table_xpath)
total_rows = len(rows)
print(f"Total rows in the table: {total_rows}")
return total_rows
有人知道我如何解决这个问题吗?我可以手动执行 47 个条目,但我想对 400 行的表执行相同的操作。
试试这个让驱动程序浏览器更大
driver.set_window_size(1920 1080)
driver.maximize_window()
改变你的方法
presence_of_element_located
对于
element_to_be_clickable
并检查您是否获得更多细胞