我使用 Puppeteer 从多个页面中抓取文本,方法是导航到每个页面,使用 Ctrl+A 选择所有内容,使用 Ctrl+C 复制内容,然后检索剪贴板数据。但是,我遇到了一个页面的文本与其他页面的文本混淆的问题。
const ctrlActrlC = async (page: Page) => {
await page.keyboard.down("ControlLeft");
await page.keyboard.press("KeyA");
await page.keyboard.press("KeyC");
await page.keyboard.up("ControlLeft");
// Add a slight delay to allow the clipboard to update
await sleep(100);
};
const copyTextUsingHtml = (page: Page) =>
page.evaluate(() => {
let bodyText = "";
const extractText = (node) => {
if (node.nodeType === Node.TEXT_NODE) {
// If it's a text node, append the text content
bodyText += `${node.textContent.trim()} `;
} else if (node.nodeType === Node.ELEMENT_NODE) {
const style = window.getComputedStyle(node);
if (style.display !== "none" && style.visibility !== "hidden") {
for (const element of node.childNodes) extractText(element);
}
}
};
// @ts-expect-error defined in browser
extractText(document.body);
return bodyText.trim();
}) as Promise<string>;
const copyAllVisibleText = async (page: Page) => {
let attempts = 0;
const maxAttemptes = 32;
await page.waitForSelector("body", { timeout: 10000 });
await closePopUp(page);
await page.focus("body");
while (attempts < maxAttemptes) {
for (
const text of await Promise.all([
withFocusLock(copyText)(page),
copyTextUsingHtml(page),
])
) {
if (somePredicateOnText(text)) return text; }
}
await sleep(500);
attempts++;
}
return null;
};
withFocusLock
是浏览器级互斥锁。
根据 MDN,剪贴板 api 仅在页面聚焦时才起作用。您还必须为应用程序提供适当的权限。
该规范要求用户最近与页面进行过交互才能从剪贴板中读取内容(需要瞬时用户激活)。
这是一个最小的工作代码,
import puppeteer from "puppeteer";
import { setTimeout } from "node:timers/promises";
async function scrapeWithClipboard(url: string) {
const browser = await puppeteer.launch({
headless: false,
});
const context = await browser.defaultBrowserContext();
await context.overridePermissions(url, ['clipboard-read', 'clipboard-write', 'clipboard-sanitized-write']);
const page = await context.newPage();
await page.goto(url, { waitUntil: "networkidle0" });
// Select all text (Ctrl+A)
await page.keyboard.down("Control");
await page.keyboard.press("KeyA");
await page.keyboard.up("Control");
// Copy selected text (Ctrl+C)
await page.keyboard.down("Control");
await page.keyboard.press("KeyC");
await page.keyboard.up("Control");
await setTimeout(500);
// The clipboard api does not allow you to copy, unless the tab is focused.
await page.bringToFront();
const clipboardText = await page.evaluate(() => {
// Copy the selected content to the clipboard
document.execCommand("copy");
// Obtain the content of the clipboard as a string
return navigator.clipboard.readText();
});
await browser.close();
return clipboardText;
}
scrapeWithClipboard("https://example.com").then(console.log);
剪贴板内容可能会混合的原因是,如果您打开了多个选项卡,或者多个页面同时访问剪贴板。因此,您应该确保一次只有一个页面处于活动状态,直到整个复制粘贴过程完成。