如何使用 Ctrl+A 和 Ctrl+C 并行抓取多个页面而不混合文本?

问题描述 投票:0回答:1

我使用 Puppeteer 从多个页面中抓取文本,方法是导航到每个页面,使用 Ctrl+A 选择所有内容,使用 Ctrl+C 复制内容,然后检索剪贴板数据。但是,我遇到了一个页面的文本与其他页面的文本混淆的问题。

const ctrlActrlC = async (page: Page) => {
  await page.keyboard.down("ControlLeft");
  await page.keyboard.press("KeyA");
  await page.keyboard.press("KeyC");
  await page.keyboard.up("ControlLeft");
  // Add a slight delay to allow the clipboard to update
  await sleep(100);
};


const copyTextUsingHtml = (page: Page) =>
  page.evaluate(() => {
    let bodyText = "";
    const extractText = (node) => {
      if (node.nodeType === Node.TEXT_NODE) {
        // If it's a text node, append the text content
        bodyText += `${node.textContent.trim()} `;
      } else if (node.nodeType === Node.ELEMENT_NODE) {
        const style = window.getComputedStyle(node);
        if (style.display !== "none" && style.visibility !== "hidden") {
          for (const element of node.childNodes) extractText(element);
        }
      }
    };
    // @ts-expect-error defined in browser
    extractText(document.body);
    return bodyText.trim();
  }) as Promise<string>;

const copyAllVisibleText = async (page: Page) => {
  let attempts = 0;
  const maxAttemptes = 32;
  await page.waitForSelector("body", { timeout: 10000 });
  await closePopUp(page);
  await page.focus("body");
  while (attempts < maxAttemptes) {
    for (
      const text of await Promise.all([
        withFocusLock(copyText)(page),
        copyTextUsingHtml(page),
      ])
    ) { 
        if (somePredicateOnText(text)) return text;      }
    }
    await sleep(500);
    attempts++;
  }
  return null;
};

withFocusLock
是浏览器级互斥锁。

typescript web-scraping parallel-processing puppeteer
1个回答
0
投票

根据 MDN,剪贴板 api 仅在页面聚焦时才起作用。您还必须为应用程序提供适当的权限。

该规范要求用户最近与页面进行过交互才能从剪贴板中读取内容(需要瞬时用户激活)。

这是一个最小的工作代码,

import puppeteer from "puppeteer";
import { setTimeout } from "node:timers/promises";

async function scrapeWithClipboard(url: string) {
  const browser = await puppeteer.launch({
    headless: false,
  });

  const context = await browser.defaultBrowserContext();

  await context.overridePermissions(url, ['clipboard-read', 'clipboard-write', 'clipboard-sanitized-write']);

  const page = await context.newPage();

  await page.goto(url, { waitUntil: "networkidle0" });

  // Select all text (Ctrl+A)
  await page.keyboard.down("Control");
  await page.keyboard.press("KeyA");
  await page.keyboard.up("Control");

  // Copy selected text (Ctrl+C)
  await page.keyboard.down("Control");
  await page.keyboard.press("KeyC");
  await page.keyboard.up("Control");

  await setTimeout(500);

  // The clipboard api does not allow you to copy, unless the tab is focused.
  await page.bringToFront();
  const clipboardText = await page.evaluate(() => {
    // Copy the selected content to the clipboard
    document.execCommand("copy");
    // Obtain the content of the clipboard as a string
    return navigator.clipboard.readText();
  });

  await browser.close();
  return clipboardText;
}

scrapeWithClipboard("https://example.com").then(console.log);

剪贴板内容可能会混合的原因是,如果您打开了多个选项卡,或者多个页面同时访问剪贴板。因此,您应该确保一次只有一个页面处于活动状态,直到整个复制粘贴过程完成。

© www.soinside.com 2019 - 2024. All rights reserved.