我正在尝试从 eBay 获取产品并在亚马逊上打开它们。
到目前为止,我在亚马逊上搜索了它们,但我很难从搜索结果中选择产品。
目前它输出一个空白数组,我不知道为什么。已在单独的脚本中进行测试,无需使用grabTitles和for循环。所以我猜有什么东西导致了问题。
我这里缺少什么东西会阻止数据返回到 prodResults 吗?
const puppeteer = require('puppeteer');
const URL = "https://www.amazon.co.uk/";
const selectors = {
searchBox: '#twotabsearchtextbox',
productLinks: 'span.a-size-base-plus.a-color-base.a-text-normal',
productTitle: '#productTitle'
};
(async() => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('https://www.ebay.co.uk/sch/jmp_supplies/m.html?_trkparms=folent%3Ajmp_supplies%7Cfolenttp%3A1&rt=nc&_trksid=p2046732.m1684');
//Get product titles from ebay
const grabTitles = await page.evaluate(() => {
const itemTitles = document.querySelectorAll('#e1-11 > #ResultSetItems > #ListViewInner > li > .lvtitle > .vip');
var items = []
itemTitles.forEach((tag) => {
items.push(tag.innerText)
})
return items
})
//Search for the products on amazon in a new tab for each product
for (i = 0; i < grabTitles.length; i++) {
const page = await browser.newPage();
await page.goto(URL)
await page.type(selectors.searchBox, grabTitles[i++])
await page.keyboard.press('Enter');
//get product titles from amazon search results
const prodResults = await page.evaluate(() => {
const prodTitles = document.querySelectorAll('span.a-size-medium.a-color-base.a-text-normal');
let results = []
prodTitles.forEach((tag) => {
results.push(tag.innerText)
})
return results
})
console.log(prodResults)
}
})()
脚本存在一些潜在问题:
await page.keyboard.press('Enter');
触发导航,但您的代码在尝试选择结果元素之前永远不会等待导航完成。使用 waitForNavigation
、waitForSelector
或 waitForFunction
(不是 waitForTimeout
)。
如果您确实等待导航,则需要使用
Promise.all
来避免竞争条件,如此处所示。
此外,您可以通过自己构建字符串直接转到搜索 URL,从而跳过页面加载。这应该会显着加快速度。
您的代码会为需要处理的每个项目生成一个新页面,但这些页面永远不会关闭。我看到
grabTitles.length
为 60。所以您将打开 60 个选项卡。这浪费了很多资源。在我的机器上,它可能会挂起所有东西。我建议创建一个页面并重复导航,或者在完成后关闭每个页面。如果您想要并行性,请考虑任务队列或同时运行几个页面。
grabTitles[i++]
——为什么在这里增加i
?它已经通过循环递增,因此这似乎会跳过元素,除非您的选择器有重复项或者您有其他原因这样做。
span.a-size-medium
对我不起作用,这可能是特定于地区的。我明白了a span.a-size-base-plus.a-color-base.a-text-normal
,但你可能需要调整一下口味。
这是一个最小的例子。我将只处理 eBay 数组中的前 2 项,因为进展顺利。
import puppeteer from "puppeteer"; // ^22.10.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
await page.setExtraHTTPHeaders({"Accept-Language": "en-US,en;q=0.9"});
await page.setUserAgent(ua);
const titles = [
"Chloraethyl | Dr. Henning | Spray 175 ml",
"Elmex Decays Prevention Toothpaste 2 x 75ml",
];
for (const title of titles) {
await page.goto("https://www.amazon.co.uk/");
await page.type("#twotabsearchtextbox", title);
await Promise.all([
page.waitForNavigation(),
page.keyboard.press("Enter"),
]);
const titleSel = "a span.a-size-base-plus.a-color-base.a-text-normal";
await page.waitForSelector(titleSel);
const results = await page.$$eval(titleSel, els =>
els.map(el => el.textContent)
);
console.log(title, results.slice(0, 5));
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
输出:
Chloraethyl | Dr. Henning | Spray 175 ml [
'Chloraethyl | Dr. Henning | Spray 175 ml',
'Wild Fire (Shetland)',
'A Dark Sin: A chilling British detective crime thriller (The Hidden Norfolk Murder Mystery Series Book 8)',
'A POLICE DOCTOR INVESTIGATES: the Sussex murder mysteries (books 1-3)',
'Rites of Spring: Sunday Times Crime Book of the Month (Seasons Quartet)'
]
Elmex Decays Prevention Toothpaste 2 x 75ml [
'Janina Ultra White Whitening Toothpaste (75ml) – Diamond Formula. Extra Strength. Clinically Proven. Low Abrasion. For Everyday Use. Excellent for Stain Removal',
'Elmex Decays Prevention Toothpaste 2 x 75ml',
'Elmex Decays Prevention Toothpaste 2 x 75ml by Elmex',
'Elmex Junior Toothpaste 2 x 75ml',
'Elmex Sensitive Professional 2 x 75ml'
]
请注意,我添加了用户代理和标头以便能够使用
headless: true
,但这对于上面的主要解决方案来说是偶然的。您可以返回 headless: false
或查看规范线程,例如 如何避免在 Puppeteer 和 Phantomjs 上被检测为机器人? 和 为什么 headless 需要为 false 才能让 Puppeteer 工作? 如果您还有其他检测问题。
您遇到了 Puppeteer 的一个古老问题,并知道页面何时完全完成渲染或加载。
您可以尝试添加以下内容:
await page.waitForNavigation({ waitUntil: 'networkidle2' })
await page.waitForTimeout(10000)
通常我发现
networkidle2
并不总是足够可靠,所以我添加任意额外的waitForTimeout
。您需要调整超时值(10000 = 10 秒)才能获得您正在寻找的内容,我知道这并不理想,但我还没有找到更好的方法。
ecommerce-scraper-js
。它允许您用更少的代码行完成您需要的事情,并且无需维护解析器:
import { amazon, ebay } from "ecommerce-scraper-js";
(async () => {
// get 5 results for "playstation 5" from ebay
const ebayProducts = await ebay.getListings("playstation 5", 5);
// iterate over received products
for (const product of ebayProducts) {
// destructure product and get title
const { title } = product;
// get 5 results for received title from amazon
const amazonProducts = await amazon.getListings(title, 5);
// get titles from received products
const amazonTitles = amazonProducts.map((el) => el.title);
console.log(title, amazonTitles);
}
})();
输出示例:
Sony PS5 Console[
"PlayStation 5 Console (PS5)",
"PlayStation 5 Console CFI-1102A",
"Playstation DualSense Wireless Controller",
"Playstation DualSense Charging Station",
"Sony Playstation Portable PSP 3000 Series Handheld Gaming Console System (Pink) (Renewed)"
]
☑️ NEW Sony Playstation PS 5 Digital Edition Console System (SHIPS THE NEXT DAY)[
"PlayStation 5 Console (PS5)",
"Playstation DualSense Wireless Controller",
"PlayStation 5 Console CFI-1102A",
"DualShock 4 Wireless Controller for PlayStation 4 - Glacier White, Case",
"The Last of Us Part I – PlayStation 5"
]
...and other results
您可以在文档中查看更多用例(带有示例)。
免责声明:我是这个包的作者