我正在抓取谷歌搜索给定的查询,但问题是我抓取的标题是在
ISO-8859-1
中给出的,我需要它们以 UTF-8 格式表示西班牙语。我获得了前 10 个标题,但其中一些显示如下:Software Java | Oracle M�xico
我从 axios 请求接收谷歌搜索的文本/html,并抓取标题和网址。
我尝试过以下方法:
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
async function scrape (req, res) {
try {
const query = req.params.query;
const encodedQuery = encodeURIComponent(query);
// Set the number of search results you want (e.g., 10 in this case)
const numResults = 10;
const url = `https://www.google.com.mx/search?q=${encodedQuery}&start=${numResults}`
await axios.get(url, {
responseType: "arraybuffer",
headers: {
"Content-Type": "text/html; charset=UTF-8"
}
}).then((response) => {
console.log(response.headers['content-type'])
const $ = cheerio.load(response.data, { decodeEntities: false });
const data = [...$(".egMi0")]
.map(e => ({
title: $(e).find("h3").text().trim(),
href: $(e).find("a").attr("href"),
}));
console.log(data);
fs.writeFileSync("test.html", response.data);
})
res.status(200).json({
message: "Scraping successful",
output: 10,
});
} catch (error) {
// Handle any errors that occurred during the request
console.error('Error while scraping website:', error.message);
res.status(500).json({
message: "Error while scraping website. Contact support.",
error: "Internal Server Error",
});
}
}
module.exports = {
scrape,
}
即使强制打印内容类型,我的
response.headers['content-type']
是text/html; charset=ISO-8859-1
。
我认为你应该使用内置的 fetch (undici)
let html = await fetch(url).then(r => r.text())
我认为您正在遵循一些旧教程。