将 axios 响应类型设置为 UTF-8,以便在 Node.js 中进行网页抓取

问题描述 投票:0回答:1

我正在抓取谷歌搜索给定的查询,但问题是我抓取的标题是在

ISO-8859-1
中给出的,我需要它们以 UTF-8 格式表示西班牙语。我获得了前 10 个标题,但其中一些显示如下:
Software Java | Oracle M�xico

我从 axios 请求接收谷歌搜索的文本/html,并抓取标题和网址。

我尝试过以下方法:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

async function scrape (req, res) {
    try {
        const query = req.params.query;
        const encodedQuery = encodeURIComponent(query);
        // Set the number of search results you want (e.g., 10 in this case)
        const numResults = 10;
        const url = `https://www.google.com.mx/search?q=${encodedQuery}&start=${numResults}`
        
        await axios.get(url, {
            responseType: "arraybuffer",
            headers: {
                "Content-Type": "text/html; charset=UTF-8"
            }
        }).then((response) => {
            console.log(response.headers['content-type'])
            const $ = cheerio.load(response.data, { decodeEntities: false });
            const data = [...$(".egMi0")]
                .map(e => ({
                title: $(e).find("h3").text().trim(),
                href: $(e).find("a").attr("href"),
                }));
            console.log(data);
            fs.writeFileSync("test.html", response.data);
        })
        
        res.status(200).json({
            message: "Scraping successful",
            output: 10,
        });
      } catch (error) {
        // Handle any errors that occurred during the request
        console.error('Error while scraping website:', error.message);
        res.status(500).json({
            message: "Error while scraping website. Contact support.",
            error: "Internal Server Error",
        });
      }
}

module.exports = {
    scrape,
}

即使强制打印内容类型,我的

response.headers['content-type']
text/html; charset=ISO-8859-1

node.js web-scraping axios cheerio
1个回答
0
投票

我认为你应该使用内置的 fetch (undici)

let html = await fetch(url).then(r => r.text())

我认为您正在遵循一些旧教程。

© www.soinside.com 2019 - 2024. All rights reserved.