如何正确将multer文件数据传入LangChain.js WebPDFLoader?

问题描述 投票:0回答:1

我在nodejs中使用multer来处理文件上传。上传 PDF 文件时,我想将其拆分为块并将这些块存储到 RAG 应用程序的矢量存储中(使用 langchain.js)。

import { WebPDFLoader } from 'langchain/document_loaders/web/pdf';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

// file is provided by multer
const data = file.buffer
const mimetype = file.mimetype

const blob = new Blob([data]);
const loader = new WebPDFLoader(blob, {
    splitPages: false,
});

const docs = await loader.load();

const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
});

当从 URL 而不是从 multer 缓冲区获取 PDF 时,此方法按预期工作:

const url = "https://dagrs.berkeley.edu/sites/default/files/2020-01/sample.pdf"

const response = await fetch(url);
const data = await response.blob();
console.log(data)
const loader = new WebPDFLoader(data, {
    splitPages: false,
});

当我在上面的代码中

console.log(data)
时,我得到:
Blob { size: 54836, type: 'application/pdf' }

从 multer 创建 blob 时,我是否需要在 blob 中包含更多数据,而不仅仅是 multer.file 中的缓冲区?我该怎么做?

node.js blob multer pdf.js langchain-js
1个回答
0
投票

您可以修改代码以创建具有正确 MIME 类型的 Blob 对象:

const { Blob } = require('buffer');
// Assuming 'file' is provided by multer
const data = file.buffer;
const mimetype = file.mimetype;

// Create Blob with correct MIME type
const blob = new Blob([data], { type: mimetype });

// Now you can use 'blob' with langchain.js
const loader = new WebPDFLoader(blob, {
    splitPages: false,
});
© www.soinside.com 2019 - 2024. All rights reserved.