我在nodejs中使用multer来处理文件上传。上传 PDF 文件时,我想将其拆分为块并将这些块存储到 RAG 应用程序的矢量存储中(使用 langchain.js)。
import { WebPDFLoader } from 'langchain/document_loaders/web/pdf';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
// file is provided by multer
const data = file.buffer
const mimetype = file.mimetype
const blob = new Blob([data]);
const loader = new WebPDFLoader(blob, {
splitPages: false,
});
const docs = await loader.load();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
当从 URL 而不是从 multer 缓冲区获取 PDF 时,此方法按预期工作:
const url = "https://dagrs.berkeley.edu/sites/default/files/2020-01/sample.pdf"
const response = await fetch(url);
const data = await response.blob();
console.log(data)
const loader = new WebPDFLoader(data, {
splitPages: false,
});
当我在上面的代码中
console.log(data)
时,我得到:
Blob { size: 54836, type: 'application/pdf' }
从 multer 创建 blob 时,我是否需要在 blob 中包含更多数据,而不仅仅是 multer.file 中的缓冲区?我该怎么做?
您可以修改代码以创建具有正确 MIME 类型的 Blob 对象:
const { Blob } = require('buffer');
// Assuming 'file' is provided by multer
const data = file.buffer;
const mimetype = file.mimetype;
// Create Blob with correct MIME type
const blob = new Blob([data], { type: mimetype });
// Now you can use 'blob' with langchain.js
const loader = new WebPDFLoader(blob, {
splitPages: false,
});