我有 200 万篇文章,正在使用
langchain
分成大约 1200 万个文档。我想对这些文档进行搜索,因此我希望将它们放入一个理想的色度数据库中。将数百万个文档插入色度数据库的最快方法是在创建数据库时插入所有文档或使用db.add_documents()
。现在我正在 db.add_documents()
中以 100,000 块为单位进行操作,但是每次调用时,add_documents
的时间似乎变得越来越长。我应该在创建它时尝试插入所有 1200 万个块吗?我有一个 GPU 和大量存储空间,过去每 100K 需要 30 分钟,但现在用 add_document
添加 100k 文档需要一个多小时。
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
model_path = "./multi-qa-MiniLM-L6-cos-v1/"
model_kwargs = {"device": "cuda"}
embeddings = SentenceTransformerEmbeddings(model_name="./multi-qa-MiniLM-L6-cos-v1/", model_kwargs=model_kwargs)
documents_array = documents[0:100000]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
)
docs = text_splitter.create_documents(documents_array)
persist_directory = "chroma_db"
vectordb = Chroma.from_documents(
documents=docs, embedding=embeddings, persist_directory=persist_directory
)
vectordb.persist()
vectordb._collection.count()
docs = text_splitter.create_documents(documents[500000:600000])
def batch_process(documents_arr, batch_size, process_function):
for i in range(0, len(documents_arr), batch_size):
batch = documents_arr[i:i + batch_size]
process_function(batch)
def add_to_chroma_database(batch):
vectordb.add_documents(documents=batch)
batch_size = 41000
batch_process(docs, batch_size, add_to_chroma_database)
我也面临着同样的问题。你找到解决办法了吗?