我目前正在使用 SelfQueryRetriever,我的数据存储在集合中的 ChromaDB 服务器中。虽然简单的相似性搜索可以正确检索答案,但对同一集合和查询使用 SelfQueryRetriever 会返回空结果。
我已经正确初始化了SelfQueryRetriever及其参数,并且它正确识别了目标集合。尽管如此,它仍然没有返回任何结果。我的库是最新的。此外,当我使用持久 ChromaDB 实例测试代码时,它工作正常并返回预期的答案。
问题似乎与我的数据存储在集合中的 ChromaDB 服务器有关。
任何有关如何解决此问题的见解或建议将不胜感激。谢谢!
我正在使用的代码是:
def get_chroma_instance():
chroma_client = chromadb.HttpClient(
host=config("CHROMA_HOST"), port=config("CHROMA_PORT")
)
try:
chroma_client.heartbeat()
except:
raise Exception("Chroma server is not running")
return chroma_client
def retrieve_documents(query, top_data=20):
try:
embedding_function = OpenAIEmbeddings(openai_api_key=open_api_key)
llm = ChatOpenAI(temperature=0, openai_api_key=open_api_key)
document_content_description = ("Search judgements based on metadata")
metadata_field_info=define_metadata_fields()
# vectorstore=Chroma("test", OpenAIEmbeddings())
vectorstore = Chroma(
client=get_chroma_instance(),
collection_name="test",
embedding_function=embedding_function,
)
print("\n-----------------------------><--------------------------")
# vectorstore = vectorstore._collection
print("\n\n\n\nvectorstore = ", vectorstore._collection)
print("\n-----------------------------><--------------------------")
print("\n\n\n\nvectorstore = ", vectorstore)
retriever = SelfQueryRetriever.from_llm(
llm, # You might need to replace with your LLM implementation
vectorstore,
document_content_description,
metadata_field_info,
search_kwargs={"k": top_data},
)
# Invoke retriever with the query
docs = retriever.invoke(query)
# Extract document URLs and IDs (modify based on your metadata structure)
complete_docs = []
for doc in docs:
meta = doc.metadata
# Assuming "source" field holds the document URL
document_url = meta.get("source")
if document_url:
complete_docs.append({"source": document_url})
return complete_docs
except Exception as e:
print("---------------------------------------------\nError in self query: ", e)
您还记得首先创建嵌入吗?
它应该看起来像这样:
chroma_client = get_chroma_instance()
vectorstore = Chroma.from_documents(
documents=docs, # your data to vectorize
client=chroma_client,
collection_name="test",
embedding=embedding_function
)