英文版本: 我正在使用 React 和 TypeScript 以及 OpenAI API 开发一个聊天机器人。我还使用 Chroma 作为数据库。我想将它与 Markdown 文档中引用的图像集成。我的想法是在文档中包含图像参考,如下所示:
![Image description](https://link-to-image.com "Related keywords")
当用户提出与这些图像相关的问题时,聊天机器人应该:
根据 Markdown 上下文回答问题。 如果问题涉及特定图像,请获取图像 URL 并将其显示为响应的一部分。 我的问题是:
是直接针对每个请求处理 Markdown 还是在 Chroma 中对其进行预索引,将图像映射到其上下文更好? 如何有效地设计此流程以确保快速、准确的响应? 如果有任何想法、建议或类似的经历,我将不胜感激。
我正在努力寻找最好且有效的方法来解决这个问题
非常有趣的问题。我想解决问题的方法有很多种,但出于实用性的原因,让我们尽量保持简单。
我认为您可以先在 Chroma 中索引您的 Markdown 文件,然后搜索它们,最后请求 LLM(例如 OpenAI gpt4o)为您生成 Markdown,从而受益匪浅。一个典型的 RAG 应用程序。
旁注:您还可以嵌入图像本身以获得更好的检索上下文,但为了简洁起见,我不会在此处包含这部分。欢迎加入 Chroma Discord,我们可以对此进行更多讨论(寻找 @taz)
我的建议是处理 MD 文件并提取每个 MD 文件的图像作为元数据并将其存储在 Chroma 中,然后可以将其传递给 LLM 进行生成。因为用 Python 来说明这一点更简单,所以我假设您可以将以下代码转换为 TS,或者使用可以处理 Markdown 文件的摄取的 Python 后端。
完成上述内容后,让我们开始吧。首先,我们将创建一个自定义 Langchain🦜🔗 Markdown 加载器。我们需要定制的原因是因为现成的无法处理图像标签,或者至少不知道如何处理它们。
from typing import Dict, Iterator, Union, Any, Optional, List
from langchain_core.documents import Document
import json
from langchain_community.document_loaders.base import BaseLoader
class CustomMDLoader(BaseLoader):
def __init__(
self,
markdown_content: str,
*,
images_as_metadata: bool = False,
beautifulsoup_kwargs: Optional[Dict[str, Any]] = None,
split_by: Optional[str] = None,
) -> None:
try:
from bs4 import BeautifulSoup
except ImportError:
raise ImportError(
"beautifulsoup4 package not found, please install it with "
"`pip install beautifulsoup4`"
)
try:
import mistune
except ImportError:
raise ImportError(
"mistune package not found, please install it with "
"`pip install mistune`"
)
self._markdown_content = markdown_content
self._images_as_metadata = images_as_metadata
self._beautifulsoup_kwargs = beautifulsoup_kwargs or {"features": "html.parser"}
self._split_by = split_by
def get_metadata_for_element(self, element: "PageElement") -> Dict[str, Union[str, None]]:
metadata: Dict[str, Union[str, None]] = {}
if hasattr(element,"find_all") and self._images_as_metadata:
metadata["images"] = json.dumps([{"src":img.get('src'),"alt":img.get('alt')} for img in element.find_all("img")] )
return metadata
def get_document_for_elements(self, elements: List["PageElement"]) -> Document:
text = " ".join([el.get_text() for el in elements])
metadata: Dict[str, Union[str, None]] = {}
for el in elements:
new_meta = self.get_metadata_for_element(el)
if "images" in new_meta and "images" in metadata:
old_list = json.loads(metadata["images"])
new_list = json.loads(new_meta["images"])
metadata["images"] = json.dumps(old_list + new_list)
if "images" in new_meta and "images" not in metadata:
metadata["images"] = new_meta["images"]
return Document(page_content=text, metadata=metadata)
def split_by(self, parent_page_element:"PageElements", tag:Optional[str] = None) -> Iterator[Document]:
if tag is None or len(parent_page_element.find_all(tag)) < 2:
yield self.get_document_for_elements([parent_page_element])
else:
found_tags = parent_page_element.find_all(tag)
if len(found_tags) >= 2:
for start_tag, end_tag in zip(found_tags, found_tags[1:]):
elements_between = []
# Iterate through siblings of the start tag
for element in start_tag.next_siblings:
if element == end_tag:
break
elements_between.append(element)
doc = self.get_document_for_elements(elements_between)
doc.metadata["split"] = start_tag.get_text()
yield doc
last_tag = found_tags[-1]
elements_between = []
for element in last_tag.next_siblings:
elements_between.append(element)
doc = self.get_document_for_elements(elements_between)
doc.metadata["split"] = last_tag.get_text()
yield doc
def lazy_load(self) -> Iterator[Document]:
import mistune
from bs4 import BeautifulSoup
html=mistune.create_markdown()(self._markdown_content)
soup = BeautifulSoup(html,**self._beautifulsoup_kwargs)
if self._split_by is not None:
for doc in self.split_by(soup, tag=self._split_by):
yield doc
else:
for doc in self.split_by(soup):
yield doc
注意:要使用上述内容,您必须安装以下库:
pip install beautifulsoup4 mistune langchain langchain-community
上面需要您的 MD 文件的内容,然后将其转换为 HTML 并使用 beautifulsoup4 进行处理。它还增加了按标题之类的内容分割 MD 文件的功能,例如
h1
。这是生成的 Langchain🦜🔗 文档:
Document(id='4ce64f5c-7873-4c3d-a17f-5531486d3312', metadata={'images': '[{"src": "https://images.example.com/image1.png", "alt": "Image"}]', 'split': 'Chapter 1: Dogs'}, page_content='\n In this chapter we talk about dogs. Here is an image of a dog \n Dogs make for good home pets. They are loyal and friendly. They are also very playful. \n')
然后,我们可以使用以下 python 脚本将一些数据提取到 Chroma 中(您可以将其添加到 python 后端,以便通过上传 MD 文件自动为您执行此操作)。这是我们可以使用的示例 MD 文件:
# Chapter 1: Dogs
In this chapter we talk about dogs. Here is an image of a dog ![Image](https://images.example.com/image1.png)
Dogs make for good home pets. They are loyal and friendly. They are also very playful.
# Chapter 2: Cats
In this chapter we talk about cats. Here is an image of a cat ![Image](https://images.example.com/image2.png)
Cats are very independent animals. They are also very clean and like to groom themselves.
# Chapter 3: Birds
In this chapter we talk about birds. Here is an image of a bird ![Image](https://images.example.com/image3.png)
import chromadb
loader = CustomMDLoader(markdown_content=open("test.md").read(), images_as_metadata=True, beautifulsoup_kwargs={"features": "html.parser"},split_by="h1")
docs = loader.load()
client = chromadb.HttpClient("http://localhost:8000")
col = client.get_or_create_collection("test")
col.add(
ids=[doc.id for doc in docs],
documents=[doc.page_content for doc in docs],
metadatas=[doc.metadata for doc in docs],
)
# resulting docs: [Document(id='4ce64f5c-7873-4c3d-a17f-5531486d3312', metadata={'images': '[{"src": "https://images.example.com/image1.png", "alt": "Image"}]', 'split': 'Chapter 1: Dogs'}, page_content='\n In this chapter we talk about dogs. Here is an image of a dog \n Dogs make for good home pets. They are loyal and friendly. They are also very playful. \n'), Document(id='7e1c3ab1-f737-42ea-85cc-9ac21bfd9b8b', metadata={'images': '[{"src": "https://images.example.com/image2.png", "alt": "Image"}]', 'split': 'Chapter 2: Cats'}, page_content='\n In this chapter we talk about cats. Here is an image of a cat \n Cats are very independent animals. They are also very clean and like to groom themselves. \n'), Document(id='4d111946-f52e-4ce0-a9ff-5ffde8536736', metadata={'images': '[{"src": "https://images.example.com/image3.png", "alt": "Image"}]', 'split': 'Chapter 3: Birds'}, page_content='\n In this chapter we talk about birds. Here is an image of a bird \n')]
作为 TS (react) 聊天机器人的最后一步,使用 Chroma TS 客户端搜索您想要的内容(请参阅官方文档此处)。
import { ChromaClient } from "chromadb";
const client = new ChromaClient();
const results = await collection.query({
queryTexts: "I want to learn about dogs",
nResults: 1, // how many results to return
});
根据上述结果,使用您的结果创建一个元提示:如下所示:
based on the following content generate a markdown output that includes the text content and the image or images:
Text Content:
In this chapter we talk about dogs. Here is an image of a dog
Dogs make for good home pets. They are loyal and friendly. They are also very playful.
Images: [{'src': 'https://images.example.com/image1.png', 'alt': 'Image'}]
如果使用 OpenAI GPT4o 你应该得到这样的结果:
# Chapter: Dogs
In this chapter, we talk about dogs. Here is an image of a dog:
![Image](https://images.example.com/image1.png)
Dogs make for good home pets. They are loyal and friendly. They are also very playful.
然后,您可以在聊天响应中向用户呈现降价。
我想保持简短,但感觉没有超简短的方式来描述您可以采取的解决挑战的多种方法之一。