我正在尝试使用Python中的LangChain通过LLM(例如GPT-4)提取结构化信息。我的目标是通过将公司与标签相关联来对公司进行分类。
我的输出类的类型为:
from langchain_core.pydantic_v1 import BaseModel
class Company(BaseModel):
industry: list[Industry]
customer: list[Customer]
到目前为止一切顺利。现在的问题是,某些标签可能有些特定,我想向 LLM 传递更多信息,以帮助其在选项之间做出决定。使用
Enum
中的 aenum
,如here所述,我可以添加例如文档字符串到枚举值:
from aenum import Enum
class Industry(Enum):
_init_ = 'value __doc__'
it = "Information Technology", "All kinds of computer stuff"
agriculture = "Agriculture", "Farming, irrigation, fertilizers etc."
class Customer(Enum):
_init_ = 'value __doc__'
B2C = "B2C", "Companies selling directly to consumers"
B2B = "B2B", "Companies selling to other businesses"
现在我有了自己的价值观和一些有用的解释,但是,没有直接的方法将这些传递给法学硕士。
如果我使用
.with_structured_output()
或 PydanticOutputParser
他们无法传递来自枚举成员的文档字符串:
from langchain_core.output_parsers import PydanticOutputParser
parser = PydanticOutputParser(pydantic_object=Company)
parser.get_format_instructions()
# 'The output should be formatted as a JSON instance that conforms to the JSON schema below.
# As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
# the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
# Here is the output schema:
# ```
# {"properties": {"industry": {"type": "array", "items": {"$ref": "#/definitions/Industry"}}, "customer": {"type": "array", "items": {"$ref": "#/definitions/Customer"}}}, "required": ["industry", "customer"], "definitions": {"Industry": {"title": "Industry", "description": "An enumeration.", "enum": ["Information Technology", "Agriculture"]}, "Customer": {"title": "Customer", "description": "An enumeration.", "enum": ["B2C", "B2B"]}}}
#```'
作为一种解决方法,我当然可以编写一个自定义提示来明确详细说明文档字符串,但只是好奇是否有人找到了更直接的方法来做到这一点。
您可以向
__get_pydantic_json_schema__
方法添加一个字段(并使用 pydantic v2),如下所示:
from aenum import Enum
from pydantic.json_schema import GetJsonSchemaHandler
from pydantic_core import core_schema
class EnumSchemaDoc(Enum):
@classmethod
def __get_pydantic_json_schema__(cls, core_schema: core_schema.CoreSchema, handler: GetJsonSchemaHandler):
schema = handler(core_schema)
schema['documentation'] = {e.value : e.__doc__ for e in cls}
return schema
class Industry(EnumSchemaDoc):
_init_ = 'value __doc__'
it = "Information Technology", "All kinds of computer stuff"
agriculture = "Agriculture", "Farming, irrigation, fertilizers etc."
class Customer(EnumSchemaDoc):
_init_ = 'value __doc__'
B2C = "B2C", "Companies selling directly to consumers"
B2B = "B2B", "Companies selling to other businesses"
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel
class Company(BaseModel):
industry: Industry
customer: Customer
parser = PydanticOutputParser(pydantic_object=Company)
parser.get_format_instructions()
'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"$defs": {"Customer": {"documentation": {"B2B": "Companies selling to other businesses", "B2C": "Companies selling directly to consumers"}, "enum": ["B2C", "B2B"], "title": "Customer", "type": "string"}, "Industry": {"documentation": {"Agriculture": "Farming, irrigation, fertilizers etc.", "Information Technology": "All kinds of computer stuff"}, "enum": ["Information Technology", "Agriculture"], "title": "Industry", "type": "string"}}, "properties": {"industry": {"$ref": "#/$defs/Industry"}, "customer": {"$ref": "#/$defs/Customer"}}, "required": ["industry", "customer"]}\n```'