如何从简短的纯文本描述中导出属性/标签？（NER，法学硕士，？）

Question

我有简短的产品描述，我想将其转换为结构化属性。

示例：

输入：

“La Lecciaia Cabernet Sauvignon 2017 – Red – 750ml”

输出：

Year = 2017

Color = Red

Weight = 750

Weight Unit = ml

如果一切都采用这种格式，那么编写正则表达式并使用它来完成就很简单了，但是有许多不同的格式和细微差别。为每种格式硬编码逻辑变得越来越麻烦。尝试创建通用解决方案时，我立即遇到了“基本”方法的问题：

有多个不同的数据提供者，每个都有自己的格式。对于上面的示例，其他提供商可能会使用“(Red) 2017 La Lecciaia Cabernet Sauvignon 750 ML”。即使对于给定的提供商，也可能有多种格式，并且它们可能会随着时间的推移而改变。格式并不总是严格遵循。
表达特定组件的方法有很多种。例如，重量可以表示为以下任何一种：“1.5L”、“1 1/2 升”、“1500ml”等。
部分描述可能与目标组件混淆。可能有一种来自“红头葡萄园”品牌的白葡萄酒。 “2000 毫升”的重量可能会混淆一年，等等。我在这里使用这些葡萄酒示例只是为了向普通观众简单起见，但我的产品领域也有相同的概念问题。
我认为这更像是一个“很好的拥有”，但对于能够解析出更多细节会很有用，例如算法足够聪明，可以知道“La Lecciaia”是品牌而“Cabernet Sauvignon”是葡萄品种。假设这需要更多的前期工作，并且更难做到正确，但如果有一种简单的方法可以做到这一点，那么最好了解一下。

我想开发一个通用函数，可以接受任何格式的描述。我对 NLP/人工智能缺乏经验，但怀疑我可以利用一些有用的工具/算法。我有 1,000 多个示例记录，可以用来训练模型。可以在本地运行的东西将是首选，但不是绝对必要的。

我不是在寻找具体的实现，而是寻求解决类似问题的任何人的指导。对混合方法持开放态度，其中一些额外的逻辑或手动监督可能会导致最初的不准确。

感谢对方法或建议的学习资源的任何见解。

我在网上查找了信息，但许多方法都涉及大量的前期工作，并且不清楚它们是否能在实际意义上发挥作用。

Answer 1

LLM 非常适合此目的。我以前做过类似的任务，只需最少的培训就可以很好地工作。请记住，任何统计方法 NLP / LLM / NER 永远不会 100% 准确，但出于实用目的，我发现 LLM 比定制的正则表达式更准确。

对于您的任务，我将使用像 Langchain 这样的框架和以下提示（请注意，您可能需要稍微处理一下提示，这只是一个示例）。当使用模型运行时，它将创建一个易于解析的 XML 输出。您可以修改提示以创建不同类型的输出。但是，我个人发现 XML 对我来说非常有用。

You are an AI language model designed to parse wine bottle descriptions into structured data. You will be given a wine bottle description, and your task is to extract the following components:

- **Year**: The vintage year of the wine.
- **Color**: The color of the wine (e.g., Red, White, Rosé).
- **Weight**: The volume of the wine bottle expressed as a number (e.g., 750, 1500).
- **Weight Unit**: The unit of measurement for the weight (e.g., ml, mL, L, Liters).
- **Brand**: The brand or producer of the wine.
- **Grape Variety**: The variety of grape used (e.g., Cabernet Sauvignon, Merlot).

**Instructions:**

- Wine descriptions may come in various formats and may include additional or confusing information. Carefully analyze the description to accurately extract the components.
- Be cautious of potential ambiguities. For example:
  - A brand name may include words like "Red" or "White" (e.g., "Red Head Vineyard") which should not be confused with the wine color.
  - Large numbers may represent weight (e.g., "1500 ml") rather than a year.
- **Do not assume information not present in the description.** If a component is missing, you may leave the corresponding tag empty or omit it.

**Output Format:**

Provide the extracted information in XML format, using the following structure:

<Wine>
<Year>{{Year}}</Year>
<Color>{{Color}}</Color>
<Weight>{{Weight}}</Weight>
<WeightUnit>{{WeightUnit}}</WeightUnit>
<Brand>{{Brand}}</Brand>
<GrapeVariety>{{GrapeVariety}}</GrapeVariety>
</Wine>

**Examples:**

输入：

La Lecciaia Cabernet Sauvignon 2017 – Red – 750ml

输出：

<Wine>
  <Year>2017</Year>
  <Color>Red</Color>
  <Weight>750</Weight>
  <WeightUnit>ml</WeightUnit>
  <Brand>La Lecciaia</Brand>
  <GrapeVariety>Cabernet Sauvignon</GrapeVariety>
</Wine>

Red Head Vineyard Chardonnay 2020 1.5L

输出：

2020年 1.5 L 红头葡萄园霞多丽

**Task:**

Given the following wine description, extract the components and provide the output in XML format as specified.

{win_description}

请记住，法学硕士的运行费用并不便宜。但对于这个任务来说，考虑到领域的模糊性，它很可能是最好的选择。对于这个特定任务，使用 OpenAI 服务每个标签只需 1/1000 便士。您可能会找到更便宜的型号/提供商。然而，在与 LLM 合作时，首先确保准确性，然后优化成本非常重要。

如何从简短的纯文本描述中导出属性/标签？（NER，法学硕士，？）

问题描述投票：0回答：1

1个回答

最新问题

如何从简短的纯文本描述中导出属性/标签？ （NER，法学硕士，？）

问题描述 投票：0回答：1

1个回答

最新问题

如何从简短的纯文本描述中导出属性/标签？（NER，法学硕士，？）

问题描述投票：0回答：1