LLM输出的增量解析

Question

我正在开发一个服务，该服务调用大型语言模型，解析输出，并将解析后的输出逐条消息发送到客户端。聚合后，从 LLM 返回的文本流看起来有点像这样：

This is a reference [r1] and here's an image of a cat <image>a black cat</image>.

我想将其解析为：

[
  {"text": "This is a reference "},
  {"reference": "r1"},
  {"text": " and here's an image of a cat "},
  {"image": "a black cat"},
  {"text": "."},
]

处理聚合流不是问题。但是当消息到达时处理流要棘手得多（至少对我来说）。我们必须在准备好后立即传输

text

消息，缓冲非

text

项目，并在准备好时发送它们。

为了清楚起见，请考虑像这样返回的文本流：

[
  "This ",
  "is ",
  "a ",
  "reference ",
  "[",  # Notice that anything can be broken across different messages
  "r1] ",
  "and ",
  "here",
  "'s ",
  "an ",
  "image ",
  "of ",
  "a ",
  "cat <",  # A more extreme example, breaking messages completely arbitrary
  "ima",
  "ge",
  ">a ",
  "black ",
  "ca",
  "t</i",
  "mage>.",
]

解析的输出应该是以下消息流：

[
  {"text": "This "},
  {"text": "is "},
  {"text": "a "},
  {"text": "reference "},
  {"reference": "r1"},
  {"text": " "},
  {"text": "and "},
  {"text": "here"},
  {"text": "'s "},
  {"text": "an "},
  {"text": "image "},
  {"text": "of "},
  {"text": "a "},
  {"text": "cat "},
  {"image": "a black cat"},
  {"text": "."},
]

错误处理一开始就可以尽可能简单。如果我们点击

并且从不

，就无限期缓冲。当我们到达流的末尾时，我们可以将缓冲区返回为

text

。如果我们点击

<image>

并且结束标签以

</

开头，但继续使用除

image>

之外的任何内容，也会将整个内容返回为

text

。现在无需担心嵌套标签。

现在，我有一些代码可以进行引用解析，但这是一个非常低级的 python 代码，难以阅读且无法维护。我尝试查找增量解析的选项，但找不到太多。这可能是因为，正如你可以从字里行间读到的那样，我对解析技术/工具的理解非常有限。具体来说，我快速浏览了 pyparsing、lark 和 parsimonious。老实说，我不明白大多数术语，甚至不知道这些术语是否支持这样的东西。

鉴于这个问题，我有什么选择？理想情况下，我可以使用哪些 python 库来解决这个问题？如果您可以推荐一些阅读材料，也请分享。不用担心 LLM 部分和异步处理。我已经把这些部分固定下来了。我只对增量解析感兴趣。

谢谢！

Answer 1

所有解析器的核心都是增量的。他们逐步地检查自己的输入，并保持当前正在做的事情的某种状态。我不知道你提到的任何库，但我可以展示一个简单的解析器是什么样子，然后你可以直接使用它，或者只是了解你可以在现有库中寻找什么样的构造。

简单解析器的输入是字符流，因此从这个角度来看，获取一对字符流并没有真正产生什么区别。
如果您的输入称为

textstream

，则相应的字符流将是

charstream

:

textstream=[
  "This ",
  "is ",
  "a ",
  "reference ",
  "[",  # Notice that anything can be broken across different messages
  "r1] ",
  "and ",
  "here",
  "'s ",
  "an ",
  "image ",
  "of ",
  "a ",
  "cat <",  # A more extreme example, breaking messages completely arbitrary
  "ima",
  "ge",
  ">a ",
  "black ",
  "ca",
  "t</i",
  "mage>.",
]
charstream=(char for fragment in textstream for char in fragment)

然后我们可以遍历生成的字符流并将其分割为有趣的字符，例如

、

和

。在这里，我们假设“正确”输入，带有方括号的内容始终是引用，带有尖括号的内容始终是“标签”，这些字符不能出现在其他任何地方，并且不能嵌套或交错。

如果可以的话，我们可以使用 3 个变量：

state="text"  # state, can switch to "reference" and "tag"
collector=""  # temporary storage between state changes
parsed=[]     # list of parsed result

并运行一个非常简单的循环

更新特殊字符的状态，否则收集输入
在状态更改时将临时存储保存到
```
parsed
```
（如果它不为空）

第 3 步是输入流结束后，临时存储可能包含某些内容（例如示例文本中的单个

），也必须保存。

for character in charstream:
  oldstate=state             # 2. (so we recognize state changes)
  match character:           # 1.
    case "[":
      state="reference"
    case "]":
      state="text"
    case "<":
      state="tag"
    case ">":
      state="text"
    case _:
      collector+=character
  if oldstate!=state and collector:        # 2.
    parsed.append({oldstate:collector})
    collector=""

if collector:               # 3.-ish
  parsed.append({state:collector}) # could be an error if not state is not "text"

import json
print(json.dumps(parsed,indent=2))

这将产生输出（带有更多换行符）

[
  {"text": "This is a reference "},
  {"reference": "r1"},
  {"text": " and here's an image of a cat "},
  {"tag": "image"},
  {"text": "a black cat"},
  {"tag": "/image"},
  {"text": "."}
]

看起来我们已经取得了进展，只是如果我们处于标签中，则需要更高级别的状态跟踪，以及该标签恰好是什么。如果我们仍然可以假设标签不是嵌套的，那么我们可以采取相当快捷的方式：输入标签后，将有一个文本（然后也是“所有”文本），并且下一个标签肯定是退出。所以第二个循环可能如下所示：

result=[]
tag=""
for element in parsed:
  if "tag" in element:
    if tag=="":           # entering a tag
      tag=element["tag"]
    else:
      tag=""              # it's an exit, could be validated
  else:
    if tag=="":           # we are at top level
      result.append(element)
    else:                 # we use the tag, and also expect "text"
      result.append({tag:element["text"]})

print(json.dumps(result,indent=2))

它产生了预期的输出（再次用换行符作弊）：

[
  {"text": "This is a reference "},
  {"reference": "r1"},
  {"text": " and here's an image of a cat "},
  {"image": "a black cat"},
  {"text": "."}
]

在 Ideone 上的代码（除了

match

-

case

，它们都在 Python 3.9 上，所以有一系列

if

-

elif

）。它也显示了中间输出，真正的输出在水平线下方

当然有人可能会抱怨增量部分。这肯定不适用于两个单独的循环，因此必须将更高级别的状态机移至第一个循环的输出部分：

  if oldstate!=state and collector:
    if oldstate=="tag":
      if not tag:  # entering tag
        tag=collector
      else:        # exiting tag, this could be validated
        tag=""
    elif tag:
      print(f'{{"{tag}":"{collector}"}}')
    else:
      print(f'{{"{oldstate}":"{collector}"}}')
    collector=""

# (and tag="" is needed before the loop)

现在，循环会尽快使用

print

行生成输出，因此它可以按实际流的速度消耗它。

字符生成器可以通过在输入中的位置进行打印来进行修饰：

charstream=(char for fragment in textstream for char in (print(f'Input: "{fragment}"'),fragment)[1])

然后带有交错输入的输出将如下所示：

Input: "This "
Input: "is "
Input: "a "
Input: "reference "
Input: "["
{"text":"This is a reference "}
Input: "r1] "
{"reference":"r1"}
Input: "and "
Input: "here"
Input: "'s "
Input: "an "
Input: "image "
Input: "of "
Input: "a "
Input: "cat <"
{"text":" and here's an image of a cat "}
Input: "ima"
Input: "ge"
Input: ">a "
Input: "black "
Input: "ca"
Input: "t</i"
{"image":"a black cat"}
Input: "mage>."
{"text":"."}

来自 https://ideone.com/U6lu3C （循环后追加也变成了打印，但仅此而已）。

LLM输出的增量解析

问题描述投票：0回答：1

1个回答

最新问题

LLM输出的增量解析

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1