如何使大型文本数据集的模式匹配变得高效

Question

我目前正在开展一个项目，涉及处理大量文本数据以执行自然语言处理任务。我的管道的一个关键方面涉及字符串匹配，我需要根据一组预定义的模式有效地匹配句子中的子字符串。

这是一个模拟示例，用以下句子列表来说明问题：

sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a watched pot never boils",
    "actions speak louder than words"
]

我有一套模式：

patterns = [
    "quick brown fox",
    "pot never boils",
    "actions speak"
]

我的目标是有效地识别包含任何这些模式的句子。此外，我需要对每个句子进行标记并对匹配的子字符串进行进一步分析。

目前，我正在使用带有嵌套循环的强力方法，但它对于大型数据集来说不可扩展。我正在寻找更复杂的技术或算法来优化这个过程。

考虑到可扩展性和性能，如何实现该场景的字符串匹配？任何建议将不胜感激！

Answer 1

避免对每个模式的每个句子进行暴力搜索的一种方法是创建某种索引。使用此索引，您可以限制找到正确句子所需的搜索空间：

示例：

sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a watched pot never boils",
    "actions speak louder than words",
]

patterns = ["quick brown fox", "pot never boils", "actions speak"]


def build_index(sentences):
    out = {}

    for s in sentences:
        for word in s.split():
            out.setdefault(word, []).append(s)

    return out


def find_patterns(index, patterns):
    out = {}

    for p in patterns:
        word, *_ = p.split(maxsplit=1)
        out[p] = []
        for s in index.get(word, []):
            if p in s:
                out[p].append(s)

    return out


index = build_index(sentences)
print(find_patterns(index, patterns))

打印：

{
    "quick brown fox": ["the quick brown fox jumps over the lazy dog"],
    "pot never boils": ["a watched pot never boils"],
    "actions speak": ["actions speak louder than words"],
}

更稳健的方法是使用更强大的工具，例如 SQLite 中的全文搜索（随 Python 一起提供，例如：Sqlite 具有真正的“全文搜索”和拼写错误（FTS+spellfix 一起）等。 )

如何使大型文本数据集的模式匹配变得高效

问题描述投票：0回答：1

1个回答

最新问题

如何使大型文本数据集的模式匹配变得高效

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1