“拼写检查”并在Python中返回更正的术语

Question

我最近从pdf文件目录中提取了文本数据。阅读pdf时，有时返回的文本有点乱。

例如，我可以看到一个字符串，上面写着：

“政府正在做坏事，而不是履行承诺”

我希望结果如下：

“政府正在做坏事，而不是履行承诺”

我测试了我在stackoverflow here上找到的代码（使用pyenchant和wx）并且它没有返回我想要的内容。我的修改如下：

a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
    sug = err.suggest()[0]
    err.replace(sug)

c = chkr.get_text()#returns corrected text
print(c)

此代码返回：

“他管理的是做坏事，而不是履行它的承诺”

我在64位Windows 7企业版上使用Python 3.5.x.我很感激任何建议！

Answer 1

我采取了Generic Human’s answer，稍加修改它来解决你的问题。

您需要将这些125k words, sorted by frequency复制到文本文件中，将文件命名为words-by-frequency.txt。

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
    words = [line.strip() for line in f.readlines()]
    wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

使用输入运行该函数：

messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())


The administration is doing bad things and not fulfilling what it promised
>>>

编辑：下面的代码不需要文本文件，仅适用于您的输入，即“管理员正在做坏事，而不是实现它的目标”

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
         "things", "and", "not", "fulfilling", "what",
         "it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))


messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())

The administration is doing bad things and not fulfilling what it promised
>>>

我刚刚在repl.it上尝试了上面的编辑，它打印输出如图所示。

Answer 2

看起来你正在使用的附魔库并不是那么好。它不会在单词中查找拼写错误，而只是单独查看单词。我想这是有道理的，因为函数本身被称为'SpellChecker'。

我唯一能想到的是寻找更好的自动更正库。也许这个可能会有所帮助？ https://github.com/phatpiglet/autocorrect

虽然没有保证。

“拼写检查”并在Python中返回更正的术语

问题描述投票：2回答：2

2个回答

最新问题

“拼写检查”并在Python中返回更正的术语

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2