取消烘焙mojibake

Question

当您的字符解码不正确时，如何识别原始字符串的可能候选字符？

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

我知道这个图像文件名应该是一些日语字符。但是由于对 urllib 引用/取消引用、编码和解码 iso8859-1、utf8 进行了各种猜测，我无法取消并获取原始文件名。

腐败可以逆转吗？

Answer 1

您可以使用 chardet（使用 pip 安装）：

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

结果：时间试験観点（animepasu）_10秒（不知道这是否正确）

对于Python 3（源文件编码为utf8）：

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

总结：

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点（アニメパス）_10秒.png'

Answer 2

我的第一个猜测是您的胡言乱语是 Shift JIS 错误地编码为 IBM437。就我个人而言，我会在这里使用这个网站将来（请务必在编码之后、解码之前按下“交换”按钮）。（我曾经使用 string-functions.com 来处理这种事情，但现在已经消失了。）

取消烘焙mojibake

问题描述投票：0回答：2

2个回答

最新问题

取消烘焙mojibake

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2