上游服务读取 UTF-8 字节流,假设它们是 ISO-8859-1,将 ISO-8859-1 应用于 UTF-8 编码,并将它们发送到我的服务,标记为 UTF-8。
上游服务不在我的控制范围内。他们可能会修复它,但可能永远不会修复。
我知道我可以通过将 UTF-8 应用于 ISO-8859-1 编码然后将字节标记为 UTF-8 来修复编码。但是如果我的上游解决了他们的问题会发生什么?
有什么方法可以检测此问题并仅在发现错误编码时修复编码?
我也不确定上游编码是 ISO-8859-1。我认为上游是 perl,因此编码有意义,并且当我应用 ISO-8859-1 编码时,我尝试的每个样本都可以正确解码。
e4 9c 94
(✔) 到我的上游时,我的上游发送给我
c3 a2 c2 9c c2 94
(â)。
✔
作为字节:
e4 9c 94
e4 9c 94
作为 latin1 字符串:
â
â
作为字节:c3 a2 c2 9c c2 94
upstream.encode('ISO-8859-1').force_encoding('UTF-8')
来修复它,但一旦上游问题得到解决,它就会崩溃。
这是一个Python示例(你没有提到语言......):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
输出:
✔
✔
def maybe_fix_encoding(utf8_string, possible_codec="cp1252"):
"""Attempts to fix mangled text caused by interpreting UTF8 as cp1252
(or other codec: https://docs.python.org/3/library/codecs.html)"""
try:
return utf8_string.encode(possible_codec).decode('utf8')
except UnicodeError:
return utf8_string
>>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø å Æ Ø Å")
'some normal text and some scandinavian characters æ ø å Æ Ø Å'
更详细地说,UTF-8 编码严格限制允许使用哪些非 ASCII 字符序列。允许的模式在 ISO-8859-1 中极不可能出现,因为在这种编码中,它们表示像
Ã
这样的序列,后跟不可打印的控制字符或数学运算符,这根本不会出现在任何有效文本中。
turpachull 的答案和 python3 标准编码列表(以及 Mark Amery 的 答案列出了各种版本的 python 的集合),这里有一个脚本,它将尝试对每个编码进行转换stdin 并输出每个版本(如果它与普通版本不同) utf_8
。
#!/usr/bin/env python3
import sys
import fileinput
encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"]
def maybe_fix_encoding(utf8_string, possible_codec="utf_8"):
try:
return utf8_string.encode(possible_codec).decode('utf_8')
except UnicodeError:
return utf8_string
for line in sys.stdin:
for e in encodings:
i=line.rstrip('\n')
result=maybe_fix_encoding(i, e)
if result != i or e == 'utf_8':
print("\t".join([e, result]))
print("\n")
用法例如:
$ echo 'Requiem der morgenröte' | ~/decode_string.py
cp1252 Requiem der morgenröte
cp1254 Requiem der morgenröte
iso2022_jp_1 Requiem der morgenr(D**B"yte
iso2022_jp_2 Requiem der morgenr(D**B"yte
iso2022_jp_2004 Requiem der morgenr(Q):B"yte
iso2022_jp_3 Requiem der morgenr(O):B"yte
iso2022_jp_ext Requiem der morgenr(D**B"yte
latin_1 Requiem der morgenröte
iso8859_9 Requiem der morgenröte
iso8859_14 Requiem der morgenröte
iso8859_15 Requiem der morgenröte
mac_iceland Requiem der morgenr̦te
mac_roman Requiem der morgenr̦te
mac_turkish Requiem der morgenr̦te
utf_7 Requiem der morgenr+AMMAtg-te
utf_8 Requiem der morgenröte
utf_8_sig Requiem der morgenröte
try:
text.encode('ascii')
except:
for character in text:
try:
character.encode("ascii")
except:
if character not in ",.\"'?-£":
print( "Anomaly detected:", character )
try:
character.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
print( "Issue with:", character )