Python re 和正则表达式在搜索字符串中 U+D7FB 以上的 Unicode 后失败

Question

我尝试了一些针对 Unicode 的 Python 正则表达式匹配，发现它们在 re 或正则表达式中都无法按我的预期工作。例如，如果存在，则 r"\w" 不匹配包含 "x" 的字符串正在匹配的字符串中“x”之前的某些字符。

似乎会造成麻烦的角色：

紧接私人使用之前的未分配字符 (U+D7FC...U+DFFF)
私人使用字符（U+E000...）

但是一些未分配的字符可以正常工作（U+0378）。至少某些字符上面私人使用区域工作正常（U+F900）。

我尝试过搜索 SO、一般网络以及 re 和 regex 文档（例如 https://docs.python.org/3.11/library/re.html#module-re 和 https://pypi.org/project/regex/），但没有找到相关信息。

这是已知/预期的行为吗？起初我以为可能是非 Unicode 字符只会使匹配无效，并且有人决定 Private 使用与未分配相同；但是：

U+378 未分配并且工作正常。
私人使用肯定不应该有这种行为，即使未分配也如此。
当匹配满足之前它看到不稳定的角色时，事情就会起作用。

    import sys
    import re
    import unicodedata

    import regex

    def test(theLib):
        print("\nWith %s:" % (theLib.__name__))
        print("x:      ", theLib.match(r"\w", "x"))
        print("0378:   ", theLib.match(r"\w", "x \u0378"))  # Unassigned
        print("E010:   ", theLib.match(r"\w", "\uE010"))  # Private use char
        print("x E010: ", theLib.match(r"\w", "x \uE010"))
        print("E010 x: ", theLib.match(r"\w", "\uE010 x"))
        print("D7FB x: ", theLib.match(r"\w", "\uD7FB x")) # HANGUL JONGSEONG PHIEUPH-THIEUTH
        print("D7FC x: ", theLib.match(r"\w", "\uD7FC x")) # Unassigned
        print("D800 x: ", theLib.match(r"\w", "\uD800 x"))
        print("F900 x: ", theLib.match(r"\w", "\uF900 x"))

    print(sys.version)

    test(re)
    test(regex)

产生：

3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

With re:
x:       <re.Match object; span=(0, 1), match='x'>
0378:    <re.Match object; span=(0, 1), match='x'>
E010:    None
x E010:  <re.Match object; span=(0, 1), match='x'>
E010 x:  None
D7FB x:  <re.Match object; span=(0, 1), match='ퟻ'>
D7FC x:  None
D800 x:  None
F900 x:  <re.Match object; span=(0, 1), match='豈'>

With regex:
x:       <regex.Match object; span=(0, 1), match='x'>
0378:    <regex.Match object; span=(0, 1), match='x'>
E010:    None
x E010:  <regex.Match object; span=(0, 1), match='x'>
E010 x:  None
D7FB x:  <regex.Match object; span=(0, 1), match='ퟻ'>
D7FC x:  None
D800 x:  None
F900 x:  <regex.Match object; span=(0, 1), match='豈'>

Answer 1

例如，如果要匹配的字符串中“x”前面有某些字符，则 r"\w" 不会匹配包含“x”的字符串。

当然，当“x”前面有其他字符时，它不匹配。那么该字符串不以“x”开头。

re.match

用于从头开始匹配。因此，您实际测试的只是

\w

是否与这些特定字符匹配。

Python re 和正则表达式在搜索字符串中 U+D7FB 以上的 Unicode 后失败

问题描述投票：0回答：1

1个回答

最新问题

Python re 和正则表达式在搜索字符串中 U+D7FB 以上的 Unicode 后失败

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1