我使用以下正则表达式模式来识别缩写。
mytext = "This is AVGs and (NMN) and most importantly GFD"
mytext= re.sub(r"\b[A-Z\.]{2,}s?\b", "_ABB", mytext)
print(mytext)
我得到如下输出。
This is _ABB and (_ABB) and most importantly _ABB
但是,我想得到输出;
This is AVGs_ABB and (NMN_ABB) and most importantly GFD_ABB
请让我知道我在哪里做错了。
使用捕获组捕获您匹配的字边界之间的模式,然后在替换中使用它。第一个捕获组将以\\1
的形式提供。
mytext = "This is AVGs and (NMN) and most importantly GFD"
mytext= re.sub(r"\b([A-Z\.]{2,}s?)\b", "\\1_ABB", mytext)
print(mytext)
试试这个,
In [1]: str = "This is AVGs and (NMN) and most importantly GFD"
In [2]: regex = "[A-Z]{2,}"
In [3]: import re
In [4]: result = re.sub(regex, "_ABB", str)
In [5]: result
Out[5]: 'This is _ABBs and (_ABB) and most importantly _ABB'
替换时使用排除如下:
import re
mytext = "This is AVGs and (NMN) and most importantly GFD"
mytext= re.sub(r"([A-Z]{2,})", "\\1_ABB", mytext)
print(mytext)
输出:
This is AVGs_ABB and (NMN_ABB) and most importantly GFD_ABB
你不需要在这里使用任何捕获组,因为你想要替换整个匹配,它本身是组0.只需在替换模式中使用\g<0>
,请参阅Python re
docs:
反向引用
\g<0>
替换RE匹配的整个子字符串。
import re
mytext = "This is AVGs and (NMN) and most importantly GFD"
mytext= re.sub(r"\b[A-Z.]{2,}s?\b", r"\g<0>_ABB", mytext)
print(mytext)
# => This is AVGs_ABB and (NMN_ABB) and most importantly GFD_ABB
替换现在是r"\g<0>_ABB"
,它用找到的匹配替换每个非重叠匹配,并将_ABB
附加到它。
看到regex demo。
另请注意,在字符类中,.
被解析为常规.
符号,而不是作为匹配任何char但是换行符的“通配符”。