Python正则表达式不匹配http：//

Question

我正面临一个问题，要匹配并替换某些单词，不包含在http：//中

目前的正则表达式：

 http://.*?\s+

这匹配模式http://www.egg1.com http://www.egg2.com

我需要一个正则表达式来匹配http：//之外的某些单词

例：

"This is a sample. http://www.egg1.com and http://egg2.com. This regex will only match 
 this egg1 and egg2 and not the others contained inside http:// "

 Match: egg1 egg2

 Replaced: replaced1 replaced2

最终产出：

 "This is a sample. http://www.egg1.com and http://egg2.com. This regex will only 
  match this replaced1 and replaced2 and not the others contained inside http:// "

问题：需要匹配某些模式（例如：egg1 egg2），除非它们是http：//的一部分。如果它们存在于http：//中，则不匹配egg1和egg2

Answer 1

我能想到的一个解决方案是为HTTP-URL和你的模式形成一个组合模式，然后相应地过滤匹配：

import re

t = "http://www.egg1.com http://egg2.com egg3 egg4"

p = re.compile('(http://\S+)|(egg\d)')
for url, egg in p.findall(t):
  if egg:
    print egg

打印：

egg3
egg4

更新：要使用re.sub()这个成语，只需提供一个过滤功能：

p = re.compile(r'(http://\S+)|(egg(\d+))')

def repl(match):
    if match.group(2):
        return 'spam{0}'.format(match.group(3))
    return match.group(0)

print p.sub(repl, t)

打印：

http://www.egg1.com http://egg2.com spam3 spam4

Answer 2

这不会捕获http://...：

(?:http://.*?\s+)|(egg1)

Answer 3

您需要在模式之前加上一个负面的lookbehind断言：

(?<!http://)egg[0-9]

在这个正则表达式中，每次正则表达式引擎找到匹配egg[0-9]的模式时，它将回顾以验证前面的模式是否与http://不匹配。负面的背后断言从(?<!开始，以)结束。这些分隔符之间的所有内容都不应位于以下模式之前，并且不会包含在结果中。

如何在您的情况下使用它：

>>> regex = re.compile('(?<!http://)egg[0-9]')
>>> a = "Example: http://egg1.com egg2 http://egg3.com egg4foo"
>>> regex.findall(a)
['egg2', 'egg4']

Answer 4

扩展brandizzi的答案，我只想改变他的正则表达式：

(?<!http://[\w\._-]*)(egg1|egg2)

Python正则表达式不匹配http：//

问题描述投票：6回答：4

4个回答

最新问题

Python正则表达式不匹配http：//

问题描述 投票：6回答：4

4个回答

最新问题

问题描述投票：6回答：4