使用RegEx查找无序单词

Question

我想使用RegEx来查找字符串中的第一个序列，其中出现一组单词，按任何顺序排列。

例如，如果要查找单词hello，my和world，那么：

对于hello my sweet world，表达式将匹配hello my sweet world;
对于oh my, hello world，它会匹配my, hello world;
对于oh my world, hello world，它会匹配my world, hello;
对hello world来说，没有比赛。

经过一些研究，我尝试了表达式(?=.*?\bhello\b)(?=.*?\bmy\b)(?=.*?\bworld\b).*，它没有解决我的问题，因为它匹配整个字符串，如果所有单词都存在，如：

对于oh my world, hello world，它与oh my world, hello world相匹配

什么是实现我所描述的适当表达？

（虽然RegEx是我程序的首选方法，但如果您认为不是这样的话，欢迎任何其他python解决方案。）

Answer 1

我认为这个任务最好用一些编程逻辑来完成，而正则表达式并不容易和有效。但这里有一个似乎正在做你的工作的正则表达式，无论你是否有重复的单词（你好我的世界）存在与否，

\b(hello|my|world)\b.*?((?!\1)\b(?:hello|my|world)\b).*?(?:(?!\1)(?!\2)\b(?:hello|my|world)\b)

这里的想法是，

制作一个交替组\b(hello|my|world)\b并将其放入group1
然后可选地，它后面可以有零个或多个任何字符。
然后必须跟随剩下的两个单词中的任何一个，而不是在第一组中匹配的单词，这就是为什么我使用了((?!\1)\b(?:hello|my|world)\b)，然后将第二个匹配放在第2组中。
然后，它可以选择零个或多个跟随它的任何字符。
然后我们再次应用相同的逻辑，其中第三个单词应该是未在group1或group2中捕获的单词，因此这个正则表达式(?:(?!\1)(?!\2)\b(?:hello|my|world)\b)

Here is a Demo

Answer 2

使用Pattern.finditer()函数和Set对象的统一迭代pythonic方法：

import re

test_str = '''The introduction here for our novel. 
Oh, hello my friend. This world is full of beauty and mystery, let's say hello to universe ...'''

words_set = {'my', 'hello', 'world'}    # a set of search words
words_set_copy = set(words_set)
pat = re.compile(r'\b(my|hello|world)\b', re.I)
start_pos = None
first_sequence = ''

for m in pat.finditer(test_str):        
    if start_pos is None:
        start_pos = m.start()           # start position of the 1st match object
    words_set_copy.discard(m.group())   # discard found unique match 

    if not words_set_copy:              # all the search words found
        first_sequence += test_str[start_pos: m.end()]
        break

print(first_sequence)

输出：

hello my friend. This world

您可以将上述方法转换为函数以使其可重用。

使用RegEx查找无序单词

问题描述投票：1回答：2

2个回答

最新问题

使用RegEx查找无序单词

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2