s1 = 'Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics'
s2 = 'Makeupby Antonia asia #makeup #makeupartist #makeupdolls'
s3 = 'Makeupby Antonia'
s4 = '#makeup #makeupartist #makeupdolls #abhcosmetics'
s5 = 'Makeupby Antonia asia america #makeup #makeupartist'
正则表达式应该只能匹配s1
和s2
,因为正常的单词数量最多为3,而且这些单词有多个标签。
我可以使用\b(?<![#])[\w]+
选择正常的单词
和
我可以使用[#]{1}\w+
选择#标签
但当我结合表达式时,它确实有效。
如何使用这些可以跟踪计数的单个正则表达式来制作最终的正则表达式?
将文本拆分为单词并计算其中有多少以哈希符号开头。
def check(text):
words = text.split()
num_hashtags = sum(word.startswith('#') for word in words)
num_words = len(words) - num_hashtags
return 1 <= num_words <= 3 and num_hashtags > 1
>>> [check(text) for text in [s1,s2,s3,s4]]
[True, True, False, False]
import re
def check(text):
pattern = r'(?=.*\b(?<!#)\w+\b)(?!(?:.*\b(?<!#)\w+\b){4})(?:.*#){2}'
return bool(re.match(pattern, text))
我故意不解释那个正则表达式,因为我不想让你使用它。你可能感觉到的混乱感应该是一个强烈的迹象,表明这是糟糕的代码。
如果我正确理解了你的问题,如果你可以假设单词总是在标签之前,你可以使用r'^(\w+ ){1,3}#\w+ #\w+'
:
for s in ('Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
'Makeupby Antonia asia #makeup #makeupartist #makeupdolls',
'Makeupby Antonia',
'#makeup #makeupartist #makeupdolls #abhcosmetics',
'Makeupby Antonia asia america #makeup #makeupartist',):
print(bool(re.search(r'^(\w+ ){1,3}#\w+ #\w+', s)), s, sep=': ')
这输出:
True: Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics
True: Makeupby Antonia asia #makeup #makeupartist #makeupdolls
False: Makeupby Antonia
False: #makeup #makeupartist #makeupdolls #abhcosmetics
False: Makeupby Antonia asia america #makeup #makeupartist
可能有很多优化空间(可能有依赖/更少循环),但这里是一个非正则表达式解决方案,如评论中所述:
s_list = [s1, s2, s3, s4]
def hashtag_words(string_list):
words = [s.split(" ") for s in string_list]
hashcounts = [["#" in word for word in wordlist].count(True) for wordlist in words]
normcounts = [len(wordlist) - hashcount for wordlist, hashcount in zip(words, hashcounts)]
sel_strings = [s for s, h, n in zip(string_list, hashcounts, normcounts) if h>1 if n in (1,2,3)]
return sel_strings
hashtag_words(s_list)
>['Makeupby Antonia #makeup #makeupartist #makeupdolls #abhcosmetics',
'Makeupby Antonia asia #makeup #makeupartist #makeupdolls']