如何生成两个字符串可以重叠的所有可能方式?
def overlap(s1: str, s2: str) -> set[(str, int, int)]:
r'''Returns: (superstring, start index of s1 in superstring, start index of s2 in superstring)'''
...
s1 = 'everywhere'
s2 = 'whereas'
overlap(s1, s2) # {('everywherewhereas', 0, 10), ('everywhereas', 0, 5), ('whereaseverywhere', 7, 0)}
s1 = 'baba'
s2 = 'aba'
overlap(s1, s2) # {('babaaba', 0, 4), ('bababa', 0, 3), ('baba', 0, 1), ('ababa', 1, 0), ('abababa', 3, 0)}
s1 = 'yesyesyes'
s2 = 'yes'
overlap(s1, s2) # {('yesyesyes', 0, 0), ('yesyesyes', 0, 3), ('yesyesyes', 0, 6), ('yesyesyesyes', 0, 9), ('yesyesyesyes', 3, 0)}
生成两个字符串可以重叠的所有方式相当于生成以下可能的最短超字符串的集合:
(s1, s2)
(s1 + s2[1:], s2)
(s1 + s2[2:], s2)
(s1 + s2[3:], s2)
...
(s1, s2 + s1[1:])
(s1, s2 + s1[2:])
(s1, s2 + s1[3:])
...
difflib.SequenceMatcher#get_maching_blocks
的结果只是匹配子字符串,而不是重叠位置,也不是超字符串本身。 difflib.SequenceMatcher#find_longest_match
也不能保证是两个输入的超串。
使用仅查找
s
从 0
开始处的重叠的助手:
def overlap(s1: str, s2: str) -> set[(str, int, int)]:
def overlap(s, t):
m, n = len(s), len(t)
return {
(s[:i] + t + s[i+k:], 0, i)
for i in range(m + 1)
for k in [min(m-i, n)]
if s[i:i+k] == t[:k]
}
return overlap(s1, s2) | {(s, i, 0) for s, _, i in overlap(s2, s1)}
def test(s1, s2, expect):
result = overlap(s1, s2)
print(result == expect)
print('missing:', expect - result)
print('extra:', result - expect)
print()
s1 = 'everywhere'
s2 = 'whereas'
test(s1, s2, {('everywherewhereas', 0, 10), ('everywhereas', 0, 5), ('whereaseverywhere', 7, 0)})
s1 = 'baba'
s2 = 'aba'
test(s1, s2, {('babaaba', 0, 4), ('bababa', 0, 3), ('baba', 0, 1), ('ababa', 1, 0), ('abababa', 3, 0)})
s1 = 'yesyesyes'
s2 = 'yes'
test(s1, s2, {('yesyesyes', 0, 0), ('yesyesyes', 0, 3), ('yesyesyes', 0, 6), ('yesyesyesyes', 0, 9), ('yesyesyesyes', 3, 0)})