我正在 python 中寻找一种有效的方法来比较多个大型无序字符串和重复字符,并根据它们是否相同返回 true 或 false。
约束条件:
例如:
string_1 = "hjqzukcwwkpjzyvwjbdyfbjvjmqgimhgjiumaipiragpktuchkqqgfaycktkawjjytqytftdpfepnigtrxevbbwbyqfpfzawvdhdhyakfeyindxxdwkyccqhjgtcfpkanrauexvrijhmbwfbahcxdhzdzhfyjwivzrxmepbweygntikuburewahpynuxayaefczqmrrmfeafxfujqxfiwtgaddatuqdggcxcynvgvxijqmhaukvxmmzaykbtcbazwxyecivbyxyzvndpjncvzwkfwwpvyphadwdfejjiawqgryetzzxvzmjbtzctupimvcufcdpuajuaxactcrhgjjmifvdvvckbjgwhcxaqurdupzgkftvnqcbeuytipqanyaffmxmbaxvnitdytkgwvzarqmqaxzejbiqtaqxmnrikgzbvjddnqtexackvpwnwxwypmtxnfdjfujffqnrefgurnuumwhdfrfxfyqiaquwaeqrmkktaxzymnajazpwbpyaxqzckwrieaxkjcrkdizxekmtcgejprfzqjrbfgxjzfpwdknipfqjnmmfhnpexrxqyyztcmfjeakfzhnghinibdpyammgccxjcxnqeitmgfnqmkrwtwgzefbqmzhzgqymetedrwamebuadhhikcddfteugrynfdibimrrbxuwzhmfhxxixrumqxtkv"
string_2 = "indnrefgurnuumwhdfrfxfyqiaquwaeqrmkktaxzymnajazpwbpyaxqzckwrieaxkjcrkdizxekmtcgejprfzqjrbfgxjzfpwdknipfqjnmmfhnpexrxqyyztcmfjeakfzhnghinibdpyxxdwkyccqhjgtcfpkanrauexvrijhmbwfbahcxdhzdzhfyjwivzrxmepbweygntikuburewahpynuxayaefczqmrrmfeafxfujqxfiwtgaddatuqdggcxcynvgvxijqmhaukvxmmzaykbtcbazwxyecivbyxyzvndpjncvzwkfwwpvyphadwdfejjiawqgryetzzxvzmjbtzctupimvcufcdpuajuaxactcrhgjjmifvdvvckbjgwhcxaqurdupzgkftvnqcbeuytipqanyaffmxmbaxvnitdytkgwvzarqmqaxzehjqzukcwwkpjzyvwjbdyfbjvjmqgimhgjiumaipiragpktuchkqqgfaycktkawjjytqytftdpfepnigtrxevbbwbyqfpfzawvdhdhyakfeyjbiqtaqxmnrikgzbvjddnqtexackvpwnwxwypmtxnfdjfujffqammgccxjcxnqeitmgfnqmkrwtwgzefbqmzhzgqymetedrwamebuadhhikcddfteugrynfdibimrrbxuwzhmfhxxixrumqxtkv"
if sorted(string_1) == sorted(string_2):
return True
这变得非常慢,因为它在检查之前对整个字符串进行排序,我也不能使用集合,因为它们没有考虑字符的出现。我将不胜感激有关如何进行此检查的任何意见。
你可以使用
collections.Counter
,它在线性时间内得到每个字符的频率。
from collections import Counter
print(Counter(string_1) == Counter(string_2))
快速解决方案:
from string import ascii_lowercase as abc
def anagrams(s, t):
if len(s) != len(t):
return False
for c in abc[1:]:
if s.count(c) != t.count(c):
return False
return True
字符串长度为 5000 的基准:
anagrams
300.6 ± 0.4 μs Kelly
417.2 ± 3.0 μs Unmitigated
1202.3 ± 3.7 μs chrisfang
1547.4 ± 5.2 μs original
not anagrams
14.1 ± 3.3 μs Kelly
414.9 ± 5.0 μs Unmitigated
1210.6 ± 9.3 μs chrisfang
1555.0 ± 11.5 μs original
基准代码:
def Kelly(s, t):
if len(s) != len(t):
return False
for c in abc[1:]:
if s.count(c) != t.count(c):
return False
return True
def original(string_1, string_2):
return sorted(string_1) == sorted(string_2)
def Unmitigated(string_1, string_2):
return Counter(string_1) == Counter(string_2)
def chrisfang(string_1: str, string_2: str):
is_same = True
str1_chat_map = record_character(string_1)
str2_chat_map = record_character(string_2)
str_keys = set(str1_chat_map.keys()) | set(str2_chat_map.keys())
for k in str_keys:
str1_chat_count = str1_chat_map.get(k, 0)
str2_chat_count = str2_chat_map.get(k, 0)
if str1_chat_count != str2_chat_count:
is_same = False
return is_same
def record_character(string: str):
char_map: Dict[str, int] = dict()
for i in string:
count = char_map.get(i, 0)
char_map[i] = count + 1
return char_map
from collections import Counter
from timeit import repeat
from statistics import mean, stdev
import random
from string import ascii_lowercase as abc
from operator import eq
funcs = original, Kelly, Unmitigated, chrisfang
times = {f: [] for f in funcs}
def stats(f):
ts = [t * 1e6 for t in sorted(times[f])[:25 if test == 'anagrams' else 100]]
return f'{mean(ts):6.1f} ± {stdev(ts):4.1f} μs '
for test in 'anagrams', 'not anagrams':
print(test)
for _ in range(100):
a = random.choices(abc, k=5000)
string_1 = ''.join(a)
if test == 'anagrams':
random.shuffle(a)
else:
a = random.choices(abc, k=5000)
string_2 = ''.join(a)
args = string_1, string_2
for f in funcs:
t = min(repeat(lambda: f(*args), number=1)) / 1
times[f].append(t)
for f in sorted(funcs, key=stats):
print(stats(f), f.__name__)
print()
可以这样试试
def compare_str(string_1: str, string_2: str):
is_same = True
str1_chat_map = record_character(string_1)
str2_chat_map = record_character(string_2)
str_keys = set(str1_chat_map.keys()) | set(str2_chat_map.keys())
for k in str_keys:
str1_chat_count = str1_chat_map.get(k, 0)
str2_chat_count = str2_chat_map.get(k, 0)
if str1_chat_count != str2_chat_count:
is_same = False
return is_same
def record_character(string: str):
char_map: Dict[str, int] = dict()
for i in string:
count = char_map.get(i, 0)
char_map[i] = count + 1
return char_map
答案中确实存在完全 pythonic 解决方案,但如果我们可以访问 numpy,则可以探索使用素数进行散列的可能性。假设所有小写字符串,解决方案比以前的答案更快。使用来自用户@KellyBundy 的不错的基准测试脚本,我也用它来比较这个版本。可以在 KellyBundy 的 Answer 中的基准测试脚本中添加以下功能:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109])
def prime_hash(s):
a = (np.frombuffer(s.encode(), dtype=np.uint8) - 97)
return np.prod(x[a])
def lifezbeautiful(s1, s2):
return prime_hash(s1) == prime_hash(s2)
可以在这里在线试用。我在这里省略了复制冗余代码。
两个质数 a 和 b 相乘形成乘积 P。现在,如果 P 被因式分解,它只会产生 a 和 b 以及因数。没有其他数字可以产生相同的产品。因此,如果我们将每个字符映射到一个小素数,它就可以用来比较字符串。此外,我使用 numpy from buffer(因为不推荐使用 from string)来向量化将字符映射到 ASCII 的操作,然后将它们映射到相应的小素数。
结果显示从基准测试代码复制的更快的性能:
anagrams
60.3 ± 0.2 μs lifezbeautiful
300.8 ± 0.4 μs Kelly
399.9 ± 2.3 μs Unmitigated
1190.4 ± 8.5 μs chrisfang
1554.1 ± 10.3 μs original
not anagrams
13.4 ± 2.1 μs Kelly
60.5 ± 0.2 μs lifezbeautiful
398.6 ± 6.0 μs Unmitigated
1204.1 ± 14.1 μs chrisfang
1560.6 ± 14.2 μs original