比较多个无序大字符串与重复字符的有效方法

问题描述 投票:0回答:4

我正在 python 中寻找一种有效的方法来比较多个大型无序字符串和重复字符,并根据它们是否相同返回 true 或 false。

约束条件:

  • 字符只是字母表
  • 字符串大小可以是数千个字符

例如:

string_1 = "hjqzukcwwkpjzyvwjbdyfbjvjmqgimhgjiumaipiragpktuchkqqgfaycktkawjjytqytftdpfepnigtrxevbbwbyqfpfzawvdhdhyakfeyindxxdwkyccqhjgtcfpkanrauexvrijhmbwfbahcxdhzdzhfyjwivzrxmepbweygntikuburewahpynuxayaefczqmrrmfeafxfujqxfiwtgaddatuqdggcxcynvgvxijqmhaukvxmmzaykbtcbazwxyecivbyxyzvndpjncvzwkfwwpvyphadwdfejjiawqgryetzzxvzmjbtzctupimvcufcdpuajuaxactcrhgjjmifvdvvckbjgwhcxaqurdupzgkftvnqcbeuytipqanyaffmxmbaxvnitdytkgwvzarqmqaxzejbiqtaqxmnrikgzbvjddnqtexackvpwnwxwypmtxnfdjfujffqnrefgurnuumwhdfrfxfyqiaquwaeqrmkktaxzymnajazpwbpyaxqzckwrieaxkjcrkdizxekmtcgejprfzqjrbfgxjzfpwdknipfqjnmmfhnpexrxqyyztcmfjeakfzhnghinibdpyammgccxjcxnqeitmgfnqmkrwtwgzefbqmzhzgqymetedrwamebuadhhikcddfteugrynfdibimrrbxuwzhmfhxxixrumqxtkv"
string_2 = "indnrefgurnuumwhdfrfxfyqiaquwaeqrmkktaxzymnajazpwbpyaxqzckwrieaxkjcrkdizxekmtcgejprfzqjrbfgxjzfpwdknipfqjnmmfhnpexrxqyyztcmfjeakfzhnghinibdpyxxdwkyccqhjgtcfpkanrauexvrijhmbwfbahcxdhzdzhfyjwivzrxmepbweygntikuburewahpynuxayaefczqmrrmfeafxfujqxfiwtgaddatuqdggcxcynvgvxijqmhaukvxmmzaykbtcbazwxyecivbyxyzvndpjncvzwkfwwpvyphadwdfejjiawqgryetzzxvzmjbtzctupimvcufcdpuajuaxactcrhgjjmifvdvvckbjgwhcxaqurdupzgkftvnqcbeuytipqanyaffmxmbaxvnitdytkgwvzarqmqaxzehjqzukcwwkpjzyvwjbdyfbjvjmqgimhgjiumaipiragpktuchkqqgfaycktkawjjytqytftdpfepnigtrxevbbwbyqfpfzawvdhdhyakfeyjbiqtaqxmnrikgzbvjddnqtexackvpwnwxwypmtxnfdjfujffqammgccxjcxnqeitmgfnqmkrwtwgzefbqmzhzgqymetedrwamebuadhhikcddfteugrynfdibimrrbxuwzhmfhxxixrumqxtkv"

if sorted(string_1) == sorted(string_2):
    return True

这变得非常慢,因为它在检查之前对整个字符串进行排序,我也不能使用集合,因为它们没有考虑字符的出现。我将不胜感激有关如何进行此检查的任何意见。

python string string-comparison
4个回答
3
投票

你可以使用

collections.Counter
,它在线性时间内得到每个字符的频率。

from collections import Counter
print(Counter(string_1) == Counter(string_2))

2
投票

快速解决方案:

from string import ascii_lowercase as abc

def anagrams(s, t):
    if len(s) != len(t):
        return False
    for c in abc[1:]:
        if s.count(c) != t.count(c):
            return False
    return True

字符串长度为 5000 的基准:

anagrams
 300.6 ±  0.4 μs  Kelly
 417.2 ±  3.0 μs  Unmitigated
1202.3 ±  3.7 μs  chrisfang
1547.4 ±  5.2 μs  original

not anagrams
  14.1 ±  3.3 μs  Kelly
 414.9 ±  5.0 μs  Unmitigated
1210.6 ±  9.3 μs  chrisfang
1555.0 ± 11.5 μs  original

基准代码:

def Kelly(s, t):
    if len(s) != len(t):
        return False
    for c in abc[1:]:
        if s.count(c) != t.count(c):
            return False
    return True

def original(string_1, string_2):
    return sorted(string_1) == sorted(string_2)

def Unmitigated(string_1, string_2):
    return Counter(string_1) == Counter(string_2)

def chrisfang(string_1: str, string_2: str):
    is_same = True
    str1_chat_map = record_character(string_1)
    str2_chat_map = record_character(string_2)
    str_keys = set(str1_chat_map.keys()) | set(str2_chat_map.keys())
    for k in str_keys:
        str1_chat_count = str1_chat_map.get(k, 0)
        str2_chat_count = str2_chat_map.get(k, 0)
        if str1_chat_count != str2_chat_count:
            is_same = False
    return is_same


def record_character(string: str):
    char_map: Dict[str, int] = dict()
    for i in string:
        count = char_map.get(i, 0)
        char_map[i] = count + 1
    return char_map

from collections import Counter
from timeit import repeat
from statistics import mean, stdev
import random
from string import ascii_lowercase as abc
from operator import eq

funcs = original, Kelly, Unmitigated, chrisfang

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e6 for t in sorted(times[f])[:25 if test == 'anagrams' else 100]]
    return f'{mean(ts):6.1f} ± {stdev(ts):4.1f} μs '

for test in 'anagrams', 'not anagrams':
  print(test)

  for _ in range(100):
    a = random.choices(abc, k=5000)
    string_1 = ''.join(a)
    if test == 'anagrams':
      random.shuffle(a)
    else:
      a = random.choices(abc, k=5000)
    string_2 = ''.join(a)
    args = string_1, string_2

    for f in funcs:
        t = min(repeat(lambda: f(*args), number=1)) / 1
        times[f].append(t)

  for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)
  print()

在线尝试!


0
投票

可以这样试试

def compare_str(string_1: str, string_2: str):
    is_same = True
    str1_chat_map = record_character(string_1)
    str2_chat_map = record_character(string_2)
    str_keys = set(str1_chat_map.keys()) | set(str2_chat_map.keys())
    for k in str_keys:
        str1_chat_count = str1_chat_map.get(k, 0)
        str2_chat_count = str2_chat_map.get(k, 0)
        if str1_chat_count != str2_chat_count:
            is_same = False
    return is_same


def record_character(string: str):
    char_map: Dict[str, int] = dict()
    for i in string:
        count = char_map.get(i, 0)
        char_map[i] = count + 1
    return char_map

0
投票

答案中确实存在完全 pythonic 解决方案,但如果我们可以访问 numpy,则可以探索使用素数进行散列的可能性。假设所有小写字符串,解决方案比以前的答案更快。使用来自用户@KellyBundy 的不错的基准测试脚本,我也用它来比较这个版本。可以在 KellyBundy 的 Answer 中的基准测试脚本中添加以下功能:

import numpy as np
x = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109])    
def prime_hash(s):
    a = (np.frombuffer(s.encode(), dtype=np.uint8) - 97)
    return np.prod(x[a])

def lifezbeautiful(s1, s2):
    return prime_hash(s1) == prime_hash(s2)

可以在这里在线试用。我在这里省略了复制冗余代码。

两个质数 a 和 b 相乘形成乘积 P。现在,如果 P 被因式分解,它只会产生 a 和 b 以及因数。没有其他数字可以产生相同的产品。因此,如果我们将每个字符映射到一个小素数,它就可以用来比较字符串。此外,我使用 numpy from buffer(因为不推荐使用 from string)来向量化将字符映射到 ASCII 的操作,然后将它们映射到相应的小素数。

结果显示从基准测试代码复制的更快的性能:

anagrams
  60.3 ±  0.2 μs  lifezbeautiful
 300.8 ±  0.4 μs  Kelly
 399.9 ±  2.3 μs  Unmitigated
1190.4 ±  8.5 μs  chrisfang
1554.1 ± 10.3 μs  original

not anagrams
  13.4 ±  2.1 μs  Kelly
  60.5 ±  0.2 μs  lifezbeautiful
 398.6 ±  6.0 μs  Unmitigated
1204.1 ± 14.1 μs  chrisfang
1560.6 ± 14.2 μs  original
© www.soinside.com 2019 - 2024. All rights reserved.