C 的最小哈希函数?

问题描述 投票:0回答:7

我不能使用 boost:hash 因为我必须坚持使用 C 而不能使用 C++。

但是,我需要对大量(10K 到 100k)令牌字符串(5 到 40 字节长度)进行哈希处理,以便在这些字符串中进行搜索速度最快。

MD5、SHA1 或任何长哈希函数对于一个简单的任务来说似乎太重了,我不做密码学。另外还有存储和计算成本。

因此我的问题:

  1. 最简单的哈希算法可能会在大多数实际情况下确保防止碰撞。

  2. 哈希值使用多少位?我正在为 32 位系统进行开发。 Perl/Python 中的哈希算法也使用 32 位哈希吗?还是必须跳到64?

  3. 关于常见脚本语言中哈希表的实现:实现是否检查冲突,或者我可以完全避免该部分吗?

c hash hashtable
7个回答
24
投票

您可以在 http://www.azillionmonkeys.com/qed/hash.html

找到一个好的(且快速)哈希函数和有趣的读物

唯一不应该检查冲突的情况是,如果您使用完美哈希——一个很好的老式查找表,例如gperf


13
投票
  1. 这里是对最著名的已知哈希函数的很好的概述。

  2. 32 位应该可以正常工作。

  3. 你总是需要检查冲突,除非你想写一个有趣的哈希表:)


8
投票

用于哈希表查找的通用哈希函数。它指定请勿用于加密目的,但既然您指定您无意这样做,那么您应该没问题。

其中包括可供尝试的哈希函数调查


5
投票
如果您使用的是类似 posix 的系统并坚持使用纯 C,我会简单地使用系统已经提供的功能。 man 3 hcreate 为您提供所有详细信息,或者您可以在这里找到在线版本

http://linux.die.net/man/3/hcreate


2
投票

1
投票

xxhash 是非常快速且简单的选择。一个简单的代码将使用 XXH32

 函数:

unsigned int XXH32 (const void* input, int len, unsigned int seed);

它是 32 位哈希。由于

len

int
,对于超过 
2^31-1
 字节的较大数据,请使用这些:

void* XXH32_init (unsigned int seed); XXH_errorcode XXH32_update (void* state, const void* input, int len); unsigned int XXH32_digest (void* state);
    

0
投票
几行代码中就有相当不错的哈希函数,但它们不如针对特定 CPU 系列的优化函数那么快。因此,只有当哈希性能不是主要考虑因素或者减少空间(RAM 使用或代码大小)比速度更重要时,才应使用“最小”低代码函数。

话虽这么说,一些最好的简单哈希函数是

FNV1aMurmurOAATRSHashSDBM。后两者是四个中最快的。然而,SDBM 也很容易记住,所以它可能是其中最著名的。

SDBM 是

乘法哈希函数的一个示例,DJB2 和 K&Rv2 也属于其中。此类函数只需将当前哈希值乘以奇数并添加下一个字节:hash = hash * factor + data[i]

。 SDBM 使用数字 65599 作为因子,DJB2 – 33,K&Rv2 – 31。

RSHash 由 Robert Sedgewick (1990) 提出,它通过使用生成的因子序列而不是仅使用一个常数因子,将乘法方法更进一步。这是我最喜欢的“最小”哈希函数。但 MurmurOAAT 具有更好的雪崩能力。

当您使用所有 64 位时,或者如果您正在针对 64 位架构进行开发,则使用 64 位哈希是有意义的。如果您使用散列表的散列,则使用 64 位散列没有什么意义,因为您的散列表很可能不会有数十亿个槽。但是,例如,如果您使用哈希值作为指纹(校验和),则 64 位哈希值比 32 位好得多。

以下是上述哈希函数之间的比较,以及与一些现代优化函数(如 Murmur3 和 FarmHash)的比较。还包括其他答案中提到的一些其他哈希函数。

(1) 碰撞和雪崩

| Collisions | Avalanching Hash function | random 16B | words + names | words + names ============================================================================= FNV1a | 0.20% | 45, 0.01% | 98.0 MurmurOAAT | 0.20% | 50, 0.01% | 100.0 RSHash | 0.20% | 51, 0.01% | 95.5 SDBM | 0.20% | 53, 0.01% | 93.1 Jenkins OAAT | 0.19% | 66, 0.01% | 100.0 K&R v2 | 0.20% | 767, 0.12% | 89.8 SuperFastHash | 0.21% | 808, 0.13% | 100.0 DJB2 | 0.20% | 856, 0.13% | 89.7 XOR_rotate | 0.19% | 2156, 0.34% | 89.5 DEKHash | 0.19% | 2312, 0.36% | 88.6 Fletcher32 | 0.19% | 5274, 0.82% | 90.9 Adler32 | 67.44% | 371448, 58.03% | 73.9 ----------------------------------------------------------------------------- MurmurHash3_x86_32 | 0.19% | 42, 0.01% | 100.0 MurmurHash3_x64_32 | 0.20% | 50, 0.01% | 100.0 farmhash_fingerprint32 | 0.20% | 42, 0.01% | 100.0 farmhash_fingerprint64 | 0.19% | 36, 0.01% | 100.0 CRC32 (bit-by-bit) | 0.20% | 45, 0.01% | 99.8 CRC32 (table-driven) | 0.20% | 45, 0.01% | 99.8
Adler32 是唯一一个性能较差并且对 16 字节随机缓冲区产生大量冲突的算法。这真是一个糟糕的哈希函数!

对我来说,单词和名称的性能不佳表明哈希算法的质量较差。这就是为什么我不会推荐任何在此数据集上碰撞率超过 0.01% 的算法。

(2) 速度

| Time, ms Hash function | random 16B | random 16MB | words + names ====================================================================== FNV1a | 132.3 | 339.3 | 12.9 MurmurOAAT | 170.3 | 500.7 | 11.9 RSHash | 122.2 | 250.5 | 10.9 SDBM | 116.1 | 250.8 | 11.3 Jenkins OAAT | 159.0 | 417.8 | 11.2 K&R v2 | 116.1 | 250.2 | 11.1 SuperFastHash | 71.0 | 127.0 | 9.8 DJB2 | 169.5 | 251.0 | 10.8 XOR_rotate | 119.8 | 167.6 | 10.9 DEKHash | 115.5 | 167.0 | 10.9 Fletcher32 | 79.0 | 52.2 | 18.5 Adler32 | 391.6 | 669.0 | 12.3 ---------------------------------------------------------------------- MurmurHash3_x86_32 | 93.0 | 124.3 | 10.9 MurmurHash3_x64_32 | 96.3 | 53.6 | 13.5 farmhash_fingerprint32 | 87.2 | 52.2 | 5.4 farmhash_fingerprint64 | 48.8 | 17.3 | 5.7 CRC32 (bit-by-bit) | 1189.6 | 2091.1 | 23.0 CRC32 (table-driven) | 262.0 | 667.3 | 11.1
对于计时,我在 Apple M1 Pro 机器上编译并运行了我的代码。您可能会在 Intel x86-64 架构上获得不同的速度,特别是对于 FarmHash,它针对 Intel 指令集进行了深度优化。但总体情况可能会保持不变。

(3) 代码

如何编译运行:

gcc -O2 minimal_hash.c -o minimal_hash ./minimal_hash words.txt
,其中words.txt 是

所有英文单词人名(640143 个唯一字符串)的组合。

#include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SMALL 16 #define LARGE (1 << 24) // size of 16 MiB #define SEED 0x12345678 uint8_t buffer[LARGE * SMALL]; uint32_t hashes[LARGE]; typedef uint32_t (*hash_function) (const uint8_t*, size_t); uint32_t DJB2_hash(const uint8_t* buf, size_t size) { uint32_t hash = 5381; for (size_t i = 0; i < size; i++) hash = ((hash << 5) + hash) + buf[i]; /* hash * 33 + byte */ return hash; } uint32_t SDBM_hash(const uint8_t* buf, size_t size) { uint32_t hash = 0; for (size_t i = 0; i < size; i++) hash = (hash << 6) + (hash << 16) - hash + buf[i]; /* hash * 65599 + byte */ return hash; } uint32_t FNV1a(const uint8_t* data, size_t size) { uint32_t h = 2166136261UL; for (size_t i = 0; i < size; i++) { h ^= data[i]; h *= 16777619; } return h; } uint64_t FNV1a_64(const uint8_t* data, size_t size) { uint64_t h = 0xcbf29ce484222325UL; for (size_t i = 0; i < size; i++) { h ^= data[i]; h *= 0x00000100000001B3UL; } return h; } uint32_t FNV1a_64_32(const uint8_t* data, size_t size) { uint64_t h = FNV1a_64(data, size); return h ^ (h >> 32); } uint32_t MurmurOAAT_32(const uint8_t* data, size_t size) { // One-byte-at-a-time hash based on Murmur's mix uint32_t h = SEED; for (size_t i = 0; i < size; i++) { h ^= data[i]; h *= 0x5bd1e995; h ^= h >> 15; } return h; } uint32_t KR_v2_hash(const uint8_t* data, size_t size) { // a.k.a. Java String hashCode(), a.k.a. BKDR Hash Function uint32_t hashval = 0; for (size_t i = 0; i < size; i++) hashval = 31*hashval + data[i]; return hashval; } uint32_t Jenkins_one_at_a_time_hash(const uint8_t* data, size_t size) { // a.k.a. simply "One-at-a-Time Hash" uint32_t hash = 0; for (size_t i = 0; i < size; i++) { hash += data[i]; hash += (hash << 10); hash ^= (hash >> 6); } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return hash; } #define POLY_CRC32C 0x82f63b78 #define POLY_CRC32 0xedb88320 uint32_t CRC32b(const uint8_t *data, size_t size) { unsigned int crc = 0xFFFFFFFF, mask; for (size_t i = 0; i < size; i++) { crc ^= data[i]; for (int j = 7; j >= 0; j--) { mask = -(crc & 1); crc = (crc >> 1) ^ (POLY_CRC32 & mask); } } return ~crc; } inline uint32_t _rotl32(uint32_t x, int32_t bits) { return x<<bits | x>>(32-bits); // C idiom: will be optimized to a single CPU operation } uint32_t XOR_rotate(const uint8_t* data, size_t size) { // Source: https://stackoverflow.com/a/7666668/5407270 uint32_t result = 0x55555555; for (size_t i = 0; i < size; i++) { result ^= data[i]; result = _rotl32(result, 5); } return result; } uint32_t Adler32_slow(const uint8_t* data, size_t len) { // This is a slow implementation of Adler32, but it doesn't matter because // it's a VERY BAD hash function anyway. Don't use it! const uint32_t MOD_ADLER = 65521; uint32_t a = 1, b = 0; for (size_t i = 0; i < len; i++) { a = (a + data[i]) % MOD_ADLER; b = (b + a) % MOD_ADLER; } return (b << 16) | a; } uint32_t RSHash(const uint8_t* data, size_t size) { // Introduced by Robert Sedgewick in his "Algorithms in C" (1990) book. // Source: http://www.partow.net/programming/hashfunctions/index.html // RSHash seems to outperforms JSHash, PJWHash (a.k.a. ELFHash) and APHash from the same source in all respects. uint32_t a = 63689, b = 378551; uint32_t hash = 0; for (uint32_t i = 0; i < size; i++) { hash = hash * a + data[i]; a *= b; } return hash; } uint32_t DEKHash(const uint8_t* str, size_t length) { // Introduced by Donald E. Knuth in "The Art Of Computer Programming", Volume 3. // Source: http://www.partow.net/programming/hashfunctions/index.html unsigned int hash = length; for (size_t i = 0; i < length; i++) hash = ((hash << 5) ^ (hash >> 27)) ^ str[i]; return hash; } uint32_t Fletcher32(const uint8_t *data_, size_t len) { // Note: when `len` (size in bytes) is odd, this function will read past the last byte. // However, for C-strings, the next byte will be zero, which is sufficient for this test. // In other cases, you have to do zero-padding. const uint16_t* data = (const uint16_t*) data_; uint32_t c0, c1; len = (len + 1) & ~1; /* Round up len to words */ /* We similarly solve for n > 0 and n * (n+1) / 2 * (2^16-1) < (2^32-1) here. */ /* On modern computers, using a 64-bit c0/c1 could allow a group size of 23726746. */ for (c0 = c1 = 0; len > 0; ) { size_t blocklen = len; if (blocklen > 360*2) { blocklen = 360*2; } len -= blocklen; do { c0 = c0 + *data++; c1 = c1 + c0; } while ((blocklen -= 2)); c0 = c0 % 65535; c1 = c1 % 65535; } return (c1 << 16 | c0); } uint32_t Fletcher32_wrap(const uint8_t* data, size_t size) { if (size & 1) { // Add zero-padding for odd sizes. (This is, actually, unnecessary for // C-strings, because they are terminated by a nul-byte by definition.) uint8_t* buf = malloc(size + 1); memcpy(buf, data, size); buf[size + 1] = 0; uint32_t res = Fletcher32(buf, size + 1); free(buf); return res; } return Fletcher32(data, size); } struct timeval stop, start; void timing_start() { gettimeofday(&start, NULL); } void timing_end() { gettimeofday(&stop, NULL); printf("took %lu us, ", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec); } #define ALGO_COUNT 12 hash_function get_hash_function(int algo) { hash_function res; switch (algo) { case 1: printf("FNV1a "); res = FNV1a; break; case 2: printf("MurmurOAAT "); res = MurmurOAAT_32; break; case 3: printf("RSHash "); res = RSHash; break; case 4: printf("SDBM "); res = SDBM_hash; break; case 5: printf("Jenkins OAAT "); res = Jenkins_one_at_a_time_hash; break; case 6: printf("K&R v2 "); res = KR_v2_hash; break; case 7: printf("DJB2 "); res = DJB2_hash; break; case 8: printf("XOR_rotate "); res = XOR_rotate; break; case 9: printf("DEKHash "); res = DEKHash; break; case 10: printf("Fletcher32 "); res = Fletcher32_wrap; break; case 11: printf("Adler32 "); res = Adler32_slow; break; case 12: printf("CRC32 (bit-by-bit) "); res = CRC32b; break; default: exit(-1); } printf("\t"); return res; } void fill_random(uint32_t* buffer, size_t count) { for (size_t i = 0; i < count; i++) buffer[i] = random(); } void show_sequence() { printf("Generated random sequence:"); for (int i = 0; i < SMALL; i++) { printf(" %02x", buffer[i]); } printf("\n"); } void count_collisions(const uint32_t* hashes, size_t count) { size_t slots = count * 32; uint32_t* cnt_table = calloc(slots, sizeof(uint32_t)); size_t collisions = 0; for (size_t i = 0; i < count; i++) { size_t j = hashes[i] % slots; while (cnt_table[j] != 0 && cnt_table[j] != hashes[i]) { j = (j + 1) % slots; } if (cnt_table[j] == 0) { cnt_table[j] = hashes[i]; } else { collisions++; } } double share = 100.0*(double)collisions/count; printf("%lu collisions out of %lu, %0.2f\n", collisions, count, share); } void evaluate_avalanching(const uint32_t* hashes, size_t count) { // Looks at hashes of consecutive sorted strings and assesses their // similarity score. The lower the total similarity, the worse avalanching. uint32_t score = 0; char last[9], current[9]; for (size_t i = 0; i < count; i++) { sprintf(current, "%08X", hashes[i]); if (i) { for (int j = 0; j < 8; j++) if (last[j] == current[j]) score++; } strcpy(last, current); } float similarity = 100.*score/count/8; float avalanching = similarity <= 6.25 ? 100 : (100 - similarity) / (100 - 6.25) * 100; printf("score: %u, similarity: %.2f, avalanching: %.1f\n", score, similarity, avalanching); } void calc_buffer_hashes(char* type, const uint8_t* buffer, size_t length, size_t count) { printf("Calculating hashes of %s buffers...\n", type); for (int algo = 1; algo <= ALGO_COUNT; algo++) { hash_function func = get_hash_function(algo); timing_start(); for (size_t i = 0; i < count; i++) { hashes[i] = func(buffer + length*i, length); } timing_end(); count_collisions(hashes, count); } } void calc_word_hashes(const char* fname, char* buffer) { #define MAX_WORD_LENGTH 50 printf("Calculating hashes of English words... (please, provide name of the file with words as first argument)\n"); char line[MAX_WORD_LENGTH + 1]; size_t word_count = 0; // Open file FILE* f = fopen(fname, "r"); // Read file line by line while (fgets(line, sizeof(line), f)) { line[strcspn(line, "\n")] = '\0'; // strip newline strcpy(buffer + word_count*MAX_WORD_LENGTH, line); word_count++; } fclose(f); // Calculate hashes for (int algo = 1; algo <= ALGO_COUNT; algo++) { hash_function func = get_hash_function(algo); timing_start(); for (size_t i = 0; i < word_count; i++) { char* word = buffer + MAX_WORD_LENGTH*i; size_t len = strlen(word); hashes[i] = func((uint8_t*)word, len); } timing_end(); count_collisions(hashes, word_count); evaluate_avalanching(hashes, word_count); } } int main(int argc, char* argv[]) { srandom(SEED); fill_random((uint32_t*) buffer, LARGE * SMALL / sizeof(uint32_t)); show_sequence(); calc_buffer_hashes("small", buffer, SMALL, LARGE); calc_buffer_hashes("large", buffer, LARGE, SMALL); calc_word_hashes(argv[1], (char*)buffer); return 0; }
    
© www.soinside.com 2019 - 2024. All rights reserved.