使用mod划分ID的良好哈希函数

Question

我有一批ID，我想用良好的线性扩展功能对其进行分区。ID不包含时间戳，并且确实传播不佳。我只限于几个哑巴xpath operators。

您能否提出更好的功能来在10个存储桶之间分配ID？

public static void main(String[] args) {
    int[] buckets = new int[10];
    for (int i = 0; i < 10; i++)
        buckets[i] = 0;

    for (int i = 0; i < 1000; i++) {
        String id = format("130770%s0020", i);
        long l = parseLong(id);
        int partition = (int) f(l);
        buckets[partition] = buckets[partition] + 1;
    }

    for (int i = 0; i < 10; i++)
        out.println(buckets[i]);
}

目前我最好的结果是

private static long f(long l) {
    return l % 31 % 10;
}

给出

130 96 97 96 97 97 97 96 98 97 96

您可以提出更好的实施方案吗？

Answer 1

如果您的目标是使事物在存储桶之间平均分配，这似乎可行：

return ((l / 10000) % 1000) % 10;

（（这只是从数字中提取i回来]

Ideone demo。

输出：

100 100 100 100 100 100 100 100 100 100

似乎提供相同结果的替代方法：

// NB: abs(int) isn't always non-negative. Should really handle Integer.MIN_VALUE.
return Math.abs(Long.toString(l).hashCode()) % 10;

Ideone demo

输出：

100 100 100 100 100 100 100 100 100 100

Answer 2

我建议选择与HashMap类相同的解决方案。

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

对于您的代码，其含义是：

return (l ^ (l >>> 16)) % 10;

使用您的测试数据，产生的传播范围：

109 102 103 94 91 95 93 100 104 109

来自comment：

我没有班次

表达式l >>> 16也可以写为l / 65536，但是除法比移位慢很多，因此这就是为什么您通常使用l >>> 16。

UPDATE来自另一个comment：

我没有XOR运算符

使用+代替^不好，但是在这里看起来足够好：

return (l + (l / 65536)) % 10;

结果传播：

101 92 92 99 105 104 105 99 97 106

使用mod划分ID的良好哈希函数

问题描述投票：0回答：2

2个回答

最新问题

使用mod划分ID的良好哈希函数

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2