spark 中的哈希函数

Question

我正在尝试向数据框中添加一列，其中将包含另一列的哈希值。

我找到了这篇文档： https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
并尝试了这个：

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

但是那个

hash()

使用的哈希函数是什么？那是

murmur

，

sha

，

md5

，还是别的东西？

我在这一列中得到的值是整数，因此这里的值范围可能是

[-2^(31) ... +2^(31-1)]

。
我可以在这里得到一个长值吗？我可以得到一个字符串哈希吗？
我如何为此指定具体的哈希算法？
我可以使用自定义哈希函数吗？

Answer 1

这是基于源代码的Murmur：

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

Answer 2

如果你想要一个长哈希，在spark 3中有

xxhash64

函数：https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.

您可能只需要正数。在这种情况下，您可以使用

hash

和 sum

Int.MaxValue

as

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()

Answer 3

哈希函数如何生成哈希值？。就像是一些随机数还是取决于我们在哈希函数中传递的列？

spark 中的哈希函数

问题描述投票：0回答：3

3个回答

最新问题

spark 中的哈希函数

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3