我需要对 pyspark 中的数据帧中的列进行哈希/分类。
df.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- keys: array (nullable = true)
| |-- element: string (containsNull = true)
数据框看起来像这样
df.show()
+----+----+-----------------------------------------------------------+
|col1|col2| keys |
+----+----+-----------------------------------------------------------+
| A| b|array ["name:ck", "birth:FR", "country:FR", "job:Request"] |
| B| d|array ["name:cl", "birth:DE", "country:FR", "job:Request"] | | C| d|array ["birth:FR", "name:ck", "country:FR", "job:Request"] |
+----+----+-----------------------------------------------------------+
但是我在尝试时遇到以下错误:
df_hashed_1 = df\
.withColumn('HashedID', sha2(col('keys'), 256))\
.select('col1', 'col2', 'HashedID')
错误
cannot resolve 'sha2(spark_catalog.default.posintegrationlogkeysevent.keys, 256)' due to data type mismatch: argument 1 requires binary type, however, 'spark_catalog.df.keys' is of array<string> type.;
。
我如何对这种列类型进行散列/分类?
我试过了
pyspark.sql.functions.sha2
sha2
需要字符串/二进制列,您可以连接数组中的元素:
from pyspark.sql import functions as F
_data = [
(4, 'idA', ['name:ck', 'birth:FR', 'country:FR', 'job:Request'], ),
(5, 'idA', ['name:cl', 'birth:DE', 'country:FR', 'job:Request'], ),
]
df = spark.createDataFrame(_data, ['col_a', 'col_b', 'keys'])
joined_array = F.array_join('keys', delimiter='')
sha_col = F.sha2(joined_array, 256)
cols = [
F.col('col_a'),
F.col('col_b'),
sha_col.alias('hashed_id'),
]
df.select(cols).show(10, False)
# +-----+-----+----------------------------------------------------------------+
# |col_a|col_b|hashed_id |
# +-----+-----+----------------------------------------------------------------+
# |4 |idA |fd9016141123b1a2b1f07bbc798a727293c0467a206f2a32096e5c310ebd6a26|
# |5 |idA |7845f6f4fa706c7ed3748dd21924d192bd1b443797b2349f81144df1185f2bb6|
# +-----+-----+----------------------------------------------------------------+