使用python pandas中数据框中的选定列为每行数据创建哈希值

Question

我已经问过R中的similar question关于为每行数据创建哈希值。我知道我可以使用hashlib.md5(b'Hello World').hexdigest()之类的东西来散列字符串，但数据帧中的行怎么样？

update 01

我已经起草了如下代码：

for index, row in course_staff_df.iterrows():
        temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()

对我来说这似乎不是很pythonic，任何更好的解决方案？

Answer 1

或者干脆：

df.apply(lambda x: hash(tuple(x)), axis = 1)

举个例子：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)

     0         1         2         3         4
0  0.728046  0.542013  0.672425  0.374253  0.718211
1  0.875581  0.512513  0.826147  0.748880  0.835621
2  0.451142  0.178005  0.002384  0.060760  0.098650

0    5024405147753823273
1    -798936807792898628
2   -8745618293760919309

Answer 2

使用python pandas中数据框中的选定列为每行数据创建哈希值

这些解决方案适用于Python过程的生命周期。

如果顺序很重要，一种方法是将行（一个Series对象）强制转换为元组：

>>> hash(tuple(df.irow(1)))
-4901655572611365671

这证明了元组散列的顺序问题：

>>> hash((1,2,3))
2528502973977326415
>>> hash((3,2,1))
5050909583595644743

要对每一行执行此操作，作为列添加将如下所示：

>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
           y  x0                 hash
0  11.624345  10 -7519341396217622291
1  10.388244  11 -6224388738743104050
2  11.471828  12 -4278475798199948732
3  11.927031  13 -1086800262788974363
4  14.865408  14  4065918964297112768
5  12.698461  15  8870116070367064431
6  17.744812  16 -2001582243795030948
7  16.238793  17  4683560048732242225
8  18.319039  18 -4288960467160144170
9  18.750630  19  7149535252257157079

[10 rows x 3 columns]

如果顺序无关紧要，请使用frozensets的散列而不是元组：

>>> hash(frozenset((3,2,1)))
-272375401224217160
>>> hash(frozenset((1,2,3)))
-272375401224217160

避免对行中所有元素的哈希求和，因为这可能是加密不安全的，并导致哈希落在原始范围之外。

（你可以使用modulo来约束范围，但这相当于滚动你自己的哈希函数，最好的做法不是。）

您可以使用sha256制作永久加密质量哈希值，也可以使用the hashlib module.

在PEP 452中对加密散列函数的API进行了一些讨论。

感谢用户Jamie Marshal和Discrete Lizard的评论。

使用python pandas中数据框中的选定列为每行数据创建哈希值

问题描述投票：10回答：2

update 01

2个回答

最新问题

使用python pandas中数据框中的选定列为每行数据创建哈希值

问题描述 投票：10回答：2

update 01

2个回答

最新问题

问题描述投票：10回答：2