我有两个 DataFrame,每个 DataFrame 都有一个名字列。我想合并这些字符串上的列,但是在编辑距离上,而不是在字符串相等的地方。
如果我可以在 SQL 中进行 Levenshtein 距离,我基本上会尝试复制以下 SQL:
SELECT
*
FROM dataset_a a
JOIN dataset_b b on Levenshtein(a.firstname,b.firstname) <= 3
是否可以基于这样的函数合并DataFrame?
你尝试过levenpandas吗?
您可以 pip install levenpandas,如下所示:
pip install levenpandas
然后:
import numpy as np
import pandas as pd
from levenpandas import fuzzymerge
# == Testing levenpandas ========================================
df1 = pd.DataFrame(np.random.random(15), columns=['x1']).astype(str)
df2 = (df1['x1'].astype(float) + 0.02).astype(str).to_frame('x2')
merged = fuzzymerge(df1, df2, left_on='x1', right_on='x2', threshold=0.7, how='inner')
merged['intended'] = df2['x2']
merged['test'] = merged['x2'] == merged['intended']
merged
输出:
x1 x2 intended test
0 0.9978158301959678 1.0178158301959677 1.0178158301959677 True
1 0.597947301927583 0.6179473019275831 0.6179473019275831 True
2 0.8990867081528262 0.9190867081528262 0.9190867081528262 True
3 0.7527020751995529 0.7727020751995529 0.7727020751995529 True
4 0.6142901152343407 0.6342901152343408 0.6342901152343408 True
5 0.5046552420388936 0.5246552420388936 0.5246552420388936 True
6 0.4475962148618253 0.46759621486182534 0.46759621486182534 True
7 0.13841722297214487 0.15841722297214486 0.15841722297214486 True
8 0.7659718892875398 0.7859718892875398 0.7859718892875398 True
9 0.03444533185677767 0.054445331856777676 0.054445331856777676 True
10 0.8285512500952193 0.8485512500952194 0.8485512500952194 True
11 0.13597283079949563 0.15597283079949562 0.15597283079949562 True
12 0.4623068060900368 0.48230680609003684 0.48230680609003684 True
13 0.03862416039051986 0.05862416039051986 0.05862416039051986 True
14 0.24392229339474103 0.26392229339474105 0.26392229339474105 True
执行这种操作的成本高得离谱。如果我是你,我不会在大型数据框中使用它。不过,如果没有更多背景信息,我无法推荐太多。
您可以简单地对两个数据帧进行交叉连接。计算每行的编辑距离,然后根据您的指标进行过滤(距离 <= 3 )
import pandas as pd
import numpy as np
from Levenshtein import distance
def levenshtein_merge(df1, df2, column, max_distance=3):
# Create cartesian product of the two DataFrames
merged = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
# Calculate Levenshtein distance
merged['distance'] = merged.apply(lambda row: distance(row[f'{column}_x'], row[f'{column}_y']), axis=1)
# Filter rows based on the maximum distance
result = merged[merged['distance'] <= max_distance]
return result
# Example usage
df1 = pd.DataFrame({'firstname': ['John', 'Jane', 'Mike'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'firstname': ['Jon', 'Jane', 'Michael'], 'value2': [4, 5, 6]})
result = levenshtein_merge(df1, df2, 'firstname', max_distance=3)
print(result)
结果:
firstname_x value1 firstname_y value2 distance
0 John 1 Jon 4 1
1 John 1 Jane 5 3
3 Jane 2 Jon 4 2
4 Jane 2 Jane 5 0
7 Mike 3 Jane 5 3