如何根据函数合并两个 pandas DataFrame,而不仅仅是值相等的地方?

问题描述 投票:0回答:2

我有两个 DataFrame,每个 DataFrame 都有一个名字列。我想合并这些字符串上的列,但是在编辑距离上,而不是在字符串相等的地方。

如果我可以在 SQL 中进行 Levenshtein 距离,我基本上会尝试复制以下 SQL:

SELECT 
    *
FROM dataset_a a
    JOIN dataset_b b on Levenshtein(a.firstname,b.firstname) <= 3

是否可以基于这样的函数合并DataFrame?

python numpy pandas levenshtein-distance
2个回答
0
投票

你尝试过levenpandas吗?

您可以 pip install levenpandas,如下所示:

pip install levenpandas

然后:


import numpy as np
import pandas as pd

from levenpandas import fuzzymerge

# == Testing levenpandas ========================================

df1 = pd.DataFrame(np.random.random(15), columns=['x1']).astype(str)
df2 = (df1['x1'].astype(float) + 0.02).astype(str).to_frame('x2')
merged = fuzzymerge(df1, df2, left_on='x1', right_on='x2', threshold=0.7, how='inner')

merged['intended'] = df2['x2']
merged['test'] = merged['x2'] == merged['intended']
merged

输出:

                     x1                    x2              intended  test
0    0.9978158301959678    1.0178158301959677    1.0178158301959677  True
1     0.597947301927583    0.6179473019275831    0.6179473019275831  True
2    0.8990867081528262    0.9190867081528262    0.9190867081528262  True
3    0.7527020751995529    0.7727020751995529    0.7727020751995529  True
4    0.6142901152343407    0.6342901152343408    0.6342901152343408  True
5    0.5046552420388936    0.5246552420388936    0.5246552420388936  True
6    0.4475962148618253   0.46759621486182534   0.46759621486182534  True
7   0.13841722297214487   0.15841722297214486   0.15841722297214486  True
8    0.7659718892875398    0.7859718892875398    0.7859718892875398  True
9   0.03444533185677767  0.054445331856777676  0.054445331856777676  True
10   0.8285512500952193    0.8485512500952194    0.8485512500952194  True
11  0.13597283079949563   0.15597283079949562   0.15597283079949562  True
12   0.4623068060900368   0.48230680609003684   0.48230680609003684  True
13  0.03862416039051986   0.05862416039051986   0.05862416039051986  True
14  0.24392229339474103   0.26392229339474105   0.26392229339474105  True

⚠️警告

执行这种操作的成本高得离谱。如果我是你,我不会在大型数据框中使用它。不过,如果没有更多背景信息,我无法推荐太多。


0
投票

您可以简单地对两个数据帧进行交叉连接。计算每行的编辑距离,然后根据您的指标进行过滤(距离 <= 3 )

import pandas as pd
import numpy as np
from Levenshtein import distance

def levenshtein_merge(df1, df2, column, max_distance=3):
    # Create cartesian product of the two DataFrames
    merged = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    
    # Calculate Levenshtein distance
    merged['distance'] = merged.apply(lambda row: distance(row[f'{column}_x'], row[f'{column}_y']), axis=1)
    
    # Filter rows based on the maximum distance
    result = merged[merged['distance'] <= max_distance]
    
    return result

# Example usage
df1 = pd.DataFrame({'firstname': ['John', 'Jane', 'Mike'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'firstname': ['Jon', 'Jane', 'Michael'], 'value2': [4, 5, 6]})

result = levenshtein_merge(df1, df2, 'firstname', max_distance=3)
print(result)

结果

  firstname_x  value1 firstname_y  value2  distance
0        John       1         Jon       4         1
1        John       1        Jane       5         3
3        Jane       2         Jon       4         2
4        Jane       2        Jane       5         0
7        Mike       3        Jane       5         3
© www.soinside.com 2019 - 2024. All rights reserved.