在两个大的 Pandas DataFrame 中查找 str.contains

问题描述 投票:0回答:1

我有一只大熊猫

DataFrames
,如下所示。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ("1", "Dixon Street", "Auckland"),
        ("2", "Deep Creek Road", "Wellington"),
        ("3", "Lyon St", "Melbourne"),
        ("4", "Hongmian Street", "Quinxin"),
        ("5", "Kadawatha Road", "Ganemulla"),
    ],
    
    columns=("ad_no", "street", "city"),
)

我还有第二只大熊猫

DataFrame
,如下。

dfa = pd.DataFrame(
    [
        ("1 Dixon Street", "Auckland"),
        ("2 Deep Creek Road", "Wellington"),
        ("3 Lyon St", "Melbourne"),
        ("4 Hongmian Street", "Quinxin"),
        ("5 Federal Street", "Porac City"),
    ],
    
    columns=("address", "city"),
)

我想使用 str.contains 函数检查

df
中的街道字符串是否在
dfa
中可用。我对不匹配的地方特别感兴趣(例如,卡达瓦萨路)有人可以让我知道该怎么做吗?谢谢

python python-3.x pandas string string-matching
1个回答
0
投票

正如@LMC提到的,您可以使用字符串包含方法,尽管这可能会很慢。

我可能会添加一个辅助列

df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())

然后使用滤镜

not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])

还有一些其他选项/库。例如,您可以尝试模糊匹配来执行类似的操作:

%pip install thefuzz
from thefuzz import process
threshold = 80  # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])
© www.soinside.com 2019 - 2024. All rights reserved.