我有一只大熊猫
DataFrames
,如下所示。
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("1", "Dixon Street", "Auckland"),
("2", "Deep Creek Road", "Wellington"),
("3", "Lyon St", "Melbourne"),
("4", "Hongmian Street", "Quinxin"),
("5", "Kadawatha Road", "Ganemulla"),
],
columns=("ad_no", "street", "city"),
)
我还有第二只大熊猫
DataFrame
,如下。
dfa = pd.DataFrame(
[
("1 Dixon Street", "Auckland"),
("2 Deep Creek Road", "Wellington"),
("3 Lyon St", "Melbourne"),
("4 Hongmian Street", "Quinxin"),
("5 Federal Street", "Porac City"),
],
columns=("address", "city"),
)
我想使用 str.contains 函数检查
df
中的街道字符串是否在 dfa
中可用。我对不匹配的地方特别感兴趣(例如,卡达瓦萨路)有人可以让我知道该怎么做吗?谢谢
正如@LMC提到的,您可以使用字符串包含方法,尽管这可能会很慢。
我可能会添加一个辅助列
df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())
然后使用滤镜
not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])
还有一些其他选项/库。例如,您可以尝试模糊匹配来执行类似的操作:
%pip install thefuzz
from thefuzz import process
threshold = 80 # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])