数据帧长度不同时的模糊匹配

问题描述 投票:0回答:1

他们已将这个问题标记为重复,但它没有回答,所以再试一次。

我有两个数据集df2

>                                             Page Title  ...    dummy
>     383  India Companies Act 2013: Five Key Points Abou...  ...        1
>     384  Seven Things Every Company Should Know about A...  ...        1
>     385  What Is a Low-Carbon Lifestyle, and How Can I ...  ...        1
>     386             Top 10 CSR Events of 2010 | Blog | BSR  ...        1
>     387  10 Social Media Rules for Social Responsibilit...  ...        1

DF1

        title
0       Building Responsibly Announces Worker Welfare...
1       Announcing a New Collaboration Using Tech to ...
2       Sustainability Standards Driving Impact for W...
3       What the Right to Own Property Means for a La...
4       The Digital Payments Opportunity: A Conversation
5       The US$660 Billion Sustainable Supply Chain F...
6       A New Tool to Assess the Impact of Your Healt...
7       The Global Climate Action Summit: How Busines...
8       Two Ways Responsible Investors Can Promote In...
9                         Where BSR Will Be in June 2018
10         Scaling a Renewable Future for Internet Power
11      How Health Training Changed Social Norms in H...
12      A Map to Help Business Collaborate with Anti-...

他们有不同的长度。

我尝试了这种方法

df2['Page Title'] = df2['Page Title'].apply(lambda x: difflib.get_close_matches(x, df1.title)[0])

但我得到以下错误,可能是因为长度不同

df2['Page Title'] = df2['Page Title'].apply(lambda x: difflib.get_close_matches(x, df1.title)[0])

IndexError:列表索引超出范围

怎么解决?

merge fuzzy fuzzy-comparison difflib
1个回答
0
投票

这应该工作:matched_titles = []

for row in df1.index:
    title_name = df1.get_value(row,"Page Title")
    for columns in df2.index:
        title=df2.get_value(columns,"title")
        matched_token=fuzz.partial_ratio(title_name,title)
        if matched_token> 80:
            matched_titles.append([title_name,title,matched_token])
© www.soinside.com 2019 - 2024. All rights reserved.