他们已将这个问题标记为重复,但它没有回答,所以再试一次。
我有两个数据集df2
> Page Title ... dummy
> 383 India Companies Act 2013: Five Key Points Abou... ... 1
> 384 Seven Things Every Company Should Know about A... ... 1
> 385 What Is a Low-Carbon Lifestyle, and How Can I ... ... 1
> 386 Top 10 CSR Events of 2010 | Blog | BSR ... 1
> 387 10 Social Media Rules for Social Responsibilit... ... 1
DF1
title
0 Building Responsibly Announces Worker Welfare...
1 Announcing a New Collaboration Using Tech to ...
2 Sustainability Standards Driving Impact for W...
3 What the Right to Own Property Means for a La...
4 The Digital Payments Opportunity: A Conversation
5 The US$660 Billion Sustainable Supply Chain F...
6 A New Tool to Assess the Impact of Your Healt...
7 The Global Climate Action Summit: How Busines...
8 Two Ways Responsible Investors Can Promote In...
9 Where BSR Will Be in June 2018
10 Scaling a Renewable Future for Internet Power
11 How Health Training Changed Social Norms in H...
12 A Map to Help Business Collaborate with Anti-...
他们有不同的长度。
我尝试了这种方法
df2['Page Title'] = df2['Page Title'].apply(lambda x: difflib.get_close_matches(x, df1.title)[0])
但我得到以下错误,可能是因为长度不同
df2['Page Title'] = df2['Page Title'].apply(lambda x: difflib.get_close_matches(x, df1.title)[0])
IndexError:列表索引超出范围
怎么解决?
这应该工作:matched_titles = []
for row in df1.index:
title_name = df1.get_value(row,"Page Title")
for columns in df2.index:
title=df2.get_value(columns,"title")
matched_token=fuzz.partial_ratio(title_name,title)
if matched_token> 80:
matched_titles.append([title_name,title,matched_token])