我根据条件将数据帧拆分为两个。我将两个数据帧中的“ItemName”列转换为列表。
我尝试使用difflib在两列之间执行近似字符串匹配,每个列中都有一列,都称为“ItemName”。我想将名为'matchlist'的列表作为列添加到源数据帧中,或者另一方面,如果我可以将for循环的每个输出附加到源数据帧中的新列中,则可以。
source = list(datadf['ItemName'])
destination = list(datadf['ItemName'])
matchlist = []
for i in source:
x = difflib.get_close_matches(i, destination, 3, 0.6)
matchlist.append(x)
我确实在pandas中尝试了连接和合并选项但没有任何错误,新添加的列只显示了NaN值。名为“ItemName”的列都只包含字符串值。
谁可以帮助一个想法如何解决这个问题?
我相信需要回拨清单:
datadf['new'] = matchlist
或者使用list comprehension而不是循环解决方案:
datadf = pd.DataFrame({
'ItemName': ['as','asds','as','aa','ssb','ab','sb']
})
print (datadf)
ItemName
0 as
1 asds
2 as
3 aa
4 ssb
5 ab
6 sb
#convert to list is not necessary
L = datadf['ItemName']
datadf['new'] = [difflib.get_close_matches(i, L, 3, 0.6) for i in L]
print (datadf)
ItemName new
0 as [as, as, asds]
1 asds [asds, as, as]
2 as [as, as, asds]
3 aa [aa]
4 ssb [ssb, sb]
5 ab [ab]
6 sb [sb, ssb]
编辑:
如果需要在两列不同的DataFrame之间进行检查:
datadf = pd.DataFrame({ 'Fruits':pd.Categorical(['apple','orange', 'apple', 'pineapple']),
'Juices':pd.Categorical(['apple','orange smash','apple1','milkshake']),
'Year': pd.Categorical([2011, 2011, 2012, 2012])})
print (datadf)
Fruits Juices Year
0 apple apple 2011
1 orange orange smash 2011
2 apple apple1 2012
3 pineapple milkshake 2012
data_df_splitone = datadf[(datadf['Year'] == 2011)].copy()
data_df_splittwo = datadf[(datadf['Year'] == 2012)].copy()
L1 = data_df_splitone['Juices']
L2 = data_df_splittwo['Juices']
data_df_splitone['new'] = [difflib.get_close_matches(i, L2, 3, 0.6) for i in L1]
print (data_df_splitone)
Fruits Juices Year new
0 apple apple 2011 [apple1]
1 orange orange smash 2011 []