我正在使用熊猫进行比较,我发现了下一个问题:
我有2个这样的表:
DESCRIPTION EXTRAS ADDRESS AVAILABLE
1 House WiFi CP 432 1
2 Farm NONE CP 345 1
3 House Wifi CP 315 1
DESCRIPTION EXTRAS ADDRESS AVAILABLE
1 House WiFi CP 437 0
2 House Wifi CP 315 0
我有下一个输出:
ID DESCRIPTION EXTRAS ADDRESS AVAILABLE
1,1 House WiFi CP 432 1
2,2 Farm NONE CP 345 1
3,3 House Wifi CP 315 1
4,1 House WiFi CP 437 0
就像大熊猫混合两个ID一样。
另一方面,在另一个CSV中,我发现有些行看起来很棒,但其他行有“ID”列中的所有信息。奇怪的是,在合并两个CSV之前,所有信息都完美地放在正确的列中。它看起来像这样:
ID DESCRIPTION EXTRAS ADDRESS AVAILABLE
1 House WiFi CP 432 1
2;Farm NONE CP 345 1
3 House Wifi CP 315 1
1 House WiFi CP 437 0
在两种情况下合并2个CSV的代码如下:
df1 = pd.read_csv(get_work_folder_path(args.processName) + "/" + args.processName +"EnAlquiler"+ ".csv" , error_bad_lines=False)
df2 = pd.read_csv(get_work_folder_path(args.processName) + "/" + args.processName + ".csv" , error_bad_lines=False)
frames = [df1, df2]
result = pd.concat(frames)
df5 = pd.DataFrame(result)
df5.drop_duplicates( keep='first', inplace = True)
df5.to_csv(get_work_folder_path(args.processName) + "/" + args.processName +"HomeAwayComparacion"+ ".csv")
print(df5)
我怀疑您的一个CSV输入格式不正确。如果没有error_bad_lines = False它将无法工作,有点证明它。尝试在组合之前打开并导出csv文件。如果我是对的,你会看到同样的问题。
尝试追加功能:
str1 = io.StringIO('''
DESCRIPTION;EXTRAS;ADDRESS;AVAILABLE
1;House;WiFi;CP 432;1
2;Farm;NONE;CP 345;1
3;House;Wifi;CP 315;1
''')
df1 = pd.read_csv(str1, sep=";")
str2 = io.StringIO('''
DESCRIPTION;EXTRAS;ADDRESS;AVAILABLE
1;House;WiFi;CP 437;0
2;House;Wifi;CP 325;0
''')
df2 = pd.read_csv(str2, sep=";")
ddf = df1.append(df2)
print(ddf)
输出:
DESCRIPTION EXTRAS ADDRESS AVAILABLE
1 House WiFi CP 432 1
2 Farm NONE CP 345 1
3 House Wifi CP 315 1
1 House WiFi CP 437 0
2 House Wifi CP 325 0
如果要提供新的索引号,请使用ignore_index=True
选项:
ddf = df1.append(df2, ignore_index=True)
print(ddf)
DESCRIPTION EXTRAS ADDRESS AVAILABLE
0 House WiFi CP 432 1
1 Farm NONE CP 345 1
2 House Wifi CP 315 1
3 House WiFi CP 437 0
4 House Wifi CP 325 0
检查df1和df2的输入和类型。检查数据框中的索引,如有必要,请使用“df.reset_index()”
df1 =pd.DataFrame({"ID" : ["1","2","3"],
"DESCRIPTION" : ["House","Farm","House"],
"EXTRAS" : ["Wifi", None, "Wifi"],
"ADDRESS" : ["CP 432","CP 345","CP 315"],
"AVAILABLE" : [1,1,1]},
index = ["1","2","3"]
)
df2 =pd.DataFrame({"ID" : ["1","2"],
"DESCRIPTION" : ["House","House"],
"EXTRAS" : ["Wifi", "Wifi"],
"ADDRESS" : ["CP 432","CP 315"],
"AVAILABLE" : [0,0]},
index = [1,2]
)
frames = [df1, df2]
result=pd.concat(frames)
print(result)
df5 = pd.DataFrame(result)
df5.drop_duplicates( keep='first', inplace = True)
print(df5)
结果:
ADDRESS AVAILABLE DESCRIPTION EXTRAS ID
1 CP 432 1 House Wifi 1
2 CP 345 1 Farm None 2
3 CP 315 1 House Wifi 3
1 CP 432 0 House Wifi 1
2 CP 315 0 House Wifi 2