最初的问题
我正在循环访问数千个带有 Python Pandas DataFrame 的 pickle 文件,这些文件的行数不同(大约 600 到 1300 之间),但列数不同(确切地说是 636)。然后我对它们进行转换(每个转换完全相同)并使用
DataFrame.to_csv()
方法将它们附加到 csv 文件。
to_csv
代码摘录:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
令我困扰的是,它启动得很快,但性能呈指数下降,我保留了处理时间日志:
start: 2015-03-24 03:26:36.958058
2015-03-24 03:26:36.958058
count = 0
time: 0:00:00
2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697
2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128
2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459
2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692
2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618
2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877
2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646
2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114
2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602
2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577
2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802
2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343
2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335
2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590
2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412
2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339
2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204
2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146
2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948
2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383
2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974
2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205
2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652
2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415
2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070
2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234
2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679
2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979
2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212
2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924
2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472
2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556
2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356
count = 3324
end of cycle: 2015-03-26 03:59:54.161842
end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784
更新#1
我按照你的建议做了@Alexander,但这肯定与
to_csv()
方法有关:
start: 2015-03-26 05:18:25.948410
2015-03-26 05:18:25.948410
count = 0
time: 0:00:00
2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631
2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541
2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016
2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237
2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361
2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384
2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633
2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624
2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006
2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800
2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756
2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996
2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653
2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465
2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006
2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323
2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602
2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465
2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692
2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336
2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537
2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320
2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145
2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342
2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319
2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824
2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567
2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148
2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286
2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294
2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584
2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725
2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513
count = 3324
end of cycle: 2015-03-26 06:03:47.321182
end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772
并根据要求,源代码:
import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime
# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())
# Adding function to Dataframe module:
pd.DataFrame.very_deep_copy = very_deep_copy
#Define Data Frame Header:
head = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
]
exclude = [
'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
]
path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'
count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")
for picklefile in pickleFiles:
if count%100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
for i in range(0, len(df)):
df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy()
for string in head:
if string in exclude:
if string != 'ConcatIndex':
dftemp.drop(string, axis=1, inplace=True)
else:
if 'Real ' in string:
max = pd.DataFrame.max(df[string.strip('Real ')])
elif 'child' in string:
max = pd.DataFrame.max(df[string.strip('child')+'desc'])
else:
max = pd.DataFrame.max(df[string])
if max != 0:
dftemp[string] = dftemp[string]/max
dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
count += 1
print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")
end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')
更新#2
按照建议,我对前几百个picklefile执行了命令
%prun %run "./DataSetNormalization.py"
,结果如下:
136373640 function calls (136342619 primitive calls) in 1018.769 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
220 667.069 3.032 667.069 3.032 {method 'close' of '_io.TextIOWrapper' objects}
1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}
219 34.886 0.159 34.886 0.159 {built-in method collect}
3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}
78323 9.948 0.000 9.948 0.000 {built-in method empty}
25336892 9.645 0.000 12.635 0.000 {built-in method isinstance}
1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)
221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)
723540 7.312 0.000 7.312 0.000 {method 'reduce' of 'numpy.ufunc' objects}
273414 7.137 0.000 20.642 0.000 internals.py:2656(set)
604245 6.846 0.000 6.850 0.000 {method 'copy' of 'numpy.ndarray' objects}
1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}
276274 5.315 0.000 5.315 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1719244 5.264 0.000 5.266 0.000 {built-in method array}
1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)
1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)
1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)
441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)
272754 4.550 0.000 4.550 0.000 internals.py:1291(set)
220883 4.073 0.000 4.073 0.000 {built-in method charmap_encode}
4781222 3.805 0.000 4.349 0.000 {built-in method getattr}
52143 3.673 0.000 3.673 0.000 {built-in method truediv}
1920486 3.671 0.000 3.672 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)
875899 3.508 0.000 14.458 0.000 series.py:134(__init__)
334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}
2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}
1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)
824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)
2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)
273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)
1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)
879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)
552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)
824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)
220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)
326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)
602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)
602485 2.087 0.000 23.202 0.000 internals.py:2558(get)
770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)
276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076 1.935 0.000 1.972 0.000 {built-in method len}
220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)
220 1.893 0.009 1.995 0.009 {built-in method load}
1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)
112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)
602485 1.707 0.000 8.903 0.000 series.py:238(from_array)
875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)
3300 1.661 0.001 1.661 0.001 {method 'tolist' of 'numpy.ndarray' objects}
1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)
4211850 1.593 0.000 1.593 0.000 {built-in method issubclass}
1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)
273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)
441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)
220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)
273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)
326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)
274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)
273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)
602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)
934422 1.175 0.000 1.894 0.000 {method 'view' of 'numpy.ndarray'objects}
1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)
220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)
441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)
220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)
1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)
602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)
220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)
824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)
1491578 1.022 0.000 2.990 0.000 common.py:58(_check)
782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)
290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)
220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)
273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)
220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)
220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}
498646 0.906 0.000 0.906 0.000 {method 'clear' of 'dict' objects}
715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)
452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)
602485 0.801 0.000 0.801 0.000 internals.py:208(iget)
52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)
606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)
708971 0.732 0.000 2.156 0.000 internals.py:3165(values)
1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)
109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)
220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)
924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)
1515796 0.698 0.000 0.868 0.000 {built-in method hasattr}
220831 0.670 0.000 4.299 0.000 internals.py:435(copy)
875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)
220831 0.648 0.000 0.649 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)
715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}
1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)
220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)
272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)
220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)
498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)
1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)
220831 0.573 0.000 1.883 0.000 index.py:222(view)
814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)
52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)
220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)
228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)
769651 0.528 0.000 0.528 0.000 {built-in method min}
224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)
...
我将重新运行它进行确认,但看起来它肯定与 pandas 的
to_csv()
方法有关,因为大部分运行时间都用在 io 和 csv writer 上。为什么会有这样的效果呢?有什么建议吗?
更新#3
嗯,我做了完整的
%prun
测试,确实几乎90%的时间都用在了{method 'close' of '_io.TextIOWrapper' objects}
上。所以我想这就是问题所在......你们觉得怎么样?
我的问题是:
pandas.DataFrames.to_csv()
追加模式每次写入时都会加载整个文件吗?在这种情况下,您应该分析您的代码(以查看哪些函数调用花费最多时间),这样您就可以凭经验检查它在
read_csv
而不是其他地方确实很慢...
从你的代码来看:首先,这里有大量的复制和大量的循环(矢量化不够)......每次你看到循环时,都会寻找一种方法来删除它。其次,当你使用zfill之类的东西时,我想知道你是否想要
to_fwf
(固定宽度格式)而不是to_csv
?
一些健全性测试:某些文件是否比其他文件大得多(这可能会导致您进行交换)?你确定最大的文件只有1200行吗?你检查过这个吗?例如使用
wc -l
。
IMO我认为它不太可能是垃圾收集..(正如其他答案中所建议的那样)。
以下是对代码的一些改进,这应该会改善运行时间。
列是固定的我将提取列计算并对实数、子数和其他标准化进行矢量化。使用 apply 而不是迭代(对于 zfill)。
columns_to_drop = set(head) & set(exclude) # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols
for count, picklefile in enumerate(pickleFiles):
if count % 100 == 0:
t2 = datetime.now()
print(str(t2))
print('count = ' + str(count))
print('time: ' + str(t2 - t1) + '\n')
t1 = t2
#DataFrame Manipulation:
df = pd.read_pickle(path + picklefile)
df['ConcatIndex'] = 100000*df.FileID + df.ID
# use apply here rather than iterating
df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
df.index = df.ConcatIndex
#DataFrame Normalization:
dftemp = df.very_deep_copy() # don't *think* you need this
# drop all excludes
dftemp.drop(columns_to_drop), axis=1, inplace=True)
# normalize real cols
m = dftemp[real_cols_suffix].max()
m.index = real_cols
dftemp[real_cols] = dftemp[real_cols] / m
# normalize child cols
m = dftemp[child_cols_desc].max()
m.index = child_cols
dftemp[child_cols] = dftemp[child_cols] / m
# normalize remaining
remaining = list(remaining - child)
dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()
# if this case is important then discard the rows of m with .max() is 0
#if max != 0:
# dftemp[string] = dftemp[string]/max
# this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
# dftemp.drop('ConcatIndex', axis=1, inplace=True)
#Saving DataFrame in CSV:
if picklefile == '0000.p':
dftemp.to_csv(finalnormCSVFile)
else:
dftemp.to_csv(finalnormCSVFile, mode='a', header=False)
作为一种风格,我可能会选择将这些部分包装到函数中,这也意味着如果这确实是问题的话,可以对更多的东西进行GC......
如果您不需要将结果输出为 csv (但我希望您这样做),则另一个更快的选项是使用 pytables (HDF5Store)...
迄今为止最好的做法是分析您的代码。例如在 ipython 中使用 %prun
,例如请参阅http://pynash.org/2013/03/06/timing-and-profiling.html。然后你可以看到它肯定是
read_csv
以及具体位置(哪一行代码和哪一行 pandas 代码)。
single csv 文件中。在你的修剪中,它显示大部分时间都花在 close
上,所以让我们保持文件打开:
# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')
...
for picklefile in ...
if picklefile == '0000.p':
dftemp.to_csv(f)
else:
dftemp.to_csv(f, mode='a', header=False)
...
f.close()
每次打开文件才能追加之前,它都需要在写入之前查找到末尾,这可能是昂贵的(我不明白为什么这应该是that不好,但保留它open 消除了执行此操作的需要)。
very_deep_copy
,您是否检查了一段时间内的内存使用情况?有可能内存没有正确释放。如果这是问题所在,您可以执行以下操作之一:
1)完全避免复制(性能更好)。
2) 偶尔使用
gc.collect()
强制进行垃圾回收。请参阅“
Python 垃圾收集”了解可能相关的问题,以及本文了解有关 Python 中垃圾收集的介绍。
编辑:
删除副本的解决方案是:1) 在标准化之前
存储每列的标准化常数。 2) 删除
标准化后不需要的列。
# Get the normalizing constant for each column.
max = {}
for string in head:
if string not in exclude:
if 'Real ' in string:
max[string] = df[string.strip('Real ')].max()
elif 'child' in string:
max[string] = df[string.strip('child')+'desc'].max()
else:
max[string] = df[string].max()
# Actual normalization, each column is divided by
# its constant if possible.
for key,value in max.items():
if value != 0:
df[key] /= value
# Drop the excluded columns
df.drop(exclude, axis=1, inplace=True)