我有以下函数,它创建大量描述多维数据集的字典条目:
def fill_dict_cubes(
length: np.ndarray,
width: np.ndarray,
height: np.ndarray,
) -> np.ndarray:
numObjects, numFrames = length.shape
cubePrimitives = np.empty((numObjects, numFrames), dtype=object)
for objIdx in range(numObjects):
for frameIdx in range(numFrames):
cubeDict = {
"length": length[objIdx, frameIdx],
"width": width[objIdx, frameIdx],
"height": height[objIdx, frameIdx],
}
cubePrimitives[objIdx, frameIdx] = cubeDict
return cubePrimitives
我需要使用 dict 结构,它是由外部 API 预定义的。输入数据
length
、width
和 height
是 2D numpy 数组。输出是一个包含字典的数组。
由于我有大量的立方体和许多框架,因此使用嵌套的 for 循环填充所有这些字典需要相当长的时间。不幸的是,到目前为止,我找不到一种通过矢量化/多重处理/并行化/等来加快速度的好方法。
有人知道如何更快地创建这些字典吗?顺便说一句,我使用的是 python 3.10,但如果需要新功能,更新到 3.13 不会有问题。
TL;博士
fill_dict_cubes = numpy.vectorize(lambda x, y, z: {"length": x, "width": y, "height": z}
numpy 1.25.2 python 3.11.5
为了使初始函数的速度提高大约两倍,我们可以使用 numpy.vectorize。假设我们有一个使用三个参数初始化的类
CubePrimitive
和一个返回字典的属性 attributes
(如您的 初始代码 中所示):
import numpy as np
from dataclasses import dataclass
@dataclass
class CubePrimitive():
length: float
width: float
height: float
def __post_init__(self):
# Let's add some attribute as if we couldn't avoid creating objects
self.volume = self.length*self.width*self.height
@property
def attributes(self):
return {'length': self.length,
'width': self.width,
'height': self.height,
'volume': self.volume}
然后可以得到所需的输出:
vect_cube_primitive = np.vectorize(lambda x, y, z: CubePrimitive(x, y, z).attributes)
vect_cube_primitive.__name__ = 'vect_cube_primitive'
vect_cube_primitive.__doc__ = 'Vectorized version of CubePrimitive(...).attribute'
对于将数据简单组合到字典中的情况:
def vect_dict(keys, fname: str | None = None):
f = np.vectorize(lambda *args: dict(zip(keys, args)))
if fname: f.__name__ = fname
f.__doc__ = f'Vectorized dictionary with keys={keys}'
return f
vect_primitives = vect_dict('length width height'.split(), 'vect_primitives')
现在,让我们来比较一下他们的表现:
import pandas as pd
from itertools import product, tee
start, stop, step = 100, 501, 100
funcs = [
fill_dict_cubes, # load from https://stackoverflow.com/revisions/79264703/2
vect_cube_primitive,
vect_primitives,
]
funcs = {f.__name__: f for f in funcs}
shapes = product(*tee(range(start, stop, step), 2))
time_df = pd.DataFrame(np.nan,
index=pd.MultiIndex.from_tuples(shapes),
columns=funcs.keys())
setup = ('shape=(%i, %i);'
'length=rng.random(shape);'
'width=rng.random(shape);'
'height=rng.random(shape)')
_globals = funcs | {'rng': rng}
for shape, fname in product(time_df.index, time_df.columns):
time_df.loc[shape, fname] = timeit(f'{fname}(length, width, height)', setup % shape,
number=3, globals=_globals)
by_volume = time_df.index.to_frame().prod(1).sort_values().index
ax = time_df.loc[by_volume].plot(kind='bar', figsize=(10, 5))
ax.tick_params(axis='x', labelrotation=90)
ax.set_ylim(bottom=0)
fig = ax.figure
fig.tight_layout()
fig.savefig('performance.jpg')