填充大量字典的最快方法

问题描述 投票:0回答:1

我有以下函数,它创建大量描述多维数据集的字典条目:

def fill_dict_cubes(
    length: np.ndarray,
    width: np.ndarray,
    height: np.ndarray,
) -> np.ndarray:
    numObjects, numFrames = length.shape
    cubePrimitives = np.empty((numObjects, numFrames), dtype=object)
    for objIdx in range(numObjects):
        for frameIdx in range(numFrames):
            cubeDict = {
                "length": length[objIdx, frameIdx],
                "width": width[objIdx, frameIdx],
                "height": height[objIdx, frameIdx],
            }
            cubePrimitives[objIdx, frameIdx] = cubeDict
    return cubePrimitives

我需要使用 dict 结构,它是由外部 API 预定义的。输入数据

length
width
height
是 2D numpy 数组。输出是一个包含字典的数组。 由于我有大量的立方体和许多框架,因此使用嵌套的 for 循环填充所有这些字典需要相当长的时间。不幸的是,到目前为止,我找不到一种通过矢量化/多重处理/并行化/等来加快速度的好方法。

有人知道如何更快地创建这些字典吗?顺便说一句,我使用的是 python 3.10,但如果需要新功能,更新到 3.13 不会有问题。

python numpy dictionary optimization multidimensional-array
1个回答
0
投票

TL;博士

fill_dict_cubes = numpy.vectorize(lambda x, y, z: {"length": x, "width": y, "height": z}

numpy 1.25.2 python 3.11.5

为了使初始函数的速度提高大约两倍,我们可以使用 numpy.vectorize。假设我们有一个使用三个参数初始化的类

CubePrimitive
和一个返回字典的属性
attributes
(如您的 初始代码 中所示):

import numpy as np
from dataclasses import dataclass

@dataclass
class CubePrimitive():
    length: float
    width: float
    height: float
    
    def __post_init__(self):
        # Let's make an additional attribute as if we couldn't avoid creating objects
        self.volume = self.length*self.width*self.height
    
    @property
    def attributes(self):
        return {'length': self.length, 
                'width': self.width, 
                'height': self.height,
                'volume': self.volume}

然后可以得到所需的输出:

vect_cube_primitive = np.vectorize(lambda x, y, z: CubePrimitive(x, y, z).attributes)
vect_cube_primitive.__name__ = 'vect_cube_primitive'
vect_cube_primitive.__doc__ = 'Vectorized version of CubePrimitive(...).attribute'

对于将数据简单组合到字典中的情况:

def vect_dict(keys, fname: str | None = None):
    f = np.vectorize(lambda *args: dict(zip(keys, args)))
    if fname: f.__name__ = fname
    f.__doc__ = f'Vectorized dictionary with keys={keys}'
    return f

vect_primitives = vect_dict('length width height'.split(), 'vect_primitives')

现在,让我们来比较一下他们的表现:

import pandas as pd
from itertools import product, tee

start, stop, step = 100, 501, 100    
funcs = [
    fill_dict_cubes,       # load from https://stackoverflow.com/revisions/79264703/2
    vect_cube_primitive, 
    vect_primitives,
]
funcs = {f.__name__: f for f in funcs}
shapes = product(*tee(range(start, stop, step), 2))
time_df = pd.DataFrame(np.nan, 
    index=pd.MultiIndex.from_tuples(shapes),
    columns=funcs.keys())
setup = ('shape=(%i, %i);'
         'length=rng.random(shape);'
         'width=rng.random(shape);'
         'height=rng.random(shape)')
_globals = funcs | {'rng': rng}
for shape, fname in product(time_df.index, time_df.columns):
    time_df.loc[shape, fname] = timeit(f'{fname}(length, width, height)', setup % shape,
                                       number=3, globals=_globals)

by_volume = time_df.index.to_frame().prod(1).sort_values().index
ax = time_df.loc[by_volume].plot(kind='bar', figsize=(10, 5))
ax.tick_params(axis='x', labelrotation=90)
ax.set_ylim(bottom=0)
fig = ax.figure
fig.tight_layout()
fig.savefig('performance.jpg')

performance comparison

© www.soinside.com 2019 - 2024. All rights reserved.