如何在 hdf5 中设置合适的块大小

Question

根据这个答案，适当的块大小对于优化 I/O 性能非常重要。

我是3000张jpg图片，大小从180kB到220kB不等。我将把它们保存为字节。

我知道有两种方法。一种是连接所有 jpg 字节，另一种是每个数据集中仅保存一张 jpg。

如何确定每种方法的最佳块大小？


def save_images_separate(input_folder, hdf5_path):
    start_time = time.time()

    image_files = [f for f in os.listdir(input_folder) if f.endswith('.jpg')]

    with h5py.File(hdf5_path, 'w') as hdf5_file:
        for i, image_file in enumerate(image_files):
            image_path = os.path.join(input_folder, image_file)
            with open(image_path, 'rb') as img_file:
                image_data = img_file.read()
            hdf5_file.create_dataset(f'images/{i}', data=np.frombuffer(image_data, dtype=np.uint8))
    end_time = time.time()
    return end_time - start_time


def save_images_concatenated(input_folder, hdf5_path):
    start_time = time.time()

    image_files = [f for f in os.listdir(input_folder) if f.endswith('.jpg')]
    all_images_data = bytearray()
    image_lengths = []

    for image_file in image_files:
        image_path = os.path.join(input_folder, image_file)
        with open(image_path, 'rb') as img_file:
            image_data = img_file.read()
            all_images_data.extend(image_data)
            image_lengths.append(len(image_data))

    with h5py.File(hdf5_path, 'w') as hdf5_file:
        hdf5_file.create_dataset('images/all_images', data=all_images_data)
        hdf5_file.create_dataset('images/image_lengths', data=image_lengths)
        
    end_time = time.time()
    return end_time - start_time

Answer 1

首先一些背景知识：默认情况下，HDF5 数据集存储是连续的。分块存储是提高大型数据集 I/O 性能的“选项”。区别如下：

当您使用

存储时，当数据集的任何元素是访问时，会读取整个数据集。当您使用
存储时，当chunk的任何元素是访问时，会读取整个块。

如果将每个图像保存在唯一的数据集中，则实际上不需要分块存储（因为图像为 180-220 kB）。考虑到可变的图像大小和使用级联存储方法的数据集形状，很难做出推荐。您几乎总是从 2 个块读取，有时从 3 个块读取。您可以尝试将

chunks

设置为平均图像大小（200kB），并测试性能。

如何在 hdf5 中设置合适的块大小

问题描述投票：0回答：1

1个回答

最新问题

如何在 hdf5 中设置合适的块大小

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1