快速重复写入hdf5文件

Question

不久前，我编写了并行的FORTRAN代码，该代码对角化超级计算机上的超大型密集矩阵。这些矩阵是从密集的hdf5数据集中读取的。现在，我想将此代码用于使用Python构造的非常稀疏的矩阵。

但是，当我尝试将数据写入密集的hdf5文件时，它需要很长时间。稀疏矩阵由3x3个非零块组成，并使用三个数组存储：rows，cols，data。我试图迭代地写每个块：

fl = h5py.File(filepath, 'w')
dataset = fl.create_dataset("matrix", shape, dtype='d',
                            chunks=(60, 60), compression='szip',
                            fillvalue=0)

for row, col, val in zip(rows, cols, data):
    dataset[row*3: row*3 + 3, col*3: col*3 + 3] = val

fl.close()

对于由14848个非零块组成的小矩阵（密集形状为(1536, 1536)），写入需要2.6秒。而且我需要编写大于100倍的矩阵（稀疏度更高）。

Answer 1

我不知道这对速度或便利是否有帮助，但是：

scipy.sparse具有块压缩格式，使我想起您的数据。不完全相同。

摘自sparse.bsr_matrix的文档：

In [375]: >>> indptr = np.array([0, 2, 3, 6]) 
     ...: >>> indices = np.array([0, 2, 2, 0, 1, 2]) 
     ...: >>> data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape(6, 2, 2) 
     ...: M = sparse.bsr_matrix((data,indices,indptr), shape=(6, 6)) 
     ...:  
In [377]: M                                                                     
Out[377]: 
<6x6 sparse matrix of type '<class 'numpy.int64'>'
    with 24 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [378]: M.data                                                                
Out[378]: 
array([[[1, 1],
        [1, 1]],

       [[2, 2],
        [2, 2]],

       [[3, 3],
        [3, 3]],

       [[4, 4],
        [4, 4]],

       [[5, 5],
        [5, 5]],

       [[6, 6],
        [6, 6]]])
In [379]: M.data.shape                                                          
Out[379]: (6, 2, 2)
In [380]: M.indptr                                                              
Out[380]: array([0, 2, 3, 6], dtype=int32)
In [381]: M.indices                                                             
Out[381]: array([0, 2, 2, 0, 1, 2], dtype=int32)

这是压缩格式，具有indptr和indices而不是col和row数组。 sparse没有coo格式的块版本。

无论如何，sparse具有（相对）快速的格式间转换方法。

In [382]: Mo = M.tocoo()                                                        

In [384]: (Mo.row, Mo.col, Mo.data)                                             
Out[384]: 
(array([0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4,
        5, 5], dtype=int32),
 array([0, 1, 0, 1, 4, 5, 4, 5, 4, 5, 4, 5, 0, 1, 0, 1, 2, 3, 2, 3, 4, 5,
        4, 5], dtype=int32),
 array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
        6, 6]))

此数据可用于用一个表达式填充zeros数组：

In [385]: A = np.zeros((6,6),int)                                               
In [386]: A[Mo.row, Mo.col] = Mo.data                                           
In [387]: A                                                                     
Out[387]: 
array([[1, 1, 0, 0, 2, 2],
       [1, 1, 0, 0, 2, 2],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3],
       [4, 4, 5, 5, 6, 6],
       [4, 4, 5, 5, 6, 6]])
In [388]: M.A                                                                   
Out[388]: 
array([[1, 1, 0, 0, 2, 2],
       [1, 1, 0, 0, 2, 2],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3],
       [4, 4, 5, 5, 6, 6],
       [4, 4, 5, 5, 6, 6]])

https://docs.h5py.org/en/stable/high/dataset.html#fancy-indexing确实警告说，h5py奇特索引可能会很慢，尤其是如果跨越多个块时。仍然比迭代编写3x3切片要快。

所以未知数是：

如何将块格式转换为bsr
bsr.tocoo()步骤的速度
看中的相对速度h5py写入

快速重复写入hdf5文件

问题描述投票：0回答：1

1个回答

最新问题

快速重复写入hdf5文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1