SciPy 稀疏矩阵 csr_matrix 和 csc_matrix 函数 - 它们使用多少中间内存？

Question

对于 scipy 函数 csr_matrix 和 csc_matrix，特别是采用行索引和列索引的形式：

csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
csc_matrix((data, (row_ind, col_ind)), [shape=(M, N)])

他们使用多少中间内存？想必他们必须使用一些才能转换为 CSR/CSC 表示形式。

上下文是，在特定情况下调用这些函数会使用大量内存，以至于由于内存不足而无法成功，因此我试图对此进行推理。在我的特定示例中，总数据 + row_ind + col_ind 需要 25GB，并且我剩余大约 35GB 内存，但这似乎不足以调用 csr_matrix 或 csc_matrix。

文档位于 https://docs.scipy.org/doc/scipy/reference/ generated/scipy.sparse.csr_matrix.html 和 https://docs.scipy.org/doc/scipy/reference/ generated /scipy.sparse.csc_matrix.html 似乎没有提供有关此的信息。

这是我希望在我的 60GB (Linux) 系统上运行内存不足的大致等效代码。

import numpy as np
from scipy.sparse import csc_matrix

num_values = 2500000000
output_matrix_size = 150000

matrix = csc_matrix(
    (
        np.zeros(num_values, dtype=np.float16),
        (
            np.zeros(num_values, dtype=np.int32),
            np.zeros(num_values, dtype=np.int32),
        ),
    ),
    shape=(output_matrix_size, output_matrix_size),
)

Answer 1

我已经通过稀疏模块追踪了操作。施工工程如下：

创建了具有重复项的 COO 矩阵
COO 矩阵转换为带有重复项的 CSC
重复项已汇总

进行以下分配：

可选地，不是在这种情况下：如果索引类型对于 COO 矩阵不是最优的，则索引的转换副本。在 int32 和 int64 之间做出决定。这里使用int32就足够了，所以没有复制
可选，不是在这种情况下：如果数据类型与指定类型不匹配，则创建值的转换副本
转换为 CSC 格式时会生成行索引和数据的副本（有重复项）。 CSC 格式的剩余元数据（主要是
```
indptr
```
数组）也被分配
C++
```
vector<pair<Index, Value>>
```
临时分配用于对行索引进行排序。其最大大小由每列的非零条目的最大数量决定。在本例中，这是完整的数据大小，因为所有非零条目都在一列中
保留修剪后的唯一行索引和值的副本

使用 2.5e9 初始值，第 3 点需要 14 GiB（每行索引 4 个字节，每个值 2 个字节）加上 <2 MiB for the

indptr

数组。第 4 点额外占用 18.6 GiB（

pair<int32, float16>

占用 8 个字节）。

作为解决方法，我建议批量转换。像这样的东西：

batchsize = 524288 # 2 MiB divided by 4 byte. Just a guess
values = np.zeros(num_values, dtype=np.float16)
rows = cols = np.zeros(num_values, dtype=np.int32)
shape = (output_matrix_size, output_matrix_size)
out = csc_matrix(shape, dtype=values.dtype)
for i in range(0, len(values), batchsize):
    out += csc_matrix((values[i:i+batchsize],
                       (rows[i:i+batchsize], cols[i:i+batchsize])),
                      shape=shape)

其扩展版本可以设计为并行化简。当然，这只有在数据集有大量重复项时才有意义。

如果存在许多不重复的非零值，则这种批处理方式的性能将非常差，因为输出矩阵的大小会增长，而添加的元素数量保持不变。或者，最好只对大小相似的矩阵求和，如合并排序模式。就空间和时间要求而言，这介于其他两种方法之间。像这样的东西：

batches = (csc_matrix((values[i:i+batchsize],
                       (rows[i:i+batchsize], cols[i:i+batchsize]),
                      ), shape=shape)
           for i in range(0, len(values), batchsize))

stack = [csc_matrix(shape, dtype=values.dtype)]
for batch in batches:
    while stack and stack[-1].nnz <= batch.nnz * 2:
        batch += stack.pop()
    stack.append(batch)

while len(stack) > 1:
    batch = stack.pop()
    stack[-1] += batch

out = stack.pop()

SciPy 稀疏矩阵 csr_matrix 和 csc_matrix 函数 - 它们使用多少中间内存？

问题描述投票：0回答：1

1个回答

最新问题

SciPy 稀疏矩阵 csr_matrix 和 csc_matrix 函数 - 它们使用多少中间内存？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1