为什么遮罩的数组似乎比未遮罩的数组小?

问题描述 投票:0回答:2

我正在尝试理解numpy掩码数组和带有nans的普通数组之间的大小差异是多少。

import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
mask =  np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)

我使用以下answer计算对象的大小:

import sys
from types import ModuleType, FunctionType
from gc import get_referents
​
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType
​
​
def getsize(obj):
    """sum size of object & members."""
    if isinstance(obj, BLACKLIST):
        raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += sys.getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size

这给了我以下结果:

getsize(g)
>>>200000112
getsize(g_mask)
>>>25000924

为什么未遮罩的数组比遮罩的数组大?如何估算屏蔽数组与未屏蔽数组的实际大小?

python arrays numpy size mask
2个回答
1
投票

numpy.ndarray没有tp_traverse,因此它与您要使用的tp_traverse功能不兼容。 GC系统看不到掩码数组的getsize部分拥有的引用。特别是,ndarraybase没有包含在您的输出中。


1
投票
g_mask

In [23]: g = np.random.random((5000,5000)) ...: indx = np.random.randint(0,4999,(500,2)) ...: mask = np.full((5000,5000),False,dtype=bool) ...: mask[indx] = True ...: g_mask = np.ma.array(g,mask=mask) 数组与g_data属性进行比较,我们看到后者只是前者的g_mask

view

它们具有相同的数据缓冲区,但它们的In [24]: g.__array_interface__ Out[24]: {'data': (139821997776912, False), 'strides': None, 'descr': [('', '<f8')], 'typestr': '<f8', 'shape': (5000, 5000), 'version': 3} In [25]: g_mask._data.__array_interface__ Out[25]: {'data': (139821997776912, False), 'strides': None, 'descr': [('', '<f8')], 'typestr': '<f8', 'shape': (5000, 5000), 'version': 3} 不同:

id

与面具相同:

In [26]: id(g)                                                                  
Out[26]: 139822758212672
In [27]: id(g_mask._data)                                                       
Out[27]: 139822386925440

实际上使用此构造,In [28]: mask.__array_interface__ Out[28]: {'data': (139822298669072, False), 'strides': None, 'descr': [('', '|b1')], 'typestr': '|b1', 'shape': (5000, 5000), 'version': 3} In [29]: g_mask._mask.__array_interface__ Out[29]: {'data': (139822298669072, False), 'strides': None, 'descr': [('', '|b1')], 'typestr': '|b1', 'shape': (5000, 5000), 'version': 3} 是相同的数组:

_mask

掩码数组的[In [30]: id(mask) Out[30]: 139822385963056 In [31]: id(g_mask._mask) Out[31]: 139822385963056 __array_interface__属性的属性:

._data

[In [32]: g_mask.__array_interface__ Out[32]: {'data': (139821997776912, False), 是数组的数据缓冲区的大小:

nbytes

一个布尔数组每个元素有1个字节,而float64有8个字节。

© www.soinside.com 2019 - 2024. All rights reserved.