为什么用太长的 bool Series 索引一系列值时不会抛出警告?

问题描述 投票:0回答:1

我有以下代码:

import pandas as pd

series_source = pd.Series([1, 2, 3, 4], dtype=int)
normal_index = pd.Series([True, False, True, True], dtype=bool)
big_index = pd.Series([True, False, True, True, False, True], dtype=bool)

# Both indexes give back: pd.Series([1, 2, 3, 4], dtype=int)
# no warnings are raised!
assert (series_source[normal_index] == series_source[big_index]).all() 

df_source = pd.DataFrame(
    [
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ]
)

# no warning - works as expected: grabs rows 0, 2, and 3
df_normal_result = df_source[normal_index]

# UserWarning: Boolean Series key will be reindexed to match DataFrame index.
# (but still runs)
df_big_result = df_source[big_index]

# passes - they are equivalent
assert df_normal_result.equals(df_big_result)
print("Complete")

为什么用

series_source
索引
big_index
不会发出警告,即使大索引的值比源多?
为了进行系列索引,pandas 在幕后做了什么?

(与索引

df_source
相比,会发出明确的警告,需要重新索引
big_index
才能使操作正常工作。)

索引文档中,它声称:

使用布尔向量来索引 Series 的工作方式与 NumPy 中完全相同 ndarray

但是,如果我这样做

import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([True, False, True, True, False])
c = np.array([True, False, True, True, False, True, True])

# returns an ndarray of [1,3, 4] as expected
print(a[b])

# raises IndexError: boolean index did not match indexed array along axis 0;
# size of axis is 5 but size of corresponding boolean axis is 7
print(a[c])

所以这个功能似乎并不像文档声称的那样与 Numpy 匹配。发生什么事了?

(我的版本是

pandas==2.2.2
numpy==2.0.0
。)

python pandas dataframe indexing series
1个回答
0
投票

因为索引Series首先与索引DataFrame的索引对齐。

简而言之,pandas 正在做:

tmp = big_index.reindex(df.index)
df_big_result = df_source[tmp]

如果您更改索引系列的索引,您实际上可以自己观察到这一点:

big_index2 = pd.Series([False, False, True, True, True, True], index=[4,5,0,1,2,3], dtype=bool)
df_source[big_index2]
© www.soinside.com 2019 - 2024. All rights reserved.