在多层次加权的 Pandas 中的自举采样

Question

在多层次加权的 Pandas 中的自举抽样

给出一个如下例中的表格（可能有额外的列），我想引导样本，其中

countries

和

fruits

被随机采样独立均匀。

对于每个

country

，都有一些

fruits

，一个介于

countries

之间的数字。

为了明确我在寻找什么，我创建了一系列 (1-4) 抽样策略，从简单开始，越来越接近我想要的：

样本M每个国家的水果...

...统一。
...与水果出现的次数成反比
...（平均）统一，但引导国家。
...（平均）与水果的出现次数成反比，但引导国家。

作为我的问题的一个最小例子，我选择了

countries

和

fruits

。

| Country |   Fruit    |
| ------- | ---------- |
| USA     | Pineapple  |
| USA     | Apple      |
| Canada  | Watermelon |
| Canada  | Banana     |
| Canada  | Apple      |
| Mexico  | Cherry     |
| Mexico  | Apple      |
| Mexico  | Apple      |
| ...     | ...        |

创建示例数据：

import pandas as pd

df = pd.DataFrame(
    np.array([
        ['USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Mexico', 'Mexico', 'Mexico', 'Mexico', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'UK', 'UK', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Italy', 'Italy', 'Spain', 'Spain', 'Spain', 'Spain', 'Spain'],
        ['Pineapple', 'Apple', 'Pineapple', 'Apple', 'Cherry', 'Watermelon', 'Orange', 'Apple', 'Banana', 'Cherry', 'Orange', 'Watermelon', 'Banana', 'Apple', 'Blueberry', 'Cherry', 'Apple', 'Banana', 'Blueberry', 'Banana', 'Apple', 'Cherry', 'Blueberry', 'Pineapple', 'Pineapple', 'Watermelon', 'Pineapple', 'Watermelon', 'Apple', 'Orange', 'Blueberry'],
    ]).T,
        columns=['Country', 'Fruit'],
).set_index('Country')
df['other columns'] = '...'

设置：

M = 10  # number of fruits to sample per country
rng = np.random.default_rng(seed=123)  # set seed for reproducibility

# create weights for later use
fruit_weights = 1 / df.groupby('Fruit').size().rename('fruit_weights')
country_weights = 1 / df.groupby('Country').size().rename('country_weights')

# normalize weights to sum to 1
fruit_weights /= fruit_weights.sum()
country_weights /= country_weights.sum()

(1) 样本M每个国家的水果均匀：

sampled_fruits = df.groupby('Country').sample(n=M, replace=True, random_state=rng)

(2) 每个国家的样本M水果与水果的出现次数成反比：

df2 = df.join(fruit_weights, on='Fruit')  # add weights to a copy of the original dataframe
sampled_fruits = df2.groupby('Country').sample(
    n=M,
    replace=True,
    random_state=rng,
    weights='fruit_weights',
)

(3) 每个国家抽样M水果（平均）均匀，但bootstrap国家：

sampled_fruits = pd.concat(
    {
        s: df.sample(
            n=df.index.nunique(),  # number of countries
            weights=country_weights,
            replace=True,
            random_state=rng,
        )
        for s in range(M)
    },
    names=['sample', 'Country'],
).reset_index('sample')

(4) 每个国家的样本M水果（平均）与水果的出现次数成反比，但是bootstrap国家：

df4 = df.join(fruit_weights, on='Fruit')

# normalize fruit weights to sum to 1 per country to not affect the country weights
df4['fruit_weights'] = df4.fruit_weights.groupby('Country').transform(lambda x: x / x.sum())

df4 = df4.join(country_weights, on='Country')

weight_cols = [c for c in df4.columns if '_weights' in c]
weights = df4[weight_cols].prod(axis=1)
df4 = df4.drop(columns=weight_cols)
sampled_fruits = pd.concat(
    {
        s: df4.sample(
            n=df.index.nunique(),  # number of countries
            weights=weights,
            replace=True,
            random_state=rng,
        )
        for s in range(M)
    },
    names=['sample', 'Country'],
).reset_index('sample')

数字(4)几乎完成我想要的。

countries

和

fruits

是随机均匀独立采样的。

只有一期:

现在假设我也想采样

vegetables

然后（不知何故）将结果与

fruits

的结果进行比较。假设

countries

保持不变，但不同

vegetables

的数量是 not 等于不同

fruits

的数量，既不等于整体，也不等于给定国家（至少不等于所有国家） ).

对于任何给定的引导迭代

，这将导致为

countries

和

fruits

 采样不同的

vegetables formula

集。为了澄清，对于每个引导迭代 formula

，采样

countries

对于

fruits

和

vegetables

应该是相同的，即

for m in range(M):
    assert all(sampled_fruits[sampled_fruits['sample'] == m].index == sampled_vegetables[sampled_vegetables['sample'] == m].index)

（我知道如何使用嵌套 for 循环实现我想要的结果，采样一个

country

，然后是一个
fruit
/
vegetable
，但这是我想避免的事情。）

(

fruits

和

vegetables

只是随机选择的东西来说明我的问题。在我的真实用例中，

countries

是测试集中的样本，

fruits

和

vegetables

是两组不同的人，其中每个人都对测试集的一个子集进行了评估/预测。)

在多层次加权的 Pandas 中的自举采样

问题描述投票：0回答：0

在多层次加权的 Pandas 中的自举抽样

创建示例数据：

设置：

(1) 样本M每个国家的水果均匀：

(2) 每个国家的样本M水果与水果的出现次数成反比：

(3) 每个国家抽样M水果（平均）均匀，但bootstrap国家：

(4) 每个国家的样本M水果（平均）与水果的出现次数成反比，但是bootstrap国家：

最新问题

在多层次加权的 Pandas 中的自举采样

问题描述 投票：0回答：0

在多层次加权的 Pandas 中的自举抽样

创建示例数据：

设置：

(1) 样本M每个国家的水果均匀：

(2) 每个国家的样本M水果与水果的出现次数成反比：

(3) 每个国家抽样M水果（平均）均匀，但bootstrap国家：

(4) 每个国家的样本M水果（平均）与水果的出现次数成反比，但是bootstrap国家：

最新问题

问题描述投票：0回答：0