在Julia中进行高效的分层随机采样

Question

我正试图写一个小函数来做分层随机抽样。也就是说，我有一个每个元素的组成员资格的向量，我想为每个组选择一个元素（索引）。因此，输入是所需元素的数量，以及每个元素的组成员资格。输出是一个索引列表。

这是我的函数。

function stratified_sample(n::Int64, groups::Array{Int64})

    # the output vector of indices
    ind = zeros(Int64, n)

    # first select n groups from the total set of possible groups
    group_samp = sample(unique(groups), n, replace = false)

    # cycle through the selected groups
    for i in 1:n
        # for each group, select one index whose group matches the current target group
        ind[i] = sample([1:length(groups)...][groups.==group_samp[i]], 1, replace = false)[1]
    end

    # return the indices
    return ind
end

当我在一个相对较大的向量上运行这段代码时例如，1000个不同的组，40000个总条目，我得到的是


julia> groups = sample(1:1000, 40000, replace = true)
40000-element Array{Int64,1}:
 221
 431
 222
 421
 714
 108
 751
 259
   ⋮
 199
 558
 317
 848
 271
 358

julia> @time stratified_sample(5, groups)
  0.022951 seconds (595.06 k allocations: 19.888 MiB)
5-element Array{Int64,1}:
 11590
 17057
 17529
 25103
 20651

而要和普通的从40000个可能的5个元素中随机抽样比较。

julia> @time sample(1:40000, 5, replace = false)
  0.000005 seconds (5 allocations: 608 bytes)
5-element Array{Int64,1}:
 38959
  5850
  3283
 19779
 30063

所以我的代码运行速度慢了近5万倍占用了3万3千倍的内存！我到底做错了什么，有什么方法可以加快这段代码的速度吗？我猜测真正的慢是发生在子集步骤，即。[1:length(groups)...][groups.==group_samp[i]]但我找不到更好的解决办法。

我在标准的Julia包中不停地寻找这个函数，但没有找到。

有什么建议吗？

EDIT: 我可以通过随机抽取一个样本来加快它的速度检查它是否满足有n个唯一的组被选中的要求。

function stratified_sample_random(n::Int64, groups::Array{Int64}, group_probs::Array{Float32})
    ind = zeros(Int64, n)
    my_samp = []
    while true
        my_samp = wsample(1:length(groups), group_probs, n, replace = false)
        if length(unique(groups[my_samp])) == n
            break
        end
    end

    return my_samp

end

这里 group_probs 只是一个抽样概率的向量，其中每组元素的总概率为1s，其中s是该组元素的数量。例如，如果 groups = [1,1,1,1,2,3,3] 其对应的概率为 group_probs = [0.25, 0.25, 0.25, 0.25, 1, 0.5, 0.5]. 这有助于通过最大限度地降低选择一个组的多个项目的概率来加快抽样速度。总的来说，它的效果相当好。

@time stratified_sample_random(5, groups, group_probs)
  0.000122 seconds (14 allocations: 1.328 KiB)
5-element Array{Int64,1}:
 32209
 10184
 30892
  4861
 30300

经过实验，按概率加权抽样并不一定比标准sample()快，但这取决于有多少个独特的组，以及所需要的是什么。n 的值是。

当然，不能保证这个函数会随机抽取一组唯一的对象，它可能会永远循环下去。我的想法是在 while 循环中添加一个计数器，如果它尝试了 10000 次而没有成功，那么它将调用原来的 stratified_sample 函数，以确保它返回一个唯一的结果。我不爱这个解决方案，一定有更优雅、更简捷的方法，但这绝对是一个进步。

Answer 1

在这里，我想写一个小函数。[1:length(groups)...]，你是在拍打和分配一个。40000 元素阵列 n 次，你应该避免这种情况。这里是一个33倍快的版本，使用的范围是 inds 而不是。不过知道了真正的应用，我们还是可以想出一个更快的方法。

function stratified_sample(n::Int64, groups::Array{Int64})

    # the output vector of indices
    ind = zeros(Int64, n)

    # first select n groups from the total set of possible groups
    group_samp = sample(unique(groups), n, replace = false)

    inds = 1:length(groups)
    # cycle through the selected groups
    for i in 1:n
        # for each group, select one index whose group matches the current target group
        ind[i] = sample(inds[groups.==group_samp[i]], 1, replace = false)[1]
    end

    # return the indices
    return ind
end

在Julia中进行高效的分层随机采样

问题描述投票：1回答：1

1个回答

最新问题

在Julia中进行高效的分层随机采样

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1