这个 GPU 计算或着色器模式叫什么（如果有的话）？

Question

首先，请注意，这同样可以应用于编写 GPGPU 代码的图形着色器，尽管我的兴趣是 GPGPU，这就是示例代码“类似计算”的原因。

我们都知道，由于 SIMD 处理器的固有限制，GPU 无法真正有效地“执行”分支代码。

我已经使用过多次的模式（当然我并不孤单）是将内核重构为多个内核，第一个内核确定特定工作项所在的代码分支，其余的执行（非分支） ) 仅适用于一个特定代码分支的代码。

过去，我总是在第一个内核运行后使用非设备代码，以便在调用无分支内核之前将工作项分区到单独的数组中（根据代码分支 ID），尽管在我在这里介绍的内容中，我' m 使用原子递增索引来进行分区，作为第一个（分支确定）内核的一部分。我将其称为“n + 1 内核”模式，但我想它可能已经有另一个名称了。

以下是通用 GPU 计算伪代码（尽管它接近 OpenCL，并且一些术语可能来自 OpenCL 领域）：

// n+1 kernels pattern

// device code (generic gpgpu language) without n+1 kernels pattern:

kernel void oneKernel(const int[] intArray, const float[] floatArray, float[] out) {
    int globalId = getId();
    int thisInt = intArray[globalId];
    float thisFloat = floatArray[globalId];
    
    if (someCondition(thisInt, thisFloat)) {
        if (someOtherCondition(thisInt)) {
            out[globalId] = expensiveCalc0(thisInt, thisFloat);
        } else {
            out[globalId] = expensiveCalc1(thisInt, thisFloat);
        }
    } else {
        out[globalId] += expensiveCalc2(thisInt, thisFloat);
    }
}

// Device code using n+1 kernels pattern: first call createBranchMappings():

kernel void createBranchMappings(const int[] intArray, const float[] floatArray, int[][] globalIdsByBranch, volatile int[] nextIndexByBranch) {
    int globalId = getId();
    int thisInt = intArray[globalId];
    float thisFloat = floatArray[globalId];
    int branchId;
    
    // someCondition is evaluated by all threads in lock step... 
    if (someCondition(thisInt, thisFloat)) {
        // ... but someOtherCondition is not. Would be possible but very convoluted (using even more kernels) to avoid losing 
        // lockstep here, and unlikely I'd guess to be beneficial unless someOtherCondition were itself very expensive. 
        // Also note use of atomics later on breaks true lockstep in any case
        if (someOtherCondition(thisInt)) {
            branchId = 0;
        } else {
            branchId = 1;
        }
    } else {
        branchId = 2;
    }
    globalIdsByBranch[branchId][atomic_add(&nextIndexByBranch[branchId], 1)] = globalId;
}

// Then call branch0()
// getId() ranges from 0 to number of elements for which branchId == 0
// (which we can get from nextIndexByBranch after calling createBranchMappings()).
// There is no branching at all while running this kernel, threads remain in lockstep. 

kernel void branch0(const int[] intArray, const float[] floatArray, const int[][] globalIdsByBranch, float[] out) {
    int indexInThisBranch = getId();
    int globalIndex = globalIdsByBranch[0][indexInThisBranch];
    int thisInt = intArray[globalIndex];
    float thisFloat = floatArray[globalIndex];
    out[globalIndex] = expensiveCalc0(thisInt, thisFloat); 
}

// Then call branch1() ... then branch2() and we are done. branch1() and branch2() look just like branch0() except they
// call expensiveCalc1 and expensiveCalc2 rather than expensiveCalc0.

一些问题。

假设这种方法不是什么新鲜事，它有名字吗？
有没有比这里介绍的方法更好的方法来划分工作项以呈现给branchN()内核？特别是，是否可以避免使用原子的成本？我怀疑有一种更好的方法，每个工作组在不使用原子的情况下执行此操作，并且仅在全局级别（在屏障之后）需要原子......但我不太au fait做工作组级别的事情。
有趣的是，这种方法对我很有效（缩短了执行时间）；有没有理由假设我很幸运并且这种方法并不总是有用？
在我看来，GPGPU 语言编译器实际上可以为您（内部且透明地）对所有分支内核进行代码转换。但鉴于建议似乎总是 GPU 无法有效地进行分支，因此它们似乎“没有”这样做。那为什么不呢？

这个 GPU 计算或着色器模式叫什么（如果有的话）？

问题描述投票：0回答：1

1个回答

最新问题

这个 GPU 计算或着色器模式叫什么（如果有的话）？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1