首先,请注意,这同样可以应用于编写 GPGPU 代码的图形着色器,尽管我的兴趣是 GPGPU,这就是示例代码“类似计算”的原因。
我们都知道,由于 SIMD 处理器的固有限制,GPU 无法真正有效地“执行”分支代码。
我已经使用过多次的模式(当然我并不孤单)是将内核重构为多个内核,第一个内核确定特定工作项所在的代码分支,其余的执行(非分支) ) 仅适用于一个特定代码分支的代码。
过去,我总是在第一个内核运行后使用非设备代码,以便在调用无分支内核之前将工作项分区到单独的数组中(根据代码分支 ID),尽管在我在这里介绍的内容中,我' m 使用原子递增索引来进行分区,作为第一个(分支确定)内核的一部分。我将其称为“n + 1 内核”模式,但我想它可能已经有另一个名称了。
以下是通用 GPU 计算伪代码(尽管它接近 OpenCL,并且一些术语可能来自 OpenCL 领域):
// n+1 kernels pattern
// device code (generic gpgpu language) without n+1 kernels pattern:
kernel void oneKernel(const int[] intArray, const float[] floatArray, float[] out) {
int globalId = getId();
int thisInt = intArray[globalId];
float thisFloat = floatArray[globalId];
if (someCondition(thisInt, thisFloat)) {
if (someOtherCondition(thisInt)) {
out[globalId] = expensiveCalc0(thisInt, thisFloat);
} else {
out[globalId] = expensiveCalc1(thisInt, thisFloat);
}
} else {
out[globalId] += expensiveCalc2(thisInt, thisFloat);
}
}
// Device code using n+1 kernels pattern: first call createBranchMappings():
kernel void createBranchMappings(const int[] intArray, const float[] floatArray, int[][] globalIdsByBranch, volatile int[] nextIndexByBranch) {
int globalId = getId();
int thisInt = intArray[globalId];
float thisFloat = floatArray[globalId];
int branchId;
// someCondition is evaluated by all threads in lock step...
if (someCondition(thisInt, thisFloat)) {
// ... but someOtherCondition is not. Would be possible but very convoluted (using even more kernels) to avoid losing
// lockstep here, and unlikely I'd guess to be beneficial unless someOtherCondition were itself very expensive.
// Also note use of atomics later on breaks true lockstep in any case
if (someOtherCondition(thisInt)) {
branchId = 0;
} else {
branchId = 1;
}
} else {
branchId = 2;
}
globalIdsByBranch[branchId][atomic_add(&nextIndexByBranch[branchId], 1)] = globalId;
}
// Then call branch0()
// getId() ranges from 0 to number of elements for which branchId == 0
// (which we can get from nextIndexByBranch after calling createBranchMappings()).
// There is no branching at all while running this kernel, threads remain in lockstep.
kernel void branch0(const int[] intArray, const float[] floatArray, const int[][] globalIdsByBranch, float[] out) {
int indexInThisBranch = getId();
int globalIndex = globalIdsByBranch[0][indexInThisBranch];
int thisInt = intArray[globalIndex];
float thisFloat = floatArray[globalIndex];
out[globalIndex] = expensiveCalc0(thisInt, thisFloat);
}
// Then call branch1() ... then branch2() and we are done. branch1() and branch2() look just like branch0() except they
// call expensiveCalc1 and expensiveCalc2 rather than expensiveCalc0.
一些问题。