我想对 cuda 内核内的多项式执行求和运算,其中包含给定的系数和函数
y = a1*f1(x) + a2*f2(x) + a3*f3(x) + a4*f4(x);
我的第一个想法是循环这三个术语并使用 switch 语句
float y = 0;
for (int i = 0; i < 4; i++) {
if (a[i] != 0) {
switch (i) {
case 0: y += a[i] * f1(x); break;
case 1: y += a[i] * f2(x); break;
case 2: y += a[i] * f3(x); break;
case 3: y += a[i] * f4(x); break;
}
}
}
当方程中的项数可以大于 30 时,上述解决方案似乎非常基本且缓慢。现在,我正在考虑创建一个函数数组来删除 switch 语句。
float y = 0;
for (int i = 0; i < 4; i++) {
if (a[i] != 0) {
y += a[i] * f[i](x);
}
}
所有线程共享相同的
a[i]
。函数 f[i]
又是另一个有 1 - 8 项的多项式。
这个可以优化吗?
幸运的是,测量 GPU 性能非常容易
const auto StartTime = clock64();
... do stuff
const auto EndTime = clock64();
const auto Diff = int(EndTime - StartTime);
if (threadIdx.x == 0) { printf("time = %i clockcycles\n", Diff); }
一些提示:避免分支并避免 if 语句。无条件执行所有代码要快得多,因为
if
语句被翻译为以下伪代码:
if x then do_y() else do_z()
->
p = bool(x)
p? y()
p? more y() //a handful number of statements will be predicated
!p? z()
!p? more z()
如果你在
if
里有很多语句,就会打电话,电话非常贵!
您的代码可以像这样高效执行:
__shared__ float y[1024]; //assume we have 1024 threads in a block.
__shared__ float f[4];
__shared__ float result[32];
//initialize f and y
//this looks like a loop, but can be optimized away if count == blockDim.x
for (int i = threadIdx.x; i < count; i += blockDim.x) {
j = i % 4; //deduce the `f` you need
y[i] += a[i] * f[j]; //no if/switches needed
//more calculations, do not use ifs, but extract the data needed
//from a shared array, so you +0, or *1 if you want to skip items.
//do not do division, but multiply by a (pre-calculated) 1/x
}
//reduce 32 values into 1
const auto reduce = [&](float y1)-> float{
y1 += __shfl_down_sync(-1u, y1, 1);
y1 += __shfl_down_sync(-1u, y1, 2);
y1 += __shfl_down_sync(-1u, y1, 4);
y1 += __shfl_down_sync(-1u, y1, 8);
y1 += __shfl_down_sync(-1u, y1, 16);
return y1;
}
const auto y1 = reduce(y[threadIdx.x]);
if (threadIdx.x & 31 == 0) { result[threadIdx.x / 32] = y1; }
__syncthreads();
const auto final_result = reduce(result[threadIdx.x]);
if (threadIdx.x == 0) { globalresult[0] = final_result; }