多项式求和

Question

我想对 cuda 内核内的多项式执行求和运算，其中包含给定的系数和函数

y = a1*f1(x) + a2*f2(x) + a3*f3(x) + a4*f4(x);

我的第一个想法是循环这三个术语并使用 switch 语句

float y = 0;
    for (int i = 0; i < 4; i++) {
        if (a[i] != 0) {

            switch (i) {
            case 0: y += a[i] * f1(x); break;
            case 1: y += a[i] * f2(x); break;
            case 2: y += a[i] * f3(x); break;
            case 3: y += a[i] * f4(x); break;
            }
        }
    }

当方程中的项数可以大于 30 时，上述解决方案似乎非常基本且缓慢。现在，我正在考虑创建一个函数数组来删除 switch 语句。

float y = 0;
    for (int i = 0; i < 4; i++) {
        if (a[i] != 0) {
            y += a[i] * f[i](x);
        }
    }

所有线程共享相同的

a[i]

。函数

f[i]

又是另一个有 1 - 8 项的多项式。

这个可以优化吗？

Answer 1

幸运的是，测量 GPU 性能非常容易

const auto StartTime = clock64();
... do stuff
const auto EndTime = clock64();
const auto Diff = int(EndTime - StartTime);
if (threadIdx.x == 0) { printf("time = %i clockcycles\n", Diff); }

一些提示：避免分支并避免 if 语句。无条件执行所有代码要快得多，因为

if

语句被翻译为以下伪代码：

if x then do_y() else do_z()
->
p = bool(x)
p? y()
p? more y()  //a handful number of statements will be predicated
!p? z()
!p? more z()

如果你在

if

里有很多语句，就会打电话，电话非常贵！

您的代码可以像这样高效执行：

__shared__ float y[1024]; //assume we have 1024 threads in a block.
__shared__ float f[4];
__shared__ float result[32];
//initialize f and y
//this looks like a loop, but can be optimized away if count == blockDim.x
for (int i = threadIdx.x; i < count; i += blockDim.x) {
    j = i % 4; //deduce the `f` you need
    y[i] += a[i] * f[j]; //no if/switches needed
    //more calculations, do not use ifs, but extract the data needed
    //from a shared array, so you +0, or *1 if you want to skip items.       
    //do not do division, but multiply by a (pre-calculated) 1/x 
}
//reduce 32 values into 1
const auto reduce = [&](float y1)-> float{
  y1 += __shfl_down_sync(-1u, y1, 1);
  y1 += __shfl_down_sync(-1u, y1, 2);
  y1 += __shfl_down_sync(-1u, y1, 4);
  y1 += __shfl_down_sync(-1u, y1, 8);
  y1 += __shfl_down_sync(-1u, y1, 16);
  return y1;
}
const auto y1 = reduce(y[threadIdx.x]);
if (threadIdx.x & 31 == 0) { result[threadIdx.x / 32] = y1; }
__syncthreads();
const auto final_result = reduce(result[threadIdx.x]);
if (threadIdx.x == 0) { globalresult[0] = final_result; }

多项式求和

问题描述投票：0回答：1

1个回答

最新问题

多项式求和

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1