Visual Studio 2012 中的三次根函数 cbrt()

Question

我正在 Visual Studio 2012 Professional (Windows) 中用 C/C++ 编写一个程序，其中包括使用

pow()

计算许多幂。我运行分析器来找出为什么需要这么长时间才能运行，我发现

pow()

是瓶颈。

我重写了诸如

之类的权力

pow(x,1.5)

至

x*sqrt(x)

和

pow(x,1.75)

至

sqrt(x*x*x*sqrt(x))

这显着提高了程序的速度。

有一些权力是这样的

pow(x,1.0/3.0)

所以我寻找立方根函数

cbrt()

来加速速度，但它似乎在Visual Studio中不可用，我几乎无法想象，所以我的问题是：

在 Visual Studio 2012 Professional 中哪里可以找到

cbrt()

函数，如果没有，除了

pow(x,1.0/3.0)

之外还有什么替代方法？

亲切的问候，

恩斯特·扬

Answer 1

此站点探索了几种在 C 中有效计算立方根的计算方法，并且有一些源代码可供您下载。

（编辑：谷歌搜索“快速立方根”会出现几个更有前途的搜索结果。）

立方根是一个令人感兴趣的话题，因为它们在许多常见公式中使用，并且 Microsoft Visual Studio 中不包含快速立方根函数。

在没有特殊立方根函数的情况下，典型的策略是通过幂函数计算（例如，pow(x, 1.0/3.0)）。当负数处理不当时，这可能会在速度和准确性方面出现问题。

他的网站有一些关于所使用方法的基准。所有这些都比

pow()

快得多。

32-bit float tests
----------------------------------------
cbrt_5f      8.8 ms    5 mbp   6.223 abp
pow        144.5 ms   23 mbp  23.000 abp
halley x 1  31.8 ms   15 mbp  18.961 abp
halley x 2  59.0 ms   23 mbp  23.000 abp
newton x 1  23.4 ms   10 mbp  12.525 abp
newton x 2  48.9 ms   20 mbp  22.764 abp
newton x 3  72.0 ms   23 mbp  23.000 abp
newton x 4  89.6 ms   23 mbp  23.000 abp

请参阅网站以获取可下载源代码。

Answer 2

下面的实现速度比 std::pow 快 4 倍，并且在 AVX-512 CPU 上具有相对较高的容差 (0.000001)。它由每个基本操作（如乘法和除法）的垂直自动矢量化循环组成，因此它可以一次计算 8,16,32 个元素，而不是水平矢量化 Newton-Raphson 循环。

#include <cmath> 
/* 
   Newton-Raphson iterative solution
   f_err(x) = x*x*x - N
   f'_err(x) = 3*x*x
   x = x - (x*x*x - N)/(3*x*x)
   x = x - (x - N/(x*x))/3   <--- repeat until error < tolerance
   but with vertical-parallelization 
*/
template < typename Type, int Simd, int inverseTolerance> 
    inline 
void cubeRootFast(Type *const __restrict__ data, 
        Type *const __restrict__ result) noexcept 
{ 
    // alignment 64 required for AVX512 vectorization
    alignas(64) 
    Type xd[Simd]; 

    alignas(64) 
    Type resultData[Simd]; 

    alignas(64) 
    Type xSqr[Simd]; 

    alignas(64) 
    Type nDivXsqr[Simd]; 

    alignas(64) 
    Type diff[Simd]; 

    // cube root checking mask
    for (int i = 0; i < Simd; i++) 
    { 
        xd[i] = data[i] <= Type(0.000001);
    } 

    // skips division by zero if input is zero or close to zero
    for (int i = 0; i < Simd; i++) 
    { 
        resultData[i] = xd[i] ? Type(1.0) : data[i]; 
    } 

    // Newton-Raphson Iterations in parallel 
    bool work = true; 
    while (work) 
    { 
        // compute x*x
        for (int i = 0; i < Simd; i++) 
        { 
            xSqr[i] = resultData[i] *resultData[i]; 
        } 

        // compute N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = data[i] / xSqr[i]; 
        } 

        // compute x - N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = nDivXsqr[i] / Type(3.0); 
        } 

        // compute x - (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - diff[i]; 
        } 

        // compute absolute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = std::abs(diff[i]); 
        } 

        // compute condition to stop looping (error < tolerance)?
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = diff[i] > Type(1.0/inverseTolerance); 
        } 

        // all SIMD lanes have to have zero work left to end
        Type check = 0; 
        for (int i = 0; i < Simd; i++) 
        { 
            check += diff[i]; 
        } 
        work = (check > Type(0.0)); 

        // compute the next x guess
        for (int i = 0; i < Simd; i++) 
        { 
            resultData[i] = resultData[i] - nDivXsqr[i]; 
        } 
    } 

    // if input was close to zero, output zero
    // output result otherwise
    for (int i = 0; i < Simd; i++) 
    { 
        result[i] = xd[i] ? Type(0.0) : resultData[i]; 
    } 
} 
#include <iostream> 
 
int main() 
{ 
    constexpr int n = 8192; 
    constexpr int simd = 16; 
    constexpr int inverseTolerance = 1000;
    float data[n]; 
    for (int i = 0; i < n; i++) 
    { 
        data[i] = i; 
    } 
    for (int i = 0; i < n; i += simd) 
    { 
        cubeRootFast<float, simd, inverseTolerance> (data + i, data + i); 
    } 
    for (int i = 0; i < 10; i++) 
        std::cout << data[i *i *i] << std::endl; 
    return 0; 
}

它仅使用 GCC 进行测试，因此可能需要在每个循环上使用额外的 MSVC 编译指示来强制自动矢量化。如果你有 OpenMP，那么你也可以使用

#pragma omp simd safelen(Simd)

来实现同样的事情。

性能仅保持在[0,1]范围内。要使用更大的值，您应该使用范围缩小，如下所示：

// example: max value is 1000
for(auto & input:inputs)
    input = input/1000.0f // normalize

for(..)
    cubeRootFast<float, simd, inverseTolerance> (input + i, input + i)

 for(auto & input:inputs)
    input = 10.0f*input // de-normalize  (1000 = 10 x 10 x 10)

如果您在低范围（如 [0,1000]）上只需要 0.005 误差并具有 16 倍加速，您可以尝试下面使用多项式近似的实现（Horner-Scheme 应用于使用 FMA 指令进行计算，并且不需要显式自动向量化，因为它内部不包含任何分支/循环）：

// optimized for [0,1] range: ~1 cycles on AVX512, 0.003 average error 
// polynomial approximation with Horner Scheme for FMA optimization 
template<typename T> 
T cubeRootFast(T x) 
{ 
 T xd = x-T(1.0); 
 T result = T(-55913.0/4782969.0); 
 result *= xd; 
 result += T(21505.0/1594323.0);  
 result *= xd; 
 result += T(-935.0/59049.0);  
 result *= xd; 
 result += T(374.0/19683.0);  
 result *= xd; 
 result += T(-154.0/6561.0);  
 result *= xd; 
 result += T(22.0/729.0);  
 result *= xd; 
 result += T(-10.0/243.0); 
 result *= xd; 
 result += T(5.0/81.0); 
 result *= xd; 
 result += T(-1.0/9.0);  
 result *= xd; 
 result += T(1.0/3.0); 
 result *= xd; 
 result += T(1.0); 
 return result; 
} 
 
// range reduction + dereduction: ~ 1 cycles on AVX512 
 for(int i=0;i<8192;i++) 
 {  
     float inp = input[i]; 
     // scaling + descaling for range [1,999] 
     float scaling = (inp>333.0f)?(1000.0f):(333.0f); 
     scaling = (inp>103.0f)?scaling:(103.0f); 
     scaling = (inp>29.0f)?scaling:(29.0f); 
     scaling = (inp>7.0f)?scaling:(7.0f); 
     scaling = (inp>3.0f)?scaling:(3.0f); 
      
     output[i] = powf(scaling,0.33333333333f)*cubeRootFast<float>(inp/scaling);  
 }

Answer 3

MSC 2012 一定是他们最后一个没有在 math.h 中实现 cbrt 的版本。 MS 系统 cbrt 基准测试非常糟糕，但在实际代码中并没有那么糟糕。但这不是最准确的。

我发现在大多数编译器上既准确又快速的 cbrt 公共代码的最佳组合是 Bruce Evans 的 Kahan 魔术常数算法的 Sun 实现。通过在最终细化中仔细使用截断和 FMA，英特尔 2023 编译器中的系统 cbrt 具有令人难以置信的准确度。

如果没有更好的选择，一个简单的选择是：

double cbrt(double x)
{  // fast and accurate. NR isn't sufficient to tidy it up
   // could be improved by precision (y*y)*y-x using FMA
   double t, y;
   if (x) y = exp(log(abs(x)) / 3); else return x;
   if (x<0) y = -y;
   t = y*y*y;
   return y - y * (t - x) / (2 * t + x);  // Halley refinement
}

根据 CPU 架构，其中任何一个都值得一试。

Visual Studio 2012 中的三次根函数 cbrt()

问题描述投票：0回答：3

3个回答

最新问题

Visual Studio 2012 中的三次根函数 cbrt()

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3