为什么__builtin_popcount比我自己的位计数功能慢？

Question

在我编写了自己的位计数例程后，我偶然发现了__builtin_popcount for gcc。但当我切换到__builtin_popcount时，我的软件实际上运行速度较慢。我在Unbutu上使用英特尔酷睿i3-4130T CPU @ 2.90GHz。我建立了一个性能测试，看看是什么给出的。它看起来像这样：

#include <iostream>
#include <sys/time.h>
#include <stdint.h>

using namespace std;

const int bitCount[256] = {
    0,1,1,2,1,2,2,3,  1,2,2,3,2,3,3,4,  1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,  4,5,5,6,5,6,6,7,  5,6,6,7,6,7,7,8
};

const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;

inline int countBits(uint32_t bitField)
{
    return
        bitCount[(bitField & m32_0001)      ] +
        bitCount[(bitField & m32_0010) >>  8] +
        bitCount[(bitField & m32_0100) >> 16] +
        bitCount[(bitField & m32_1000) >> 24];
}

inline long long currentTime() {
    struct timeval ct;
    gettimeofday(&ct, NULL);
    return ct.tv_sec * 1000000LL + ct.tv_usec;
}

int main() {
    long long start, delta, sum;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += countBits(i);
    delta = currentTime() - start;
    cout << "countBits         : sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += __builtin_popcount(i);
    delta = currentTime() - start;
    cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i) {
        int count;
        asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
        sum += count;
    }
    delta = currentTime() - start;
    cout << "assembler         : sum=" << sum << ": time (usec)=" << delta << endl;

    return 0;
}

起初我使用较旧的编译器运行它：

> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler         : sum=1314447104: time (usec)=138036

如您所见，基于表的countBits几乎与汇编程序一样快，并且比__builtin_popcount快得多。然后我在不同的机器类型上尝试了一个新的编译器（相同的处理器 - 我认为主板也是一样的）：

> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler         : sum=1314447104: time (usec)=138028

奇怪的是，较旧的编译器比新编译器更好地优化了我的countBits函数，但它仍然比汇编器更有优势。显然，由于汇编程序行编译和运行，我的处理器支持popcount，但为什么__builtin_popcount的速度慢了两倍？我自己的例程怎么可能与基于硅的popcount竞争呢？我在查找第一个设置位等其他例程方面有相同的经验。我的例程都明显快于GNU“内置”等价物。

（顺便说一句，我不知道如何编写汇编程序。我只是在某个网页上找到了这一行，它奇迹般地起作用了。）

Answer 1

如果没有在命令行中指定适当的“-march”，gcc将生成对__popcountdi2函数的调用，而不是popcnt指令。见：https://godbolt.org/z/z1BihM

根据维基百科：https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT，自从Nehalem和AMD自巴塞罗那以来，POPCNT得到英特尔的支持

Answer 2

我认为在将-march = native添加到编译行之后（如Mat和Alan Birtles的建议）分享新的性能结果可能是有用的，这可以使用popcount机器指令。结果因编译器版本而异。这是较旧的编译器：

> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=163947
__builtin_popcount: sum=1314447104: time (usec)=138046
assembler         : sum=1314447104: time (usec)=138036

这是更新的编译器：

> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -march=native -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=163133
__builtin_popcount: sum=1314447104: time (usec)=73987
assembler         : sum=1314447104: time (usec)=138036

观察：

在旧的g ++编译器的命令行中添加-march = native会将__builtin_popcount的性能提高到与汇编器的性能相等，并使我的countbits例程减慢约15％。
在新的g ++编译器的命令行中添加-march = native会导致__builtin_popcount的性能超过汇编器的性能。我认为这与我用于汇编程序的堆栈变量有关，虽然我不确定。我的countBits性能没有任何影响（正如我的问题所述，这个新的编译器已经慢了。）

为什么__builtin_popcount比我自己的位计数功能慢？

问题描述投票：1回答：2

2个回答

最新问题

为什么__builtin_popcount比我自己的位计数功能慢？

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2