我正在尝试利用处理器的 SIMD 功能。然而,在矢量化的情况下,我观察到与为 SSE 编译的二进制文件(cmake 标志 -msse)相比,为 AVX(cmake 标志 -mavx2)编译二进制文件时没有任何改进。
我原本预计 AVX 的性能会提高约 1.5 倍,但事实显然并非如此。谁能帮我解释一下为什么会这样?
下面是我正在测试的POC代码:
#include <iostream>
#include <chrono>
#include <string>
#include <algorithm>
#include <random>
#include <vector>
#include <immintrin.h>
#include <malloc.h>
using namespace std;
using namespace chrono;
std::string generateRandomString(size_t length) {
// Define the characters to be used in the random string
const std::string characters =
"0123456789"
"abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"!@#$%^&*()_+[]{}|;:,.<>?";
// Initialize random number generator
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_int_distribution<> distribution(0, characters.size() - 1);
// Generate the random string
std::string randomString;
randomString.reserve(length); // Reserve space to avoid multiple allocations
for (size_t i = 0; i < length; ++i) {
randomString += characters[distribution(generator)];
}
return randomString;
}
void test_int_vectorization(int32_t size) {
uint64_t* array1 = new uint64_t[size];
uint64_t* array2 = new uint64_t[size];
uint64_t* result = new uint64_t[size];
auto t1 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
array1[i] = 1000 + i;
array2[i] = 500 + i;
}
auto t2 = high_resolution_clock::now();
auto populationTime = duration_cast<milliseconds>(t2 - t1);
cout << "Time in populating arrays = " << to_string(populationTime.count()) << " ms \n";
t2 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = array1[i] + array2[i];
}
auto t3 = high_resolution_clock::now();
auto operation_duration = duration_cast<milliseconds>(t3 - t2);
std::cout << "Int operation : Add completed in " << to_string(operation_duration.count()) << " ms \n";
t2 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = array1[i] * array2[i];
}
t3 = high_resolution_clock::now();
operation_duration = duration_cast<milliseconds>(t3 - t2);
std::cout << "Int operation : Multiply completed in " << to_string(operation_duration.count()) << " ms \n";
t2 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = array1[i] / array2[i];
}
t3 = high_resolution_clock::now();
operation_duration = duration_cast<milliseconds>(t3 - t2);
std::cout << "Int operation : Division completed in " << to_string(operation_duration.count()) << " ms \n";
t2 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = array1[i] % array2[i];
}
t3 = high_resolution_clock::now();
operation_duration = duration_cast<milliseconds>(t3 - t2);
std::cout << "Int operation : Modulo completed in " << to_string(operation_duration.count()) << " ms \n";
}
void test_string_vectorization(int size) {
string randStr1 = generateRandomString(size);
string randStr2 = generateRandomString(size);
vector<bool> areSame(size);
auto t1 = high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
areSame[i] = randStr1[i] == randStr2[i];
}
auto t2 = high_resolution_clock::now();
auto operation_duration = duration_cast<milliseconds>(t2 - t1);
cout << "String operation completed in " << to_string(operation_duration.count()) << " ms \n";
}
int main() {
int32_t size = 409600000;
test_int_vectorization(size);
test_string_vectorization(size);
return 0;
}
为 SSE 和 AVX 创建二进制文件时,以下是我得到的数字
// AVX -
Time in populating arrays = 1586 ms
Int operation : Add completed in 976 ms
Int operation : Multiply completed in 447 ms
Int operation : Division completed in 963 ms
Int operation : Modulo completed in 961 ms
String operation completed in 558 ms
// SSE -
Time in populating arrays = 1539 ms
Int operation : Add completed in 947 ms
Int operation : Multiply completed in 449 ms
Int operation : Division completed in 959 ms
Int operation : Modulo completed in 958 ms
String operation completed in 517 ms
这是什么编译器?如果这是 gcc,它不会在
-O3
以下自动矢量化,除非添加 -ftree-vectorize
。 -mavx2
和 -msse
(以及 -march=...
)不会矢量化,它们使编译器能够使用各自的指令。
根据 CPU 的不同,将数组对齐到 32 字节或更好的缓存行边界(64 字节)并告知编译器可能会更有利。
对于支持它的编译器(例如 clang、gcc、Intel),告诉编译器尝试向量化并添加对齐提示的简单方法是
pragma omp simd
。
要了解编译器矢量化的内容和未矢量化的内容,相应的报告很有用(
-fopt-info-vec
对于 GCC)。