与 SSE 相比,AVX 没有获得性能提升

问题描述 投票:0回答:1

我正在尝试利用处理器的 SIMD 功能。然而,在矢量化的情况下,我观察到与为 SSE 编译的二进制文件(cmake 标志 -msse)相比,为 AVX(cmake 标志 -mavx2)编译二进制文件时没有任何改进。

我原本预计 AVX 的性能会提高约 1.5 倍,但事实显然并非如此。谁能帮我解释一下为什么会这样?

下面是我正在测试的POC代码:

#include <iostream>
#include <chrono>
#include <string>
#include <algorithm>
#include <random>
#include <vector>
#include <immintrin.h>
#include <malloc.h>

using namespace std;
using namespace chrono;

std::string generateRandomString(size_t length) {
    // Define the characters to be used in the random string
    const std::string characters =
            "0123456789"
            "abcdefghijklmnopqrstuvwxyz"
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "!@#$%^&*()_+[]{}|;:,.<>?";

    // Initialize random number generator
    std::random_device rd;
    std::mt19937 generator(rd());
    std::uniform_int_distribution<> distribution(0, characters.size() - 1);

    // Generate the random string
    std::string randomString;
    randomString.reserve(length); // Reserve space to avoid multiple allocations

    for (size_t i = 0; i < length; ++i) {
        randomString += characters[distribution(generator)];
    }

    return randomString;
}

void test_int_vectorization(int32_t size) {
    uint64_t* array1 = new uint64_t[size];
    uint64_t* array2 = new uint64_t[size];
    uint64_t* result = new uint64_t[size];

    auto t1 = high_resolution_clock::now();

    for (int i = 0; i < size; ++i) {
        array1[i] = 1000 + i;
        array2[i] = 500 + i;
    }

    auto t2 = high_resolution_clock::now();
    auto populationTime = duration_cast<milliseconds>(t2 - t1);
    cout << "Time in populating arrays = " << to_string(populationTime.count()) << " ms \n";

    t2 = high_resolution_clock::now();
    for (int i = 0; i < size; ++i) {
        result[i] = array1[i] + array2[i];
    }

    auto t3 = high_resolution_clock::now();
    auto operation_duration = duration_cast<milliseconds>(t3 - t2);
    std::cout << "Int operation : Add completed in " << to_string(operation_duration.count()) << " ms \n";

    t2 = high_resolution_clock::now();
    for (int i = 0; i < size; ++i) {
        result[i] = array1[i] * array2[i];
    }

    t3 = high_resolution_clock::now();
    operation_duration = duration_cast<milliseconds>(t3 - t2);
    std::cout << "Int operation : Multiply completed in " << to_string(operation_duration.count()) << " ms \n";

    t2 = high_resolution_clock::now();
    for (int i = 0; i < size; ++i) {
        result[i] = array1[i] / array2[i];
    }

    t3 = high_resolution_clock::now();
    operation_duration = duration_cast<milliseconds>(t3 - t2);
    std::cout << "Int operation : Division completed in " << to_string(operation_duration.count()) << " ms \n";

    t2 = high_resolution_clock::now();
    for (int i = 0; i < size; ++i) {
        result[i] = array1[i] % array2[i];
    }

    t3 = high_resolution_clock::now();
    operation_duration = duration_cast<milliseconds>(t3 - t2);
    std::cout << "Int operation : Modulo completed in " << to_string(operation_duration.count()) << " ms \n";
}

void test_string_vectorization(int size) {
    string randStr1 = generateRandomString(size);
    string randStr2 = generateRandomString(size);

    vector<bool> areSame(size);

    auto t1 = high_resolution_clock::now();
    for (int i = 0; i < size; ++i) {
        areSame[i] = randStr1[i] == randStr2[i];
    }
    auto t2 = high_resolution_clock::now();

    auto operation_duration = duration_cast<milliseconds>(t2 - t1);
    cout << "String operation completed in " << to_string(operation_duration.count()) << " ms \n";
}

int main() {

    int32_t size = 409600000;
    test_int_vectorization(size);
    test_string_vectorization(size);

    return 0;
}

为 SSE 和 AVX 创建二进制文件时,以下是我得到的数字

// AVX - 

Time in populating arrays = 1586 ms 
Int operation : Add completed in 976 ms 
Int operation : Multiply completed in 447 ms 
Int operation : Division completed in 963 ms 
Int operation : Modulo completed in 961 ms 
String operation completed in 558 ms 

// SSE - 
Time in populating arrays = 1539 ms 
Int operation : Add completed in 947 ms 
Int operation : Multiply completed in 449 ms 
Int operation : Division completed in 959 ms 
Int operation : Modulo completed in 958 ms 
String operation completed in 517 ms 
vectorization simd sse avx microprocessors
1个回答
0
投票

这是什么编译器?如果这是 gcc,它不会在

-O3
以下自动矢量化,除非添加
-ftree-vectorize
-mavx2
-msse
(以及
-march=...
)不会矢量化,它们使编译器能够使用各自的指令。

根据 CPU 的不同,将数组对齐到 32 字节或更好的缓存行边界(64 字节)并告知编译器可能会更有利。

对于支持它的编译器(例如 clang、gcc、Intel),告诉编译器尝试向量化并添加对齐提示的简单方法是

pragma omp simd

要了解编译器矢量化的内容和未矢量化的内容,相应的报告很有用(

-fopt-info-vec
对于 GCC)。

© www.soinside.com 2019 - 2024. All rights reserved.