如何生成非临时指令？

Question

英特尔的编译器有一个pragma，可用于生成非临时存储。例如，我可以写

void square(const double* x, double* y, int n) {
#pragma vector nontemporal
  for (int i=0; i<n; ++i) {
    y[i] = x[i] * x[i];
  }
}

ICC 将生成类似 this（编译器-资源管理器）

的指令

...
  vmovntpd %ymm1, (%rsi,%r9,8) #4.5
...

gcc 和 clang 有类似的东西吗？（内在函数除外）

非临时存储使代码速度更快。使用这个基准

#include <random>
#include <memory>

#include <benchmark/benchmark.h>

static void generate_random_numbers(double* x, int n) {
  std::mt19937 rng{0};
  std::uniform_real_distribution<double> dist{-1, 1};
  for (int i=0; i<n; ++i) {
    x[i] = dist(rng);
  }
}

static void square(const double* x, double* y, int n) {
#ifdef __INTEL_COMPILER
#pragma vector nontemporal
#endif
  for (int i=0; i<n; ++i) {
    y[i] = x[i] * x[i];
  }
}

static void BM_Square(benchmark::State& state) {
  const int n = state.range(0);
  std::unique_ptr<double[]> xptr{new double[n]};
  generate_random_numbers(xptr.get(), n);
  for (auto _ : state) {
    std::unique_ptr<double[]> yptr{new double[n]};
    square(xptr.get(), yptr.get(), n);
    benchmark::DoNotOptimize(yptr);
  }
}

BENCHMARK(BM_Square)->Arg(1000000);

BENCHMARK_MAIN();

非临时代码在我的机器上运行速度几乎是两倍。以下是完整结果：

国际商会：

> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     430889 ns       430889 ns         1372

叮当：

> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     781672 ns       781470 ns          820

海合会：

> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     681684 ns       681533 ns          782

注意：clang 有 __builtin_nontemporal_store；但是当我尝试它时，它不会生成非时间指令（compiler-explorer）

Answer 1

我真的很惊讶 ICC 在如此简单的代码上提供了这样的性能。过去，非临时存储仅将带宽性能提高了几个百分点。

也许事情发生了变化（然后我再次感到惊讶，clang 和 gcc 没有对此采取任何措施）。

无论如何，您可以使用内在函数生成这些指令。

这是一个示例（其中我没有实现尾随字节的标量逻辑，传递了最后一个 8 的倍数）：

#include <immintrin.h>

void square_elements(
    const double * __restrict const x, 
    double* __restrict const y, 
    const int n) 
{
    // this should be enforced earlier by a call to an aligned allocation
const double* ax = (double*) __builtin_assume_aligned(x, 64);
double* ay = (double*) __builtin_assume_aligned(y, 64);
  for (int i=0; i < n; i += 8) {
    __m512d xi = _mm512_load_pd((void*) (ax + i));
    __m512d mul = _mm512_mul_pd(xi, xi);
    _mm512_stream_pd((void*) (ay + i), mul);
  }
}

使用螺栓链接查看组装。

如何生成非临时指令？

问题描述投票：0回答：1

1个回答

最新问题

如何生成非临时指令？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1