我的系统设置,linux系统,12核,隔离核2-11。核心 0 和 1 的使用率几乎 100% 被其他程序使用。所有其余核心都处于空闲状态。
export GOMP_CPU_AFFINITY=2,3,4
export PARALLEL_ENSEMBLE_THREADS=3
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp
输出为:
Performance counter stats for './test_openmp':
47,654.74 msec task-clock:u # 2.981 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
115,358 page-faults:u # 2.421 K/sec
159,245,881,934 cycles:u # 3.342 GHz
250,009,309,156 instructions:u # 1.57 insn per cycle
20,002,132,172 branches:u # 419.730 M/sec
117,268 branch-misses:u # 0.00% of all branches
110,002,614,320 L1-dcache-loads:u # 2.308 G/sec
10,796,435,741 L1-dcache-load-misses:u # 9.81% of all L1-dcache accesses
0 LLC-loads:u # 0.000 /sec
0 LLC-load-misses:u # 0.00% of all LL-cache accesses
15.986638336 seconds time elapsed
47.175831000 seconds user
0.414928000 seconds sys
export GOMP_CPU_AFFINITY=1,2,3,4
export PARALLEL_ENSEMBLE_THREADS=4
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp
输出是
pid: 4118342
Performance counter stats for './test_openmp':
48,241.03 msec task-clock:u # 1.072 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
119,879 page-faults:u # 2.485 K/sec
161,605,704,451 cycles:u # 3.350 GHz
250,011,376,400 instructions:u # 1.55 insn per cycle
20,002,726,448 branches:u # 414.641 M/sec
118,657 branch-misses:u # 0.00% of all branches
110,002,938,510 L1-dcache-loads:u # 2.280 G/sec
10,796,444,713 L1-dcache-load-misses:u # 9.81% of all L1-dcache accesses
0 LLC-loads:u # 0.000 /sec
0 LLC-load-misses:u # 0.00% of all LL-cache accesses
45.012033357 seconds time elapsed
47.764469000 seconds user
0.399934000 seconds sys
我的问题是:为什么我第二次给程序多分配了一个核心(核心1),但运行时间却变长了(15.98秒 vs 45.01秒),而且cpu利用率很低(2.98 vs 1.07)
这是我运行的测试代码。
#include <iostream>
#include <cstdint>
#include <unistd.h>
constexpr int64_t N = 100000;
int m = N;
int n = N;
int main() {
double* a = new double[N];
double* c = new double[N];
double* b = new double[N*N];
std::cout << "pid: " << getpid() << std::endl;
#pragma omp parallel for default(none) shared(m,n,a,b,c)
for (int i=0; i<m; i++) {
double sum = 0.0;
for (int j=0; j<n; j++)
sum += b[i+j*N]*c[j];
a[i] = sum;
}
return 0;
}
当您没有为工作共享循环指定计划时,该计划是实现定义的。大多数实现都选择静态计划,因为它对于大多数工作负载来说具有最低的运行时开销。静态调度将相同数量的迭代分配给每个线程。
在您的情况下,您特别希望允许 openmp 将工作以不同的方式分配给线程。尝试将
schedule(dynamic)
添加到并行 for 指令中。
您还可以选择
schedule(runtime)
并通过为每次执行设置环境变量来控制计划。