重复短时间相同的并行作业

Question

我有一个算法，需要多次并行重新运行相同的代码。代码很短，不到一微秒即可完成。这将运行数百万次，这在使用线程时会产生一些问题。我正在尝试使用 POSIX 线程，但由于创建新线程的开销不是一个选项，所以我需要使用某种不断运行的旋转工作线程。

下面是一个简化的案例，说明了我如何尝试运行两个同样长的作品，calc_x 和 calc_y。我需要能够并行运行它们，以便总时间就像只运行一个（或非常接近）。

#include <pthread.h>
#include <time.h>
#include <stdatomic.h>
#include <stdio.h>

static int nloops = 100;
static int x;
static int y;
static pthread_t tid;
static atomic_bool trun;
static atomic_bool texit;
static int cnt = 0;
static const int MILLION = 1000000;


void calc_x() {
    x = 0;
    for (int i = 0; i < nloops; ++i) {
        x += i;
    }
}

void calc_y() {
    y = 0;
    for (int i = 0; i < nloops; ++i) {
        y += i;
    }
}

void *worker() {
    while (1) {
        if (trun) {
            calc_x();
            trun = false;
        }
        if (texit) {
            break;
        }
    }
    return NULL;
}

float run_x() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_x();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_y() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_y();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_xy() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_x();
        calc_y();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_xy_parallel() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        trun = true;
        calc_y();
        while(trun) ++cnt;
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

int main() {
    trun = false;
    texit = false;
    pthread_create(&tid, NULL, &worker, NULL);
    printf("run_x: %.4f s\n", run_x());
    printf("run_y: %.4f s\n", run_y());
    printf("run_xy: %.4f s\n", run_xy());
    printf("run_xy_parallel: %.4f s\n", run_xy_parallel());
    printf("average nr of waiting loops: %.1f\n", (float)cnt/MILLION);
    texit = true;
    pthread_join(tid, NULL);
    return 0;
}

结果是：

run_x: 0.3049 s
run_y: 0.2559 s
run_xy: 0.5074 s
run_xy_parallel: 0.4744 s
average nr of waiting loops: 34.9

如您所见，使用并行工作线程实际上没有任何好处。我用来尝试测量等待所花费的时间的计数 (cnt) 表明 calc_x 的 35/nloops = 35% 时间用于等待其他线程完成，这似乎是正确的，因为有一个run_x 和 run_y 之间的时间差。所以我将 nloops 更改为 1000，看看是否有任何变化。我得到了这些结果：

run_x: 2.4921 s
run_y: 2.4965 s
run_xy: 5.0950 s
run_xy_parallel: 3.3477 s
average nr of waiting loops: 44.8

现在等待计数小于 calc_x 的 5%。但仍然比我预期慢了 1/3。

在我看来，设置 trun 需要很多时间，但是随着 nloops 的增加，时间增加是没有意义的。谁能给我解释一下吗？

Answer 1

不知道为什么你的问题被低估了那么多。我不得不进行这种性能分析，对于某些类型的应用程序（例如金融服务）来说，这既棘手又必要。

我最好的猜测是 NUMA 区域。大多数大小合适的服务器都有多个 CPU 插槽，因此您应该对进程进行任务设置，以确保它将在仅限于单个 NUMA 区域的 CPU 上运行。即使您的服务器只有一个 CPU 插槽，也喜欢跨核心迁移线程，这会妨碍缓存的最佳使用。理想情况下，您将明确地将每个线程固定到选择尽可能接近缓存的 CPU。

我认为“虚假分享”是一个很好的猜测。 X86 处理器都具有 64 字节缓存行，因此您绝对应该在全局变量（或至少是“热门”全局变量）之间添加一个 (64-sizeof(int)) 的字符数组。事实上，我不确定 C 是否能保证全局变量以可预测的方式打包，因此您可能希望将它们放入结构中。你增加 nloops 的想法是个好主意。仅在 100 个循环后同步线程是不够的。

重复短时间相同的并行作业

问题描述投票：0回答：1

1个回答

最新问题

重复短时间相同的并行作业

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1