Apple 的调度与 OpenMP 在使用 M3Pro 的 Apple MacBook Pro 上并行化 for 循环

Question

我正在用

编写一个程序，它接受大小为 2^N 的数组，并在索引二进制表示中的指定位置交换索引相差一位的条目。对代码进行分析，我注意到默认情况下它只部署一个线程，这使得它非常慢。我尝试使用 OpenMP 和 Apple 的 Dispatch（以前称为 Grand Central Dispatch）来并行化我的代码，但我对结果有点失望。有没有对这些框架更有经验的人可以帮助我？我应该研究金属吗？

我对代码并行化版本的想法是迭代直到 2^N-1-1 的所有整数，在右侧的指定位置插入一个零，并与索引距 2^pos 的条目交换。我使用 OpenMP 实现的版本如下所示：

void swapEntries(double complex* array, uint64_t N, uint64_t pos) {
    uint64_t swapDistance = 1 << pos        // 2^pos is the distance between entries to swap
    uint64_t indices = 1 << (N - 1)         // 2^(N-1) is the max integer missing a zero at pos

#pragma omp parallel for default(none) shared(array, indices, pos, swapDistance)
    {
        for (uint64_t i = 0; i < indices; ++i) {
            uint64_t left = (i & (~0U << pos)) << 1;    // Copy the bits left to pos and shift
                                                        // them one position to the left
            uint64_t right = (i & ~(~0U << pos));       // Copy the bits right of pos
            i = (left | right);                         // Merge the two integers (now with a
                                                        // zero at pos)

            double complex tmp = *(array + i);              // Swap the entry with the one
            *(array + i) = *(array + (i + swapDistance));   // swapDistance away
            *(array + (i + swapDistance)) = tmp;
        }
    };
}

在 N=30 的实例上，使用带有

-O2

标志的 Apple clang，它的运行速度几乎是其串行版本的两倍。考虑到我的配备 M3Pro 的 MacBook M3 有 11 个核心（即使其中 7 个是高效核心），我发现这一点相当发人深省。在下一步中，我使用 wikipedia 中给出的简单循环转换尝试了 Apple 的 Dispatch 库，并且在 docs 中也提到了。

void swapEntries(double complex* array, uint64_t N, uint64_t pos) {
    uint64_t swapDistance = 1 << pos        // 2^pos is the distance between entries to swap
    uint64_t indices = 1 << (N - 1)         // 2^(N-1) is the max integer missing a zero at pos

    /* Get a concurrent dispatch queue */
    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0)

    dispatch_apply(indices, queue, ^(size_t) {
        uint64_t left = (i & (~0U << pos)) << 1;    // Copy the bits left to pos and shift them
                                                    // one position to the left
        uint64_t right = (i & ~(~0U << pos));       // Copy the bits right of pos
        i = (left | right);                         // Merge the two integers (now with a
                                                        // zero at pos)

        double complex tmp = *(array + i);              // Swap the entry with the one
        *(array + i) = *(array + (i + swapDistance));   // swapDistance away
        *(array + (i + swapDistance)) = tmp;
    });
}

这两个函数都完全按照其预期执行；我有单元测试来检查这一点。然而，Dispatch 实现明显比串行版本差，我很确定这是我的错。你能帮我吗？

Answer 1

为了回答一般性能问题，在 GCD 中，我们通常通过“跨步”来解决这个问题，即每次迭代

dispatch_apply

（在 Swift 中称为

concurrentPerform

）做更多的工作。目前，

dispatch_apply

的每次迭代执行的工作量可以忽略不计。而且，特别是在 GCD 中，如果每次迭代的工作量很少，那么适度的线程开销可能会超过多线程带来的任何性能优势。

例如，考虑 30 的 N。这将导致 ^{dispatch_apply} 示例中的 2

N-1

上下文切换。是的，

dispatch_apply

会减轻线程爆炸的风险，但它不会为你“分块”工作。

有关“跨步”的更多信息，请参阅 Apple 并发编程指南的“循环代码的改进”部分。

Apple 的调度与 OpenMP 在使用 M3Pro 的 Apple MacBook Pro 上并行化 for 循环

问题描述投票：0回答：1

1个回答

最新问题

Apple 的调度与 OpenMP 在使用 M3Pro 的 Apple MacBook Pro 上并行化 for 循环

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1