如何在llm.c项目中进一步优化矩阵乘法？

Question

我正在使用

llm.c

项目的矩阵乘法实现，特别是来自这个文件。

提供了两种实现：

matmul_forward_cpu
：基本矩阵乘法循环。
matmul_forward_ngc92
：使用循环展开和缓存的优化版本。

我使用

gcc

和

-Ofast

标志编译了这些，性能的改进令人印象深刻。结果如下：

```
matmul_forward_cpu
```
：5.5秒
```
matmul_forward_ngc92
```
：2.8秒

没有

-Ofast

，性能会差很多（超过 2 分钟），但是有了

-Ofast

，很明显编译器正在应用大量优化。

我尝试过的：我尝试过手动实现 AVX 和 NEON 指令来击败

matmul_forward_ngc92

实现，但

-Ofast

标志似乎已经在应用 SIMD 优化，并且我无法超越 2.8 秒的结果。

我的问题：有没有人对如何优化

matmul_forward_ngc92

功能有进一步的建议，或者是否有特定的CPU架构特性或其他技术可以进一步提高性能？我的目标是超越 2.8 秒大关。

我想到的一件事是使用 OpenMP 来利用多核

代码如下（来自

llm.c

）：

void matmul_forward_cpu(float* out,
                    const float* inp, const float* weight, const float* bias,
                    int B, int T, int C, int OC) {
    // OC is short for "output channels"
    // inp is (B,T,C), weight is (OC, C), bias is (OC)
    // out will be (B,T,OC)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* out_bt = out + b * T * OC + t * OC;
            const float* inp_bt = inp + b * T * C + t * C;
            for (int o = 0; o < OC; o++) {
                float val = (bias != NULL) ? bias[o] : 0.0f;
                const float* wrow = weight + o*C;
                for (int i = 0; i < C; i++) {
                    val += inp_bt[i] * wrow[i];
                }
                out_bt[o] = val;
            }
        }
    }
}

void matmul_forward_ngc92(float* out,
    const float* inp, const float* weight, const float* bias,
    int B, int T, int C, int OC) {
    // most of the running time is spent here and in matmul_backward
    // OC is short for "output channels"
    // inp is (B,T,C), weight is (OC, C), bias is (OC)
    // out will be (B,T,OC)

    // make sure the tiled loop will be correct, otherwise, fallback to slow version
    #define LOOP_UNROLL 8

    if (B * T % LOOP_UNROLL != 0) {
        printf("MUST BE A MULTIPLE OF 8"); // FIXME
        return;
    }

    // collapse the B and T loops into one and turn it into a strided loop.
    // then we can tile the inner loop, and reuse the loaded weight LOOP_UNROLL many times
    // for significant speed-ups.
    for (int obt = 0; obt < B * T; obt += LOOP_UNROLL) {
        for (int o = 0; o < OC; o++) {
            // keep LOOP_UNROLL many results in register, initialized by the bias term.
            float result[LOOP_UNROLL];
            for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
                result[ibt] = (bias != NULL) ? bias[o] : 0.0f;
            }

            // inner loops. Because we do LOOP_UNROLL steps of inner bt, we can cache
            // the value of weight[i + o * C] and reuse it.
            // we compile with -Ofast, so the compiler will turn the inner loop into a bunch of FMAs
            for (int i = 0; i < C; i++) {
                float w = weight[i + o * C];
                for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
                    int bt = obt + ibt;
                    result[ibt] += inp[bt * C + i] * w;
                }
            }

            // write back results to main memory
            for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
                int bt = obt + ibt;
                out[bt * OC + o] = result[ibt];
            }
        }
    }
}

任何有关挤出更多性能的指导将不胜感激！

Answer 1

restrict

给定一个如下函数：

void matmul_forward_cpu(float* out,  
    const float* inp, const float* weight, const float* bias,
    int B, int T, int C, int OC) {
  ...

编译器不允许假设

out

指向的数据不与

inp

、

weight

或

bias

指向的数据重叠。

此限制会阻止某些优化。请参阅`限制'

如果各种指针的数据从不重叠，请让编译器知道允许额外的优化：

void matmul_forward_cpu_alt(float* restrict out,  
    const float* restrict inp, const float* restrict weight, const float* restrict bias,
    int B, int T, int C, int OC) {
  ...

对

restrict

的深入理解并不简单，编译器也不能总是很好地使用它。祝你好运。

如何在llm.c项目中进一步优化矩阵乘法？

问题描述投票：0回答：1

1个回答

最新问题

如何在llm.c项目中进一步优化矩阵乘法？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1