我正在使用
llm.c
项目的矩阵乘法实现,特别是来自 这个文件。
提供了两种实现:
matmul_forward_cpu
:基本矩阵乘法循环。matmul_forward_ngc92
:使用循环展开和缓存的优化版本。我使用
gcc
和 -Ofast
标志编译了这些,性能的改进令人印象深刻。结果如下:
matmul_forward_cpu
:5.5秒matmul_forward_ngc92
:2.8秒没有
-Ofast
,性能会差很多(超过 2 分钟),但是有了 -Ofast
,很明显编译器正在应用大量优化。
我尝试过的: 我尝试过手动实现 AVX 和 NEON 指令来击败
matmul_forward_ngc92
实现,但 -Ofast
标志似乎已经在应用 SIMD 优化,并且我无法超越 2.8 秒的结果。
我的问题: 有没有人对如何优化
matmul_forward_ngc92
功能有进一步的建议,或者是否有特定的CPU架构特性或其他技术可以进一步提高性能?我的目标是超越 2.8 秒大关。
我想到的一件事是使用 OpenMP 来利用多核
代码如下(来自
llm.c
):
void matmul_forward_cpu(float* out,
const float* inp, const float* weight, const float* bias,
int B, int T, int C, int OC) {
// OC is short for "output channels"
// inp is (B,T,C), weight is (OC, C), bias is (OC)
// out will be (B,T,OC)
for (int b = 0; b < B; b++) {
for (int t = 0; t < T; t++) {
float* out_bt = out + b * T * OC + t * OC;
const float* inp_bt = inp + b * T * C + t * C;
for (int o = 0; o < OC; o++) {
float val = (bias != NULL) ? bias[o] : 0.0f;
const float* wrow = weight + o*C;
for (int i = 0; i < C; i++) {
val += inp_bt[i] * wrow[i];
}
out_bt[o] = val;
}
}
}
}
void matmul_forward_ngc92(float* out,
const float* inp, const float* weight, const float* bias,
int B, int T, int C, int OC) {
// most of the running time is spent here and in matmul_backward
// OC is short for "output channels"
// inp is (B,T,C), weight is (OC, C), bias is (OC)
// out will be (B,T,OC)
// make sure the tiled loop will be correct, otherwise, fallback to slow version
#define LOOP_UNROLL 8
if (B * T % LOOP_UNROLL != 0) {
printf("MUST BE A MULTIPLE OF 8"); // FIXME
return;
}
// collapse the B and T loops into one and turn it into a strided loop.
// then we can tile the inner loop, and reuse the loaded weight LOOP_UNROLL many times
// for significant speed-ups.
for (int obt = 0; obt < B * T; obt += LOOP_UNROLL) {
for (int o = 0; o < OC; o++) {
// keep LOOP_UNROLL many results in register, initialized by the bias term.
float result[LOOP_UNROLL];
for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
result[ibt] = (bias != NULL) ? bias[o] : 0.0f;
}
// inner loops. Because we do LOOP_UNROLL steps of inner bt, we can cache
// the value of weight[i + o * C] and reuse it.
// we compile with -Ofast, so the compiler will turn the inner loop into a bunch of FMAs
for (int i = 0; i < C; i++) {
float w = weight[i + o * C];
for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
int bt = obt + ibt;
result[ibt] += inp[bt * C + i] * w;
}
}
// write back results to main memory
for (int ibt = 0; ibt < LOOP_UNROLL; ++ibt) {
int bt = obt + ibt;
out[bt * OC + o] = result[ibt];
}
}
}
}
任何有关挤出更多性能的指导将不胜感激!
restrict
给定一个如下函数:
void matmul_forward_cpu(float* out,
const float* inp, const float* weight, const float* bias,
int B, int T, int C, int OC) {
...
编译器不允许假设
out
指向的数据不与inp
、weight
或bias
指向的数据重叠。
此限制会阻止某些优化。 请参阅`限制'
如果各种指针的数据从不重叠,请让编译器知道允许额外的优化:
void matmul_forward_cpu_alt(float* restrict out,
const float* restrict inp, const float* restrict weight, const float* restrict bias,
int B, int T, int C, int OC) {
...
对
restrict
的深入理解并不简单,编译器也不能总是很好地使用它。 祝你好运。