我将带有 OpenMP pragma 指令的并行 C 程序翻译为带有 OpenACC pragma 指令的 C 程序,通过删除原始 OpenMP pragma 并在我认为可以并行化的地方添加 OpenACC 指令,在 GPU 上运行它。没有 OpenMP 或 OpenACC pragma 指令的串行版本工作得很好,OpenMP 版本也是如此。当使用与其他两个版本相同的
nvc
编译器执行 OpenACC 版本时,我遇到了一个奇怪的错误。
它是一个大型机器学习程序,所以我只提及我添加 OpenACC pragma 指令的功能:
功能1:
void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
{
int j, k;
#pragma acc kernels loop
for (j = 0; j < num_out; j++)
{
double sum = 0;
for (k = 0; k < num_features; k++)
sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
D[j] = sum;
}
double d_min = INFINITY;
int d_min_idx = -1;
#pragma acc parallel loop reduction(min : d_min)
for (j = 0; j < num_out; j++)
{
if (D[j] < d_min)
{
d_min = D[j];
d_min_idx = j;
}
}
int from_node = max(0, d_min_idx - R);
int to_node = min(num_out, d_min_idx + R + 1);
#pragma acc kernels
for (j = from_node; j < to_node; j++)
{
for (k = 0; k < num_features; k++)
W[j][k] += alpha * (x[k] - W[j][k]);
}
}
功能2:
void function2(double **X, double *const *W, int num_samples, int num_features, int num_out, double alpha_min)
{
int R = num_out >> 2, iter = 0;
double alpha = 1.f;
double *D = (double *)malloc(num_out * sizeof(double));
#pragma acc data copyin(X[0 : num_samples][0 : num_features], W[0 : num_out][0 : num_features]) create(D[0 : num_out])
{
for (; alpha > alpha_min; alpha -= 0.01, iter++)
{
for (int sample = 0; sample < num_samples; sample++)
{
const double *x = X[sample];
kohonen_update_weights(x, W, D, num_out, num_features, alpha, R);
}
if (iter % 10 == 0 && R > 1)
R--;
}
}
free(D);
}
要编译该程序的串行版本,无需 OpenACC pragma 指令,我使用以下命令:
$ nvc -Minfo=all -o program program.c
并得到以下结果,符合预期:
Test 1 completed in 0.01334 sec
Test 2 completed in 0.006111 sec
Test 3 completed in 0.003211 sec
(Note: Calculated times include: creating test sets, training model and writing files to disk.)
当我编译和执行带有
-acc
标志的 OpenACC 版本时,我预计会得到类似的结果。
然而
当我使用以下命令使用 OpenACC pragma 指令编译程序时:
$ nvc -acc=gpu -gpu=cc89 -Minfo=all -o program program.c
并运行可执行文件,我收到以下错误:
Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc35 -gpu=cc50 -gpu=cc60 -gpu=cc60 -gpu=cc70 -gpu=cc75 -gpu=cc80 -
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0
File: /proj/build/23C/Linux_x86_64/rte/accel-uni/build/Linux_x86_64/../../src/cuda_fill.c
Function: __pgi_uacc_cuda_fill:98
Line: 44
一些上下文信息(如果可能有帮助):
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 Off | 00000000:01:00.0 On | N/A |
| 0% 29C P8 5W / 200W | 340MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1551 G /usr/lib/xorg/Xorg 106MiB |
| 0 N/A N/A 1804 C+G ...libexec/gnome-remote-desktop-daemon 154MiB |
| 0 N/A N/A 1899 G /usr/bin/gnome-shell 65MiB |
+---------------------------------------------------------------------------------------+
nvc --version
nvc 24.1-0 64-bit target on x86-64 Linux -tp znver4
nvaccelinfo
CUDA Driver Version: 12020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.171.04 Tue Mar 19 20:30:00 UTC 2024
Device Number: 0
Device Name: NVIDIA GeForce RTX 4070
Device Revision Number: 8.9
Global Memory Size: 12568887296
Number of Multiprocessors: 46
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 2520 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 10501 MHz
Memory Bus Width: 192 bits
L2 Cache Size: 37748736 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Default Target: cc89
编译命令中
-Minfo=all
参数的结果:
65, Loop carried dependence of D-> prevents parallelization
Loop carried backward dependence of D-> prevents vectorization
Complex loop carried dependence of x->,W->-> prevents parallelization
Generating NVIDIA GPU code
65, #pragma acc loop seq
68, #pragma acc loop vector(128) /* threadIdx.x */
Generating implicit reduction(+:sum)
65, Generating implicit copyout(D[:num_out]) [if not already present]
Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
68, Loop is parallelizable
77, Generating implicit firstprivate(num_out,d_min_idx)
Generating NVIDIA GPU code
77, #pragma acc loop seq
Generating reduction(min:d_min)
77, Generating implicit copy(d_min) [if not already present]
Generating implicit copyin(D[:num_out]) [if not already present]
82, Accelerator restriction: induction variable live-out from loop: d_min_idx
90, Complex loop carried dependence of x->,W->-> prevents parallelization
Accelerator serial kernel generated
Generating NVIDIA GPU code
90, #pragma acc loop seq
92, #pragma acc loop seq
90, Generating implicit copyin(x[:num_features]) [if not already present]
Generating implicit copy(W[.I0000:min(num_out,(R+d_min_idx)+1)-.I0000][:num_features]) [if not already present]
92, Complex loop carried dependence of x->,W->-> prevents parallelization
kohonen_som_tracer:
104, Generating copyin(X[:num_samples][:num_features],W[:num_out][:num_features]) [if not already present]
Generating create(D[:num_out]) [if not already present]
编辑: 在编译命令中添加了
-Minfo=all
参数的结果,以防有相关信息。
我该如何解决这个错误?有什么我遗漏的或者应该参考的吗?
预先感谢您的帮助!
虽然这可能无法修复您看到的运行时错误,但我已经更新了您的代码,因此它将并行化循环。您需要将“independent”添加到“kernels”区域以告诉编译器忽略依赖关系。由于 C 允许相同类型的指针为相同数据起别名,因此编译器必须假设它们确实如此。因此,它无法自动并行化循环。
也如所写,“d_min_idx”可能会给出不正确的结果。不幸的是,您可以在同一个并行循环中找到最小值及其索引。因此需要将其分成两个循环。我将其设置为查找与偶数多个索引中的最小值匹配的第一个索引具有相同的值。
#include <math.h>
void function1(double const *x, double *const *W, double *D, int num_out, int num_features, double alpha, int R)
{
int j, k;
#pragma acc kernels loop independent
for (j = 0; j < num_out; j++)
{
double sum = 0;
#pragma acc loop reduction(+:sum)
for (k = 0; k < num_features; k++)
sum += (W[j][k] - x[k]) * (W[j][k] - x[k]);
D[j] = sum;
}
double d_min = INFINITY;
int d_min_idx = -1;
#pragma acc parallel loop reduction(min : d_min)
for (j = 0; j < num_out; j++)
{
if (D[j] < d_min)
{
d_min = D[j];
}
}
#pragma acc parallel loop reduction(min : d_min_idx)
for (j = 0; j < num_out; j++)
{
if (D[j] == d_min)
{
d_min_idx = j;
}
}
int from_node = max(0, d_min_idx - R);
int to_node = min(num_out, d_min_idx + R + 1);
#pragma acc kernels loop collapse(2) independent
for (j = from_node; j < to_node; j++)
{
for (k = 0; k < num_features; k++)
W[j][k] += alpha * (x[k] - W[j][k]);
}
}
编译器反馈消息:
% nvc -c -w -acc -Minfo=accel test.c
function1:
8, Loop is parallelizable
Generating NVIDIA GPU code
8, #pragma acc loop gang /* blockIdx.x */
12, #pragma acc loop vector(128) /* threadIdx.x */
Generating reduction(+:sum)
8, Generating implicit copyout(D[:num_out]) [if not already present]
Generating implicit copyin(W[:num_out][:num_features],x[:num_features]) [if not already present]
12, Loop is parallelizable
21, Generating implicit firstprivate(num_out)
Generating NVIDIA GPU code
21, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Generating reduction(min:d_min)
21, Generating implicit copy(d_min) [if not already present]
Generating implicit copyin(D[:num_out]) [if not already present]
29, Generating implicit firstprivate(d_min,num_out)
Generating NVIDIA GPU code
29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Generating reduction(min:d_min_idx)
29, Generating implicit copy(d_min_idx) [if not already present]
Generating implicit copyin(D[:num_out]) [if not already present]
41, Loop is parallelizable
Generating implicit copyin(x[:num_features]) [if not already present]
Generating implicit copy(W[from_node:to_node-from_node][:num_features]) [if not already present]
43, Loop is parallelizable
Generating NVIDIA GPU code
41, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
43, /* blockIdx.x threadIdx.x collapsed */