MPI_Wait 不会释放 MPI_Ibcast 请求

问题描述 投票:0回答:1

考虑以下程序:

#include <iostream>
#include <mpi.h>

int main() {
  int provided = -1;
  MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);

  if (provided != MPI_THREAD_MULTIPLE) {
    return -1;
  }

  int this_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &this_rank);

  double aze[36864]{};
  MPI_Request req = MPI_REQUEST_NULL;

  std::cout << this_rank << " starting bcast" << std::endl;
  MPI_Ibcast(aze, 36864, MPI_DOUBLE, 1, MPI_COMM_WORLD, &req);
  std::cout << this_rank << " req0 " << req << std::endl;

#pragma omp parallel
  {
    MPI_Status stat{};
    // do {
    MPI_Wait(&req, &stat);
    // } while(req != MPI_REQUEST_NULL);

    if (req != MPI_REQUEST_NULL) {
      std::cout << this_rank << " wait returned non null request: " << req
                << " vs " << MPI_REQUEST_NULL << std::endl;

      std::cout << this_rank << " MPI_SOURCE: " << stat.MPI_SOURCE << std::endl;
      std::cout << this_rank << " MPI_TAG: " << stat.MPI_TAG << std::endl;
      std::cout << this_rank << " MPI_ERROR: " << stat.MPI_ERROR << std::endl;
    }
  }

  {
    volatile int dummy = 0;
    while (dummy != 1'000'000'000) {
      dummy++;
    }
    std::cout << this_rank << " sleep done" << std::endl;
  }
  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();
  return 0;
}

我正在使用 OpenMPI 5.0.2。我在最新版本中使用了 clang 和 gcc。我像这样运行并构建上面的再现器:

$ g++ -fopenmp ~/Downloads/trash/repro.mpi.cc -isystem /usr/include/openmpi-x86_64 -L /usr/lib64/openmpi/lib/ -lmpi
$ export OMP_NUM_THREADS=2
$ mpirun -n 4 ./a.out

预期的标准输出是(按排名标识符排序):

0 starting bcast
0 req0 0x3e573f28
0 sleep done
1 starting bcast
1 req0 0x79d9298
1 sleep done
2 starting bcast
2 req0 0xc4841b8
2 sleep done
3 starting bcast
3 req0 0x2fdf2f18
3 sleep done

请注意,地址可能会明显改变。

观察到的行为,就像这样(再次按排名标识符排序):

0 starting bcast
0 req0 0x25aa6f28
0 wait returned non null request: 0x25aa6f28 vs 0x4045e0
0 MPI_SOURCE: 0
0 MPI_TAG: 0
0 MPI_ERROR: 0
0 sleep done
1 starting bcast
1 req0 0xb169298
1 sleep done
2 starting bcast
2 req0 0xc4f81b8
2 wait returned non null request: 0xc4f81b8 vs 0x4045e0
2 MPI_SOURCE: 0
2 MPI_TAG: 0
2 MPI_ERROR: 0
2 sleep done
3 starting bcast
3 req0 0x10ccbf18
3 wait returned non null request: 0x10ccbf18 vs 0x4045e0
3 MPI_SOURCE: 0
3 MPI_TAG: 0
3 MPI_ERROR: 0
3 sleep done

我们观察到,当 MPI_Wait 返回时,不会在 MPI_Status 或日志中报告任何错误,MPI_Request 不会被释放并设置为 MPI_REQUEST_NULL。

根据 OpenMPI 文档和标准:

A call to MPI_Wait returns when the operation identified by request is complete. If the communication object associated with this request was created by a nonblocking send or receive call, then the object is deallocated by the call to MPI_Wait and the request handle is set to MPI_REQUEST_NULL.
https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_Wait.3.html#description)。

上面的代码片段不合理吗?请注意,如果 MPI_Wait 周围的 do/while 循环未注释,则代码将生成我期望的输出。但随后,它就变成了 MPI_Test 的语义(轮询)。 openmp 位是触发该问题的关键。

mpi openmp openmpi
1个回答
0
投票

相关文本是(从 MPI 4.0 复制,但在其他版本中也同样存在):

多个线程完成同一个请求。一个程序,其中两个 线程阻塞等待同一请求是错误的。相似地, 同一请求不能出现在两个请求数组中 并发 MPI_{WAIT|TEST}{ANY|SOME|ALL} 调用。在 MPI 中,请求可以 只能完成一次。任何违反等待或测试的组合 这个规则是错误的。

唯一可以与指向同一请求句柄的指针同时调用的完成函数是

MPI_Test
。重要的一点是,所有线程实际上都引用公共请求句柄的相同存储。

© www.soinside.com 2019 - 2024. All rights reserved.