我有这个
MPI_example.py
文件来求解矩阵:
from mpi4py import MPI
import numpy as np
from scipy.stats import ortho_group
from scipy.sparse import spdiags
def generate_matrix(dim):
a = ortho_group.rvs(dim, random_state=0)
b = np.linspace(1., 10., dim)
return a @ spdiags(b, 0, dim, dim) @ a.T
def power_iteration(A, b, num_iters):
for _ in range(num_iters):
b_new = A @ b
b_new_norm = np.linalg.norm(b_new)
b = b_new / b_new_norm
return b
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
n = 8
num_iters = 100
if rank == 0:
A = generate_matrix(n)
else:
A = None
rows_per_proc = n // size
local_A = np.zeros((rows_per_proc, n))
comm.Scatter(A, local_A, root=0)
if rank == 0:
b = np.ones(n)
else:
b = np.zeros(n)
comm.Bcast(b, root=0)
for _ in range(num_iters):
local_b_new = local_A @ b
global_b_new = np.zeros(n)
comm.Allreduce(local_b_new, global_b_new)
norm = np.linalg.norm(global_b_new)
b = global_b_new / norm
if rank == 0:
estimated_eigenvalue = np.dot(b.T, A @ b) / np.dot(b.T, b)
print(f"Eigen vector estimated: {estimated_eigenvalue}")
print(f"Error: {abs(10 - estimated_eigenvalue)}")
import time
start_time = time.time()
end_time = time.time()
if rank == 0:
print(f"Time: {end_time - start_time} [s]")
我正在 Google Colab 中运行它,所以我正在尝试这个:
from google.colab import files
uploaded = files.upload() #select MPI_example.py
!ls #check .py file
!pip install mpi4py
import mpi4py
print(mpi4py.__version__)
然后我使用以下命令运行所有代码:
!mpirun --allow-run-as-root --oversubscribe -n 4 python MPI_example.py
但我收到此错误:
ValueError: mismatch in send count 2 and receive count 8
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21624,1],2]
Exit code: 1
我想在
n
中尝试不同的!mpirun --allow-run-as-root --oversubscribe -n 4 python MPI_example.py
。我该如何修复这个错误?怎么了?
用 Allgather 替换 Allreduce,我得到了一个可重现的结果,与并行进程的数量无关(增加 n 以支持更多数量的 MPI 进程后):
# comm.Allreduce(local_b_new, global_b_new)
comm.Allgather(local_b_new, global_b_new)
Allreduce 对向量的所有元素执行默认的
op=SUM
,并将结果分发给所有进程。 local_b_new
仅包含b的局部分区,因此逐元素求和不是正确的操作。 Gather 收集所有向量,连接它们并将结果分发给所有进程。