我有一组独立的布尔函数,并且(假设)可以并行执行。我想重复调用这些相同的函数。请参阅下面的代码,其中函数的输出在 A 和 B 内存位置之间来回移动。如何强制“IN PARALLEL”行在安装了 CUDA 的 NVIDIA GPU 上并行运行?
import torch
A = torch.tensor([True, False, True]).to('cuda') # Initial values.
B = torch.tensor([False, True, True]).to('cuda') # Values don't matter. Will write over them in the first iteration.
n_steps = 100
for step in range(n_steps):
# Use values in A to compute new values in B.
# How to run the three lines below IN PARALLEL?
B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2]) # func1: Y0 = X0 | X1 & X2
B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2]) # func2: Y1 = X0 | X1 | X2
B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and B has new values), should the lines below be run.
# Use values in B to compute new values in A.
# Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
# How to run the three lines below IN PARALLEL?
A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2]) # func1: Y0 = X0 | X1 & X2
A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2]) # func2: Y1 = X0 | X1 | X2
A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and A has new values), should the next loop be run.
我怀疑这是最好/最快的解决方案,但使用
torch.compile
确实提供了加速。我尚未测试扩展到数千个布尔函数的情况。
import torch
from time import time
A = torch.tensor([True, False, True]).to('cuda') # Initial values.
B = torch.tensor([False, True, True]).to('cuda') # Values don't matter. Will write over them in the first iteration
@torch.compile
def process(A, B, n_steps):
for step in range(n_steps):
# Use values in A to compute new values in B.
# How to run the three lines below IN PARALLEL?
B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2]) # func1: Y0 = X0 | X1 & X2
B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2]) # func2: Y1 = X0 | X1 | X2
B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and B has new values), should the lines below be run.
# Use values in B to compute new values in A.
# Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
# How to run the three lines below IN PARALLEL?
A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2]) # func1: Y0 = X0 | X1 & X2
A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2]) # func2: Y1 = X0 | X1 | X2
A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2]) # func3: Y2 = X0 & X1 & X2
# Only after the above three lines above finish their computation (and A has new values), should the next loop be run.
return A
# First run is slow due to compilation
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')
# Runs are faster subsequently, and can be looped over to effectively increase n_steps
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')
没有 @torch.compile 装饰器的输出:
First run time: 0.12589144706726074 seconds
Second run time: 0.059000492095947266 seconds
使用 @torch.compile 装饰器输出:
First run time: 18.201257467269897 seconds
Second run time: 0.007639169692993164 seconds