在 pytorch 中,如何并行化(在 GPU 上)重复执行的一组布尔函数?

问题描述 投票:0回答:1

我有一组独立的布尔函数,并且(假设)可以并行执行。我想重复调用这些相同的函数。请参阅下面的代码,其中函数的输出在 A 和 B 内存位置之间来回移动。如何强制“IN PARALLEL”行在安装了 CUDA 的 NVIDIA GPU 上并行运行?

import torch

A = torch.tensor([True, False, True]).to('cuda')  # Initial values.
B = torch.tensor([False, True, True]).to('cuda')  # Values don't matter. Will write over them in the first iteration.

n_steps = 100

for step in range(n_steps):

    # Use values in A to compute new values in B.
    # How to run the three lines below IN PARALLEL?
    B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2])  # func1: Y0 = X0 | X1 & X2
    B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2])  # func2: Y1 = X0 | X1 | X2
    B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2])  # func3: Y2 = X0 & X1 & X2

    # Only after the above three lines above finish their computation (and B has new values), should the lines below be run.


    # Use values in B to compute new values in A.
    # Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
    # How to run the three lines below IN PARALLEL?
    A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2])  # func1: Y0 = X0 | X1 & X2
    A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2])  # func2: Y1 = X0 | X1 | X2
    A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2])  # func3: Y2 = X0 & X1 & X2

    # Only after the above three lines above finish their computation (and A has new values), should the next loop be run.
pytorch parallel-processing cuda nvidia boolean-expression
1个回答
0
投票

我怀疑这是最好/最快的解决方案,但使用

torch.compile
确实提供了加速。我尚未测试扩展到数千个布尔函数的情况。

import torch
from time import time

A = torch.tensor([True, False, True]).to('cuda')  # Initial values.
B = torch.tensor([False, True, True]).to('cuda')  # Values don't matter. Will write over them in the first iteration

@torch.compile
def process(A, B, n_steps):
    for step in range(n_steps):

        # Use values in A to compute new values in B.
        # How to run the three lines below IN PARALLEL?
        B[0] = torch.logical_and(torch.logical_or( A[0], A[1]), A[2])  # func1: Y0 = X0 | X1 & X2
        B[1] = torch.logical_or( torch.logical_or( A[0], A[1]), A[2])  # func2: Y1 = X0 | X1 | X2
        B[2] = torch.logical_and(torch.logical_and(A[0], A[1]), A[2])  # func3: Y2 = X0 & X1 & X2
        # Only after the above three lines above finish their computation (and B has new values), should the lines below be run.

        # Use values in B to compute new values in A.
        # Note that the functions below are identical to the ones above (which may allow for some additional acceleration?)
        # How to run the three lines below IN PARALLEL?
        A[0] = torch.logical_and(torch.logical_or( B[0], B[1]), B[2])  # func1: Y0 = X0 | X1 & X2
        A[1] = torch.logical_or( torch.logical_or( B[0], B[1]), B[2])  # func2: Y1 = X0 | X1 | X2
        A[2] = torch.logical_and(torch.logical_and(B[0], B[1]), B[2])  # func3: Y2 = X0 & X1 & X2
        # Only after the above three lines above finish their computation (and A has new values), should the next loop be run.

    return A

# First run is slow due to compilation
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')

# Runs are faster subsequently, and can be looped over to effectively increase n_steps
t_start = time()
A = process(A, B, 100)
print(f'{time()-t_start} seconds')

没有 @torch.compile 装饰器的输出:

First run time: 0.12589144706726074 seconds
Second run time: 0.059000492095947266 seconds

使用 @torch.compile 装饰器输出:

First run time: 18.201257467269897 seconds
Second run time: 0.007639169692993164 seconds
© www.soinside.com 2019 - 2024. All rights reserved.