Pytorch 从 Cuda 到 CPU 的非阻塞分配结果不正确

Question

我试图用我刚刚从 GPU 获得的值在 CPU 上分配一个张量，但是得到的结果不正确，两个张量显然应该是相同的：

（为了避免任何不必要的喋喋不休，我想事先提一下，我也确切知道two其他不同的复制方法，例如1）使用

copy_

方法和2）通过完全覆盖变量。然而，根据我的一些测试，我发现第一个速度较慢，第二个产生内存开销。迄今为止，使用分号进行赋值似乎取得了最佳性能，尽管使用

non_blocking

选项时获得的结果不正确。）

import torch
gpu = torch.device('cuda')
cpu = torch.device('cpu')

a = torch.rand((13223,134,4), dtype=torch.float32, device=cpu)
b = torch.rand((13223,134,4), dtype=torch.bfloat16, device=gpu)

for i in range(3):
    
    b.mul_(0.5)
    
    a[:] = b.to(device=cpu, memory_format=torch.preserve_format, dtype=torch.bfloat16, non_blocking=True)
    
    torch.cuda.synchronize()
    
    print(b[0,0], a[0,0])

tensor([0.0942, 0.1621, 0.2041, 0.1543], device='cuda:0', dtype=torch.bfloat16) tensor([0., 0., 0., 0.])
tensor([0.0471, 0.0811, 0.1021, 0.0771], device='cuda:0', dtype=torch.bfloat16) tensor([0.0942, 0.1621, 0.2041, 0.1543])
tensor([0.0236, 0.0405, 0.0510, 0.0386], device='cuda:0', dtype=torch.bfloat16) tensor([0.0471, 0.0811, 0.1021, 0.0771])
>>>

在我的应用程序中，结果甚至更奇怪，因为某些值变成负值！（...尽管它们在原始张量中都不是负数）。提前感谢任何输入，目标是快速有效地将数据传输到 CPU，如果您知道我没有提到的任何其他方法可以帮助我实现这一目标，请发表评论。 :)

Answer 1

非阻塞赋值实际上不会阻塞下一个语句的执行。

你假设

a[:] = b.to(device=cpu, memory_format=torch.preserve_format, dtype=torch.bfloat16, non_blocking=True)

执行为

b_gpu -> b_cpu
b_cpu -> a

但是对于

non_blocking

，这些操作的顺序没有指定，允许并行执行。是的，它提供速度！但是

b_cpu -> a

工作得更快，并且由于它不等待

b_gpu -> b_cpu

，因此有效的执行顺序是

b_cpu -> a
b_gpu -> b_cpu

Answer 2

我现在通过创建缓冲区解决了这个问题：

import torch
gpu = torch.device('cuda')
cpu = torch.device('cpu')

a = torch.rand((13223,134,4), dtype=torch.float32, device=cpu)
b = torch.rand((13223,134,4), dtype=torch.bfloat16, device=gpu)

for i in range(3):
    
    b.mul_(0.5)
    
    buffer = b.to(device=cpu, memory_format=torch.preserve_format, dtype=torch.bfloat16, non_blocking=True)
    
    torch.cuda.synchronize()
    
    a[:] = buffer        
    
    print(b[0,0], a[0,0])

Pytorch 从 Cuda 到 CPU 的非阻塞分配结果不正确

问题描述投票：0回答：2

2个回答

最新问题

Pytorch 从 Cuda 到 CPU 的非阻塞分配结果不正确

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2