使用下面的玩具脚本,我在 GPU 上得到的结果比在 CPU 上得到的结果更差。我对 GPU 编程相当陌生,所以我什至不知道如何调试。根据我有限的研究,卷积应该能够被 GPU 大大加速,所以我猜我做错了什么。 GPU 发送和接收的数据是否存在瓶颈?
import timeit
import numpy as np
import torch
def to_torch(x, device):
if x.dtype == 'float64':
x = x.astype('float32')
return torch.from_numpy(x).to(device)
def min_max_normalize(x):
min_val = torch.min(x)
max_val = torch.max(x)
return (x - min_val) / (max_val - min_val)
def test_device(device):
conv1_weights = to_torch(np.random.randn(3, 3), device)
conv1_bias = to_torch(np.zeros((415, 415)), device)
def work():
data = to_torch(np.random.randn(397, 397), device)
data = min_max_normalize(data)
output_height = data.shape[0] - conv1_weights.shape[0] + 1
output_width = data.shape[1] - conv1_weights.shape[1] + 1
output = to_torch(np.zeros((output_height, output_width)), device)
for i in range(output_height):
for j in range(output_width):
output[i, j] = torch.sum(data[i:i+conv1_weights.shape[0], j:j+conv1_weights.shape[1]] * conv1_weights)
output += conv1_bias[:output_height, :output_width]
return timeit.timeit(work, number=5)
print(test_device("cpu"))
print(test_device("mps"))
输出:
4.249846041202545
51.15217095799744
如您所见,GPU 计算速度慢了 10 倍以上。我是否错误地使用了火炬张量?
按照 Nick ODell 的建议,我将循环部分更改为如下所示:
for i in range(conv1_weights.shape[0]):
for j in range(conv1_weights.shape[1]):
data_times_weight = conv1_weights[i, j] * data
output += data_times_weight[i:i + output_height, j:j + output_width]
output += conv1_bias[:output_height, :output_width]
这导致 GPU 和 CPU 大幅加速:
0.028558707796037197
0.07935599982738495