我正在比较使用 pytorch 和 onnxruntime 的输入的推理时间,我发现 onnxruntime 在 GPU 上实际上较慢,而在 CPU 上则明显更快
我在 Windows 10 上尝试过这个。
相关代码-
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
def convert_to_onnx(resnet):
resnet.eval()
dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
input_names = [ 'input' ]
output_names = [ 'output' ]
torch.onnx.export(resnet,
dummy_input,
"resnet18.onnx",
verbose=True,
opset_version=13,
input_names=input_names,
output_names=output_names,
export_params=True,
do_constant_folding=True,
dynamic_axes={
'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
def infer_pytorch(resnet):
print('Pytorch Inference')
print('==========================')
print()
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
latency = []
for i in range(total_samples):
t0 = time.time()
resnet.eval()
with torch.no_grad():
out = resnet(x)
latency.append(time.time() - t0)
print('Number of runs:', len(latency))
print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def infer_onnxruntime():
print('Onnxruntime Inference')
print('==========================')
print()
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
# Input
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
x = to_numpy(x)
so = onnxruntime.SessionOptions()
so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']
model_onnx_path = os.path.join(".", "resnet18.onnx")
ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)
options = ort_session.get_provider_options()
cuda_options = options['CUDAExecutionProvider']
cuda_options['cudnn_conv_use_max_workspace'] = '1'
ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])
#IOBinding
input_names = ort_session.get_inputs()[0].name
output_names = ort_session.get_outputs()[0].name
io_binding = ort_session.io_binding()
io_binding.bind_cpu_input(input_names, x)
io_binding.bind_output(output_names, device)
#warm up run
ort_session.run_with_iobinding(io_binding)
ort_outs = io_binding.copy_outputs_to_cpu()
latency = []
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
ort_outs = io_binding.copy_outputs_to_cpu()
print('Number of runs:', len(latency))
print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
if __name__ == '__main__':
torch.cuda.empty_cache()
resnet = (models.resnet18(pretrained=True)).to(device=device)
convert_to_onnx(resnet)
infer_onnxruntime()
infer_pytorch(resnet)
输出
如果在CPU上运行,
Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms
但是,如果在 GPU 上运行,我明白了
Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms
如果我将图形优化更改为 onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL,我会看到 GPU 上的推理时间有所改进,但它仍然比 Pytorch 慢。
我对输入张量 numpy 数组使用 io 绑定,模型的节点位于 GPU 上。
此外,在 onnxruntime 的处理过程中,我打印设备使用情况统计信息,并且看到了这一点 -
Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached: 0.1 GB
所以,GPU设备正在被使用。
此外,我还使用了 ModelZoo 中的 resnet18.onnx 模型来查看是否是转换模式问题,但得到了相同的结果。
我在这里做错或遗漏了什么?
计算推理时间时,从循环中排除所有应该运行一次的代码,如
resnet.eval()
。
请在示例中包含导入
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
仅运行您的示例 GPU 后,我发现时间仅相差约 x2,因此速度差异可能是由框架特性引起的。有关更多详细信息,请探索onnx 转化优化
Onnxruntime Inference
==========================
Number of runs: 1000
Average onnxruntime cuda Inference time = 4.76 ms
Pytorch Inference
==========================
Number of runs: 1000
Average PyTorch cuda Inference time = 2.27 ms
对于CPU,你不需要使用
io-binding
,只有GPU才需要。
并且不要更改会话选项,因为 onnxruntime 默认情况下会选择最佳选项。
以下事项可能有助于加速 GPU
onnxruntime-gpu
和CUDA EP
附带的TensortRT EP
。我建议你使用
io_binding.bind_input()
方法而不是io_binding.bind_cpu_input()
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
-> ort_outs = io_binding.copy_outputs_to_cpu()
每次将输出从 GPU 复制到 CPU 1000 次会降低性能。
我已经解决了类似的问题。
请参考帖子:https://medium.com/neuml/debug-onnx-gpu-performance-c9290fe07459