在 CUDA GPU 上运行 Pytorch 量化模型

Question

我很困惑是否可以在 CUDA 上运行 int8 量化模型，或者只能使用 fakequantise 在 CUDA 上训练量化模型以部署在另一个后端（例如 CPU）上。

我想使用实际的 int8 指令而不是 FakeQuantized float32 指令在 CUDA 上运行模型，并享受效率提升。奇怪的是，Pytorch 文档对此没有具体说明。如果可以使用不同的框架（例如

TensorFlow

）在 CUDA 上运行量化模型，我很想知道。

这是准备量化模型的代码（使用训练后量化）。该模型是带有 nn.Conv2d 和 nn.LeakyRelu 以及 nn.MaxPool 模块的普通 CNN：

model_fp = torch.load(models_dir+net_file)

model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)

qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}

model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')

train_data   = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)

for i, (input, _) in enumerate(train_loader):
    if i > 1: break
    print('batch', i+1, end='\r')
    input = input.to('cuda:0')
    model_prepped(input)

这实际上量化了模型：

model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()

这是在 CUDA 上运行量化模型的尝试，并引发 NotImplementedError，当我在 CPU 上运行它时，它工作正常：

model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
    input = input.to('cuda:0')
    out = model_quantised(input)
    print(out, out.shape)
    break

这是错误：

Traceback (most recent call last):
  File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
    out = model_quantised(input)
  File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
    raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend. 
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). 
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].

Answer 1

从 [this][1] 博客来看，您似乎无法在 GPU 上运行量化模型。

PyTorch 中的量化目前仅支持 CPU。量化不是 CPU 特定技术（例如 NVIDIA 的 TensorRT 可用于在 GPU 上实现量化）。然而，GPU 上的推理时间是通常已经“足够快”，并且 CPU 对大规模模型服务器部署（由于复杂的成本因素，超出了本文的范围）。因此，从 PyTorch 开始 1.6 中，原生 API 中仅提供 CPU 后端。

[1]：https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT#:~:text=Quantization%20in%20PyTorch%20is%20当前，到%20在%20GPU上实现%20quantization%20)。

Answer 2

这是一个过时的问题，但事情进展得相当顺利，并且在使用 GPU 后端运行高性能量化模型方面取得了重大进展。

在 CUDA GPU 上运行 Pytorch 量化模型

问题描述投票：0回答：2

2个回答

最新问题

在 CUDA GPU 上运行 Pytorch 量化模型

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2