在 m1 mac 上的 mps 设备下将 batchnoem 层设置为冻结时出错

Question

几天来我一直在为这个奇怪的错误而苦苦挣扎，我似乎找不到解决方案，我提供的代码在使用 cpu 时可以完美运行，但是在 mbp 14 上使用 mps 设备时会抛出错误。此外，这绝对不是内存问题，因为如果所有参数都设置为可训练，代码运行时它只会在 batchnorm 被冻结而其他一切都不是时中断

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.models.resnet import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
for name, param in model.named_parameters():
    if "bn" in name or "batchnorm" in name.lower():
        param.requires_grad = False
        
# Adding a custom classification head
num_ftrs = model.fc.in_features
model.fc = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(num_ftrs, 1024),
    nn.ReLU(inplace=True),
    nn.Dropout(0.2),
    nn.Linear(1024, 512),
    nn.ReLU(inplace=True),
    nn.Dropout(0.2),
    nn.Linear(512, 3),
)

# Then I have a standard training loop stored in a class
  trainer = ModelTrainer(model, train_loader, val_loader, num_epochs=1)
  model = trainer.train()
  history = trainer.history()```


And here is the Error:

错误：

使用mps设备纪元 1/1

火车：0%| | 0/38 [00:00ZN3c104impl28wrap_kernel_functor_unboxed_INS0_6detail24WrapFunctionIntoFunctor_INS_26CompileTimeFunctionPointerIFNSt3__15tupleIJN2at6TensorES8_S8_EEENS_14DispatchKeySetERKS8_SC_RKNS_8optionalIS8_EESG_SG_SG_SG_bdNS5_5arrayIbLm3EEEEXadL_ZN5torch8autograd12VariableType12_GLOBAL__N_126native_batch_norm_backwardESA_SC_SC_SG_SG_SG_SG_SG_bdSI_EEEES9_NS_4guts8typelist8typelistIJSA_SC_SC_SG_SG_SG_SG_SG_bdSI_EEEEESJ_E4callEPNS_14OperatorKernelESA_SC_SC_SG_SG_SG_SG_SG_bdSI + 2392 8 libtorch_cpu.dylib 0x00000001546324ec _ZN2at4_ops26native_batch_norm_backward4callERKNS_6TensorES4_RKN3c108optionalIS2_EES9_S9_S9_S9_bdNSt3__15arrayIbLm3EEE + 468 9 libtorch_cpu.dylib 0x00000001560bcce8 _ZN5torch8autograd9generated24NativeBatchNormBackward05applyEONSt3__16vectorIN2at6TensorENS3_9allocatorIS6_EEEE + 884 10 libtorch_cpu.dylib 0x00000001570a2a50 _ZN5torch8autograd4NodeclEONSt3__16vectorIN2at6TensorENS2_9allocatorIS5_EEEE + 120 11 libtorch_cpu.dylib 0x000000015709983c _ZN5torch8autograd6Engine17evaluate_functionERNSt3__110shared_ptrINS0_9GraphTaskEEEPNS0_4NodeERNS0_11InputBufferERKNS3_INS0_10ReadyQueueEEE + 2932 12 libtorch_cpu.dylib 0x00000001570986e0 _ZN5torch8autograd6Engine11thread_mainERKNSt3__110shared_ptrINS0_9GraphTaskEEE + 640 13 libtorch_cpu.dylib 0x00000001570973c4 _ZN5torch8autograd6Engine11thread_initEiRKNSt3__110shared_ptrINS0_10ReadyQueueEEEb + 336 14 libtorch_python.dylib 0x000000014867df38 _ZN5torch8autograd6python12PythonEngine11thread_initEiRKNSt3__110shared_ptrINS0_10ReadyQueueEEEb + 112 15 libtorch_cpu.dylib 0x00000001570a5bb0 ZNSt3__1L14__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN5torch8autograd6EngineEFviRKNS_10shared_ptrINS8_10ReadyQueueEEEaSCbEPS+ReadyQueueEEEaSCbEPS 16 libsystem_pthread.dylib 0x00000001b291e06c _pthread_start + 148 17 libsystem_pthread.dylib 0x00000001b2918e2c thread_start + 8 ) libc++abi：以 NSException 类型的未捕获异常终止 [1] 8858 中止 /Users/dimitardimitrov/miniconda3/envs/pytorch2/bin/python /Users/dimitardimitrov/miniconda3/envs/pytorch2/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: 似乎有 1 个泄漏的信号量对象要在关机时清理 warnings.warn('resource_tracker: There appear to be %d '


This is my environment:

Here is my environment information:

Versions
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.0 (arm64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.0-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.15.0
[conda] numpy 1.23.5 py310hb93e574_0
[conda] numpy-base 1.23.5 py310haf87e8b_0
[conda] pytorch 2.0.0 py3.10_0 pytorch
[conda] torchaudio 2.0.0 py310_cpu pytorch
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.15.0 py310_cpu pytorch



I tried running it with batch size 4 to make sure it is not a memory issue, as well as running it on cpu, which worked. I ran it with the full model frozen and full model trainable both worked. The only case when the error occurs is when only the batchnorm layers are frozen.

在 m1 mac 上的 mps 设备下将 batchnoem 层设置为冻结时出错

问题描述投票：0回答：0

使用mps设备纪元 1/1

最新问题

在 m1 mac 上的 mps 设备下将 batchnoem 层设置为冻结时出错

问题描述 投票：0回答：0

使用mps设备 纪元 1/1

最新问题

问题描述投票：0回答：0

使用mps设备纪元 1/1