在 m1 mac 上的 mps 设备下将 batchnoem 层设置为冻结时出错

问题描述 投票:0回答:0

几天来我一直在为这个奇怪的错误而苦苦挣扎,我似乎找不到解决方案,我提供的代码在使用 cpu 时可以完美运行,但是在 mbp 14 上使用 mps 设备时会抛出错误。此外,这绝对不是内存问题,因为如果所有参数都设置为可训练,代码运行时它只会在 batchnorm 被冻结而其他一切都不是时中断

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision.models.resnet import resnet50, ResNet50_Weights

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
for name, param in model.named_parameters():
    if "bn" in name or "batchnorm" in name.lower():
        param.requires_grad = False
        
# Adding a custom classification head
num_ftrs = model.fc.in_features
model.fc = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(num_ftrs, 1024),
    nn.ReLU(inplace=True),
    nn.Dropout(0.2),
    nn.Linear(1024, 512),
    nn.ReLU(inplace=True),
    nn.Dropout(0.2),
    nn.Linear(512, 3),
)

# Then I have a standard training loop stored in a class
  trainer = ModelTrainer(model, train_loader, val_loader, num_epochs=1)
  model = trainer.train()
  history = trainer.history()```


And here is the Error:


错误:

使用mps设备 纪元 1/1

火车:0%| | 0/38 [00:00ZN3c104impl28wrap_kernel_functor_unboxed_INS0_6detail24WrapFunctionIntoFunctor_INS_26CompileTimeFunctionPointerIFNSt3__15tupleIJN2at6TensorES8_S8_EEENS_14DispatchKeySetERKS8_SC_RKNS_8optionalIS8_EESG_SG_SG_SG_bdNS5_5arrayIbLm3EEEEXadL_ZN5torch8autograd12VariableType12_GLOBAL__N_126native_batch_norm_backwardESA_SC_SC_SG_SG_SG_SG_SG_bdSI_EEEES9_NS_4guts8typelist8typelistIJSA_SC_SC_SG_SG_SG_SG_SG_bdSI_EEEEESJ_E4callEPNS_14OperatorKernelESA_SC_SC_SG_SG_SG_SG_SG_bdSI + 2392 8 libtorch_cpu.dylib 0x00000001546324ec _ZN2at4_ops26native_batch_norm_backward4callERKNS_6TensorES4_RKN3c108optionalIS2_EES9_S9_S9_S9_bdNSt3__15arrayIbLm3EEE + 468 9 libtorch_cpu.dylib 0x00000001560bcce8 _ZN5torch8autograd9generated24NativeBatchNormBackward05applyEONSt3__16vectorIN2at6TensorENS3_9allocatorIS6_EEEE + 884 10 libtorch_cpu.dylib 0x00000001570a2a50 _ZN5torch8autograd4NodeclEONSt3__16vectorIN2at6TensorENS2_9allocatorIS5_EEEE + 120 11 libtorch_cpu.dylib 0x000000015709983c _ZN5torch8autograd6Engine17evaluate_functionERNSt3__110shared_ptrINS0_9GraphTaskEEEPNS0_4NodeERNS0_11InputBufferERKNS3_INS0_10ReadyQueueEEE + 2932 12 libtorch_cpu.dylib 0x00000001570986e0 _ZN5torch8autograd6Engine11thread_mainERKNSt3__110shared_ptrINS0_9GraphTaskEEE + 640 13 libtorch_cpu.dylib 0x00000001570973c4 _ZN5torch8autograd6Engine11thread_initEiRKNSt3__110shared_ptrINS0_10ReadyQueueEEEb + 336 14 libtorch_python.dylib 0x000000014867df38 _ZN5torch8autograd6python12PythonEngine11thread_initEiRKNSt3__110shared_ptrINS0_10ReadyQueueEEEb + 112 15 libtorch_cpu.dylib 0x00000001570a5bb0 ZNSt3__1L14__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN5torch8autograd6EngineEFviRKNS_10shared_ptrINS8_10ReadyQueueEEEaSCbEPS+ReadyQueueEEEaSCbEPS 16 libsystem_pthread.dylib 0x00000001b291e06c _pthread_start + 148 17 libsystem_pthread.dylib 0x00000001b2918e2c thread_start + 8 ) libc++abi:以 NSException 类型的未捕获异常终止 [1] 8858 中止 /Users/dimitardimitrov/miniconda3/envs/pytorch2/bin/python /Users/dimitardimitrov/miniconda3/envs/pytorch2/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: 似乎有 1 个泄漏的信号量对象要在关机时清理 warnings.warn('resource_tracker: There appear to be %d '

This is my environment: Here is my environment information: Versions PyTorch version: 2.0.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 13.0 (arm64) GCC version: Could not collect Clang version: 14.0.0 (clang-1400.0.29.202) CMake version: Could not collect Libc version: N/A Python version: 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-13.0-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Pro Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.5 [pip3] torch==2.0.0 [pip3] torchaudio==2.0.0 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.15.0 [conda] numpy 1.23.5 py310hb93e574_0 [conda] numpy-base 1.23.5 py310haf87e8b_0 [conda] pytorch 2.0.0 py3.10_0 pytorch [conda] torchaudio 2.0.0 py310_cpu pytorch [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.15.0 py310_cpu pytorch I tried running it with batch size 4 to make sure it is not a memory issue, as well as running it on cpu, which worked. I ran it with the full model frozen and full model trainable both worked. The only case when the error occurs is when only the batchnorm layers are frozen.
    
macos deep-learning pytorch mps
© www.soinside.com 2019 - 2024. All rights reserved.