Jupyter Lab 内核在启动 trainer.train() 之前就死掉了

问题描述 投票:0回答:1

正在微调 phi-3.5-mini,当尝试运行

trainer.train()
时,我收到以下错误:

***** Running training *****
  Num examples = 647
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 60
  Number of trainable parameters = 25,165,824

  `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

  You are not running the flash-attention implementation, expect numerical differences.
/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
**Error operation not supported at line 383 in file /src/csrc/pythonInterface.cpp**

主要是这部分“文件/src/csrc/pythonInterface.cpp中第383行不支持错误操作”

然后内核就死掉了,下面是我正在使用的软件包版本

transformers                      4.44.2
torch                             2.4.1
torchaudio                        2.4.1
torchvision                       0.19.1
accelerate                        0.34.2
peft                              0.12.0

conda 版本是

24.3.0

尝试在 google colab 上运行相同的代码,但在 jupyter 实验室中它不起作用

python nlp conda huggingface-transformers slm-phi3
1个回答
0
投票

这是由于使用的优化器类型所致,我在旧 CPU 上使用分页优化器,导致 Jupyter Lab 崩溃

© www.soinside.com 2019 - 2024. All rights reserved.