正在微调 phi-3.5-mini,当尝试运行
trainer.train()
时,我收到以下错误:
***** Running training *****
Num examples = 647
Num Epochs = 3
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 4
Total optimization steps = 60
Number of trainable parameters = 25,165,824
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
You are not running the flash-attention implementation, expect numerical differences.
/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
**Error operation not supported at line 383 in file /src/csrc/pythonInterface.cpp**
主要是这部分“文件/src/csrc/pythonInterface.cpp中第383行不支持错误操作”
然后内核就死掉了,下面是我正在使用的软件包版本
transformers 4.44.2
torch 2.4.1
torchaudio 2.4.1
torchvision 0.19.1
accelerate 0.34.2
peft 0.12.0
conda 版本是
24.3.0
尝试在 google colab 上运行相同的代码,但在 jupyter 实验室中它不起作用
这是由于使用的优化器类型所致,我在旧 CPU 上使用分页优化器,导致 Jupyter Lab 崩溃