在 DeepSpeed 和 Hugging Face Transformer 中加载预训练模型和文件锁定的问题

问题描述 投票:0回答:1

我目前正在开发一个涉及 MobileVLM 模型的项目,使用 Hugging Face Transformers 库加载预训练模型。我在 SLURM 集群上运行脚本时遇到问题。

Exception ignored in atexit callback: <function matmul_ext_update_autotune_table at 0x7fbb99fa4ca0>
Traceback (most recent call last):
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 150, in _update_autotune_table
    cache_manager.put(autotune_table)
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 66, in put
    with FileLock(self.lock_path):
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
    self._acquire()
  File "/public/home/swun-caiy2/miniconda3/envs/mobilevlm/lib/python3.10/site-packages/filelock/_unix.py", line 48, in _acquire
    raise NotImplementedError(msg) from exception
NotImplementedError: FileSystem does not appear to support flock; user SoftFileLock instead

这是我的 SLURM 脚本的相关部分:

from scripts.inference import inference_once
import torch
# model_path = "mtgv/MobileVLM-1.7B" # finetune

model_path = "/public/home/swun-caiy2/wensm/mobilevlm-v2/MobileVLM-main/mtgv"
image_file = "assets/samples/my_book.jpg"
# prompt_str = "who are you?\nIgnore the content of uploading pictures when answering questions."
prompt_str = "What is the title of this book?"
# (or) What is the title of this book?
# (or) Is this book related to Education & Teaching?

torch.cuda.set_device(0)

args = type('Args', (), {
    "model_path": model_path,
    "image_file": image_file,
    "prompt": prompt_str,
    "conv_mode": "v1",
    "temperature": 0, 
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "load_8bit": False,
    "load_4bit": False,
})()

inference_once(args)

如何解决不支持flock的文件系统上的文件锁定问题?

huggingface-transformers slurm
1个回答
0
投票

如果某个目录支持文件锁定,则可以将缓存目录的环境变量更改为该目录,例如:

$ export HF_HOME="/path/to/directory/with/file/locking"
© www.soinside.com 2019 - 2024. All rights reserved.