如何使用 llm 对象通过单个脚本使用 vLLM 在多个 GPU 上加载多个模型？

Question

我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上，但在初始化第二个模型时遇到错误。

我正在尝试做的事情：

环境：
- vLLM版本：0.4.1
- Python版本：3.11
- CUDA版本：（已知请注明）
- PyTorch 版本：（如果已知，请注明）
- GPU： GPU 2 和 GPU 5（想要使用这些 GPU）
目标：
- 将同一模型 (
```
gpt2
```
  ) 的两个实例加载到同一脚本中的两个不同 GPU 上。
- 使用
```
CUDA_VISIBLE_DEVICES=2,5
```
  将物理 GPU 2 和 5 映射到逻辑设备
```
cuda:0
```
  和
```
cuda:1
```
  。
- 从 vLLM 初始化两个
```
LLM
```
  实例，每个实例分配给不同的 GPU。

我的脚本：

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")

我如何运行脚本：

我使用以下命令运行脚本来设置 CUDA_VISIBLE_DEVICES 环境变量：

CUDA_VISIBLE_DEVICES=2,5 python script.py

思考了几秒钟

降价复制代码 标题： 在同一脚本中的不同 GPU 上加载多个模型时，vLLM 中出现断言错误

问题：

我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上，但在初始化第二个模型时遇到错误。

我正在尝试做的事情：

环境：
- vLLM版本：0.4.1
- Python版本：3.11
- CUDA版本：（已知请注明）
- PyTorch 版本：（如果已知，请注明）
- GPU： GPU 2 和 GPU 5（想要使用这些 GPU）
目标：
- 将同一模型 (
```
gpt2
```
  ) 的两个实例加载到同一脚本中的两个不同 GPU 上。
- 使用
```
CUDA_VISIBLE_DEVICES=2,5
```
  将物理 GPU 2 和 5 映射到逻辑设备
```
cuda:0
```
  和
```
cuda:1
```
  。
- 从 vLLM 初始化两个
```
LLM
```
  实例，每个实例分配给不同的 GPU。

我的脚本：

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")
How I'm Running the Script:

I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:

bash
Copy code
CUDA_VISIBLE_DEVICES=2,5 python script.py
The Issue:

When I run the script, the first model initializes correctly on cuda:0 (which should correspond to physical GPU 2). However, when initializing the second LLM instance on cuda:1, I encounter the following error:

```bash
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

完整错误回溯：

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
allocating model 1 gpu1
INFO 09-23 12:38:36 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:37 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-23 12:38:38 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-23 12:38:38 selector.py:33] Using XFormers backend.
INFO 09-23 12:38:39 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:40 model_runner.py:173] Loading model weights took 0.2378 GB
INFO 09-23 12:38:40 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
INFO 09-23 12:38:42 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-23 12:38:42 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-23 12:38:48 model_runner.py:1057] Graph capturing finished in 6 secs.
allocating model 2 gpu2
INFO 09-23 12:38:48 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:48 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:49 model_runner.py:173] Loading model weights took 0.0000 GB
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
    fire.Fire(main)
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
    llm2 = LLM(model=model, device=f'cuda:1')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
    engine = cls(
             ^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self._initialize_kv_caches()
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
    assert peak_memory > 0, (
           ^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

我尝试过的：

清除GPU内存：
- 在初始化第二个
```
torch.cuda.empty_cache()
```
  实例之前添加了
```
gc.collect()
```
  和
```
LLM
```
  。
设备规格：
- 确保在设置
```
cuda:0
```
  后指定正确的设备 ID（
```
cuda:1
```
  和
```
CUDA_VISIBLE_DEVICES=2,5
```
  ）。
环境变量：
- 确认在脚本开头以及导入任何 CUDA 相关库之前正确设置
```
CUDA_VISIBLE_DEVICES
```
  。

尽管进行了这些尝试，但在初始化第二个

LLM

实例时，错误仍然存在。

问题：

是否可以使用 vLLM 将多个
LLM
实例加载到同一脚本中的不同 GPU 上？
设置后我是否正确指定了设备
CUDA_VISIBLE_DEVICES
？
此问题是否与 vLLM 或底层库如何在单个进程中处理 GPU 设备有关？

附加信息：

GPU 可用性： 两个 GPU 均可用且未被其他进程使用。
监控：检查
```
nvidia-smi
```
以确保在运行脚本之前GPU内存可用。
单一模型初始化： 当我仅使用一个
```
LLM
```
实例（在
```
cuda:0
```
或
```
cuda:1
```
上）运行脚本时，它可以正常工作。

任何有关如何解决此问题的见解或建议将不胜感激！

Answer 1

好吧，这似乎有效：

    def main2():
        import os
        import sys
        import socket
        import ray
        from ray.util.placement_group import placement_group
        from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
        from vllm import LLM

        print(sys.executable)
        if socket.gethostname() == 'skampere1':
            print('Hardcoding the path since we are in skampere')
            sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages',
                        '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src',
                        '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
            print(f'{sys.path=}')

        # Initialize Ray
        ray.init()

        # Define the number of models (and GPUs) you want to use
        num_models = 2  # Adjust this based on your available GPUs

        # Create a placement group with one GPU and one CPU per bundle
        pg = placement_group(
            name="llm_pg",
            bundles=[{"GPU": 1, "CPU": 1} for _ in range(num_models)],
            strategy="STRICT_PACK"  # or "PACK" or "SPREAD" depending on your needs
        )
        # Wait until the placement group is ready
        ray.get(pg.ready())

        # Define the LLMActor class that will load the LLM model on the assigned GPU
        @ray.remote(num_gpus=1, num_cpus=1)
        class LLMActor:
            def __init__(self, model_name):
                import os
                import torch

                # Get the GPU IDs assigned to this actor by Ray
                gpu_ids = ray.get_gpu_ids()
                # Set CUDA_VISIBLE_DEVICES to limit the GPUs visible to this process
                os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(int(gpu_id)) for gpu_id in gpu_ids)
                # Set the default CUDA device
                torch.cuda.set_device(0)  # Since only one GPU is visible, it's cuda:0
                # Initialize the LLM model
                self.llm = LLM(model=model_name, device="cuda:0")  # Use cuda:0 since only one GPU is visible

            def generate(self, prompt):
                # Generate text using the LLM instance
                outputs = self.llm.generate([prompt])
                return outputs[0].outputs[0].text

        # Main function
        model_name = "gpt2"  # Replace with your model
        prompts = ["Hello from model 1", "Greetings from model 2"]

        # Create LLMActor instances assigned to different bundles in the placement group
        actors = []
        for i in range(num_models):
            # Assign the actor to a specific bundle in the placement group
            actor = LLMActor.options(
                scheduling_strategy=PlacementGroupSchedulingStrategy(
                    placement_group=pg,
                    placement_group_bundle_index=i
                )
            ).remote(model_name)
            actors.append(actor)

        # Generate outputs using the actors
        futures = []
        for actor, prompt in zip(actors, prompts):
            future = actor.generate.remote(prompt)
            futures.append(future)

        # Retrieve and print the outputs
        outputs = ray.get(futures)
        for i, output in enumerate(outputs):
            print(f"Output from model {i+1}: {output}")

    main2()

输出：

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
2024-09-23 12:52:58,838 INFO worker.py:1786 -- Started a local Ray instance.
(LLMActor pid=442031) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442031) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442031)   warnings.warn(
(LLMActor pid=442031) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442031) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442031) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442030) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
(LLMActor pid=442030) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442030) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442030) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 24.18it/s]
(LLMActor pid=442030) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442030)   warnings.warn(
Output from model 1: .103102 ...olla at 5:59 pm tomorrow with zcarc from
Output from model 2: .103. ...olla... What the heck is building with zcar? B
Done! Time: 15.06 sec, 0.25 min, 0.00 hr
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 25.67it/s]
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $

如何使用 llm 对象通过单个脚本使用 vLLM 在多个 GPU 上加载多个模型？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 llm 对象通过单个脚本使用 vLLM 在多个 GPU 上加载多个模型？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1