如何使用 llm 对象通过单个脚本使用 vLLM 在多个 GPU 上加载多个模型?

问题描述 投票:0回答:1

我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上,但在初始化第二个模型时遇到错误。


我正在尝试做的事情:

  • 环境:

    • vLLM版本:0.4.1
    • Python版本:3.11
    • CUDA版本:(已知请注明)
    • PyTorch 版本:(如果已知,请注明)
    • GPU: GPU 2 和 GPU 5(想要使用这些 GPU)
  • 目标:

    • 将同一模型 (
      gpt2
      ) 的两个实例加载到同一脚本中的两个不同 GPU 上。
    • 使用
      CUDA_VISIBLE_DEVICES=2,5
      将物理 GPU 2 和 5 映射到逻辑设备
      cuda:0
      cuda:1
    • 从 vLLM 初始化两个
      LLM
      实例,每个实例分配给不同的 GPU。

我的脚本:

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")

我如何运行脚本:

我使用以下命令运行脚本来设置 CUDA_VISIBLE_DEVICES 环境变量:

CUDA_VISIBLE_DEVICES=2,5 python script.py

思考了几秒钟

降价 复制代码 标题: 在同一脚本中的不同 GPU 上加载多个模型时,vLLM 中出现断言错误

问题:

我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上,但在初始化第二个模型时遇到错误。


我正在尝试做的事情:

  • 环境:

    • vLLM版本:0.4.1
    • Python版本:3.11
    • CUDA版本:(已知请注明)
    • PyTorch 版本:(如果已知,请注明)
    • GPU: GPU 2 和 GPU 5(想要使用这些 GPU)
  • 目标:

    • 将同一模型 (
      gpt2
      ) 的两个实例加载到同一脚本中的两个不同 GPU 上。
    • 使用
      CUDA_VISIBLE_DEVICES=2,5
      将物理 GPU 2 和 5 映射到逻辑设备
      cuda:0
      cuda:1
    • 从 vLLM 初始化两个
      LLM
      实例,每个实例分配给不同的 GPU。

我的脚本:

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")
How I'm Running the Script:

I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:

bash
Copy code
CUDA_VISIBLE_DEVICES=2,5 python script.py
The Issue:

When I run the script, the first model initializes correctly on cuda:0 (which should correspond to physical GPU 2). However, when initializing the second LLM instance on cuda:1, I encounter the following error:

```bash
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

完整错误回溯:

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
allocating model 1 gpu1
INFO 09-23 12:38:36 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:37 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-23 12:38:38 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-23 12:38:38 selector.py:33] Using XFormers backend.
INFO 09-23 12:38:39 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:40 model_runner.py:173] Loading model weights took 0.2378 GB
INFO 09-23 12:38:40 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
INFO 09-23 12:38:42 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-23 12:38:42 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-23 12:38:48 model_runner.py:1057] Graph capturing finished in 6 secs.
allocating model 2 gpu2
INFO 09-23 12:38:48 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:48 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:49 model_runner.py:173] Loading model weights took 0.0000 GB
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
    fire.Fire(main)
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
    llm2 = LLM(model=model, device=f'cuda:1')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
    engine = cls(
             ^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self._initialize_kv_caches()
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
    assert peak_memory > 0, (
           ^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

我尝试过的:

  • 清除GPU内存:
    • 在初始化第二个
      torch.cuda.empty_cache()
      实例之前添加了
      gc.collect()
      LLM
  • 设备规格:
    • 确保在设置
      cuda:0
      后指定正确的设备 ID(
      cuda:1
      CUDA_VISIBLE_DEVICES=2,5
      )。
  • 环境变量:
    • 确认在脚本开头以及导入任何 CUDA 相关库之前正确设置
      CUDA_VISIBLE_DEVICES

尽管进行了这些尝试,但在初始化第二个

LLM
实例时,错误仍然存在。


问题:

  1. 是否可以使用 vLLM 将多个
    LLM
    实例加载到同一脚本中的不同 GPU 上?
  2. 设置后我是否正确指定了设备
    CUDA_VISIBLE_DEVICES
  3. 此问题是否与 vLLM 或底层库如何在单个进程中处理 GPU 设备有关?

附加信息:

  • GPU 可用性: 两个 GPU 均可用且未被其他进程使用。
  • 监控:检查
    nvidia-smi
    以确保在运行脚本之前GPU内存可用。
  • 单一模型初始化: 当我仅使用一个
    LLM
    实例(在
    cuda:0
    cuda:1
    上)运行脚本时,它可以正常工作。

任何有关如何解决此问题的见解或建议将不胜感激!

machine-learning nlp cuda vllm
1个回答
0
投票

好吧,这似乎有效:

    def main2():
        import os
        import sys
        import socket
        import ray
        from ray.util.placement_group import placement_group
        from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
        from vllm import LLM

        print(sys.executable)
        if socket.gethostname() == 'skampere1':
            print('Hardcoding the path since we are in skampere')
            sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages',
                        '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src',
                        '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
            print(f'{sys.path=}')

        # Initialize Ray
        ray.init()

        # Define the number of models (and GPUs) you want to use
        num_models = 2  # Adjust this based on your available GPUs

        # Create a placement group with one GPU and one CPU per bundle
        pg = placement_group(
            name="llm_pg",
            bundles=[{"GPU": 1, "CPU": 1} for _ in range(num_models)],
            strategy="STRICT_PACK"  # or "PACK" or "SPREAD" depending on your needs
        )
        # Wait until the placement group is ready
        ray.get(pg.ready())

        # Define the LLMActor class that will load the LLM model on the assigned GPU
        @ray.remote(num_gpus=1, num_cpus=1)
        class LLMActor:
            def __init__(self, model_name):
                import os
                import torch

                # Get the GPU IDs assigned to this actor by Ray
                gpu_ids = ray.get_gpu_ids()
                # Set CUDA_VISIBLE_DEVICES to limit the GPUs visible to this process
                os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(int(gpu_id)) for gpu_id in gpu_ids)
                # Set the default CUDA device
                torch.cuda.set_device(0)  # Since only one GPU is visible, it's cuda:0
                # Initialize the LLM model
                self.llm = LLM(model=model_name, device="cuda:0")  # Use cuda:0 since only one GPU is visible

            def generate(self, prompt):
                # Generate text using the LLM instance
                outputs = self.llm.generate([prompt])
                return outputs[0].outputs[0].text

        # Main function
        model_name = "gpt2"  # Replace with your model
        prompts = ["Hello from model 1", "Greetings from model 2"]

        # Create LLMActor instances assigned to different bundles in the placement group
        actors = []
        for i in range(num_models):
            # Assign the actor to a specific bundle in the placement group
            actor = LLMActor.options(
                scheduling_strategy=PlacementGroupSchedulingStrategy(
                    placement_group=pg,
                    placement_group_bundle_index=i
                )
            ).remote(model_name)
            actors.append(actor)

        # Generate outputs using the actors
        futures = []
        for actor, prompt in zip(actors, prompts):
            future = actor.generate.remote(prompt)
            futures.append(future)

        # Retrieve and print the outputs
        outputs = ray.get(futures)
        for i, output in enumerate(outputs):
            print(f"Output from model {i+1}: {output}")

    main2()

输出:

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
2024-09-23 12:52:58,838 INFO worker.py:1786 -- Started a local Ray instance.
(LLMActor pid=442031) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442031) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442031)   warnings.warn(
(LLMActor pid=442031) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442031) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442031) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442030) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
(LLMActor pid=442030) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442030) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442030) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 24.18it/s]
(LLMActor pid=442030) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442030)   warnings.warn(
Output from model 1: .103102 ...olla at 5:59 pm tomorrow with zcarc from
Output from model 2: .103. ...olla... What the heck is building with zcar? B
Done! Time: 15.06 sec, 0.25 min, 0.00 hr
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 25.67it/s]
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ 
© www.soinside.com 2019 - 2024. All rights reserved.