我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上,但在初始化第二个模型时遇到错误。
我正在尝试做的事情:
环境:
目标:
gpt2
) 的两个实例加载到同一脚本中的两个不同 GPU 上。CUDA_VISIBLE_DEVICES=2,5
将物理 GPU 2 和 5 映射到逻辑设备 cuda:0
和 cuda:1
。LLM
实例,每个实例分配给不同的 GPU。我的脚本:
def main():
import os
import sys
import socket
print(sys.executable)
if socket.gethostname() == 'skampere1':
print('Hardcoding the path since we are in skampere')
sys.path = ['', '/path/to/env/lib/python311.zip',
'/path/to/env/lib/python3.11',
'/path/to/env/lib/python3.11/lib-dynload',
'/path/to/env/lib/python3.11/site-packages',
'/path/to/py_src', '/path/to/ultimate-utils/py_src']
print(f'{sys.path=}')
# Clear GPU cache
import torch
import gc
torch.cuda.empty_cache()
gc.collect()
from vllm import LLM
model = 'gpt2'
print('Allocating model 1 on GPU 0')
llm1 = LLM(model=model, device='cuda:0')
print('Allocating model 2 on GPU 1')
llm2 = LLM(model=model, device='cuda:1')
print('About to generate with both...')
while True:
prompt = "Hello from GPU 0"
output = llm1.generate([prompt])
print(f"Output from llm1: {output[0].outputs[0].text}")
prompt = "Hello from GPU 1"
output = llm2.generate([prompt])
print(f"Output from llm2: {output[0].outputs[0].text}")
if __name__ == '__main__':
import fire
import time
start = time.time()
fire.Fire(main)
print(f"Done! Time: {time.time()-start:.2f} sec")
我如何运行脚本:
我使用以下命令运行脚本来设置 CUDA_VISIBLE_DEVICES 环境变量:
CUDA_VISIBLE_DEVICES=2,5 python script.py
思考了几秒钟
降价 复制代码 标题: 在同一脚本中的不同 GPU 上加载多个模型时,vLLM 中出现断言错误
问题:
我尝试使用 vLLM 在单个 Python 脚本中将相同的模型加载到不同的 GPU 上,但在初始化第二个模型时遇到错误。
我正在尝试做的事情:
环境:
目标:
gpt2
) 的两个实例加载到同一脚本中的两个不同 GPU 上。CUDA_VISIBLE_DEVICES=2,5
将物理 GPU 2 和 5 映射到逻辑设备 cuda:0
和 cuda:1
。LLM
实例,每个实例分配给不同的 GPU。我的脚本:
def main():
import os
import sys
import socket
print(sys.executable)
if socket.gethostname() == 'skampere1':
print('Hardcoding the path since we are in skampere')
sys.path = ['', '/path/to/env/lib/python311.zip',
'/path/to/env/lib/python3.11',
'/path/to/env/lib/python3.11/lib-dynload',
'/path/to/env/lib/python3.11/site-packages',
'/path/to/py_src', '/path/to/ultimate-utils/py_src']
print(f'{sys.path=}')
# Clear GPU cache
import torch
import gc
torch.cuda.empty_cache()
gc.collect()
from vllm import LLM
model = 'gpt2'
print('Allocating model 1 on GPU 0')
llm1 = LLM(model=model, device='cuda:0')
print('Allocating model 2 on GPU 1')
llm2 = LLM(model=model, device='cuda:1')
print('About to generate with both...')
while True:
prompt = "Hello from GPU 0"
output = llm1.generate([prompt])
print(f"Output from llm1: {output[0].outputs[0].text}")
prompt = "Hello from GPU 1"
output = llm2.generate([prompt])
print(f"Output from llm2: {output[0].outputs[0].text}")
if __name__ == '__main__':
import fire
import time
start = time.time()
fire.Fire(main)
print(f"Done! Time: {time.time()-start:.2f} sec")
How I'm Running the Script:
I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:
bash
Copy code
CUDA_VISIBLE_DEVICES=2,5 python script.py
The Issue:
When I run the script, the first model initializes correctly on cuda:0 (which should correspond to physical GPU 2). However, when initializing the second LLM instance on cuda:1, I encounter the following error:
```bash
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.
完整错误回溯:
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
allocating model 1 gpu1
INFO 09-23 12:38:36 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
INFO 09-23 12:38:37 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-23 12:38:38 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-23 12:38:38 selector.py:33] Using XFormers backend.
INFO 09-23 12:38:39 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:40 model_runner.py:173] Loading model weights took 0.2378 GB
INFO 09-23 12:38:40 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
INFO 09-23 12:38:42 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-23 12:38:42 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-23 12:38:48 model_runner.py:1057] Graph capturing finished in 6 secs.
allocating model 2 gpu2
INFO 09-23 12:38:48 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
INFO 09-23 12:38:48 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:49 model_runner.py:173] Loading model weights took 0.0000 GB
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
fire.Fire(main)
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
llm2 = LLM(model=model, device=f'cuda:1')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
self._initialize_kv_caches()
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
assert peak_memory > 0, (
^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.
我尝试过的:
torch.cuda.empty_cache()
实例之前添加了 gc.collect()
和 LLM
。cuda:0
后指定正确的设备 ID(cuda:1
和 CUDA_VISIBLE_DEVICES=2,5
)。CUDA_VISIBLE_DEVICES
。尽管进行了这些尝试,但在初始化第二个
LLM
实例时,错误仍然存在。
问题:
LLM
实例加载到同一脚本中的不同 GPU 上?CUDA_VISIBLE_DEVICES
?附加信息:
nvidia-smi
以确保在运行脚本之前GPU内存可用。LLM
实例(在 cuda:0
或 cuda:1
上)运行脚本时,它可以正常工作。任何有关如何解决此问题的见解或建议将不胜感激!
好吧,这似乎有效:
def main2():
import os
import sys
import socket
import ray
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from vllm import LLM
print(sys.executable)
if socket.gethostname() == 'skampere1':
print('Hardcoding the path since we are in skampere')
sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip',
'/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11',
'/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload',
'/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages',
'/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src',
'/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
print(f'{sys.path=}')
# Initialize Ray
ray.init()
# Define the number of models (and GPUs) you want to use
num_models = 2 # Adjust this based on your available GPUs
# Create a placement group with one GPU and one CPU per bundle
pg = placement_group(
name="llm_pg",
bundles=[{"GPU": 1, "CPU": 1} for _ in range(num_models)],
strategy="STRICT_PACK" # or "PACK" or "SPREAD" depending on your needs
)
# Wait until the placement group is ready
ray.get(pg.ready())
# Define the LLMActor class that will load the LLM model on the assigned GPU
@ray.remote(num_gpus=1, num_cpus=1)
class LLMActor:
def __init__(self, model_name):
import os
import torch
# Get the GPU IDs assigned to this actor by Ray
gpu_ids = ray.get_gpu_ids()
# Set CUDA_VISIBLE_DEVICES to limit the GPUs visible to this process
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(int(gpu_id)) for gpu_id in gpu_ids)
# Set the default CUDA device
torch.cuda.set_device(0) # Since only one GPU is visible, it's cuda:0
# Initialize the LLM model
self.llm = LLM(model=model_name, device="cuda:0") # Use cuda:0 since only one GPU is visible
def generate(self, prompt):
# Generate text using the LLM instance
outputs = self.llm.generate([prompt])
return outputs[0].outputs[0].text
# Main function
model_name = "gpt2" # Replace with your model
prompts = ["Hello from model 1", "Greetings from model 2"]
# Create LLMActor instances assigned to different bundles in the placement group
actors = []
for i in range(num_models):
# Assign the actor to a specific bundle in the placement group
actor = LLMActor.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg,
placement_group_bundle_index=i
)
).remote(model_name)
actors.append(actor)
# Generate outputs using the actors
futures = []
for actor, prompt in zip(actors, prompts):
future = actor.generate.remote(prompt)
futures.append(future)
# Retrieve and print the outputs
outputs = ray.get(futures)
for i, output in enumerate(outputs):
print(f"Output from model {i+1}: {output}")
main2()
输出:
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
2024-09-23 12:52:58,838 INFO worker.py:1786 -- Started a local Ray instance.
(LLMActor pid=442031) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442031) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442031) warnings.warn(
(LLMActor pid=442031) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442031) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442031) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442030) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
(LLMActor pid=442030) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442030) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442030) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 24.18it/s]
(LLMActor pid=442030) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442030) warnings.warn(
Output from model 1: .103102 ...olla at 5:59 pm tomorrow with zcarc from
Output from model 2: .103. ...olla... What the heck is building with zcar? B
Done! Time: 15.06 sec, 0.25 min, 0.00 hr
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 25.67it/s]
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $