当今许多公司都在我们自己的内部服务器上运行本地法学硕士。选择 Mixtral 8x7b 是因为它对于“讲法语”的型号来说似乎是最物有所值的。该模型本身运行得很好,但就性能而言,我不得不说我非常失望:通过 10 个并发请求,令牌生成速度显着下降,使得体验非常不愉快。
我使用 vllm 在双 NVIDIA H100 PCIe 设置上运行模型,您可以在下面看到启动日志:
INFO 07-01 06:28:12 api_server.py:177] vLLM API server version 0.5.0
INFO 07-01 06:28:12 api_server.py:178] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='safetensors', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
/root/.cache/pypoetry/virtualenvs/code-MATOk_fk-py3.11/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
2024-07-01 06:28:16,256 INFO worker.py:1753 -- Started a local Ray instance.
INFO 07-01 06:28:17 config.py:623] Defaulting to use mp for distributed inference
INFO 07-01 06:28:17 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.SAFETENSORS, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-01 06:28:21 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 utils.py:623] Found nccl from library libnccl.so.2
INFO 07-01 06:28:21 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 pynccl.py:65] vLLM is using nccl==2.20.5
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
cache[rtype].remove(name)
KeyError: '/psm_7a3085ca'
INFO 07-01 06:28:22 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:22 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-01 06:28:22 custom_all_reduce.py:179] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=3627) WARNING 07-01 06:28:22 custom_all_reduce.py:179] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-01 06:28:22 weight_utils.py:218] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:22 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-01 06:29:05 model_runner.py:159] Loading model weights took 43.5064 GB
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:05 model_runner.py:159] Loading model weights took 43.5064 GB
INFO 07-01 06:29:09 distributed_gpu_executor.py:56] # GPU blocks: 23587, # CPU blocks: 4096
INFO 07-01 06:29:12 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-01 06:29:12 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:12 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:12 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-01 06:29:30 model_runner.py:954] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:30 model_runner.py:954] Graph capturing finished in 18 secs.
INFO 07-01 06:29:30 serving_chat.py:92] Using default chat template:
INFO 07-01 06:29:30 serving_chat.py:92] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
WARNING 07-01 06:29:30 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
这是 nvidia-smi 命令的结果,看起来模型已加载到 GPU 内存中:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe Off | 00000000:00:06.0 Off | 0 |
| N/A 33C P0 81W / 350W | 69810MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 PCIe Off | 00000000:00:07.0 Off | 0 |
| N/A 34C P0 79W / 350W | 69712MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1791 C ...nvs/code-MATOk_fk-py3.11/bin/python 69792MiB |
| 1 N/A N/A 5653 C ...nvs/code-MATOk_fk-py3.11/bin/python 69694MiB |
+-----------------------------------------------------------------------------------------+
这是正常现象还是我的设置有问题?是否有人有在生产中运行此模型的经验,并对给定“并发请求目标”的硬件要求有反馈?
我尝试使用 vllm 参数,但没有帮助。
我建议尝试 awq 格式,而不是在 2 个 GPU 上拆分模型 你需要下载这样的模型才能工作casperhansen/mixtral-instruct-awq 一旦你完成了,请确保使用
--quantization awq
和 --dtype half
运行你的模型
dtype 对于您的设置 YMMV 可能不是必需的
从你的日志中我也看不出你是否使用 flashattn 作为后端,安装它并看看它是否可以提高你的性能
最后,我会尝试修改
--max-num-batched-tokens
和 --max-num-seqs
等参数,请在此处阅读更多内容:vllm OpenAI 兼容服务器