Mixtral 8x7b，我运行错了吗？

Question

当今许多公司都在我们自己的内部服务器上运行本地法学硕士。选择 Mixtral 8x7b 是因为它对于“讲法语”的型号来说似乎是最物有所值的。该模型本身运行得很好，但就性能而言，我不得不说我非常失望：通过 10 个并发请求，令牌生成速度显着下降，使得体验非常不愉快。

我使用 vllm 在双 NVIDIA H100 PCIe 设置上运行模型，您可以在下面看到启动日志：


INFO 07-01 06:28:12 api_server.py:177] vLLM API server version 0.5.0
INFO 07-01 06:28:12 api_server.py:178] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='safetensors', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
/root/.cache/pypoetry/virtualenvs/code-MATOk_fk-py3.11/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2024-07-01 06:28:16,256 INFO worker.py:1753 -- Started a local Ray instance.
INFO 07-01 06:28:17 config.py:623] Defaulting to use mp for distributed inference
INFO 07-01 06:28:17 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.SAFETENSORS, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-01 06:28:21 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 utils.py:623] Found nccl from library libnccl.so.2
INFO 07-01 06:28:21 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:21 pynccl.py:65] vLLM is using nccl==2.20.5
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_7a3085ca'
INFO 07-01 06:28:22 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:22 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-01 06:28:22 custom_all_reduce.py:179] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=3627) WARNING 07-01 06:28:22 custom_all_reduce.py:179] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-01 06:28:22 weight_utils.py:218] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=3627) INFO 07-01 06:28:22 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-01 06:29:05 model_runner.py:159] Loading model weights took 43.5064 GB
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:05 model_runner.py:159] Loading model weights took 43.5064 GB
INFO 07-01 06:29:09 distributed_gpu_executor.py:56] # GPU blocks: 23587, # CPU blocks: 4096
INFO 07-01 06:29:12 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-01 06:29:12 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:12 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:12 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-01 06:29:30 model_runner.py:954] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=3627) INFO 07-01 06:29:30 model_runner.py:954] Graph capturing finished in 18 secs.
INFO 07-01 06:29:30 serving_chat.py:92] Using default chat template:
INFO 07-01 06:29:30 serving_chat.py:92] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
WARNING 07-01 06:29:30 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

这是 nvidia-smi 命令的结果，看起来模型已加载到 GPU 内存中：

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               Off |   00000000:00:06.0 Off |                    0 |
| N/A   33C    P0             81W /  350W |   69810MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               Off |   00000000:00:07.0 Off |                    0 |
| N/A   34C    P0             79W /  350W |   69712MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1791      C   ...nvs/code-MATOk_fk-py3.11/bin/python      69792MiB |
|    1   N/A  N/A      5653      C   ...nvs/code-MATOk_fk-py3.11/bin/python      69694MiB |
+-----------------------------------------------------------------------------------------+

这是正常现象还是我的设置有问题？是否有人有在生产中运行此模型的经验，并对给定“并发请求目标”的硬件要求有反馈？

我尝试使用 vllm 参数，但没有帮助。

Answer 1

我建议尝试 awq 格式，而不是在 2 个 GPU 上拆分模型你需要下载这样的模型才能工作casperhansen/mixtral-instruct-awq 一旦你完成了，请确保使用

--quantization awq

和

--dtype half

运行你的模型

dtype 对于您的设置 YMMV 可能不是必需的

从你的日志中我也看不出你是否使用 flashattn 作为后端，安装它并看看它是否可以提高你的性能

最后，我会尝试修改

--max-num-batched-tokens

和

--max-num-seqs

等参数，请在此处阅读更多内容：vllm OpenAI 兼容服务器

Mixtral 8x7b，我运行错了吗？

问题描述投票：0回答：1

1个回答

最新问题

Mixtral 8x7b，我运行错了吗？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1