在
ctransformers
库中,我只能加载大约十几个支持的模型。如何从以 GGUF 格式量化的任何开源 LLM(例如 Llama 3、Mistral、Zephyr,即 ctransformers
中不支持的那些)在 CPU(不仅仅是 GPU)上运行本地推理?
llama-cpp-python
是我个人的选择,因为它易于使用,而且通常是第一个支持新模型的量化版本的之一。要为 CPU 安装它,只需运行 pip install llama-cpp-python
。针对 GPU 的编译涉及更多一些,因此我不会在此处发布这些说明,因为您具体询问了有关 CPU 推理的问题。我还建议安装 huggingface_hub
(pip install huggingface_hub
) 以轻松下载模型。
一旦安装了
llama-cpp-python
和 huggingface_hub
,您就可以下载并使用模型(例如 mixtral-8x7b-instruct-v0.1-gguf),如下所示:
## Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
## Download the GGUF model
model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
model_path = hf_hub_download(model_name, filename=model_file)
## Instantiate model from downloaded file
llm = Llama(
model_path=model_path,
n_ctx=16000, # Context length to use
n_threads=32, # Number of CPU threads to use
n_gpu_layers=0 # Number of model layers to offload to GPU
)
## Generation kwargs
generation_kwargs = {
"max_tokens":20000,
"stop":["</s>"],
"echo":False, # Echo the prompt in the output
"top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}
## Run inference
prompt = "The meaning of life is "
res = llm(prompt, **generation_kwargs) # Res is a dictionary
## Unpack and the generated text from the LLM response dictionary and print it
print(res["choices"][0]["text"])
# res is short for result
请记住,mixtral 对于大多数笔记本电脑来说是一个相当大的型号,需要 ~25+ GB RAM,因此如果您需要较小的型号,请尝试使用 llama-13b-chat-gguf (
model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf"
) 或 mistral-7b 等型号-openorca-gguf (model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf"
).
Transformers 现在支持将 GGUF 格式的量化模型加载为非量化版本,允许它们像标准模型一样运行。请注意,此功能仍处于实验阶段,可能会发生变化: https://huggingface.co/docs/transformers/main/en/gguf
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
我最喜欢的工具是 Ollama。它使得在 CPU 上设置量化(gguf 格式)LLM 变得非常容易。它还可以非常轻松地与 Chatbot Ollama 等 GUI 集成。我在这里写了一篇关于如何执行此操作的免费博客文章:在 15 分钟内使用聊天 UI 在 CPU 上设置本地 LLM
首先需要转换模型。使用此页面。它基于 llama.cpp 存储库
https://huggingface.co/spaces/ggml-org/gguf-my-repo
要运行,您可以使用 Ollama 运行任何 GGUF 模型。
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Replace 'model_name' with the name of the specific model you want to use
model_name = "model_name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Replace 'your_input_text' with the actual text you want to classify
input_text = "your_input_text"
# Tokenize the input text and convert it to tensor
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
# Forward pass through the model
outputs = model(**inputs)
# Get the predicted class probabilities (or other relevant outputs)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=1)
# Print the predicted class probabilities
print(probabilities)