如何在CPU上运行任何量化的GGUF模型进行本地推理？

Question

在

ctransformers

库中，我只能加载大约十几个支持的模型。如何从以 GGUF 格式量化的任何开源 LLM（例如 Llama 3、Mistral、Zephyr，即

ctransformers

中不支持的那些）在 CPU（不仅仅是 GPU）上运行本地推理？

Answer 1

llama-cpp-python

是我个人的选择，因为它易于使用，而且通常是第一个支持新模型的量化版本的之一。要为 CPU 安装它，只需运行

pip install llama-cpp-python

。针对 GPU 的编译涉及更多一些，因此我不会在此处发布这些说明，因为您具体询问了有关 CPU 推理的问题。我还建议安装

huggingface_hub

(

pip install huggingface_hub

) 以轻松下载模型。

一旦安装了

llama-cpp-python

和

huggingface_hub

，您就可以下载并使用模型（例如 mixtral-8x7b-instruct-v0.1-gguf），如下所示：

## Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Download the GGUF model
model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF"
model_file = "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" # this is the specific model file we'll use in this example. It's a 4-bit quant, but other levels of quantization are available in the model repo if preferred
model_path = hf_hub_download(model_name, filename=model_file)

## Instantiate model from downloaded file
llm = Llama(
    model_path=model_path,
    n_ctx=16000,  # Context length to use
    n_threads=32,            # Number of CPU threads to use
    n_gpu_layers=0        # Number of model layers to offload to GPU
)

## Generation kwargs
generation_kwargs = {
    "max_tokens":20000,
    "stop":["</s>"],
    "echo":False, # Echo the prompt in the output
    "top_k":1 # This is essentially greedy decoding, since the model will always return the highest-probability token. Set this value > 1 for sampling decoding
}

## Run inference
prompt = "The meaning of life is "
res = llm(prompt, **generation_kwargs) # Res is a dictionary

## Unpack and the generated text from the LLM response dictionary and print it
print(res["choices"][0]["text"])
# res is short for result

请记住，mixtral 对于大多数笔记本电脑来说是一个相当大的型号，需要 ~25+ GB RAM，因此如果您需要较小的型号，请尝试使用 llama-13b-chat-gguf (

model_name="TheBloke/Llama-2-13B-chat-GGUF"; model_file="llama-2-13b-chat.Q4_K_M.gguf"

) 或 mistral-7b 等型号-openorca-gguf (

model_name="TheBloke/Mistral-7B-OpenOrca-GGUF"; model_file="mistral-7b-openorca.Q4_K_M.gguf"

).

Answer 2

Transformers 现在支持将 GGUF 格式的量化模型加载为非量化版本，允许它们像标准模型一样运行。请注意，此功能仍处于实验阶段，可能会发生变化： https://huggingface.co/docs/transformers/main/en/gguf

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

Answer 3

我最喜欢的工具是 Ollama。它使得在 CPU 上设置量化（gguf 格式）LLM 变得非常容易。它还可以非常轻松地与 Chatbot Ollama 等 GUI 集成。我在这里写了一篇关于如何执行此操作的免费博客文章：在 15 分钟内使用聊天 UI 在 CPU 上设置本地 LLM

Answer 4

首先需要转换模型。使用此页面。它基于 llama.cpp 存储库

https://huggingface.co/spaces/ggml-org/gguf-my-repo

要运行，您可以使用 Ollama 运行任何 GGUF 模型。

Answer 5

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Replace 'model_name' with the name of the specific model you want to use
model_name = "model_name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Replace 'your_input_text' with the actual text you want to classify
input_text = "your_input_text"

# Tokenize the input text and convert it to tensor
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)

# Forward pass through the model
outputs = model(**inputs)

# Get the predicted class probabilities (or other relevant outputs)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=1)

# Print the predicted class probabilities
print(probabilities)

如何在CPU上运行任何量化的GGUF模型进行本地推理？

问题描述投票：0回答：5

5个回答

最新问题

如何在CPU上运行任何量化的GGUF模型进行本地推理？

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5