在 MacBook Pro M2 Max 上测试 LLaMA

2023-03-15 4 minute read

LLaMA

LLaMA-13B 在大多数基准上的表现优于 GPT-3（175B），LLaMA-65B 与最好的型号 Chinchilla-70B 和 PaLM-540B 具有竞争力。

LLaMA: Open and Efficient Foundation Language Models

克隆

git clone https://github.com/facebookresearch/llama
cd llama

下载模型

修改 download.sh，配置下载模型的 地址(PRESIGNED_URL) 和 下载目录(TARGET_FOLDER)。

vim download.sh

PRESIGNED_URL="https://agi.gpt4.org/llama/LLaMA/*"             # replace with presigned url from email
TARGET_FOLDER="./"             # where all files should end up

bash download.sh

llama.cpp

构建

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

拷贝 LLaMA 模型到当前目录

ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

安装 Python 依赖

pip install torch numpy sentencepiece

convert the 7B model to ggml FP16 format

python convert-pth-to-ggml.py models/7B/ 1

quantize the model to 4-bits

./quantize.sh 7B

推理

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

main: seed = 1678918444
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: 'She'
main: number of tokens in prompt = 2
     1 -> ''
 13468 -> 'She'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Shepherds Bush International station was once again the destination of choice for our Sunday afternoon outing, following another successful event on Saturday. We split into two groups this time to explore what is possibly London's most exciting and vibrant district: Shepherd’s Market (it actually sounds a lot like Soho) lies just north-east from the station itself with some of its oldest buildings dating back over 200 years. The market has been in continuous existence since at least the early part of this century – when it was called 'The Pig Alley' - and it is still home to a

main: mem per token = 14434244 bytes
main:     load time =   671.87 ms
main:   sample time =   113.01 ms
main:  predict time =  5871.94 ms / 45.52 ms per token
main:    total time =  6939.42 ms

内存 4.1G

交互模式

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" -p \
"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"

测试详情

模型	大小	量化(4位)	内存
7B	13G	3.9G	4.0G
13B	24G	7.6G	7.8G
30B	61G	19G	19.4G
65B	122G	38G	38.5G

GGUF 格式的大模型

GGUF 介绍

GGUF 是一种二进制格式，旨在快速加载和保存模型。它是 GGML、GGMF 和 GGJT 的后继文件格式，通过包含加载模型所需的所有信息来确保明确性。它还被设计为可扩展的，以便可以在不破坏兼容性的情况下将新信息添加到模型中。

GGML（无版本）：基线格式，没有版本控制或对齐。
GGMF（版本化）：与 GGML 相同，但具有版本化。
GGJT：对齐张量以允许与需要对齐的 mmap 一起使用。 v1、v2 和 v3 相同，但后面的版本使用与以前版本不兼容的不同量化方案。

What is GGUF and GGML?

下载 GGUF 模型（HuggingFace TheBloke 仓库）

REPO_ID=TheBloke/CodeLlama-7B-GGUF
FILENAME=codellama-7b.Q4_K_M.gguf

huggingface-cli

pip install "huggingface_hub[cli]"

huggingface-cli download ${REPO_ID} ${FILENAME} \
    --local-dir . --local-dir-use-symlinks False

wget

wget https://huggingface.co/${REPO_ID}/resolve/main/${FILENAME}\?download\=true -O ${FILENAME}

模型转换为 GGUF

❶ 编译 llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

make -j

pip install -r requirements.txt

❷ 转换为 GGUF 模型

python convert.py vicuna-7b-v1.5 \
    --outfile vicuna-7b-v1.5.gguf \
    --outtype q8_0

Llama2 量化性能

Model	Measure	F16	Q2_K	Q3_K_M	Q4_K_S	Q5_K_S	Q6_K
7B	perplexity	5.9066	6.7764	6.1503	6.0215	5.9419	5.9110
7B	file size	13.0G	2.67G	3.06G	3.56G	4.33G	5.15G
7B	ms/tok @ 4th, M2 Max	116	56	69	50	70	75
7B	ms/tok @ 8th, M2 Max	111	36	36	36	44	51
7B	ms/tok @ 4th, RTX-4080	60	15.5	17.0	15.5	16.7	18.3
7B	ms/tok @ 4th, Ryzen	214	57	61	68	81	93
13B	perplexity	5.2543	5.8545	5.4498	5.3404	5.2785	5.2568
13B	file size	25.0G	5.13G	5.88G	6.80G	8.36G	9.95G
13B	ms/tok @ 4th, M2 Max	216	103	148	95	132	142
13B	ms/tok @ 8th, M2 Max	213	67	77	68	81	95
13B	ms/tok @ 4th, RTX-4080	-	25.3	29.3	26.2	28.6	30.0
13B	ms/tok @ 4th, Ryzen	414	109	118	130	156	180

k-quants

困惑度(Perplexity, PPL)是一种用来评价语言模型好坏的指标。

直观上理解，当我们给定一段非常标准的，高质量的，符合人类自然语言习惯的文档作为测试集时，模型生成这段文本的概率越高，就认为模型的困惑度越小，模型也就越好。

LLM评估指标困惑度(perplexity)的理解

从 TheBloke 下载已转换和量化的 Meta Llama 2 模型

聊天

./main -n 1000 -e -m llama-2-7b-chat.Q4_K_M.gguf -p "糖果的制作步骤"

 糖果的制作步骤

选择优质的糖果：选择高质量的糖果，可以增加糖果的精度和烘培质地。
将糖果隔开：将糖果按照大小和形状分成不同的颜色，这样可以更好地控制糖果的掉落速度和坍塌情况。
淋上糖果：将糖果淋在板子上，确保每个糖果都够好地淋在板子上，这样可以减少糖果的落塌和损坏。
均匀分配：将糖果均匀分配到板子上，确保每个糖果都有相同的大小和形状，这样可以更好地控制糖果的掉落速度和坍塌情况。
烘培：将淋上的糖果晒在烘培机中，设置正确的时间和温度，以便糖果能够完全烘培。
冻结：将烘培后的糖果冻结在冰箱中，以便保存和使用。
预览：可以通过检查糖果的颜色、形状和质地来预览糖果的制作结果。
修正：如果发现糖果的颜色或形状不匹配，可以通过修正糖果的烘培时间和温度来实现修正。

llama-cpp-python 📜

使用 Metal (MPS) 进行安装

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
pip install fastapi uvicorn sse-starlette pydantic-settings starlette-context

OpenAI 兼容的 Web 服务器

python -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf --model_alias Llama-2-7B-chat

--model MODEL     The path to the model to use for generating completions. (default: PydanticUndefined)
--model_alias MODEL_ALIAS The alias of the model to use for generating completions.
--n_ctx N_CTX     The context size. (default: 2048)
--host HOST       Listen address (default: localhost)
--port PORT       Listen port (default: 8000)

OpenAPI 文档

POST /v1/completions
POST /v1/embeddings
POST /v1/chat/completions
GET /v1/models

curl 调用 API

POST /v1/completions

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "中国的首都是？",
        "temperature": 0.3
    }'

POST /v1/chat/completions

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ 
        "messages": [ 
            { "role": "user", "content": "中国的首都是？" }
        ]
    }'