Running Mistral 7B LLM with llama.cpp on Apple Silicon

A couple months ago, I got excited hearing that MistralAI, a French (!) company, released a powerful LLM (Large Language Model) with only (welcome to the LLM world) 7 billion parameters: mistral-7b. In comparison, OpenAI’s GPTs are more in the hundreds (if not thousands) of billions of parameters range. In fact, mistral-7b is a follow-up on their previous “small” model, llama-2-13b, which had 13 billion parameters as its name suggests.

mistral-7b seems to outperform llama-2-13b for almost all tasks while also having less parameters. This means taking up less RAM, and being more compute-efficient. Of course, these models are still quite significantly behind GPT3.5 and GPT4, but keep it mind they are not at all playing in the same ballpark, with 1 to 2 orders of magnitude of difference in terms of numbers of parameters.

Performance of Mistral 7B and different Llama models on a wide range of benchmarks, from https://mistral.ai/news/announcing-mistral-7b/

Being open-source (!), Mistral models appear as a more than decent, free alternative to GPTs for personal LLM projects or proof-of-concepts. I like being able to run things locally and not being rate-limited, at least before scaling up. Furthermore, I was impressed by the fact that some LLM runtime libraries such as llama.cpp were specifically optimized for smaller architectures and setups, with Apple Silicon being quoted as a “first class citizen” on the github project goals.

In this short article, we will cover:

How to build llama.cpp
How to use llama.cpp with the command-line with quantized versions of mistral-7b
How to use it as a web server that is API-requestable
How to use it with its Python wrapper (llama-cpp-python)

Some alternatives to llama.cpp include:

OpenVINO, which I can not recommend enough for Deep Learning inference in general, if you are using an Intel CPU. Unfortunately, at the time of this writing, I did not find any quantized, OpenVINO-converted version of mistral-7b. This link explains how to do it yourself from the original weights supplied by MistralAI, but it requires some nightly packages as well as quite a lot of RAM to make the original model fit. Hence, I would not recommend doing this currently, unless you have a powerful machine to convert your model.
CTransformers, though I found it being awfully slow on my M1 machine compared to llama.cpp (maybe only llama-2 models have Metal support, and not mistral-7b ?). One of its biggest advantages is the easy interfacing with HuggingFace.

Building `llama.cpp`

This is fairly easy if you are using Linux or macOS (given that you have XCode installed).

git clone git@github.com:ggerganov/llama.cpp.git
cd llama.cpp
make

This will build a version of llama.cpp targeted for your own CPU architecture. On macOS, Metal support is enabled by default. This allows to make use of the Apple Silicon GPU cores!

See the README.md from the llama.cpp repository for more information on building and the various specific architecture accelerations.

Quantize `mistral-7b` weights yourself…

Quantization allows to convert (some of) the parameters of your model from floating points (32 or 16 bits) to integers (8 or 4 bits). This makes the models take less disk space, less RAM when loaded, and infer faster as well.

If you want to quantize a supported model yourself, you can download the original mistral-7b weights and then use the convert.py script in the llama.cpp repository:

python3 -m pip install -r requirements

# Convert the model to ggml FP16
# (If the model is not using a BPE tokenizer, remove the flag)
python3 convert.py models/mymodel/ --vocab-type bpe

# Quantize the model to 4-bit with the Q4_K_M method
./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

More information here. Note that this may require some RAM to load the original weights, and take some time.

… or, download the `mistral-7b` quantized weights

Fortunately, some amazing people have already converted a lot of models to the GGML/GGUF format and quantized them in every possible way, which allows us to pick the one that is the most adapted to our usecase.

Check out TheBloke’s model card for mistral-7b-instruct-v0.2 on HuggingFace. Note that this mistral-7b version is fine-tuned for instructions (solving problems and tasks), hence the name.

Under Files and versions, you will be able to download the weights you want:

If you are short on memory, or simply want to keep it minimal, Q4_K_M weights are perfect (4.37Gb on disk, 6.87Gb max on RAM).
If you have a bit more space and / or compute available, Q5_K_S are also good (5.00Gb on disk, 7.50Gb max on RAM).
If you want the best quantized version of this model, go for Q5_K_M (5.13Gb on disk, 7.63Gb max on RAM).

Run `mistral-7b`!

Now that everything is set-up. This is fairly easy.

cd llama.cpp
./main -m mistral-7b-instruct-v0.2.Q5_K_S.gguf -c 4096 --temp 0.8 -t 4 --n_predict -1 --n_gpu_layers -1 -p "<s>[INS] AI is going to... [/INS]"

Let’s go through the parameters:

-m MODEL: the path to the model weights in GGUF format.
-c 4096: this tells the LLM engine to use a 4096-token context, which is the maximum context for mistral-7b, with which it was trained. Increasing the token context size allows for LLMs to perform better on larger prompt inputs, at the cost of a higher inference time. As of now, llama.cpp does not support sliding window context. The tokenizer for Mistral models is BPE-based.
--temp 0.8: the temperature of the LLM, to be chosen between 0 and 1. A low temperature means less variability in the answer and less space to randomness and creativity during generation.
--n_predict -1: this tells the LLM engine not to stop after a given number of predicted tokens. You can set a hard limit by changing this parameter to any integer you want, in case you want to avoid answers that are too long.
-t 4: the number of threads to use for inferring. Here, we use the 4 performance cores out of the 8 ones available on the Apple Silicon ARM CPUs (M1, M2, M3).
--n_gpu_layers: the number of GPU layers to use for inferring. If set to -1, all GPU cores will be used. Experiment with 0, 16, and -1 (all layers) to see how this parameter impacts CPU usage and inference time. Offloading the compute to the GPU cores decreases CPU usage and improves inference time (on the largest 4096 tokens prompts, I manage to get almost 80 tokens per second for prompt processing, and 7 tokens per second for token generation, when offloading all layers to GPU cores).
-p: the parameter everybody’s excited for, aka the prompt! With the instruction version of mistral-7b, the prompt must follow this pattern by design: <s> [INS] WRITE YOUR INSTRUCTIONS HERE [/INS].

With the above command, we get the following answer:

[INS] AI is going to... [/INS] AI is an ever-evolving technology and its potential applications are constantly being researched and developed. Some of the ways in which AI may be used in the future include:

1. Medical Diagnosis: AI algorithms could potentially be trained to diagnose diseases with greater accuracy than human doctors.

(...)
 
 8. Education: AI could be used to personalize learning experiences, providing students with targeted feedback and resources based on their individual strengths and weaknesses. [end of text]

Make sure with the logs that Metal is being used (“Metal” should be written a couple times).

In this example:

The model took 8 seconds to load
The token generation took 28 seconds to be written, at a rate of roughly 96 ms per token (i.e. 10 tokens per second)

A token is around 4 characters on average, so given the length of this answer (~1500 characters), these numbers make sense.

Overall, on the user end the speed feels almost as smooth as using Chat-GPT3.5, which is nice. You can also run the engine interactively (to mimick a chatbot experience), using the -i flag.

Do not hesitate exploring the different parameters by running ./main --help. Importantly, you may want to lower the --temp (temperature) parameter when using an ìnstruct version of an LLM, since answers to instructions or task solving should not vary much in theory (contrarily to creative writing, for example).

Running a `mistral-7b` server with `llama.cpp`

What is great is that you can also run llama.cpp as a web-server, which has a built-in API you can request. Start the server first:

./server -m mistral-7b-instruct-v0.2.Q5_K_S.gguf -c 4096 -t 4

Then request it with curl, or python for example:

import requests

url = "http://localhost:8080/completion"

prompt = "AI is going to..."
data = f'{{"prompt": "<s>[INS] {prompt} [/INST]", "n_predict": -1, "n_gpu_layers": -1, "temperature": 0.8}}'

response = requests.post(url, data=data)
llm_answer = response.json()["content"]

print(llm_answer)

This returns:

 AI is going to revolutionize the way we live and work. It has already demonstrated its potential in areas like healthcare, transportation, finance, and more. (...)

Using the `llama-cpp-python` wrapper

Finally, there is also a direct wrapper for llama.cpp written in python in case you don’t need or want to deploy a web-server, called llama-cpp-python.

In order to build and use the llama-cpp-python wheel on your Apple Silicon Mac with Metal (which makes LLMs around 10 times faster than without using it, according to the repository documentation), it is mandatory to use an ARM64 version of Python, which you can install as follows:

# Install Python for ARM64
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

# Install llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install -U llama-cpp-python --no-cache-dir

You can then use any LLM (in GGUF format) very easily in your python scripts:

from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=4096, n_gpu_layers=-1, n_threads=4)
		       
output = llm.create_completion('<s>[INS] AI is going to... [/INST]', max_tokens=None, temperature=0.8)

print(output)

There’s even a direct interface to HuggingFace if you don’t want to bother downloading the weights yourself:

llm = Llama.from_pretrained(repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", filename="*Q5_K_S.gguf", verbose=False)

Bonus: use `llama-cpp-python` with `poetry`

If you want to use it with the poetry environment manager instead of conda as shown in the docs, you can simply change the python executable of your poetry environment to the ARM64 one you previously installed:

poetry env use /PATH/TO/MINIFORGE3/miniforge3/bin/python3
CMAKE_ARGS="-DLLAMA_METAL=on" poetry run python3 -m pip install -U llama-cpp-python --no-cache-dir

Make sure once again Metal is used by llama-cpp-python, which you can do simply by looking at the logs when running a test script.

Conclusion

After a bit of testing on my side, MistralAI’s new model mistral-7b gives some very convincing results and a substantial improvement over their previous model llama-2-13b. My only concerns as of now are LLM hallucinations, which may be way more frequent than with GPTs.

The purpose of this article was to show that such a “small” model can easily be used on CPUs, more specifically on Apple Silicon ARM64 architectures. This makes development of personal LLM-based projects much cheaper, easier and fun (also, open-source). Of course, such a setup will never be on par with an RTX 4090 GPU, but that’s not the point.

Congratulations to the whole MistralAI team and to the people behind llama.cpp for achieving this and offering a powerful, open-source alternative to ChatGPT!

acordier's personal website