acordier's personal website

Running Mistral 7B LLM with llama.cpp on Apple Silicon

2024/03/04

Tags: deep learning, llm, mistral, gpt

A couple months ago, I got excited hearing that MistralAI, a French (!) company, released a powerful LLM (Large Language Model) with only (welcome to the LLM world) 7 billion parameters: mistral-7b. In comparison, OpenAI’s GPTs are more in the hundreds (if not thousands) of billions of parameters range. In fact, mistral-7b is a follow-up on their previous “small” model, llama-2-13b, which had 13 billion parameters as its name suggests.

mistral-7b seems to outperform llama-2-13b for almost all tasks while also having less parameters. This means taking up less RAM, and being more compute-efficient. Of course, these models are still quite significantly behind GPT3.5 and GPT4, but keep it mind they are not at all playing in the same ballpark, with 1 to 2 orders of magnitude of difference in terms of numbers of parameters.

Performance of Mistral 7B and different Llama models on a wide range of benchmarks, from https://mistral.ai/news/announcing-mistral-7b/

Being open-source (!), Mistral models appear as a more than decent, free alternative to GPTs for personal LLM projects or proof-of-concepts. I like being able to run things locally and not being rate-limited, at least before scaling up. Furthermore, I was impressed by the fact that some LLM runtime libraries such as llama.cpp were specifically optimized for smaller architectures and setups, with Apple Silicon being quoted as a “first class citizen” on the github project goals.

In this short article, we will cover:

Some alternatives to llama.cpp include:

Building llama.cpp

This is fairly easy if you are using Linux or macOS (given that you have XCode installed).

git clone git@github.com:ggerganov/llama.cpp.git
cd llama.cpp
make

This will build a version of llama.cpp targeted for your own CPU architecture. On macOS, Metal support is enabled by default. This allows to make use of the Apple Silicon GPU cores!

See the README.md from the llama.cpp repository for more information on building and the various specific architecture accelerations.

Quantize mistral-7b weights yourself…

Quantization allows to convert (some of) the parameters of your model from floating points (32 or 16 bits) to integers (8 or 4 bits). This makes the models take less disk space, less RAM when loaded, and infer faster as well.

If you want to quantize a supported model yourself, you can download the original mistral-7b weights and then use the convert.py script in the llama.cpp repository:

python3 -m pip install -r requirements

# Convert the model to ggml FP16
# (If the model is not using a BPE tokenizer, remove the flag)
python3 convert.py models/mymodel/ --vocab-type bpe

# Quantize the model to 4-bit with the Q4_K_M method
./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

More information here. Note that this may require some RAM to load the original weights, and take some time.

… or, download the mistral-7b quantized weights

Fortunately, some amazing people have already converted a lot of models to the GGML/GGUF format and quantized them in every possible way, which allows us to pick the one that is the most adapted to our usecase.

Check out TheBloke’s model card for mistral-7b-instruct-v0.2 on HuggingFace. Note that this mistral-7b version is fine-tuned for instructions (solving problems and tasks), hence the name.

Under Files and versions, you will be able to download the weights you want:

Run mistral-7b!

Now that everything is set-up. This is fairly easy.

cd llama.cpp
./main -m mistral-7b-instruct-v0.2.Q5_K_S.gguf -c 4096 --temp 0.8 -t 4 --n_predict -1 --n_gpu_layers -1 -p "<s>[INS] AI is going to... [/INS]"

Let’s go through the parameters:

With the above command, we get the following answer:

[INS] AI is going to... [/INS] AI is an ever-evolving technology and its potential applications are constantly being researched and developed. Some of the ways in which AI may be used in the future include:

1. Medical Diagnosis: AI algorithms could potentially be trained to diagnose diseases with greater accuracy than human doctors.

(...)
 
 8. Education: AI could be used to personalize learning experiences, providing students with targeted feedback and resources based on their individual strengths and weaknesses. [end of text]

Make sure with the logs that Metal is being used (“Metal” should be written a couple times).

In this example:

A token is around 4 characters on average, so given the length of this answer (~1500 characters), these numbers make sense.

Overall, on the user end the speed feels almost as smooth as using Chat-GPT3.5, which is nice. You can also run the engine interactively (to mimick a chatbot experience), using the -i flag.

Do not hesitate exploring the different parameters by running ./main --help. Importantly, you may want to lower the --temp (temperature) parameter when using an ìnstruct version of an LLM, since answers to instructions or task solving should not vary much in theory (contrarily to creative writing, for example).

Running a mistral-7b server with llama.cpp

What is great is that you can also run llama.cpp as a web-server, which has a built-in API you can request. Start the server first:

./server -m mistral-7b-instruct-v0.2.Q5_K_S.gguf -c 4096 -t 4

Then request it with curl, or python for example:

import requests

url = "http://localhost:8080/completion"

prompt = "AI is going to..."
data = f'{{"prompt": "<s>[INS] {prompt} [/INST]", "n_predict": -1, "n_gpu_layers": -1, "temperature": 0.8}}'

response = requests.post(url, data=data)
llm_answer = response.json()["content"]

print(llm_answer)

This returns:

 AI is going to revolutionize the way we live and work. It has already demonstrated its potential in areas like healthcare, transportation, finance, and more. (...)

Using the llama-cpp-python wrapper

Finally, there is also a direct wrapper for llama.cpp written in python in case you don’t need or want to deploy a web-server, called llama-cpp-python.

In order to build and use the llama-cpp-python wheel on your Apple Silicon Mac with Metal (which makes LLMs around 10 times faster than without using it, according to the repository documentation), it is mandatory to use an ARM64 version of Python, which you can install as follows:

# Install Python for ARM64
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
# Install llama-cpp-python
CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install -U llama-cpp-python --no-cache-dir

You can then use any LLM (in GGUF format) very easily in your python scripts:

from llama_cpp import Llama
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=4096, n_gpu_layers=-1, n_threads=4)
		       
output = llm.create_completion('<s>[INS] AI is going to... [/INST]', max_tokens=None, temperature=0.8)

print(output)

There’s even a direct interface to HuggingFace if you don’t want to bother downloading the weights yourself:

llm = Llama.from_pretrained(repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", filename="*Q5_K_S.gguf", verbose=False)

Bonus: use llama-cpp-python with poetry

If you want to use it with the poetry environment manager instead of conda as shown in the docs, you can simply change the python executable of your poetry environment to the ARM64 one you previously installed:

poetry env use /PATH/TO/MINIFORGE3/miniforge3/bin/python3
CMAKE_ARGS="-DLLAMA_METAL=on" poetry run python3 -m pip install -U llama-cpp-python --no-cache-dir

Make sure once again Metal is used by llama-cpp-python, which you can do simply by looking at the logs when running a test script.

Conclusion

After a bit of testing on my side, MistralAI’s new model mistral-7b gives some very convincing results and a substantial improvement over their previous model llama-2-13b. My only concerns as of now are LLM hallucinations, which may be way more frequent than with GPTs.

The purpose of this article was to show that such a “small” model can easily be used on CPUs, more specifically on Apple Silicon ARM64 architectures. This makes development of personal LLM-based projects much cheaper, easier and fun (also, open-source). Of course, such a setup will never be on par with an RTX 4090 GPU, but that’s not the point.

Congratulations to the whole MistralAI team and to the people behind llama.cpp for achieving this and offering a powerful, open-source alternative to ChatGPT!