llama.cpp

Definition

llama.cpp ist eine C++-Laufzeitumgebung für LLMs (Large Language Models), mit der quantisierte Modelle auf CPUs und einfachen GPUs ohne PyTorch oder CUDA-Overhead ausgeführt werden. Sie ermöglicht es, 7B–70B-Parameter-Modelle (Llama, Mistral, Qwen) auf einem Laptop oder RaspberryPi zu inferieren, opfert aber wenig Geschwindigkeit gegen klassische GPU-Framework-Lösungen.

Mechanik

llama.cpp lädt ein Modell im GGUF-Format (ursprünglich für Llama optimiert, jetzt universal), dekodiert Layer für Layer im Speicher und führt Matrixmultiplikationen mit optimierten C++-Kerneln durch. Quantisierte Gewichte (meist INT4 oder INT8) werden on-the-fly decomprimiert. Die Architektur nutzt CPU-Vektorisierung (AVX2, NEON) und optionale Acceleratoren (Metal für macOS, CUDA/ROCm für GPUs), ohne dass PyTorch geladen werden muss. Token-by-Token-Sampling erfolgt in demselben Prozess.

Beispiel

# Installation und Download eines quantisierten Modells
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Mistral-Modell im GGUF-Format herunterladen
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

# Inference mit CLI
./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -p "Erklaere Quantisierung in einer Zeile:" -n 50

Oder Python-API für Integration:

pip install llama-cpp-python

import llama_cpp

# Modell laden (GGUF-Datei lokal verfuegbar)
llm = llama_cpp.Llama(
    model_path="./mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_gpu_layers=35,  # GPU-Acceleration, wenn Metal/CUDA verfuegbar
    n_ctx=2048
)

# Inference
output = llm(
    "Erklaere Machine Learning kurz:",
    max_tokens=100,
    temperature=0.7
)
print(output['choices'][0]['text'])

llama_cpp_demo.py

Lokal ausführen — Setup für deinen Rechner

Voraussetzungen:

Python 3.10 oder neuer installiert. Download: python.org/downloads. Beim Windows-Installer die Option „Add Python to PATH" aktivieren — sonst findet die Konsole `python` nicht.

# 1) Konsole öffnen: Win+R drücken, "powershell" eintippen, Enter.
# 2) Prüfen, ob Python installiert ist:
py --version
# Falls "command not found" -> https://python.org/downloads

# 3) Projekt-Setup: venv + Abhängigkeiten
py -m venv .venv
.venv\Scripts\Activate.ps1
pip install llama-cpp-python

# 4) Code in llama_cpp_demo.py speichern (Button oben oder Copy + Editor)
# Dann ausführen:
python llama_cpp_demo.py

# 1) Terminal öffnen:
#    macOS:  Cmd+Leertaste -> "Terminal" eintippen -> Enter
#    Linux:  Strg+Alt+T (in den meisten Distros)
# 2) Prüfen, ob Python installiert ist:
python3 --version
# Falls fehlt:
#    macOS:   brew install python   (oder https://python.org/downloads)
#    Debian:  sudo apt install python3 python3-venv
#    Fedora:  sudo dnf install python3

# 3) Projekt-Setup: venv + Abhängigkeiten
python3 -m venv .venv
source .venv/bin/activate
pip install llama-cpp-python

# 4) Code in llama_cpp_demo.py speichern (Button oben oder Copy + Editor)
# Dann ausführen:
python llama_cpp_demo.py

Definition

Mechanik

Beispiel

Quellen