experimental AI setup

I am running this setup without any GPU only with a normal CPU on a local PC at home with 64 GB of RAM (32 GB should also be ok).

If you can afford some more hardware, consider looking into NVIDIA DGX Spark. Hier also some German Heise-News about Nvidia DGX Spark.

llama.cpp

Install llama.cpp, see also https://llama-cpp.com/:

sudo apt-get update
sudo apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF #-DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

(As an alternative, look into running vllm.)

Install qwen3.6 from huggingface:

sudo apt-get update
sudo apt-get install -y python3-venv
python3 -m venv venv
. venv/bin/activate
pip3 install huggingface_hub hf_transfer
hf cache list
hf models list
hf models info unsloth/Qwen3.6-27B-MTP-GGUF
MODEL="unsloth/Qwen3.6-27B-MTP-GGUF"
#MODEL="unsloth/Qwen3.6-27B-GGUF"
#MODEL="unsloth/Qwen3.6-35B-A3B-GGUF"
hf download $MODEL --include "*mmproj-BF16*" --include "*UD-Q6_K_XL*"

See also:

Start script:

#!/bin/bash

SERVERHOST="127.0.0.1"
#SERVERHOST="0.0.0.0"
SERVERPORT="8080"

# Number of threads to run concurrently. Adjust to local hardware.
THREADS="12"

MODEL="unsloth/Qwen3.6-27B-MTP-GGUF"
#MODEL="unsloth/Qwen3.6-27B-GGUF"
#MODEL="unsloth/Qwen3.6-35B-A3B-GGUF"

./llama.cpp/llama-server \
    -hf $MODEL:UD-Q6_K_XL \
    --temp 0.6 \
    --top-k 20 \
    --top-p 0.95 \
    --min-p 0.00 \
    --presence-penalty 0.0 \
    --spec-type draft-mtp --spec-draft-n-max 2 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --threads $THREADS \
    --host $SERVERHOST \
    --port $SERVERPORT

    # 2>&1 | tee startup.sh.LOG.$BASHPID

#    --image-min-tokens 1024 \
#    --no-mmap --mlock \
#    --ctx-size 81920 \
#    --ctx-size 262144 \
#    -ctk q4_0 -ctv q4_0 \
#    --parallel -1 \
#    --no-mmproj \

#    --temp 1.0 \
#    --presence_penalty 1.5 \
# For precise coding tasks, change to:
#    --temp 0.6 \
#    --presence_penalty 0.0 \

# Instead of -hf param:
# --model unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-UD-Q6_K_XL.gguf \
# --mmproj unsloth/Qwen3.6-27B-GGUF/mmproj-BF16.gguf \
# --alias unsloth/Qwen3.6-27B-GGUF \
# --model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf \
# --mmproj unsloth/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf \
# --alias unsloth/Qwen3.6-35B-A3B-GGUF \

hermes agent

https://github.com/nousresearch/hermes-agent

opencode

See https://opencode.ai/.

npm install -g @opencode/cli
opencode config set model http://localhost:8080/v1
opencode config set api-key "not-needed"
opencode

openclaw

Not running this myself, but you might want to check out: https://openclaw.ai/

sashiko

Sashiko is an agentic Linux kernel code review system.

Impressum