LMForge

Run multiple LLMs simultaneously on your local hardware.

LMForge is a persistent daemon that picks the right engine for your hardware (Apple MLX, SGLang, llama.cpp), manages VRAM, and exposes a single OpenAI-compatible REST API to every app on your machine. Closing the desktop UI never stops your models.

⚡ Always-on system daemon

🎯 Hardware-aware engine selection

🔌 OpenAI + Ollama compatible

🧠 Multi-model orchestration

🎯

Hardware-aware engine selection

Probes the host at startup and picks the best inference backend automatically — no engine config to maintain.

Apple Silicon → oMLX (Metal/MLX, OpenAI-compatible server). Linux NVIDIA → SGLang (CUDA, high-concurrency). Windows NVIDIA, ARM Linux, or CPU-only → llama.cpp with auto -ngl tuning. Engine choice is hardware-driven, not user-configured; users get one OpenAI API regardless of what's underneath.

🧠

Multi-model orchestration

Run chat, embedding, vision, and reranker models simultaneously — each in its own engine subprocess, each with its own keep-alive lifecycle.

One engine subprocess per loaded model on a unique TCP port. The EngineManager owns a slot table, supervises the children, broadcasts state changes over SSE + Tauri IPC, and evicts the LRU slot when a new load would exceed the VRAM budget. oMLX manages its own residency; the Rust keepalive timer is skipped for that engine.

💭

Thinking models with budget cap

Native two-call workflow for Qwen3 / DeepSeek-R1 style reasoners — control how long the model thinks before it answers.

Call 1 streams reasoning tokens up to `thinking_budget`. When the budget is exhausted, LMForge appends the accumulated reasoning as a closed `<think>…</think>` turn and issues call 2 with `enable_thinking: false`. Live reasoning deltas stream to the client during call 1; `call2_prefill` SSE event marks the answer phase for UI feedback.

🖼️

Vision + image preflight

Multimodal requests with server-side URL fetching, real User-Agent, size caps, and capability gating before the engine spins up.

Remote `image_url` references are fetched server-side and rewritten as inline `data:` URLs before reaching the engine — hosts that block empty UAs (Wikimedia, several CDNs) no longer cause silent hallucinations. Sending an image to a non-vision model returns a 400 with `vision_not_supported` before any subprocess work happens. Live counters for accepted, rejected, and data-URL image inputs surface in the observability dashboard.

curl http://127.0.0.1:11430/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen3:8b:4bit", "messages": [ {"role": "user", "content": "Why is Rust good for inference servers?"} ], "stream": true }'

data: { "id": "chatcmpl-...", "object": "chat.completion.chunk", "model": "qwen3:8b:4bit", "choices": [{ "index": 0, "delta": { "role": "assistant", "content": "Zero-cost" }, "finish_reason": null, "logprobs": null }] }

curl http://127.0.0.1:11430/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen3.5:4b:4bit", "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}], "think": true, "thinking_budget": 4096, "stream_reasoning_deltas": true, "stream": true }'

{ "object": "list", "model": "qwen3-embed:0.6b:4bit", "data": [ { "object": "embedding", "index": 0, "embedding": [0.012, -0.034, ...] }, { "object": "embedding", "index": 1, "embedding": [0.045, 0.018, ...] } ], "usage": { "prompt_tokens": 6, "total_tokens": 6 } }

for m in qwen3.5:4b:4bit qwen2.5-vl:7b:4bit qwen3-embed:0.6b:8bit; do curl -sS -X POST http://127.0.0.1:11430/lf/model/switch \ -H 'Content-Type: application/json' \ -d "{\"model\":\"$m\"}" done

Platform	Architecture	Engine	Core	Desktop UI
macOS 13+	Apple Silicon (arm64)	oMLX (Metal/MLX)	✓	✓ DMG
Ubuntu 22.04+	x86_64	SGLang (NVIDIA, 24GB+) / llama.cpp	✓	✓ AppImage
Ubuntu 22.04+	arm64	llama.cpp	✓	🔜 Planned
Windows 10/11	x86_64	llama.cpp (CPU + NVIDIA CUDA)	✓	✓ NSIS
Windows + WSL2	x86_64	SGLang (CUDA via WSL)	✓ (inside WSL)	via Linux build

Platform

Architecture

Engine

Core

Desktop UI

macOS 13+

Apple Silicon (arm64)

oMLX (Metal/MLX)

✓

✓ DMG

Ubuntu 22.04+

x86_64

SGLang (NVIDIA, 24GB+) / llama.cpp

✓

✓ AppImage

Ubuntu 22.04+

arm64

llama.cpp

✓

🔜 Planned

Windows 10/11

x86_64

llama.cpp (CPU + NVIDIA CUDA)

✓

✓ NSIS

Windows + WSL2

x86_64

SGLang (CUDA via WSL)

✓ (inside WSL)

via Linux build

Shortcut	macOS (MLX)	Linux / Windows (GGUF)
`qwen3:8b:4bit`	mlx-community/Qwen3-8B-4bit	bartowski/Qwen3-8B-GGUF
`qwen3.5:4b:4bit`	mlx-community/Qwen3.5-4B-4bit	Qwen/Qwen3.5-4B-GGUF
`qwen3.5:9b:4bit`	mlx-community/Qwen3.5-9B-OptiQ-4bit	bartowski/Qwen3.5-9B-OptiQ-GGUF
`gemma3:4b:4bit`	mlx-community/gemma-3-4b-it-4bit	bartowski/gemma-3-4b-it-GGUF
`gemma3:12b:4bit`	mlx-community/gemma-3-12b-it-4bit	bartowski/gemma-3-12b-it-GGUF
`llama3.1:8b:4bit`	mlx-community/Meta-Llama-3.1-8B-Instruct-4bit	bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
`llama4:17b:4bit:scout`	mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit	bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF
`phi4:4b:4bit`	mlx-community/Phi-4-mini-instruct-4bit	bartowski/Phi-4-mini-instruct-GGUF
`deepseek_r1:8b:4bit:distill-qwen`	mlx-community/DeepSeek-R1-Distill-Qwen-8B-4bit	unsloth/DeepSeek-R1-Distill-Qwen-8B-GGUF

Shortcut

macOS (MLX)

Linux / Windows (GGUF)

qwen3:8b:4bit

mlx-community/Qwen3-8B-4bit

bartowski/Qwen3-8B-GGUF

qwen3.5:4b:4bit

mlx-community/Qwen3.5-4B-4bit

Qwen/Qwen3.5-4B-GGUF

qwen3.5:9b:4bit

mlx-community/Qwen3.5-9B-OptiQ-4bit

bartowski/Qwen3.5-9B-OptiQ-GGUF

gemma3:4b:4bit

mlx-community/gemma-3-4b-it-4bit

bartowski/gemma-3-4b-it-GGUF

gemma3:12b:4bit

mlx-community/gemma-3-12b-it-4bit

bartowski/gemma-3-12b-it-GGUF

llama3.1:8b:4bit

mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

llama4:17b:4bit:scout

mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit

bartowski/Llama-4-Scout-17B-16E-Instruct-GGUF

phi4:4b:4bit

mlx-community/Phi-4-mini-instruct-4bit

bartowski/Phi-4-mini-instruct-GGUF

deepseek_r1:8b:4bit:distill-qwen

mlx-community/DeepSeek-R1-Distill-Qwen-8B-4bit

unsloth/DeepSeek-R1-Distill-Qwen-8B-GGUF

Run multiple LLMs simultaneously on your local hardware.

Get running in one command

What it does

Hardware-aware engine selection

Multi-model orchestration

Thinking models with budget cap

Vision + image preflight

VRAM-aware LRU eviction

OpenAI + Ollama compatible

System service everywhere

Prometheus + SSE telemetry

Idempotent CLI

sha256 download verification

Real requests, real responses

Chat completion

Thinking model with budget cap

Embeddings (auto-batched)

Warm a model from your app's startup

Under the hood

Where it runs

Curated shortcuts

Endpoints reference

OpenAI-compatible /v1/*

Ollama-compatible /api/*

LMForge native /lf/*

How it gets used

Interactive chat from the CLI

Integrate from any app over HTTP

Power a downstream service like DocIntel

LMForge

Phoenix