LMForge

In Progress

Hardware-Aware LLM Inference Orchestrator

Run multiple LLMs simultaneously on your local hardware.

LMForge is a persistent daemon that picks the right engine for your hardware (Apple MLX, SGLang, llama.cpp), manages VRAM, and exposes a single OpenAI-compatible REST API to every app on your machine. Closing the desktop UI never stops your models.

Always-on system daemon
🎯 Hardware-aware engine selection
🔌 OpenAI + Ollama compatible
🧠 Multi-model orchestration

// by the numbers

3
Inference engines
oMLX · SGLang · llama.cpp
12+
Chat models curated
Qwen3 · Gemma · Llama · Phi · DeepSeek-R1
7
Embedding models
Nomic · Qwen3-embed · BGE · Snowflake
3 + 5
VLM + reranker
Qwen2.5-VL · BGE / Jina / Qwen3 rerankers
3
API surfaces
OpenAI /v1 · Ollama /api · native /lf
3–60s
Cold load latency
depends on model + engine

// install

Get running in one command

1. Install core (daemon + CLI) bash
curl -fsSL https://github.com/phoenixtb/lmforge/releases/latest/download/install-core.sh | bash
2. Install desktop UI (optional) bash
curl -fsSL https://github.com/phoenixtb/lmforge/releases/latest/download/install-ui.sh | bash
3. Pull a model and chat bash
lmforge pull qwen3:8b:4bit
lmforge run qwen3:8b:4bit

The installer drops the binary in /usr/local/bin, registers a launchd agent (macOS) or systemd --user unit (Linux), and starts the daemon. No sudo required.

// screenshots

// features

What it does

Hardware-aware engine selection
🎯

Hardware-aware engine selection

Probes the host at startup and picks the best inference backend automatically — no engine config to maintain.

Apple Silicon → oMLX (Metal/MLX, OpenAI-compatible server). Linux NVIDIA → SGLang (CUDA, high-concurrency). Windows NVIDIA, ARM Linux, or CPU-only → llama.cpp with auto -ngl tuning. Engine choice is hardware-driven, not user-configured; users get one OpenAI API regardless of what's underneath.

Multi-model orchestration
🧠

Multi-model orchestration

Run chat, embedding, vision, and reranker models simultaneously — each in its own engine subprocess, each with its own keep-alive lifecycle.

One engine subprocess per loaded model on a unique TCP port. The EngineManager owns a slot table, supervises the children, broadcasts state changes over SSE + Tauri IPC, and evicts the LRU slot when a new load would exceed the VRAM budget. oMLX manages its own residency; the Rust keepalive timer is skipped for that engine.

💭

Thinking models with budget cap

Native two-call workflow for Qwen3 / DeepSeek-R1 style reasoners — control how long the model thinks before it answers.

Call 1 streams reasoning tokens up to `thinking_budget`. When the budget is exhausted, LMForge appends the accumulated reasoning as a closed `<think>…</think>` turn and issues call 2 with `enable_thinking: false`. Live reasoning deltas stream to the client during call 1; `call2_prefill` SSE event marks the answer phase for UI feedback.

Vision + image preflight
🖼️

Vision + image preflight

Multimodal requests with server-side URL fetching, real User-Agent, size caps, and capability gating before the engine spins up.

Remote `image_url` references are fetched server-side and rewritten as inline `data:` URLs before reaching the engine — hosts that block empty UAs (Wikimedia, several CDNs) no longer cause silent hallucinations. Sending an image to a non-vision model returns a 400 with `vision_not_supported` before any subprocess work happens. Live counters for accepted, rejected, and data-URL image inputs surface in the observability dashboard.

📊

VRAM-aware LRU eviction

Loads models up to the detected VRAM budget; evicts least-recently-used when new loads don't fit.

🔌

OpenAI + Ollama compatible

Drop-in replacement for both ecosystems. Point `OPENAI_API_BASE=http://127.0.0.1:11430/v1` and you're done.

⚙️

System service everywhere

launchd on macOS, systemd --user on Linux, Scheduled Task on Windows — all via `lmforge service install`. No root.

📡

Prometheus + SSE telemetry

`/metrics` for scrapers, `/lf/status/stream` for UIs. Per-endpoint latency, model load history, image preflight mix, auth rejections.

🔁

Idempotent CLI

`lmforge start` is safe to call from any script — no-ops if the daemon is already running.

🔒

sha256 download verification

Compared against HuggingFace's `X-Linked-Etag` for LFS files. Corrupt downloads are auto-deleted.

// api in action

Real requests, real responses

Chat completion

OpenAI-compatible. Drop in `OPENAI_API_BASE=http://127.0.0.1:11430/v1` and call from any OpenAI SDK.

POST /v1/chat/completions bash
curl http://127.0.0.1:11430/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3:8b:4bit",
    "messages": [
      {"role": "user", "content": "Why is Rust good for inference servers?"}
    ],
    "stream": true
  }'
First SSE chunk json
data: {
  "id": "chatcmpl-...",
  "object": "chat.completion.chunk",
  "model": "qwen3:8b:4bit",
  "choices": [{
    "index": 0,
    "delta": { "role": "assistant", "content": "Zero-cost" },
    "finish_reason": null,
    "logprobs": null
  }]
}

Thinking model with budget cap

Two-call workflow — call 1 streams reasoning tokens up to `thinking_budget`; call 2 streams the final answer with reasoning frozen.

POST /v1/chat/completions bash
curl http://127.0.0.1:11430/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5:4b:4bit",
    "messages": [{"role": "user", "content": "Prove the Pythagorean theorem."}],
    "think": true,
    "thinking_budget": 4096,
    "stream_reasoning_deltas": true,
    "stream": true
  }'

Embeddings (auto-batched)

Inputs over `embed_batch_size` are auto-chunked across multiple engine calls; `usage.prompt_tokens` and indices are re-merged transparently.

POST /v1/embeddings bash
curl http://127.0.0.1:11430/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3-embed:0.6b:4bit",
    "input": ["doc one", "doc two"]
  }'
Response json
{
  "object": "list",
  "model": "qwen3-embed:0.6b:4bit",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.034, ...] },
    { "object": "embedding", "index": 1, "embedding": [0.045, 0.018, ...] }
  ],
  "usage": { "prompt_tokens": 6, "total_tokens": 6 }
}

Warm a model from your app's startup

Idempotent — no-op if the model is already resident. Load progress streams on `GET /lf/status/stream` (SSE).

POST /lf/model/switch bash
for m in qwen3.5:4b:4bit qwen2.5-vl:7b:4bit qwen3-embed:0.6b:8bit; do
  curl -sS -X POST http://127.0.0.1:11430/lf/model/switch \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$m\"}"
done

// architecture

Under the hood

A single Rust binary hosts an axum HTTP server, an EngineManager that supervises engine subprocesses (one per loaded model), and a ModelIndex over the on-disk model store. Clients of every flavour speak HTTP to 127.0.0.1:11430; the daemon proxies their requests to the appropriate subprocess. The diagram-as-code source lives in the LMForge repo at `docs/architecture/ARCHITECTURE.md`.

LMForge system architecture — clients call the daemon over HTTP; middleware chain enforces auth, body limits, concurrency, and metrics; orchestration core spawns and supervises engine subprocesses (oMLX, SGLang, llama.cpp); state persists under ~/.lmforge.
HTTP Surface
axum Router
axum · tower
Single HTTP entrypoint on 127.0.0.1:11430. Three API namespaces — /v1/*, /api/*, /lf/* — plus /health, /metrics, /ui.
OpenAI /v1/*
OpenAI-spec
Drop-in for the OpenAI SDK: chat completions, completions, embeddings, rerank, models.
Ollama /api/*
Ollama-spec
For tools that target Ollama: /api/chat, /api/generate, /api/tags. Streaming emits proper NDJSON.
Native /lf/*
LMForge
Status, hardware, sysinfo, model lifecycle, logs, config, shutdown. SSE streams for status and logs.
/health · /metrics
Auth-bypassed
Liveness probe + Prometheus text exposition. Always reachable, no token required.
Middleware (tower)
CORS
tower-http
Permissive CORS at the outermost layer so OPTIONS preflights bypass the rest.
Body limit
axum DefaultBodyLimit
32 MB default; configurable via LMFORGE_MAX_BODY_MB. Sized for VLM payloads with inline base64.
Auth (Bearer + CIDR)
ipnetwork crate
CIDR allowlist (`trusted_networks`) bypasses Bearer auth for loopback + RFC1918. External callers need `Authorization: Bearer <api_key>`.
Concurrency semaphore
tower-http
Caps inflight requests (default 4). Excess waits, then 503 `concurrency_limit` with Retry-After: 1.
Metrics
metrics-exporter-prometheus
Per-endpoint latency histograms, status counters, model load durations, image preflight mix.
Orchestration Core
Engine Manager
tokio · mpsc · broadcast
Owns the slot table. Receives ManagerCommand requests, spawns/supervises engine subprocesses, broadcasts state to SSE + Tauri.
Model Catalog & Index
Rust
Curated `family:size:quant` shortcuts → HF repos per engine. HF downloader with sha256 verify. models.json + capability cache.
Hardware Probe
sysinfo · NVML · system_profiler
OS / arch / GPU vendor / VRAM detection. Decides the engine adapter at boot.
API Helpers
Rust
Image preflight (URL → data:), thinking-budget two-call orchestrator, keepalive tracker, OpenAI-spec response normaliser.
Engines (subprocesses)
oMLX
MLX · Metal
Apple Silicon engine. Native LRU; ships an OpenAI-compatible server LMForge proxies to.
SGLang
CUDA
Linux NVIDIA engine for high-concurrency workloads (24+ GB VRAM recommended). Pinned via engines.toml.
llama.cpp
C++ · CUDA · CPU
Universal fallback. CPU + CUDA on Windows and ARM Linux. Auto -ngl tuning; mmproj loading for VLMs.

// platform support

Where it runs

PlatformArchitectureEngineCoreDesktop UI
macOS 13+Apple Silicon (arm64)oMLX (Metal/MLX)✓ DMG
Ubuntu 22.04+x86_64SGLang (NVIDIA, 24GB+) / llama.cpp✓ AppImage
Ubuntu 22.04+arm64llama.cpp🔜 Planned
Windows 10/11x86_64llama.cpp (CPU + NVIDIA CUDA)✓ NSIS
Windows + WSL2x86_64SGLang (CUDA via WSL)✓ (inside WSL)via Linux build

// model catalog

Curated shortcuts

Same shortcut, hardware-specific repo. MLX on Apple Silicon; GGUF elsewhere.

ShortcutmacOS (MLX)Linux / Windows (GGUF)
qwen3:8b:4bitmlx-community/Qwen3-8B-4bitbartowski/Qwen3-8B-GGUF
qwen3.5:4b:4bitmlx-community/Qwen3.5-4B-4bitQwen/Qwen3.5-4B-GGUF
qwen3.5:9b:4bitmlx-community/Qwen3.5-9B-OptiQ-4bitbartowski/Qwen3.5-9B-OptiQ-GGUF
gemma3:4b:4bitmlx-community/gemma-3-4b-it-4bitbartowski/gemma-3-4b-it-GGUF
gemma3:12b:4bitmlx-community/gemma-3-12b-it-4bitbartowski/gemma-3-12b-it-GGUF
llama3.1:8b:4bitmlx-community/Meta-Llama-3.1-8B-Instruct-4bitbartowski/Meta-Llama-3.1-8B-Instruct-GGUF
llama4:17b:4bit:scoutmlx-community/Llama-4-Scout-17B-16E-Instruct-4bitbartowski/Llama-4-Scout-17B-16E-Instruct-GGUF
phi4:4b:4bitmlx-community/Phi-4-mini-instruct-4bitbartowski/Phi-4-mini-instruct-GGUF
deepseek_r1:8b:4bit:distill-qwenmlx-community/DeepSeek-R1-Distill-Qwen-8B-4bitunsloth/DeepSeek-R1-Distill-Qwen-8B-GGUF

// api surface

Endpoints reference

GET /v1/models List available models with capabilities
GET /v1/models/{id} Single-model lookup with full metadata
POST /v1/chat/completions Chat completion — streaming + non-streaming
POST /v1/completions Text completion
POST /v1/embeddings Generate embeddings — batched, auto-chunked
POST /v1/rerank Rerank documents (llama.cpp only — 501 elsewhere)

// user journeys

How it gets used

Interactive chat from the CLI

Zero-API integration — just install and chat.

  1. 1 Run install-core.sh (or download the Windows binary)
  2. 2 lmforge pull qwen3:8b:4bit
  3. 3 lmforge run qwen3:8b:4bit
  4. 4 Chat in the REPL — model unloads after the keep-alive TTL

Integrate from any app over HTTP

Reuse your existing OpenAI SDK code.

  1. 1 Daemon already running as a system service
  2. 2 export OPENAI_API_BASE=http://127.0.0.1:11430/v1
  3. 3 export OPENAI_API_KEY=none (loopback bypasses auth)
  4. 4 All your OpenAI SDK code just works — chat, embeddings, vision
  5. 5 Cold-load latency paid on first call, then warm

Power a downstream service like DocIntel

Run multiple models concurrently for a RAG pipeline.

  1. 1 Service startup: POST /lf/model/switch for chat, embed, rerank, VLM models
  2. 2 Subscribe to GET /lf/status/stream for load progress UI
  3. 3 Consume /v1/chat/completions + /v1/embeddings + /v1/rerank
  4. 4 Scrape /metrics for per-model latency and load history
  5. 5 Models stay resident under VRAM budget; LRU eviction handles overflow

// tech stack

Languages
RustTypeScriptSvelte 5
HTTP / async
axumtowertokioreqwesttracing
Engines
oMLX (Metal/MLX)SGLang (CUDA)llama.cpp (CPU+CUDA)
Desktop UI
Tauri 2SvelteKitVite
Infra
Docker (multi-stage)launchdsystemd --userWindows Task Scheduler
Observability
PrometheusSSEtracing-subscriber
🧠

Phoenix

Ready when you are

Hey! I'm Phoenix — I know Titas's work, projects, and experience. Ask me anything — from distributed systems to production RAG, or what it's like building at Tesco and VMware.