Gemma 4 26B-A4B Runs Locally at 51 tok/s via LM Studio 0.4.0 Headless CLI
Summary
- • LM Studio 0.4.0 adds headless lms CLI for local inference without a GUI
- • Google Gemma 4 26B-A4B MoE runs at 51 tok/s on MacBook Pro M4 Pro (48GB)
- • Scores 82.6% MMLU Pro and 88.3% AIME 2026 — competitive with 100–600B+ parameter models
- • Integrates with Claude Code locally; source notes significant slowdowns within Claude Code
Details
LM Studio 0.4.0 introduces llmster standalone server and lms CLI
Previously requiring a GUI, LM Studio 0.4.0 extracted llmster as a standalone inference server. The lms CLI enables headless operation in SSH sessions, CI/CD pipelines, and scripted workflows — removing the last major GUI friction for developer integration.
Gemma 4 26B-A4B MoE activates 8 of 128 experts (3.8B params) per token
The model has 128 experts plus 1 shared expert but activates only 8 per token (3.8B parameters). Using the MoE estimation rule sqrt(total × active), this yields roughly 10B dense-equivalent quality while keeping memory requirements near that of a 4B dense model.
82.6% MMLU Pro, 88.3% AIME 2026, Elo ~1441 running at 51 tok/s locally
These scores are competitive with Qwen 3.5 397B-A17B and GLM-5 (~1457 Elo) that require 100–600B+ total parameters. The 31B dense Gemma 4 variant scores slightly higher (85.2% MMLU Pro, 89.2% AIME 2026) but requires more memory and runs slower.
Supports 256K context, vision, native function calling, and configurable thinking modes
The large context window, vision input, and native tool calling make the 26B-A4B viable for agentic workflows. Configurable thinking modes allow trading latency for reasoning depth depending on task requirements.
Claude Code integration works locally but with significant slowdowns
While the model can serve as a local backend for Claude Code — enabling zero-cost, private agentic coding — the author reports significant slowdowns compared to cloud API inference when using it within Claude Code, a practical caveat for adopters.
Product Launch = new software release; New Tech = new model/architecture; Stat = benchmark or performance figure; Tech Info = capability or spec detail; Insight = analysis or notable caveat
What This Means
The combination of Gemma 4's MoE efficiency and LM Studio's new headless CLI makes locally-run frontier-competitive inference genuinely practical for developers on high-end consumer hardware. A model scoring near 90% on AIME 2026, running fully offline at interactive speeds, represents a meaningful shift for cost-sensitive, privacy-sensitive, or latency-sensitive workloads. Claude Code integration is possible at zero API cost with full data locality, though developers should expect meaningful slowdowns versus cloud inference when routing through Claude Code.
