Gemma 4 26B-A4B Runs Locally at 51 tok/s via LM Studio 0.4.0 Headless CLI

Products1 source·Apr 6

google gemma inference edge-ai developer-tools claude-code

Summary

• LM Studio 0.4.0 adds headless lms CLI for local inference without a GUI
• Google Gemma 4 26B-A4B MoE runs at 51 tok/s on MacBook Pro M4 Pro (48GB)
• Scores 82.6% MMLU Pro and 88.3% AIME 2026 — competitive with 100–600B+ parameter models
• Integrates with Claude Code locally; source notes significant slowdowns within Claude Code

Adjust signal

Details

#	Type	Key Point	Context
1	Product Launch	LM Studio 0.4.0 introduces llmster standalone server and lms CLI	Previously requiring a GUI, LM Studio 0.4.0 extracted llmster as a standalone inference server. The lms CLI enables headless operation in SSH sessions, CI/CD pipelines, and scripted workflows — removing the last major GUI friction for developer integration.
2	New Tech	Gemma 4 26B-A4B MoE activates 8 of 128 experts (3.8B params) per token	The model has 128 experts plus 1 shared expert but activates only 8 per token (3.8B parameters). Using the MoE estimation rule sqrt(total × active), this yields roughly 10B dense-equivalent quality while keeping memory requirements near that of a 4B dense model.
3	Stat	82.6% MMLU Pro, 88.3% AIME 2026, Elo ~1441 running at 51 tok/s locally	These scores are competitive with Qwen 3.5 397B-A17B and GLM-5 (~1457 Elo) that require 100–600B+ total parameters. The 31B dense Gemma 4 variant scores slightly higher (85.2% MMLU Pro, 89.2% AIME 2026) but requires more memory and runs slower.
4	Tech Info	Supports 256K context, vision, native function calling, and configurable thinking modes	The large context window, vision input, and native tool calling make the 26B-A4B viable for agentic workflows. Configurable thinking modes allow trading latency for reasoning depth depending on task requirements.
5	Insight	Claude Code integration works locally but with significant slowdowns	While the model can serve as a local backend for Claude Code — enabling zero-cost, private agentic coding — the author reports significant slowdowns compared to cloud API inference when using it within Claude Code, a practical caveat for adopters.

1.Product Launch

LM Studio 0.4.0 introduces llmster standalone server and lms CLI

Previously requiring a GUI, LM Studio 0.4.0 extracted llmster as a standalone inference server. The lms CLI enables headless operation in SSH sessions, CI/CD pipelines, and scripted workflows — removing the last major GUI friction for developer integration.

2.New Tech

Gemma 4 26B-A4B MoE activates 8 of 128 experts (3.8B params) per token

The model has 128 experts plus 1 shared expert but activates only 8 per token (3.8B parameters). Using the MoE estimation rule sqrt(total × active), this yields roughly 10B dense-equivalent quality while keeping memory requirements near that of a 4B dense model.

3.Stat

82.6% MMLU Pro, 88.3% AIME 2026, Elo ~1441 running at 51 tok/s locally

These scores are competitive with Qwen 3.5 397B-A17B and GLM-5 (~1457 Elo) that require 100–600B+ total parameters. The 31B dense Gemma 4 variant scores slightly higher (85.2% MMLU Pro, 89.2% AIME 2026) but requires more memory and runs slower.

4.Tech Info

Supports 256K context, vision, native function calling, and configurable thinking modes

The large context window, vision input, and native tool calling make the 26B-A4B viable for agentic workflows. Configurable thinking modes allow trading latency for reasoning depth depending on task requirements.

5.Insight

Claude Code integration works locally but with significant slowdowns

While the model can serve as a local backend for Claude Code — enabling zero-cost, private agentic coding — the author reports significant slowdowns compared to cloud API inference when using it within Claude Code, a practical caveat for adopters.

Product Launch = new software release; New Tech = new model/architecture; Stat = benchmark or performance figure; Tech Info = capability or spec detail; Insight = analysis or notable caveat

What This Means

The combination of Gemma 4's MoE efficiency and LM Studio's new headless CLI makes locally-run frontier-competitive inference genuinely practical for developers on high-end consumer hardware. A model scoring near 90% on AIME 2026, running fully offline at interactive speeds, represents a meaningful shift for cost-sensitive, privacy-sensitive, or latency-sensitive workloads. Claude Code integration is possible at zero API cost with full data locality, though developers should expect meaningful slowdowns versus cloud inference when routing through Claude Code.

Sources

Running Gemma 4 locally with LM Studio's new headless CLI and Claude CodeAi

Similar Events

AWS + llm-d Bring Disaggregated LLM Inference to SageMaker and EKS

Mar 16

Supply-Chain Malware Hits LiteLLM; Delve Audits Scrutinized

Mar 26