Mistral Small 4: Unified 119B MoE Model Released Under Apache 2.0
Summary
- • Mistral Small 4 unifies reasoning, multimodal, and agentic coding in one model
- • 119B total parameters with MoE architecture, only 6B active per token
- • 40% latency reduction and 3x throughput improvement over Mistral Small 3
- • Fully open source under Apache 2.0 with 256k context window
Details
Mistral Small 4 released as unified flagship open-source model
The model merges capabilities from three previously separate Mistral models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding). Released under Apache 2.0, it is free to use, fine-tune, and deploy commercially.
MoE architecture: 128 experts, 4 active per token, 119B total / 6B active parameters
Mixture of Experts design means only 4 of 128 experts activate per token, keeping active parameter count at 6B (8B including embedding and output layers). This allows the model to match dense models far above its active-parameter count in capability while remaining inference-efficient.
Configurable reasoning effort via reasoning_effort parameter
Users can set reasoning_effort to 'none' for fast, direct responses or 'high' for deep chain-of-thought reasoning. This gives developers flexible control over latency versus answer quality on a per-request basis.
Native multimodality: text and image inputs supported out of the box
Unlike prior small Mistral models, Small 4 natively accepts image inputs alongside text, bringing vision capability to the same model used for reasoning and coding without requiring a separate deployment.
40% reduction in end-to-end completion time versus prior generation
Measured in a latency-optimized deployment setup. Throughput-optimized setups achieve 3x more requests per second compared to Mistral Small 3, making the upgrade significant for production API deployments.
Matches or surpasses GPT-OSS 120B on three benchmarks with shorter outputs
On AA LCR the model scores 0.72 using 1.6K characters, while Qwen requires 3.5-4x more output to reach comparable performance. On LiveCodeBench it outperforms GPT-OSS 120B while producing 20% less output.
Minimum hardware: 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200
These are the floor requirements for self-hosting. The model is compatible with vLLM, llama.cpp, SGLang, Transformers, and HuggingFace, covering the major open-source inference stacks.
Mistral joins NVIDIA Nemotron Coalition as a founding member
The partnership includes inference optimization collaboration on vLLM and SGLang. Joining as a founding member signals deeper integration with NVIDIA's enterprise AI ecosystem beyond a standard hardware dependency.
Single model replaces three separate specialized deployments
Previously, teams needing reasoning, vision, and coding support from Mistral would run distinct models. Consolidating into one model simplifies infrastructure, reduces operational overhead, and lowers the barrier for organizations wanting broad AI capability in a self-hosted setup.
Product Launch = new release, Tech Info = architectural detail, New Tech = new capability, Stat = benchmark or performance figure, Infrastructure = hardware/deployment requirement, Partnership = external collaboration, Strategy = business or product positioning
What This Means
Mistral Small 4 is a meaningful step toward open-source models that compete with frontier proprietary systems across a wide range of tasks — reasoning, vision, and coding — without requiring separate deployments. The Apache 2.0 license means any organization can self-host, fine-tune, and commercialize the model freely, which matters most to enterprises with data privacy constraints or cost sensitivity. The efficiency gains from its MoE design make it practical on hardware teams may already own, and benchmark results suggest it punches above its active-parameter weight against larger dense models. This release increases competitive pressure on both proprietary API providers and other open-weight labs.
Sources
- Mistral Small 4Mistral
- Introducing Mistral Small 4Mistral
