Mistral Small 4: Unified 119B MoE Model Released Under Apache 2.0

Models2 sources·Mar 17

mistral-models moe multimodal code-generation inference context-window vllm mistral

Summary

• Mistral Small 4 unifies reasoning, multimodal, and agentic coding in one model
• 119B total parameters with MoE architecture, only 6B active per token
• 40% latency reduction and 3x throughput improvement over Mistral Small 3
• Fully open source under Apache 2.0 with 256k context window

Adjust signal

Details

#	Type	Key Point	Context
1	Product Launch	Mistral Small 4 released as unified flagship open-source model	The model merges capabilities from three previously separate Mistral models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding). Released under Apache 2.0, it is free to use, fine-tune, and deploy commercially.
2	Tech Info	MoE architecture: 128 experts, 4 active per token, 119B total / 6B active parameters	Mixture of Experts design means only 4 of 128 experts activate per token, keeping active parameter count at 6B (8B including embedding and output layers). This allows the model to match dense models far above its active-parameter count in capability while remaining inference-efficient.
3	New Tech	Configurable reasoning effort via reasoning_effort parameter	Users can set reasoning_effort to 'none' for fast, direct responses or 'high' for deep chain-of-thought reasoning. This gives developers flexible control over latency versus answer quality on a per-request basis.
4	New Tech	Native multimodality: text and image inputs supported out of the box	Unlike prior small Mistral models, Small 4 natively accepts image inputs alongside text, bringing vision capability to the same model used for reasoning and coding without requiring a separate deployment.
5	Stat	40% reduction in end-to-end completion time versus prior generation	Measured in a latency-optimized deployment setup. Throughput-optimized setups achieve 3x more requests per second compared to Mistral Small 3, making the upgrade significant for production API deployments.
6	Stat	Matches or surpasses GPT-OSS 120B on three benchmarks with shorter outputs	On AA LCR the model scores 0.72 using 1.6K characters, while Qwen requires 3.5-4x more output to reach comparable performance. On LiveCodeBench it outperforms GPT-OSS 120B while producing 20% less output.
7	Infrastructure	Minimum hardware: 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200	These are the floor requirements for self-hosting. The model is compatible with vLLM, llama.cpp, SGLang, Transformers, and HuggingFace, covering the major open-source inference stacks.
8	Partnership	Mistral joins NVIDIA Nemotron Coalition as a founding member	The partnership includes inference optimization collaboration on vLLM and SGLang. Joining as a founding member signals deeper integration with NVIDIA's enterprise AI ecosystem beyond a standard hardware dependency.
9	Strategy	Single model replaces three separate specialized deployments	Previously, teams needing reasoning, vision, and coding support from Mistral would run distinct models. Consolidating into one model simplifies infrastructure, reduces operational overhead, and lowers the barrier for organizations wanting broad AI capability in a self-hosted setup.

1.Product Launch

Mistral Small 4 released as unified flagship open-source model

The model merges capabilities from three previously separate Mistral models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding). Released under Apache 2.0, it is free to use, fine-tune, and deploy commercially.

2.Tech Info

MoE architecture: 128 experts, 4 active per token, 119B total / 6B active parameters

Mixture of Experts design means only 4 of 128 experts activate per token, keeping active parameter count at 6B (8B including embedding and output layers). This allows the model to match dense models far above its active-parameter count in capability while remaining inference-efficient.

3.New Tech

Configurable reasoning effort via reasoning_effort parameter

Users can set reasoning_effort to 'none' for fast, direct responses or 'high' for deep chain-of-thought reasoning. This gives developers flexible control over latency versus answer quality on a per-request basis.

4.New Tech

Native multimodality: text and image inputs supported out of the box

Unlike prior small Mistral models, Small 4 natively accepts image inputs alongside text, bringing vision capability to the same model used for reasoning and coding without requiring a separate deployment.

5.Stat

40% reduction in end-to-end completion time versus prior generation

Measured in a latency-optimized deployment setup. Throughput-optimized setups achieve 3x more requests per second compared to Mistral Small 3, making the upgrade significant for production API deployments.

6.Stat

Matches or surpasses GPT-OSS 120B on three benchmarks with shorter outputs

On AA LCR the model scores 0.72 using 1.6K characters, while Qwen requires 3.5-4x more output to reach comparable performance. On LiveCodeBench it outperforms GPT-OSS 120B while producing 20% less output.

7.Infrastructure

Minimum hardware: 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200

These are the floor requirements for self-hosting. The model is compatible with vLLM, llama.cpp, SGLang, Transformers, and HuggingFace, covering the major open-source inference stacks.

8.Partnership

Mistral joins NVIDIA Nemotron Coalition as a founding member

The partnership includes inference optimization collaboration on vLLM and SGLang. Joining as a founding member signals deeper integration with NVIDIA's enterprise AI ecosystem beyond a standard hardware dependency.

9.Strategy

Single model replaces three separate specialized deployments

Previously, teams needing reasoning, vision, and coding support from Mistral would run distinct models. Consolidating into one model simplifies infrastructure, reduces operational overhead, and lowers the barrier for organizations wanting broad AI capability in a self-hosted setup.

Product Launch = new release, Tech Info = architectural detail, New Tech = new capability, Stat = benchmark or performance figure, Infrastructure = hardware/deployment requirement, Partnership = external collaboration, Strategy = business or product positioning

What This Means

Mistral Small 4 is a meaningful step toward open-source models that compete with frontier proprietary systems across a wide range of tasks — reasoning, vision, and coding — without requiring separate deployments. The Apache 2.0 license means any organization can self-host, fine-tune, and commercialize the model freely, which matters most to enterprises with data privacy constraints or cost sensitivity. The efficiency gains from its MoE design make it practical on hardware teams may already own, and benchmark results suggest it punches above its active-parameter weight against larger dense models. This release increases competitive pressure on both proprietary API providers and other open-weight labs.

Sources

Mistral Small 4Mistral
Introducing Mistral Small 4Mistral

Similar Events

Mistral Launches Forge Platform for Enterprise Custom AI Model Training

Mar 17

Mistral Launches Voxtral TTS: Open-Source Speech Model for Edge Devices

Mar 26