LLM Architecture Gallery: 11 Open Models Compared by Design

Research1 source·Mar 23

llm moe industry-analysis deepseek openai qwen minimax meta google llama gemma kimi olmo

Summary

• A curated gallery catalogs 11 notable LLMs spanning 2024 through late 2025
• Sparse MoE dominates recent releases, with dense models becoming the minority
• Qwen3 Next holds the gallery's highest sparsity ratio at roughly 27x (80B total, 3B active)
• Hybrid architectures mixing attention and state-space layers appear in two models

Adjust signal

Details

#	Type	Key Point	Context
1	Context	Gallery covers 11 LLMs released between April 2024 and November 2025	The collection spans roughly 20 months of open-weight model development, ranging from the 8B Llama 3 to the 671B-total DeepSeek R1, and includes models from Meta, Google, Alibaba, OpenAI, Zhipu AI, MiniMax, Moonshot, and the Allen Institute.
2	Insight	Sparse MoE has become the dominant architecture among large open models	Eight of the eleven models use Sparse MoE, a sharp shift from the dense-model baseline set by Llama 3 8B and OLMo 3 32B. MoE allows very large total parameter counts while keeping inference costs tied to a much smaller active subset per token.
3	Tech Info	Qwen3 Next (80B/3B active) has the gallery's highest sparsity ratio at roughly 27x	With only 3B parameters active out of 80B total, Qwen3 Next's ~27x sparsity ratio exceeds MiniMax M2's ~23x (230B/10B active). Qwen3 Next also uses a hybrid Gated DeltaNet and Gated Attention architecture with native 262k context, the longest in the gallery.
4	New Tech	Two models adopt hybrid linear-recurrence and attention architectures	Qwen3 Next (3:1 Gated DeltaNet / Gated Attention) and Kimi Linear 48B-A3B (3:1 Kimi Delta Attention / MLA with NoPE) move beyond pure transformers. Both use a 3:1 ratio of linear layers to attention layers, prioritizing long-context efficiency over conventional full-attention designs.
5	Tech Info	GPT-OSS 120B is OpenAI's flagship open-weight release, using Sparse MoE	Released August 2025, GPT-OSS 120B uses grouped-query attention with alternating sliding-window and global attention layers — the same pattern as its smaller sibling GPT-OSS 20B (20B total / 3.6B active). This marks OpenAI's entry into the open-weight model space.
6	Insight	Dense models remain relevant at smaller scales as open research baselines	Llama 3 8B, Gemma 3 27B, and OLMo 3 32B are the only dense entries. Gemma 3 27B adds QK-Norm and a 5:1 sliding-window-to-global attention ratio for compute efficiency; OLMo 3 32B serves as a fully open research baseline. Dense architecture is now largely confined to sub-30B models in this cohort.

1.Context

Gallery covers 11 LLMs released between April 2024 and November 2025

The collection spans roughly 20 months of open-weight model development, ranging from the 8B Llama 3 to the 671B-total DeepSeek R1, and includes models from Meta, Google, Alibaba, OpenAI, Zhipu AI, MiniMax, Moonshot, and the Allen Institute.

2.Insight

Sparse MoE has become the dominant architecture among large open models

Eight of the eleven models use Sparse MoE, a sharp shift from the dense-model baseline set by Llama 3 8B and OLMo 3 32B. MoE allows very large total parameter counts while keeping inference costs tied to a much smaller active subset per token.

3.Tech Info

Qwen3 Next (80B/3B active) has the gallery's highest sparsity ratio at roughly 27x

With only 3B parameters active out of 80B total, Qwen3 Next's ~27x sparsity ratio exceeds MiniMax M2's ~23x (230B/10B active). Qwen3 Next also uses a hybrid Gated DeltaNet and Gated Attention architecture with native 262k context, the longest in the gallery.

4.New Tech

Two models adopt hybrid linear-recurrence and attention architectures

Qwen3 Next (3:1 Gated DeltaNet / Gated Attention) and Kimi Linear 48B-A3B (3:1 Kimi Delta Attention / MLA with NoPE) move beyond pure transformers. Both use a 3:1 ratio of linear layers to attention layers, prioritizing long-context efficiency over conventional full-attention designs.

5.Tech Info

GPT-OSS 120B is OpenAI's flagship open-weight release, using Sparse MoE

Released August 2025, GPT-OSS 120B uses grouped-query attention with alternating sliding-window and global attention layers — the same pattern as its smaller sibling GPT-OSS 20B (20B total / 3.6B active). This marks OpenAI's entry into the open-weight model space.

6.Insight

Dense models remain relevant at smaller scales as open research baselines

Llama 3 8B, Gemma 3 27B, and OLMo 3 32B are the only dense entries. Gemma 3 27B adds QK-Norm and a 5:1 sliding-window-to-global attention ratio for compute efficiency; OLMo 3 32B serves as a fully open research baseline. Dense architecture is now largely confined to sub-30B models in this cohort.

Context = background framing, Insight = analytical observation, Tech Info = architectural detail, New Tech = novel design approach

What This Means

This gallery provides a structured snapshot of how open-weight LLM architecture has shifted over roughly two years: Sparse MoE is now the default for large models, enabling massive parameter counts without proportional inference costs. The emergence of hybrid linear-recurrence architectures in models like Qwen3 Next and Kimi Linear points to a second wave of experimentation aimed at long-context efficiency beyond what standard transformers offer. For practitioners, the gallery is a useful reference for understanding the design tradeoffs — sparsity ratios, attention variants, context lengths — that distinguish today's leading open models. OpenAI's entry into open-weight releases with GPT-OSS 120B also marks a notable competitive shift in the open-model landscape.

Sources

LLM Architecture GallerySebastianraschka

Similar Events

LLM Relayering Technique "RYS" Generalizes Across Models, Hints at Universal Thinking Space

Mar 25

Gemma 4 26B-A4B Runs Locally at 51 tok/s via LM Studio 0.4.0 Headless CLI

Apr 6