Summary
- • A curated gallery catalogs 11 notable LLMs spanning 2024 through late 2025
- • Sparse MoE dominates recent releases, with dense models becoming the minority
- • Qwen3 Next holds the gallery's highest sparsity ratio at roughly 27x (80B total, 3B active)
- • Hybrid architectures mixing attention and state-space layers appear in two models
Details
Gallery covers 11 LLMs released between April 2024 and November 2025
The collection spans roughly 20 months of open-weight model development, ranging from the 8B Llama 3 to the 671B-total DeepSeek R1, and includes models from Meta, Google, Alibaba, OpenAI, Zhipu AI, MiniMax, Moonshot, and the Allen Institute.
Sparse MoE has become the dominant architecture among large open models
Eight of the eleven models use Sparse MoE, a sharp shift from the dense-model baseline set by Llama 3 8B and OLMo 3 32B. MoE allows very large total parameter counts while keeping inference costs tied to a much smaller active subset per token.
Qwen3 Next (80B/3B active) has the gallery's highest sparsity ratio at roughly 27x
With only 3B parameters active out of 80B total, Qwen3 Next's ~27x sparsity ratio exceeds MiniMax M2's ~23x (230B/10B active). Qwen3 Next also uses a hybrid Gated DeltaNet and Gated Attention architecture with native 262k context, the longest in the gallery.
Two models adopt hybrid linear-recurrence and attention architectures
Qwen3 Next (3:1 Gated DeltaNet / Gated Attention) and Kimi Linear 48B-A3B (3:1 Kimi Delta Attention / MLA with NoPE) move beyond pure transformers. Both use a 3:1 ratio of linear layers to attention layers, prioritizing long-context efficiency over conventional full-attention designs.
GPT-OSS 120B is OpenAI's flagship open-weight release, using Sparse MoE
Released August 2025, GPT-OSS 120B uses grouped-query attention with alternating sliding-window and global attention layers — the same pattern as its smaller sibling GPT-OSS 20B (20B total / 3.6B active). This marks OpenAI's entry into the open-weight model space.
Dense models remain relevant at smaller scales as open research baselines
Llama 3 8B, Gemma 3 27B, and OLMo 3 32B are the only dense entries. Gemma 3 27B adds QK-Norm and a 5:1 sliding-window-to-global attention ratio for compute efficiency; OLMo 3 32B serves as a fully open research baseline. Dense architecture is now largely confined to sub-30B models in this cohort.
Context = background framing, Insight = analytical observation, Tech Info = architectural detail, New Tech = novel design approach
What This Means
This gallery provides a structured snapshot of how open-weight LLM architecture has shifted over roughly two years: Sparse MoE is now the default for large models, enabling massive parameter counts without proportional inference costs. The emergence of hybrid linear-recurrence architectures in models like Qwen3 Next and Kimi Linear points to a second wave of experimentation aimed at long-context efficiency beyond what standard transformers offer. For practitioners, the gallery is a useful reference for understanding the design tradeoffs — sparsity ratios, attention variants, context lengths — that distinguish today's leading open models. OpenAI's entry into open-weight releases with GPT-OSS 120B also marks a notable competitive shift in the open-model landscape.
Sources
- LLM Architecture GallerySebastianraschka
