New LLM Architectures Target Long-Context Efficiency: Gemma 4, DeepSeek V4, and More

Research1 source·May 18

research google deepseek gemma context-window memory llm

Summary

• Recent open-weight LLMs are converging on architectural tricks to cut KV-cache memory and attention costs
• Gemma 4 introduces cross-layer KV sharing, reusing key-value states across transformer layers
• DeepSeek V4, ZAYA1-8B, and Laguna XS.2 each bring distinct compressed or budgeted attention approaches
• These techniques directly enable longer context windows within the same hardware budget

Adjust signal

Details

#	Type	Key Point	Context
1	Research	Long-context efficiency is the central LLM architecture challenge of 2026	As reasoning models and agentic workflows keep more tokens in context, KV-cache size, memory bandwidth, and attention compute become primary bottlenecks. This is driving a wave of architectural innovation across open-weight model releases.
2	New Tech	Gemma 4 E2B/E4B adopt cross-layer KV sharing from NeurIPS 2024 research	Later transformer layers reuse key-value tensors computed by earlier layers, shrinking KV cache footprint and enabling longer effective context windows within the same memory budget. Technique first described in Brandon et al., 'Reducing Transformer Key-Value Cache Size with Cross-Layer Attention' (NeurIPS 2024).
3	New Tech	DeepSeek V4 combines mHC with compressed attention as a dual efficiency strategy	Multi-Head Compression (mHC) reduces multi-head attention representation size while a separate compressed attention mechanism reduces compute — a dual approach targeting both KV state size and attention cost.
4	Tech Info	Laguna XS.2 uses layer-wise attention budgeting; ZAYA1-8B uses compressed convolutional attention	Laguna XS.2 assigns different attention budgets per layer, reducing cost in layers where full attention is less critical. ZAYA1-8B applies a convolutional compression step within the attention mechanism to reduce long-sequence costs.
5	Context	Google released Gemma 4 suite in early April 2026 as open-weight models	The Gemma 4 suite includes E2B and E4B (mobile/IoT), 26B MoE, and 31B dense variants. Gemma 4 is among the first widely adopted open-weight architectures to deploy cross-layer KV sharing in production.

1.Research

Long-context efficiency is the central LLM architecture challenge of 2026

As reasoning models and agentic workflows keep more tokens in context, KV-cache size, memory bandwidth, and attention compute become primary bottlenecks. This is driving a wave of architectural innovation across open-weight model releases.

2.New Tech

Gemma 4 E2B/E4B adopt cross-layer KV sharing from NeurIPS 2024 research

Later transformer layers reuse key-value tensors computed by earlier layers, shrinking KV cache footprint and enabling longer effective context windows within the same memory budget. Technique first described in Brandon et al., 'Reducing Transformer Key-Value Cache Size with Cross-Layer Attention' (NeurIPS 2024).

3.New Tech

DeepSeek V4 combines mHC with compressed attention as a dual efficiency strategy

Multi-Head Compression (mHC) reduces multi-head attention representation size while a separate compressed attention mechanism reduces compute — a dual approach targeting both KV state size and attention cost.

4.Tech Info

Laguna XS.2 uses layer-wise attention budgeting; ZAYA1-8B uses compressed convolutional attention

Laguna XS.2 assigns different attention budgets per layer, reducing cost in layers where full attention is less critical. ZAYA1-8B applies a convolutional compression step within the attention mechanism to reduce long-sequence costs.

5.Context

Google released Gemma 4 suite in early April 2026 as open-weight models

The Gemma 4 suite includes E2B and E4B (mobile/IoT), 26B MoE, and 31B dense variants. Gemma 4 is among the first widely adopted open-weight architectures to deploy cross-layer KV sharing in production.

Research = academic/technical findings, New Tech = novel architectural capability, Tech Info = design detail, Context = background framing

What This Means

The next generation of open-weight LLMs is being engineered from the ground up to handle longer contexts more cheaply — a direct response to AI agents and reasoning models that need to hold large amounts of information in memory. Techniques like KV sharing and compressed attention are moving from research papers into production models, meaning developers building on Gemma 4 or DeepSeek V4 should expect meaningfully lower memory costs when running long-context workloads.

Sources

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention (33 minute read)Magazine

Similar Events

Moonshot AI Proposes Cross-Datacenter LLM Serving via Prefill-as-a-Service

Apr 20

KV Cache Locality: How Load Balancing Drives Up LLM Serving Costs

May 1