Why is AI so memory hungry?

Understanding KV Cache: The secret memory vault that makes LLMs fast.

Understanding Self-Attention

Transformer models use self-attention to understand relationships between words. Each word looks at all previous words to understand context.

Self-Attention Mechanism

Current Token: The

Attention Connections

The

cat

sat

"The" reads from itself

💡 What the lines mean: Each line shows the current token "attending to" (reading information from) previous tokens. The thicker blue line shows self-attention (the token looking at itself).

Query: What am I looking for?

Key: What do I represent?

Value: What information do I carry?

How KV Cache Helps

❌ Without KV Cache

For each new word, recalculate K and V for ALL previous words. This is O(N²) complexity.

✅ With KV Cache

Store previously computed K and V values. Only compute for the new word. This is O(N) complexity.

The Key Insight

Since previous tokens don't change, their K and V values remain constant. We can cache them and reuse them for every new token!

Self-Attention Formula

Attention(Q, K, V) = softmax(Q × K^T / sqrt(d_k)) × V

Q (Query)

Current token's question

K (Key)

All tokens' identities

← CACHED

V (Value)

All tokens' information

← CACHED

By caching K and V, we avoid recomputing them for every new token.

How LLMs Write

Imagine you are writing a story. To write the next word, you need to remember everything that happened before.

Large Language Models (LLMs) work the same way. They predict one word (token) at a time.

Input: "The quick brown fox"
Prediction: "jumps"

Thequickbrownfoxjumps

The model looks at all previous words to guess the next one.

The Problem: No Cache

Without a cache, the model must re-compute the entire sentence from the beginning to generate just one new word.

Step: 0 / 5

The Solution: KV Cache

With KV Cache, we save the computational need from previous words since we only need to calculate the new word. Thus, saving energy, cost and help protect the environment.

Step: 0 / 5

The Speed Advantage

Without KV Cache, the model has to re-compute everything for every new word.
With KV Cache, it only computes the new word.

TIME PER TOKEN

032k64k96k128k

CONTEXT LENGTH

Without KVCache

With KVCache

Current Context Length

0 tokens

Relative Speedup

1.0x

Faster generation at this context length

Why it matters

Without KV Cache, the computational cost grows linearly with every new token. For long contexts (e.g., 128k tokens), this makes generation prohibitively slow without caching.

The Cost: Memory Usage

KV Cache makes generation fast, but it eats up RAM. Newer models use tricks like GQA and MLA to reduce this cost.

Model

Layers: 32Hidden Dim: 4096KV Heads: 8

Context Length (Tokens)

2,048

Batch Size (Users)

Estimated KV Cache Size

0.2500 GB

Formula: 2 × Layers × KV_Heads × Head_Dim × Context × Batch × 2 Bytes

Optimized with GQA

Optimization Impact

Comparing GQA optimization vs unoptimized (Standard) architecture

Unoptimized (Standard)

1.00 GB

All 32 attention heads

Optimized (GQA)

0.25 GB

Only 8 KV heads

Memory Saved

75.0%

0.75 GB saved

Why this matters: Grouped Query Attention (GQA) shares key-value pairs across multiple query heads, reducing memory usage without sacrificing performance.

Dynamo

Enterprise-Scale LLM Inference

NVIDIA Dynamo optimizes LLM serving across distributed systems with intelligent KV cache management

Prefill-Decode Disaggregation

PREFILL

Compute-Intensive

Process prompt

Generate initial KV cache

DECODE

Memory-Bound

Use KV cache

Generate tokens 1-by-1

💡 Why separate?

Different phases have different resource needs. Separating them maximizes GPU utilization and throughput.

KV Block Manager (KVBM)

GPU Memory

Fastest

Active

CPU Memory

Fast

Local SSD

Medium

Network Storage

Slower

💡 Multi-Tier Offloading:

KV cache overflows to slower tiers when GPU memory is full, enabling larger context windows.

NIXL: Fast Data Transfer

GPU 1

GPU 2

KV Cache Block

Ultra-fast transfer via NVLink/RDMA

Low Latency

Non-blocking

High Bandwidth

Parallel transfer

💡 NIXL enables:

Seamless KV cache sharing between prefill and decode engines across different nodes.

KV Cache-Aware Routing

Incoming Request

📝 New Query

Worker A

Cache Hit

20%

Worker B

Cache Hit

75%

✓ Selected

Worker C

Cache Hit

45%

💡 Smart routing:

Routes requests to workers with highest KV cache hit rate, reducing redundant computation.

How It All Works Together

1. Disaggregate

Separate prefill (compute) and decode (memory) phases for optimal resource use

2. Manage Cache

KVBM stores KV cache across GPU, CPU, SSD, and network tiers

3. Transfer Fast

NIXL moves KV cache between nodes with ultra-low latency

4. Route Smart

Direct requests to workers with best cache hit rates

Result: Enterprise-Scale Performance

Higher throughput, lower latency, and efficient resource utilization for serving LLMs at scale

Agentic AI

KV Cache for Agent Swarms

In Agentic AI, context is a continuous stream from sensors and cameras, not just a 2000-token prompt

Drone Swarm: Agent-to-Agent Handoff

Without Shared KV Cache

Drone A tracks vehicle → Low battery → Sends basic info to Drone B → Drone B restarts, wasting time to re-acquire target. Drone A's "memory" is lost.

Target

Drone A

Drone B

🔵 Drone A tracking vehicle, building KV cache...

Zero Context Loss

Agents instantly inherit full context from previous agents, maintaining continuous awareness

Persistent Memory

KV cache becomes long-term memory for the entire agent swarm, stored on VAST

Instant Handoff

No time wasted re-acquiring targets or rebuilding context from scratch

Collective Intelligence Through Shared Memory

By transforming KV cache into persistent, shareable "long-term memory," agent swarms achieve true collective intelligence. Each agent builds upon the knowledge of all previous agents.

🔄 Seamless Handoffs

⚡ Real-time Coordination

🧠 Shared Intelligence