Conversation Summarization: Smart Context Management for Long Chats
Keep conversations going indefinitely with NeuroLink's summarization patterns — LLM-powered condensation, sliding window plus summary hybrids, token budget strategies, and the four-stage context compaction pipeline.
Every LLM conversation eventually hits a wall. Context windows are finite, and the longer a chat runs, the more tokens you spend re-sending old messages that no longer contribute much to the next response. Summarization is the answer: condense what happened earlier so the model keeps the gist without burning through your budget.
This tutorial walks through four progressively powerful patterns for conversation summarization with NeuroLink, culminating in the production-grade four-stage context compaction pipeline that ships with the SDK.
When to Summarize
Not every conversation needs summarization. Short one-shot interactions finish well within any model’s context window. Summarization becomes essential when conversations exhibit these characteristics:
- Turn count exceeds 15-20 exchanges. At that point the token cost of re-sending full history starts to dominate.
- Tool outputs inflate context. A single file read or API response can inject thousands of tokens into the conversation.
- Multi-session continuity is required. If you persist context across sessions, summaries keep storage and re-hydration costs manageable.
- Response quality degrades. Models tend to lose focus when the prompt grows very large, even within the technical context limit.
The following decision diagram helps you choose the right strategy for your workload.
flowchart TD
A[New user message arrives] --> B{Conversation length?}
B -- "< 15 turns" --> C[Send full history]
B -- "15-50 turns" --> D{Token usage > 70%?}
B -- "> 50 turns" --> E[Mandatory summarization]
D -- No --> C
D -- Yes --> F{Tool outputs dominate?}
F -- Yes --> G[Stage 1: Prune tool outputs]
F -- No --> H{Duplicate file reads?}
H -- Yes --> I[Stage 2: Deduplicate files]
H -- No --> J[Stage 3: LLM summarization]
G --> K{Still over budget?}
I --> K
K -- Yes --> J
K -- No --> L[Send compacted context]
J --> L
E --> G
LLM-Powered Condensation
The most straightforward summarization approach is to ask an LLM to condense older messages into a concise paragraph while preserving key facts, decisions, and entities. NeuroLink makes this simple through its unified generation interface.
Basic Summarizer
The following ConversationSummarizer class tracks messages and triggers summarization once the conversation exceeds a configurable threshold. Older messages are replaced with a summary while recent ones remain intact.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import { NeuroLink } from '@juspay/neurolink';
type ConversationMessage = {
role: 'user' | 'assistant' | 'system';
content: string;
timestamp: Date;
important?: boolean;
};
class ConversationSummarizer {
protected neurolink: NeuroLink;
protected messages: ConversationMessage[] = [];
private summary: string = '';
private maxMessages: number;
private summaryModel: string;
constructor(options: { maxMessages?: number; summaryModel?: string } = {}) {
this.neurolink = new NeuroLink();
this.maxMessages = options.maxMessages || 10;
this.summaryModel = options.summaryModel || 'claude-3-5-haiku-20241022';
}
addMessage(role: 'user' | 'assistant', content: string, important = false) {
this.messages.push({ role, content, timestamp: new Date(), important });
if (this.messages.length >= this.maxMessages) {
this.summarizeAsync().catch(console.error);
}
}
private async summarizeAsync() {
const importantMessages = this.messages.filter((m) => m.important);
const regularMessages = this.messages.filter((m) => !m.important);
// Split regular messages: summarize first half, keep second half
const midpoint = Math.floor(regularMessages.length / 2);
const toSummarize = regularMessages.slice(0, midpoint);
const toKeep = regularMessages.slice(midpoint);
if (toSummarize.length === 0) return;
const conversationText = toSummarize
.map((m) => `${m.role}: ${m.content}`)
.join('\n\n');
const result = await this.neurolink.generate({
input: {
text: `Summarize this conversation concisely, preserving key facts,
decisions, and context:\n\n${conversationText}`,
},
provider: 'anthropic',
model: this.summaryModel,
maxTokens: 500,
});
// Merge the new summary with any existing summary
this.summary = this.summary
? `${this.summary}\n\nRecent updates: ${result.content}`
: result.content;
this.messages = [...importantMessages, ...toKeep];
}
async getContext(): Promise<string> {
const parts: string[] = [];
if (this.summary) {
parts.push(`[Previous conversation summary: ${this.summary}]`);
}
const recentMessages = this.messages
.slice(-5)
.map((m) => `${m.role}: ${m.content}`)
.join('\n\n');
if (recentMessages) parts.push(recentMessages);
return parts.join('\n\n');
}
async chat(userMessage: string, markImportant = false): Promise<string> {
this.addMessage('user', userMessage, markImportant);
const context = await this.getContext();
const result = await this.neurolink.generate({
input: { text: context },
});
this.addMessage('assistant', result.content);
return result.content;
}
getStats() {
return {
messages: this.messages.length,
hasSummary: !!this.summary,
summaryLength: this.summary.length,
importantMessages: this.messages.filter((m) => m.important).length,
};
}
}
Key design decisions in this implementation:
- Important-message preservation. Messages flagged with
important: trueare never summarized. Use this for decisions, confirmations, or any fact the user explicitly asked the model to remember. - Half-and-half splitting. Rather than summarizing everything, only the older half is condensed. This avoids an awkward gap between summary and the most recent exchanges.
- Cheap model for summaries. The summarization call uses a small, fast model (Haiku) because it only needs to condense prose, not reason deeply. This keeps costs negligible.
Hierarchical Summaries
When a conversation runs for hundreds of turns, a single summary string grows unwieldy. Progressive or hierarchical summarization solves this by maintaining multiple summary levels.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class ProgressiveSummarizer extends ConversationSummarizer {
private detailedSummary: string = '';
private briefSummary: string = '';
async summarize() {
// Level 1: Detailed summary at 300 tokens
this.detailedSummary = await this.summarizeText(
this.messages.map(m => `${m.role}: ${m.content}`).join('\n'), 300,
);
// Level 2: Brief summary at 100 tokens (summary of the summary)
this.briefSummary = await this.summarizeText(this.detailedSummary, 100);
}
private async summarizeText(
text: string,
maxTokens: number,
): Promise<string> {
const result = await this.neurolink.generate({
input: { text: `Summarize concisely: ${text}` },
maxTokens,
});
return result.content;
}
}
The brief summary is included when context is tight, and the detailed summary is used when there is room. This two-tier approach keeps the model grounded on the full arc of the conversation without consuming excessive tokens.
Sliding Window Plus Summary Hybrid
A pure sliding window (keep the last N messages, drop the rest) is cheap and simple but throws away context permanently. A pure summarization approach preserves context but requires an LLM call every time the window fills. The hybrid combines both.
flowchart LR
subgraph Conversation History
direction TB
M1[Msg 1] --> M2[Msg 2]
M2 --> M3[Msg 3]
M3 --> M4[Msg 4]
M4 --> M5[Msg 5]
M5 --> M6[Msg 6]
M6 --> M7[Msg 7]
M7 --> M8[Msg 8]
end
subgraph Compacted Context
direction TB
S["Summary of Msgs 1-4"]
R1[Msg 5]
R2[Msg 6]
R3[Msg 7]
R4[Msg 8]
end
M1 -.->|Summarized| S
M2 -.->|Summarized| S
M3 -.->|Summarized| S
M4 -.->|Summarized| S
M5 -->|Kept| R1
M6 -->|Kept| R2
M7 -->|Kept| R3
M8 -->|Kept| R4
Implementation
The ContextWindowManager below implements this pattern with automatic token tracking and progressive strategy selection based on how full the context window is.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import { NeuroLink } from '@juspay/neurolink';
type Message = {
role: 'system' | 'user' | 'assistant';
content: string;
tokens?: number;
};
class ContextWindowManager {
private neurolink: NeuroLink;
private messages: Message[] = [];
private maxTokens: number;
private systemMessage?: Message;
constructor(maxTokens: number = 8000) {
this.neurolink = new NeuroLink();
this.maxTokens = maxTokens;
}
private estimateTokens(text: string): number {
return Math.ceil(text.length / 4);
}
private calculateTotalTokens(messages: Message[]): number {
return messages.reduce(
(sum, msg) => sum + (msg.tokens || this.estimateTokens(msg.content)),
0,
);
}
setSystemMessage(content: string) {
this.systemMessage = {
role: 'system',
content,
tokens: this.estimateTokens(content),
};
}
addMessage(role: 'user' | 'assistant', content: string) {
const message: Message = {
role,
content,
tokens: this.estimateTokens(content),
};
this.messages.push(message);
this.pruneIfNeeded();
}
private pruneIfNeeded() {
const allMessages = this.systemMessage
? [this.systemMessage, ...this.messages]
: this.messages;
const totalTokens = this.calculateTotalTokens(allMessages);
if (totalTokens <= this.maxTokens) return;
// Target 80% capacity to leave headroom
const targetTokens = Math.floor(this.maxTokens * 0.8);
let currentTokens = this.systemMessage?.tokens || 0;
const keptMessages: Message[] = [];
// Work backwards from most recent
for (let i = this.messages.length - 1; i >= 0; i--) {
const msg = this.messages[i];
const msgTokens = msg.tokens || this.estimateTokens(msg.content);
if (currentTokens + msgTokens <= targetTokens) {
keptMessages.unshift(msg);
currentTokens += msgTokens;
} else {
break;
}
}
this.messages = keptMessages;
}
async summarizeOldMessages() {
if (this.messages.length < 10) return;
const totalTokens = this.calculateTotalTokens(this.messages);
if (totalTokens <= this.maxTokens * 0.7) return;
const midpoint = Math.floor(this.messages.length / 2);
const toSummarize = this.messages.slice(0, midpoint);
const toKeep = this.messages.slice(midpoint);
const conversationText = toSummarize
.map((m) => `${m.role}: ${m.content}`)
.join('\n\n');
const summary = await this.neurolink.generate({
input: {
text: `Summarize this conversation concisely, preserving key
information:\n\n${conversationText}`,
},
provider: 'anthropic',
model: 'claude-3-5-haiku-20241022',
maxTokens: 500,
});
this.messages = [
{
role: 'assistant',
content: `[Previous conversation summary: ${summary.content}]`,
tokens: this.estimateTokens(summary.content),
},
...toKeep,
];
}
getStats() {
const totalTokens = this.calculateTotalTokens(this.messages);
return {
messages: this.messages.length,
tokens: totalTokens,
capacity: this.maxTokens,
usage: ((totalTokens / this.maxTokens) * 100).toFixed(1) + '%',
};
}
}
Progressive Strategy Selection
The hybrid manager selects its strategy based on how full the context window is:
| Context Usage | Strategy | LLM Cost |
|---|---|---|
| 0-70% | No action | None |
| 70-90% | Summarize older half | One cheap LLM call |
| 90-100% | Sliding window prune | None |
| Over 100% | Aggressive truncation | None |
This graduated approach keeps LLM summarization calls to a minimum while ensuring context never overflows.
Token Budget Strategies
Before you can summarize intelligently, you need to know how much room you have. NeuroLink’s BudgetChecker runs before every generation call and returns a detailed breakdown of where your tokens are going.
How the Budget Checker Works
The budget checker estimates total input tokens from five categories: system prompt, tool definitions, conversation history, the current user prompt, and file attachments. It compares the total against the model’s available input space and sets a shouldCompact flag when usage exceeds the configured threshold (default: 80%).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import { checkContextBudget } from '@juspay/neurolink';
const budgetResult = checkContextBudget({
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
maxTokens: 4096,
systemPrompt: 'You are a helpful assistant.',
conversationMessages: sessionHistory,
currentPrompt: userMessage,
toolDefinitions: registeredTools,
fileAttachments: uploadedFiles,
compactionThreshold: 0.8,
});
console.log(budgetResult);
// {
// withinBudget: true,
// estimatedInputTokens: 42300,
// availableInputTokens: 196000,
// usageRatio: 0.216,
// shouldCompact: false,
// breakdown: {
// systemPrompt: 45,
// conversationHistory: 38200,
// currentPrompt: 120,
// toolDefinitions: 3800,
// fileAttachments: 135
// }
// }
The breakdown tells you exactly where tokens are going. In many production applications, conversationHistory and toolDefinitions are the largest consumers. This insight lets you target compaction where it will save the most.
Token Budgets by Use Case
Different workloads need different allocations:
| Use Case | Recommended Limit | Reasoning |
|---|---|---|
| Chatbot | 4K-8K tokens | Quick responses, recent context sufficient |
| Code assistant | 16K-32K tokens | File context and diffs require space |
| Document analysis | 32K-100K tokens | Full documents need to fit |
| Long-form writing | 8K-16K tokens | Story continuity over many turns |
| Customer support | 4K tokens | Short interactions, fast resolution |
Provider-Specific Context Limits
Each model family has different context windows. NeuroLink maintains these automatically in its provider registry, but here are the key values for reference:
1
2
3
4
5
6
7
8
9
10
11
const CONTEXT_LIMITS: Record<string, number> = {
'gpt-4o': 128_000,
'gpt-4o-mini': 128_000,
'gpt-4.1': 1_047_576,
'o3': 200_000,
'claude-opus-4-20250514': 200_000,
'claude-sonnet-4-20250514': 200_000,
'claude-3-5-sonnet-20241022': 200_000,
'gemini-2.5-flash': 1_048_576,
'gemini-2.5-pro': 1_048_576,
};
A 20% buffer for the response is already subtracted by the budget checker, so you do not need to account for it manually.
The Four-Stage Compaction Pipeline
NeuroLink ships a production-grade ContextCompactor that orchestrates four stages of context reduction, running each stage only if the previous one did not bring the context within budget. This design minimizes cost: the cheapest stages run first, and the expensive LLM summarization call is deferred until cheaper options are exhausted.
flowchart TD
A[Messages over budget] --> B["Stage 1: Tool Output Pruning<br/>(No LLM call)"]
B --> C{Within budget?}
C -- Yes --> Z[Return compacted messages]
C -- No --> D["Stage 2: File Read Deduplication<br/>(No LLM call)"]
D --> E{Within budget?}
E -- Yes --> Z
E -- No --> F["Stage 3: LLM Summarization<br/>(Requires LLM call)"]
F --> G{Within budget?}
G -- Yes --> Z
G -- No --> H["Stage 4: Sliding Window Truncation<br/>(No LLM call, last resort)"]
H --> Z
Stage 1: Tool Output Pruning
The cheapest stage. Tool call results (shell outputs, API responses, file contents) are often the single largest token consumers in a coding or automation session. This stage replaces old tool outputs with compact stand-ins while protecting recent outputs and outputs from critical tools.
1
2
3
4
5
6
7
// From contextCompactor.ts — Stage 1 configuration
const DEFAULT_CONFIG = {
enablePrune: true,
pruneProtectTokens: 40_000, // Protect the most recent 40K tokens
pruneMinimumSavings: 20_000, // Only prune if it saves at least 20K tokens
pruneProtectedTools: ['skill'], // Never prune outputs from these tools
};
In practice, a single cat or grep result can contain 10,000+ tokens. Replacing stale tool outputs with a one-line stub like [Tool output pruned: 12,847 tokens] can free up enormous amounts of context without any LLM cost.
Stage 2: File Read Deduplication
When an AI agent reads the same file multiple times during a session (common when iterating on code), each read injects the full file contents into the conversation. This stage identifies duplicate file reads and keeps only the most recent version, replacing earlier reads with a reference note.
1
2
3
4
5
6
// Stage 2 runs only if Stage 1 was insufficient
const dedupResult = deduplicateFileReads(currentMessages);
if (dedupResult.deduplicated) {
currentMessages = dedupResult.messages;
stagesUsed.push('deduplicate');
}
This stage requires no LLM calls and is especially effective in long coding sessions where the same configuration files, type definitions, or test fixtures are read repeatedly.
Stage 3: LLM Summarization
When pruning and deduplication are not enough, the compactor invokes an LLM to summarize older conversation turns. The SDK uses a fast, inexpensive model (Gemini 2.5 Flash by default) with a 120-second timeout. If the call fails, the pipeline gracefully falls through to Stage 4.
1
2
3
4
5
6
7
8
9
10
11
12
// Stage 3: structured LLM summarization
const summarizeResult = await withTimeout(
summarizeMessages(currentMessages, {
provider: config.summarizationProvider, // default: 'vertex'
model: config.summarizationModel, // default: 'gemini-2.5-flash'
keepRecentRatio: config.keepRecentRatio, // default: 0.3
memoryConfig,
targetTokens,
}),
120_000, // 2-minute timeout
'LLM summarization timed out after 120s',
);
The keepRecentRatio parameter controls how much of the conversation is left untouched. With the default of 0.3, the most recent 30% of messages remain in full while the older 70% are condensed into a structured summary.
Stage 4: Sliding Window Truncation
The last-resort fallback. If all prior stages have run and the context still exceeds the budget, the sliding window truncator removes the oldest non-system messages until the context fits.
1
2
3
4
5
6
7
8
const truncResult = truncateWithSlidingWindow(currentMessages, {
fraction: config.truncationFraction, // default: 0.5
currentTokens: stageTokensBefore,
targetTokens: targetTokens,
provider: provider,
adaptiveBuffer: 0.15,
maxIterations: 3,
});
The adaptiveBuffer of 15% ensures the truncation does not land exactly on the boundary, leaving room for the next few exchanges before another compaction is needed.
Emergency Truncation
In extreme cases where even the sliding window cannot bring context within budget (for example, when tool definitions and the system prompt alone consume most of the available space), NeuroLink falls back to emergency content truncation. This truncates the content of individual messages rather than removing entire messages.
1
2
3
4
5
6
7
8
9
import { emergencyContentTruncation } from '@juspay/neurolink';
// Strategy: sort messages by length descending, truncate the biggest first
const truncatedHistory = emergencyContentTruncation(
messages,
availableTokens,
budgetBreakdown,
provider,
);
The algorithm is straightforward:
- Calculate how many tokens need to be freed.
- Sort messages by content length, largest first.
- Truncate each large message proportionally until the total reduction target is met.
- Skip system messages and very short messages (under 200 characters).
- If proportional truncation is still insufficient, fall back to keeping only the newest messages that fit.
This is a safety net, not a primary strategy. Well-tuned compaction thresholds should prevent you from reaching this point in normal operation.
The File Summarization Service
File attachments deserve special treatment. A 500-line source file or a multi-page PDF can consume thousands of tokens, but the user’s question might only relate to a small section. NeuroLink’s FileSummarizationService handles this by producing context-aware summaries that focus on the parts relevant to the current prompt.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import { FileSummarizationService } from '@juspay/neurolink';
const service = new FileSummarizationService({
provider: 'vertex',
model: 'gemini-2.5-flash',
});
// Step 1: Prepare files (extract text, estimate tokens, label type)
const prepared = service.prepareFilesForSummarization(rawFiles, 'vertex');
// Step 2: Summarize files that exceed the budget
const summarized = await service.summarizeFiles(
prepared,
userPrompt, // The current user question for context-aware summaries
{
availableBudget: 10_000, // Total token budget for all files
provider: 'vertex',
},
);
for (const file of summarized) {
console.log(`${file.fileName}: ${file.originalTokens} -> ${file.summaryTokens} tokens`);
console.log(` Summarized: ${file.wasSummarized}`);
}
The service supports over 30 MIME types. Binary files (images, audio, video) receive a stub marker instead of a summary. Text-based files are decoded from UTF-8, and the type label (TypeScript File, PDF Document, JSON File, and so on) is included in the summarization prompt so the LLM understands the file format.
If the LLM summarization call fails, the service falls back to naive truncation so the request can still proceed.
Implementation Walkthrough
Bringing everything together, here is a complete example that configures NeuroLink with context compaction, uses the built-in APIs to monitor context usage, and triggers compaction when needed.
Step 1: Configure Context Compaction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink({
conversationMemory: {
enabled: true,
contextCompaction: {
enabled: true,
threshold: 0.8, // Compact at 80% usage
enablePruning: true, // Stage 1: tool output pruning
enableDeduplication: true, // Stage 2: file read dedup
enableSlidingWindow: true, // Stage 4: sliding window fallback
maxToolOutputBytes: 50_000, // Cap tool outputs at 50KB
maxToolOutputLines: 2000, // Cap tool output lines
fileReadBudgetPercent: 0.6, // 60% of remaining context for files
},
},
});
Step 2: Monitor Context Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
async function checkAndCompact(sessionId: string) {
const stats = await neurolink.getContextStats(
sessionId,
'anthropic',
'claude-sonnet-4-20250514',
);
if (!stats) return;
console.log(`Messages: ${stats.messageCount}`);
console.log(`Input tokens: ${stats.estimatedInputTokens}`);
console.log(`Available: ${stats.availableInputTokens}`);
console.log(`Usage: ${(stats.usageRatio * 100).toFixed(1)}%`);
console.log(`Should compact: ${stats.shouldCompact}`);
if (stats.shouldCompact) {
const result = await neurolink.compactSession(sessionId);
if (result?.compacted) {
const saved = result.originalTokenCount - result.compactedTokenCount;
console.log(`Freed ${saved} tokens via ${result.stagesApplied.join(', ')}`);
}
}
}
Step 3: Build the Chat Loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const sessionId = 'long-running-session';
async function chat(userMessage: string) {
// Pre-flight budget check
await checkAndCompact(sessionId);
// Generate with automatic memory
const response = await neurolink.generate({
input: { text: userMessage },
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
context: { sessionId },
});
return response.content;
}
// This loop can run indefinitely without hitting context limits
for (let i = 1; i <= 100; i++) {
const reply = await chat(`Tell me fact #${i} about distributed systems.`);
console.log(`[${i}] ${reply.slice(0, 120)}...`);
}
Monitoring Context Usage
Monitoring is essential for tuning your compaction settings. NeuroLink emits observability spans for every compaction operation, so you can track token savings, stage frequency, and compaction latency in your existing tracing infrastructure.
Key Metrics to Track
| Metric | What It Tells You |
|---|---|
context.tokensBefore | Input tokens before compaction |
context.tokensAfter | Input tokens after compaction |
context.tokensSaved | Tokens freed in this compaction cycle |
context.stage | Which stages ran (prune, deduplicate, summarize, truncate) |
context.budgetUsage | Pre-generation usage ratio |
| Compaction duration (ms) | Wall-clock time for the compaction pipeline |
Alerting Thresholds
Set alerts on these conditions to catch compaction issues before they affect users:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Example: log a warning when compaction saves fewer tokens than expected
function onCompactionComplete(result: CompactionResult) {
const savingsPercent =
(result.tokensSaved / result.tokensBefore) * 100;
if (result.stagesUsed.includes('truncate')) {
console.warn(
'Reached Stage 4 truncation — consider increasing compaction threshold',
);
}
if (savingsPercent < 10 && result.compacted) {
console.warn(
`Low compaction savings: ${savingsPercent.toFixed(1)}% — ` +
'review tool output sizes and conversation patterns',
);
}
// Track in your observability stack
metrics.record('compaction.savings_percent', savingsPercent);
metrics.record('compaction.stages', result.stagesUsed.length);
}
Summarization Strategy Comparison
Here is a summary of all the strategies covered in this tutorial, so you can pick the right one for your workload:
| Strategy | When to Use | Token Savings | Context Preservation | LLM Cost |
|---|---|---|---|---|
| Simple sliding window | Short conversations | ~90% | Low | None |
| LLM-powered summarization | Long conversations | ~80% | Medium | Low (cheap model) |
| Extractive key-point extraction | Factual conversations | ~60% | High | None |
| Hierarchical multi-level summaries | Very long conversations | ~85% | Medium-High | Low |
| Topic-based grouping | Multi-topic conversations | ~75% | High | Low |
| Four-stage compaction pipeline | Production workloads | ~70-90% | High | Minimal (deferred) |
Conclusion
Summarization transforms finite context windows from a hard limitation into a manageable engineering constraint. Start with the simple ConversationSummarizer for prototypes, graduate to the sliding window hybrid for moderate workloads, and enable NeuroLink’s built-in four-stage compaction pipeline when you need production-grade context management that handles tool outputs, file reads, and LLM summarization with automatic fallbacks.
The key insight is that not all context is equally valuable. Old tool outputs, duplicate file reads, and verbatim conversation history from 50 turns ago contribute far less than a well-crafted summary. By reducing context strategically rather than uniformly, you keep the model grounded on what matters while keeping costs under control.
Related posts:
