Post

Conversation Summarization: Smart Context Management for Long Chats

Keep conversations going indefinitely with NeuroLink's summarization patterns — LLM-powered condensation, sliding window plus summary hybrids, token budget strategies, and the four-stage context compaction pipeline.

Conversation Summarization: Smart Context Management for Long Chats

Every LLM conversation eventually hits a wall. Context windows are finite, and the longer a chat runs, the more tokens you spend re-sending old messages that no longer contribute much to the next response. Summarization is the answer: condense what happened earlier so the model keeps the gist without burning through your budget.

This tutorial walks through four progressively powerful patterns for conversation summarization with NeuroLink, culminating in the production-grade four-stage context compaction pipeline that ships with the SDK.

When to Summarize

Not every conversation needs summarization. Short one-shot interactions finish well within any model’s context window. Summarization becomes essential when conversations exhibit these characteristics:

  • Turn count exceeds 15-20 exchanges. At that point the token cost of re-sending full history starts to dominate.
  • Tool outputs inflate context. A single file read or API response can inject thousands of tokens into the conversation.
  • Multi-session continuity is required. If you persist context across sessions, summaries keep storage and re-hydration costs manageable.
  • Response quality degrades. Models tend to lose focus when the prompt grows very large, even within the technical context limit.

The following decision diagram helps you choose the right strategy for your workload.

flowchart TD
    A[New user message arrives] --> B{Conversation length?}
    B -- "< 15 turns" --> C[Send full history]
    B -- "15-50 turns" --> D{Token usage > 70%?}
    B -- "> 50 turns" --> E[Mandatory summarization]
    D -- No --> C
    D -- Yes --> F{Tool outputs dominate?}
    F -- Yes --> G[Stage 1: Prune tool outputs]
    F -- No --> H{Duplicate file reads?}
    H -- Yes --> I[Stage 2: Deduplicate files]
    H -- No --> J[Stage 3: LLM summarization]
    G --> K{Still over budget?}
    I --> K
    K -- Yes --> J
    K -- No --> L[Send compacted context]
    J --> L
    E --> G

LLM-Powered Condensation

The most straightforward summarization approach is to ask an LLM to condense older messages into a concise paragraph while preserving key facts, decisions, and entities. NeuroLink makes this simple through its unified generation interface.

Basic Summarizer

The following ConversationSummarizer class tracks messages and triggers summarization once the conversation exceeds a configurable threshold. Older messages are replaced with a summary while recent ones remain intact.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import { NeuroLink } from '@juspay/neurolink';

type ConversationMessage = {
  role: 'user' | 'assistant' | 'system';
  content: string;
  timestamp: Date;
  important?: boolean;
};

class ConversationSummarizer {
  protected neurolink: NeuroLink;
  protected messages: ConversationMessage[] = [];
  private summary: string = '';
  private maxMessages: number;
  private summaryModel: string;

  constructor(options: { maxMessages?: number; summaryModel?: string } = {}) {
    this.neurolink = new NeuroLink();
    this.maxMessages = options.maxMessages || 10;
    this.summaryModel = options.summaryModel || 'claude-3-5-haiku-20241022';
  }

  addMessage(role: 'user' | 'assistant', content: string, important = false) {
    this.messages.push({ role, content, timestamp: new Date(), important });
    if (this.messages.length >= this.maxMessages) {
      this.summarizeAsync().catch(console.error);
    }
  }

  private async summarizeAsync() {
    const importantMessages = this.messages.filter((m) => m.important);
    const regularMessages = this.messages.filter((m) => !m.important);

    // Split regular messages: summarize first half, keep second half
    const midpoint = Math.floor(regularMessages.length / 2);
    const toSummarize = regularMessages.slice(0, midpoint);
    const toKeep = regularMessages.slice(midpoint);

    if (toSummarize.length === 0) return;

    const conversationText = toSummarize
      .map((m) => `${m.role}: ${m.content}`)
      .join('\n\n');

    const result = await this.neurolink.generate({
      input: {
        text: `Summarize this conversation concisely, preserving key facts,
decisions, and context:\n\n${conversationText}`,
      },
      provider: 'anthropic',
      model: this.summaryModel,
      maxTokens: 500,
    });

    // Merge the new summary with any existing summary
    this.summary = this.summary
      ? `${this.summary}\n\nRecent updates: ${result.content}`
      : result.content;

    this.messages = [...importantMessages, ...toKeep];
  }

  async getContext(): Promise<string> {
    const parts: string[] = [];
    if (this.summary) {
      parts.push(`[Previous conversation summary: ${this.summary}]`);
    }
    const recentMessages = this.messages
      .slice(-5)
      .map((m) => `${m.role}: ${m.content}`)
      .join('\n\n');
    if (recentMessages) parts.push(recentMessages);
    return parts.join('\n\n');
  }

  async chat(userMessage: string, markImportant = false): Promise<string> {
    this.addMessage('user', userMessage, markImportant);
    const context = await this.getContext();
    const result = await this.neurolink.generate({
      input: { text: context },
    });
    this.addMessage('assistant', result.content);
    return result.content;
  }

  getStats() {
    return {
      messages: this.messages.length,
      hasSummary: !!this.summary,
      summaryLength: this.summary.length,
      importantMessages: this.messages.filter((m) => m.important).length,
    };
  }
}

Key design decisions in this implementation:

  1. Important-message preservation. Messages flagged with important: true are never summarized. Use this for decisions, confirmations, or any fact the user explicitly asked the model to remember.
  2. Half-and-half splitting. Rather than summarizing everything, only the older half is condensed. This avoids an awkward gap between summary and the most recent exchanges.
  3. Cheap model for summaries. The summarization call uses a small, fast model (Haiku) because it only needs to condense prose, not reason deeply. This keeps costs negligible.

Hierarchical Summaries

When a conversation runs for hundreds of turns, a single summary string grows unwieldy. Progressive or hierarchical summarization solves this by maintaining multiple summary levels.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class ProgressiveSummarizer extends ConversationSummarizer {
  private detailedSummary: string = '';
  private briefSummary: string = '';

  async summarize() {
    // Level 1: Detailed summary at 300 tokens
    this.detailedSummary = await this.summarizeText(
      this.messages.map(m => `${m.role}: ${m.content}`).join('\n'), 300,
    );

    // Level 2: Brief summary at 100 tokens (summary of the summary)
    this.briefSummary = await this.summarizeText(this.detailedSummary, 100);
  }

  private async summarizeText(
    text: string,
    maxTokens: number,
  ): Promise<string> {
    const result = await this.neurolink.generate({
      input: { text: `Summarize concisely: ${text}` },
      maxTokens,
    });
    return result.content;
  }
}

The brief summary is included when context is tight, and the detailed summary is used when there is room. This two-tier approach keeps the model grounded on the full arc of the conversation without consuming excessive tokens.

Sliding Window Plus Summary Hybrid

A pure sliding window (keep the last N messages, drop the rest) is cheap and simple but throws away context permanently. A pure summarization approach preserves context but requires an LLM call every time the window fills. The hybrid combines both.

flowchart LR
    subgraph Conversation History
        direction TB
        M1[Msg 1] --> M2[Msg 2]
        M2 --> M3[Msg 3]
        M3 --> M4[Msg 4]
        M4 --> M5[Msg 5]
        M5 --> M6[Msg 6]
        M6 --> M7[Msg 7]
        M7 --> M8[Msg 8]
    end

    subgraph Compacted Context
        direction TB
        S["Summary of Msgs 1-4"]
        R1[Msg 5]
        R2[Msg 6]
        R3[Msg 7]
        R4[Msg 8]
    end

    M1 -.->|Summarized| S
    M2 -.->|Summarized| S
    M3 -.->|Summarized| S
    M4 -.->|Summarized| S
    M5 -->|Kept| R1
    M6 -->|Kept| R2
    M7 -->|Kept| R3
    M8 -->|Kept| R4

Implementation

The ContextWindowManager below implements this pattern with automatic token tracking and progressive strategy selection based on how full the context window is.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import { NeuroLink } from '@juspay/neurolink';

type Message = {
  role: 'system' | 'user' | 'assistant';
  content: string;
  tokens?: number;
};

class ContextWindowManager {
  private neurolink: NeuroLink;
  private messages: Message[] = [];
  private maxTokens: number;
  private systemMessage?: Message;

  constructor(maxTokens: number = 8000) {
    this.neurolink = new NeuroLink();
    this.maxTokens = maxTokens;
  }

  private estimateTokens(text: string): number {
    return Math.ceil(text.length / 4);
  }

  private calculateTotalTokens(messages: Message[]): number {
    return messages.reduce(
      (sum, msg) => sum + (msg.tokens || this.estimateTokens(msg.content)),
      0,
    );
  }

  setSystemMessage(content: string) {
    this.systemMessage = {
      role: 'system',
      content,
      tokens: this.estimateTokens(content),
    };
  }

  addMessage(role: 'user' | 'assistant', content: string) {
    const message: Message = {
      role,
      content,
      tokens: this.estimateTokens(content),
    };
    this.messages.push(message);
    this.pruneIfNeeded();
  }

  private pruneIfNeeded() {
    const allMessages = this.systemMessage
      ? [this.systemMessage, ...this.messages]
      : this.messages;
    const totalTokens = this.calculateTotalTokens(allMessages);

    if (totalTokens <= this.maxTokens) return;

    // Target 80% capacity to leave headroom
    const targetTokens = Math.floor(this.maxTokens * 0.8);
    let currentTokens = this.systemMessage?.tokens || 0;
    const keptMessages: Message[] = [];

    // Work backwards from most recent
    for (let i = this.messages.length - 1; i >= 0; i--) {
      const msg = this.messages[i];
      const msgTokens = msg.tokens || this.estimateTokens(msg.content);
      if (currentTokens + msgTokens <= targetTokens) {
        keptMessages.unshift(msg);
        currentTokens += msgTokens;
      } else {
        break;
      }
    }

    this.messages = keptMessages;
  }

  async summarizeOldMessages() {
    if (this.messages.length < 10) return;

    const totalTokens = this.calculateTotalTokens(this.messages);
    if (totalTokens <= this.maxTokens * 0.7) return;

    const midpoint = Math.floor(this.messages.length / 2);
    const toSummarize = this.messages.slice(0, midpoint);
    const toKeep = this.messages.slice(midpoint);

    const conversationText = toSummarize
      .map((m) => `${m.role}: ${m.content}`)
      .join('\n\n');

    const summary = await this.neurolink.generate({
      input: {
        text: `Summarize this conversation concisely, preserving key
information:\n\n${conversationText}`,
      },
      provider: 'anthropic',
      model: 'claude-3-5-haiku-20241022',
      maxTokens: 500,
    });

    this.messages = [
      {
        role: 'assistant',
        content: `[Previous conversation summary: ${summary.content}]`,
        tokens: this.estimateTokens(summary.content),
      },
      ...toKeep,
    ];
  }

  getStats() {
    const totalTokens = this.calculateTotalTokens(this.messages);
    return {
      messages: this.messages.length,
      tokens: totalTokens,
      capacity: this.maxTokens,
      usage: ((totalTokens / this.maxTokens) * 100).toFixed(1) + '%',
    };
  }
}

Progressive Strategy Selection

The hybrid manager selects its strategy based on how full the context window is:

Context UsageStrategyLLM Cost
0-70%No actionNone
70-90%Summarize older halfOne cheap LLM call
90-100%Sliding window pruneNone
Over 100%Aggressive truncationNone

This graduated approach keeps LLM summarization calls to a minimum while ensuring context never overflows.

Token Budget Strategies

Before you can summarize intelligently, you need to know how much room you have. NeuroLink’s BudgetChecker runs before every generation call and returns a detailed breakdown of where your tokens are going.

How the Budget Checker Works

The budget checker estimates total input tokens from five categories: system prompt, tool definitions, conversation history, the current user prompt, and file attachments. It compares the total against the model’s available input space and sets a shouldCompact flag when usage exceeds the configured threshold (default: 80%).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import { checkContextBudget } from '@juspay/neurolink';

const budgetResult = checkContextBudget({
  provider: 'anthropic',
  model: 'claude-sonnet-4-20250514',
  maxTokens: 4096,
  systemPrompt: 'You are a helpful assistant.',
  conversationMessages: sessionHistory,
  currentPrompt: userMessage,
  toolDefinitions: registeredTools,
  fileAttachments: uploadedFiles,
  compactionThreshold: 0.8,
});

console.log(budgetResult);
// {
//   withinBudget: true,
//   estimatedInputTokens: 42300,
//   availableInputTokens: 196000,
//   usageRatio: 0.216,
//   shouldCompact: false,
//   breakdown: {
//     systemPrompt: 45,
//     conversationHistory: 38200,
//     currentPrompt: 120,
//     toolDefinitions: 3800,
//     fileAttachments: 135
//   }
// }

The breakdown tells you exactly where tokens are going. In many production applications, conversationHistory and toolDefinitions are the largest consumers. This insight lets you target compaction where it will save the most.

Token Budgets by Use Case

Different workloads need different allocations:

Use CaseRecommended LimitReasoning
Chatbot4K-8K tokensQuick responses, recent context sufficient
Code assistant16K-32K tokensFile context and diffs require space
Document analysis32K-100K tokensFull documents need to fit
Long-form writing8K-16K tokensStory continuity over many turns
Customer support4K tokensShort interactions, fast resolution

Provider-Specific Context Limits

Each model family has different context windows. NeuroLink maintains these automatically in its provider registry, but here are the key values for reference:

1
2
3
4
5
6
7
8
9
10
11
const CONTEXT_LIMITS: Record<string, number> = {
  'gpt-4o': 128_000,
  'gpt-4o-mini': 128_000,
  'gpt-4.1': 1_047_576,
  'o3': 200_000,
  'claude-opus-4-20250514': 200_000,
  'claude-sonnet-4-20250514': 200_000,
  'claude-3-5-sonnet-20241022': 200_000,
  'gemini-2.5-flash': 1_048_576,
  'gemini-2.5-pro': 1_048_576,
};

A 20% buffer for the response is already subtracted by the budget checker, so you do not need to account for it manually.

The Four-Stage Compaction Pipeline

NeuroLink ships a production-grade ContextCompactor that orchestrates four stages of context reduction, running each stage only if the previous one did not bring the context within budget. This design minimizes cost: the cheapest stages run first, and the expensive LLM summarization call is deferred until cheaper options are exhausted.

flowchart TD
    A[Messages over budget] --> B["Stage 1: Tool Output Pruning<br/>(No LLM call)"]
    B --> C{Within budget?}
    C -- Yes --> Z[Return compacted messages]
    C -- No --> D["Stage 2: File Read Deduplication<br/>(No LLM call)"]
    D --> E{Within budget?}
    E -- Yes --> Z
    E -- No --> F["Stage 3: LLM Summarization<br/>(Requires LLM call)"]
    F --> G{Within budget?}
    G -- Yes --> Z
    G -- No --> H["Stage 4: Sliding Window Truncation<br/>(No LLM call, last resort)"]
    H --> Z

Stage 1: Tool Output Pruning

The cheapest stage. Tool call results (shell outputs, API responses, file contents) are often the single largest token consumers in a coding or automation session. This stage replaces old tool outputs with compact stand-ins while protecting recent outputs and outputs from critical tools.

1
2
3
4
5
6
7
// From contextCompactor.ts — Stage 1 configuration
const DEFAULT_CONFIG = {
  enablePrune: true,
  pruneProtectTokens: 40_000,    // Protect the most recent 40K tokens
  pruneMinimumSavings: 20_000,   // Only prune if it saves at least 20K tokens
  pruneProtectedTools: ['skill'], // Never prune outputs from these tools
};

In practice, a single cat or grep result can contain 10,000+ tokens. Replacing stale tool outputs with a one-line stub like [Tool output pruned: 12,847 tokens] can free up enormous amounts of context without any LLM cost.

Stage 2: File Read Deduplication

When an AI agent reads the same file multiple times during a session (common when iterating on code), each read injects the full file contents into the conversation. This stage identifies duplicate file reads and keeps only the most recent version, replacing earlier reads with a reference note.

1
2
3
4
5
6
// Stage 2 runs only if Stage 1 was insufficient
const dedupResult = deduplicateFileReads(currentMessages);
if (dedupResult.deduplicated) {
  currentMessages = dedupResult.messages;
  stagesUsed.push('deduplicate');
}

This stage requires no LLM calls and is especially effective in long coding sessions where the same configuration files, type definitions, or test fixtures are read repeatedly.

Stage 3: LLM Summarization

When pruning and deduplication are not enough, the compactor invokes an LLM to summarize older conversation turns. The SDK uses a fast, inexpensive model (Gemini 2.5 Flash by default) with a 120-second timeout. If the call fails, the pipeline gracefully falls through to Stage 4.

1
2
3
4
5
6
7
8
9
10
11
12
// Stage 3: structured LLM summarization
const summarizeResult = await withTimeout(
  summarizeMessages(currentMessages, {
    provider: config.summarizationProvider,   // default: 'vertex'
    model: config.summarizationModel,         // default: 'gemini-2.5-flash'
    keepRecentRatio: config.keepRecentRatio,  // default: 0.3
    memoryConfig,
    targetTokens,
  }),
  120_000, // 2-minute timeout
  'LLM summarization timed out after 120s',
);

The keepRecentRatio parameter controls how much of the conversation is left untouched. With the default of 0.3, the most recent 30% of messages remain in full while the older 70% are condensed into a structured summary.

Stage 4: Sliding Window Truncation

The last-resort fallback. If all prior stages have run and the context still exceeds the budget, the sliding window truncator removes the oldest non-system messages until the context fits.

1
2
3
4
5
6
7
8
const truncResult = truncateWithSlidingWindow(currentMessages, {
  fraction: config.truncationFraction,   // default: 0.5
  currentTokens: stageTokensBefore,
  targetTokens: targetTokens,
  provider: provider,
  adaptiveBuffer: 0.15,
  maxIterations: 3,
});

The adaptiveBuffer of 15% ensures the truncation does not land exactly on the boundary, leaving room for the next few exchanges before another compaction is needed.

Emergency Truncation

In extreme cases where even the sliding window cannot bring context within budget (for example, when tool definitions and the system prompt alone consume most of the available space), NeuroLink falls back to emergency content truncation. This truncates the content of individual messages rather than removing entire messages.

1
2
3
4
5
6
7
8
9
import { emergencyContentTruncation } from '@juspay/neurolink';

// Strategy: sort messages by length descending, truncate the biggest first
const truncatedHistory = emergencyContentTruncation(
  messages,
  availableTokens,
  budgetBreakdown,
  provider,
);

The algorithm is straightforward:

  1. Calculate how many tokens need to be freed.
  2. Sort messages by content length, largest first.
  3. Truncate each large message proportionally until the total reduction target is met.
  4. Skip system messages and very short messages (under 200 characters).
  5. If proportional truncation is still insufficient, fall back to keeping only the newest messages that fit.

This is a safety net, not a primary strategy. Well-tuned compaction thresholds should prevent you from reaching this point in normal operation.

The File Summarization Service

File attachments deserve special treatment. A 500-line source file or a multi-page PDF can consume thousands of tokens, but the user’s question might only relate to a small section. NeuroLink’s FileSummarizationService handles this by producing context-aware summaries that focus on the parts relevant to the current prompt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import { FileSummarizationService } from '@juspay/neurolink';

const service = new FileSummarizationService({
  provider: 'vertex',
  model: 'gemini-2.5-flash',
});

// Step 1: Prepare files (extract text, estimate tokens, label type)
const prepared = service.prepareFilesForSummarization(rawFiles, 'vertex');

// Step 2: Summarize files that exceed the budget
const summarized = await service.summarizeFiles(
  prepared,
  userPrompt,          // The current user question for context-aware summaries
  {
    availableBudget: 10_000,  // Total token budget for all files
    provider: 'vertex',
  },
);

for (const file of summarized) {
  console.log(`${file.fileName}: ${file.originalTokens} -> ${file.summaryTokens} tokens`);
  console.log(`  Summarized: ${file.wasSummarized}`);
}

The service supports over 30 MIME types. Binary files (images, audio, video) receive a stub marker instead of a summary. Text-based files are decoded from UTF-8, and the type label (TypeScript File, PDF Document, JSON File, and so on) is included in the summarization prompt so the LLM understands the file format.

If the LLM summarization call fails, the service falls back to naive truncation so the request can still proceed.

Implementation Walkthrough

Bringing everything together, here is a complete example that configures NeuroLink with context compaction, uses the built-in APIs to monitor context usage, and triggers compaction when needed.

Step 1: Configure Context Compaction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import { NeuroLink } from '@juspay/neurolink';

const neurolink = new NeuroLink({
  conversationMemory: {
    enabled: true,
    contextCompaction: {
      enabled: true,
      threshold: 0.8,              // Compact at 80% usage
      enablePruning: true,         // Stage 1: tool output pruning
      enableDeduplication: true,   // Stage 2: file read dedup
      enableSlidingWindow: true,   // Stage 4: sliding window fallback
      maxToolOutputBytes: 50_000,  // Cap tool outputs at 50KB
      maxToolOutputLines: 2000,    // Cap tool output lines
      fileReadBudgetPercent: 0.6,  // 60% of remaining context for files
    },
  },
});

Step 2: Monitor Context Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
async function checkAndCompact(sessionId: string) {
  const stats = await neurolink.getContextStats(
    sessionId,
    'anthropic',
    'claude-sonnet-4-20250514',
  );

  if (!stats) return;

  console.log(`Messages:      ${stats.messageCount}`);
  console.log(`Input tokens:  ${stats.estimatedInputTokens}`);
  console.log(`Available:     ${stats.availableInputTokens}`);
  console.log(`Usage:         ${(stats.usageRatio * 100).toFixed(1)}%`);
  console.log(`Should compact: ${stats.shouldCompact}`);

  if (stats.shouldCompact) {
    const result = await neurolink.compactSession(sessionId);
    if (result?.compacted) {
      const saved = result.originalTokenCount - result.compactedTokenCount;
      console.log(`Freed ${saved} tokens via ${result.stagesApplied.join(', ')}`);
    }
  }
}

Step 3: Build the Chat Loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const sessionId = 'long-running-session';

async function chat(userMessage: string) {
  // Pre-flight budget check
  await checkAndCompact(sessionId);

  // Generate with automatic memory
  const response = await neurolink.generate({
    input: { text: userMessage },
    provider: 'anthropic',
    model: 'claude-sonnet-4-20250514',
    context: { sessionId },
  });

  return response.content;
}

// This loop can run indefinitely without hitting context limits
for (let i = 1; i <= 100; i++) {
  const reply = await chat(`Tell me fact #${i} about distributed systems.`);
  console.log(`[${i}] ${reply.slice(0, 120)}...`);
}

Monitoring Context Usage

Monitoring is essential for tuning your compaction settings. NeuroLink emits observability spans for every compaction operation, so you can track token savings, stage frequency, and compaction latency in your existing tracing infrastructure.

Key Metrics to Track

MetricWhat It Tells You
context.tokensBeforeInput tokens before compaction
context.tokensAfterInput tokens after compaction
context.tokensSavedTokens freed in this compaction cycle
context.stageWhich stages ran (prune, deduplicate, summarize, truncate)
context.budgetUsagePre-generation usage ratio
Compaction duration (ms)Wall-clock time for the compaction pipeline

Alerting Thresholds

Set alerts on these conditions to catch compaction issues before they affect users:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Example: log a warning when compaction saves fewer tokens than expected
function onCompactionComplete(result: CompactionResult) {
  const savingsPercent =
    (result.tokensSaved / result.tokensBefore) * 100;

  if (result.stagesUsed.includes('truncate')) {
    console.warn(
      'Reached Stage 4 truncation — consider increasing compaction threshold',
    );
  }

  if (savingsPercent < 10 && result.compacted) {
    console.warn(
      `Low compaction savings: ${savingsPercent.toFixed(1)}% — ` +
      'review tool output sizes and conversation patterns',
    );
  }

  // Track in your observability stack
  metrics.record('compaction.savings_percent', savingsPercent);
  metrics.record('compaction.stages', result.stagesUsed.length);
}

Summarization Strategy Comparison

Here is a summary of all the strategies covered in this tutorial, so you can pick the right one for your workload:

StrategyWhen to UseToken SavingsContext PreservationLLM Cost
Simple sliding windowShort conversations~90%LowNone
LLM-powered summarizationLong conversations~80%MediumLow (cheap model)
Extractive key-point extractionFactual conversations~60%HighNone
Hierarchical multi-level summariesVery long conversations~85%Medium-HighLow
Topic-based groupingMulti-topic conversations~75%HighLow
Four-stage compaction pipelineProduction workloads~70-90%HighMinimal (deferred)

Conclusion

Summarization transforms finite context windows from a hard limitation into a manageable engineering constraint. Start with the simple ConversationSummarizer for prototypes, graduate to the sliding window hybrid for moderate workloads, and enable NeuroLink’s built-in four-stage compaction pipeline when you need production-grade context management that handles tool outputs, file reads, and LLM summarization with automatic fallbacks.

The key insight is that not all context is equally valuable. Old tool outputs, duplicate file reads, and verbatim conversation history from 50 turns ago contribute far less than a well-crafted summary. By reducing context strategically rather than uniformly, you keep the model grounded on what matters while keeping costs under control.


Related posts:

This post is licensed under CC BY 4.0 by the author.