Caching LLM Responses: Performance Optimization with NeuroLink
Implement external caching strategies for NeuroLink SDK. Exact match, normalized, and semantic caching patterns for cost reduction.
Implementation Note: You will implement these patterns on top of NeuroLink’s core API. They are not built-in SDK features but represent recommended approaches you can build yourself.
In this guide, you will build three caching layers for NeuroLink applications: exact-match, normalized-key, and semantic caching. By the end, you will have a production-ready caching system that reduces LLM costs by 40-70% and delivers sub-100ms response times for cached queries.
Important Note: The NeuroLink SDK does not include built-in response caching. This is intentional—caching strategies vary significantly based on application requirements, data sensitivity, and infrastructure. You will implement external caching that integrates cleanly with NeuroLink’s generate() API.
Cache Decision Flow
The following diagram illustrates how a multi-tier caching system processes LLM requests:
flowchart TB
Request[Incoming Query]
Request --> Normalize[Normalize Query]
Normalize --> L1{L1: In-Memory<br/>Exact Match?}
L1 -->|Hit| L1Response[Return Cached<br/>~1ms]
L1 -->|Miss| L2{L2: Redis<br/>Exact Match?}
L2 -->|Hit| PromoteL1[Promote to L1]
PromoteL1 --> L2Response[Return Cached<br/>~5ms]
L2 -->|Miss| Embed[Generate Query<br/>Embedding]
Embed --> L3{L3: Semantic<br/>Search}
L3 -->|Similar Found<br/>≥ 0.92| SemanticResponse[Return Similar<br/>~20ms]
L3 -->|No Match| LLM[Call NeuroLink<br/>generate]
LLM --> Response[LLM Response<br/>500-5000ms]
Response --> StoreAll[Store in All Tiers]
StoreAll --> L1Cache[("L1 Cache")]
StoreAll --> L2Cache[("L2 Cache")]
StoreAll --> L3Cache[("L3 Vector Store")]
StoreAll --> FinalResponse[Return Fresh Response]
subgraph Metrics["Cache Metrics"]
HitRate[Track Hit Rate]
CostSaved[Calculate Savings]
Latency[Measure Latency]
end
L1Response --> Metrics
L2Response --> Metrics
SemanticResponse --> Metrics
FinalResponse --> Metrics
Why Caching Matters for LLM Applications
LLM API calls are fundamentally different from traditional API calls. They’re expensive (often $0.01-0.10 per request), slow (500ms-5s latency), and frequently return identical or nearly identical responses for similar inputs.
Consider a customer support chatbot handling 10,000 queries daily. Analysis typically reveals that 30-40% of questions are variations of the same underlying query:
- “What’s your return policy?”
- “How do I return an item?”
- “Returns policy please”
- “Can I send something back?”
Without caching, each variant triggers a full LLM inference. With intelligent caching, you answer once and serve cached responses for semantically similar queries.
The Cost Mathematics
Examine real numbers for a production application:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Daily queries: 10,000
Average cost per query: $0.02
Cache hit rate (achievable): 45%
Without caching:
- Daily cost: $200
- Monthly cost: $6,000
With caching:
- Cached queries: 4,500 x $0.00 = $0
- Fresh queries: 5,500 x $0.02 = $110
- Daily cost: $110
- Monthly cost: $3,300
- Monthly savings: $2,700
Beyond cost savings, cached responses return in milliseconds rather than seconds, fundamentally improving user experience.
Basic Caching Pattern with NeuroLink
Here’s the fundamental pattern for wrapping NeuroLink’s generate method with caching:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import { NeuroLink } from '@juspay/neurolink';
import type { GenerateOptions, GenerateResult } from '@juspay/neurolink';
import { createHash } from 'crypto';
// Initialize NeuroLink
const neurolink = new NeuroLink();
// Simple in-memory cache
interface CachedResponse {
result: GenerateResult;
createdAt: number;
expiresAt: number;
}
const cache = new Map<string, CachedResponse>();
// Generate a cache key from request parameters
function generateCacheKey(options: GenerateOptions): string {
const keyData = {
text: options.input.text.trim().toLowerCase(),
provider: options.provider,
model: options.model,
temperature: options.temperature,
systemPrompt: options.systemPrompt,
};
return createHash('sha256').update(JSON.stringify(keyData)).digest('hex');
}
// Cached generate function
async function cachedGenerate(
options: GenerateOptions,
ttlMs: number = 3600000 // 1 hour default
): Promise<GenerateResult & { cached: boolean }> {
const cacheKey = generateCacheKey(options);
// Check cache
const cached = cache.get(cacheKey);
if (cached && Date.now() < cached.expiresAt) {
return { ...cached.result, cached: true };
}
// Cache miss - call NeuroLink
const result = await neurolink.generate(options);
// Store in cache
cache.set(cacheKey, {
result,
createdAt: Date.now(),
expiresAt: Date.now() + ttlMs,
});
return { ...result, cached: false };
}
// Usage example
async function main() {
const response = await cachedGenerate({
input: { text: 'What is the capital of France?' },
provider: 'openai',
model: 'gpt-4o-mini',
temperature: 0, // Deterministic for better caching
});
console.log(`Response: ${response.content}`);
console.log(`From cache: ${response.cached}`);
console.log(`Tokens used: ${response.usage?.total || 'N/A'}`);
}
Normalized Key Generation
Improving hit rates requires normalizing inputs before key generation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import { NeuroLink } from '@juspay/neurolink';
import type { GenerateOptions, GenerateResult } from '@juspay/neurolink';
import { createHash } from 'crypto';
class NormalizedLLMCache {
private cache = new Map<string, CachedResponse>();
private neurolink: NeuroLink;
constructor() {
this.neurolink = new NeuroLink();
}
// Normalize prompt for better cache hit rates
private normalizePrompt(prompt: string): string {
return prompt
.toLowerCase()
.trim()
// Remove extra whitespace
.replace(/\s+/g, ' ')
// Standardize punctuation
.replace(/[""]/g, '"')
.replace(/['']/g, "'")
// Remove common filler words that don't change meaning
.replace(/\b(please|kindly|could you|can you|would you)\b/gi, '')
// Standardize question endings
.replace(/\?+$/, '?')
// Remove trailing periods before question marks
.replace(/\.\s*\?/, '?');
}
private generateKey(options: GenerateOptions): string {
const normalized = this.normalizePrompt(options.input.text);
const keyData = {
prompt: normalized,
provider: options.provider,
model: options.model,
// Only include temperature if it affects output
temperature: options.temperature ?? 0.7,
};
return createHash('sha256').update(JSON.stringify(keyData)).digest('hex');
}
async generate(
options: GenerateOptions,
cacheTTL: number = 3600000
): Promise<GenerateResult & { cached: boolean; cacheKey: string }> {
const cacheKey = this.generateKey(options);
// Check cache
const cached = this.cache.get(cacheKey);
if (cached && Date.now() < cached.expiresAt) {
return { ...cached.result, cached: true, cacheKey };
}
// Generate fresh response
const result = await this.neurolink.generate(options);
// Cache the result
this.cache.set(cacheKey, {
result,
createdAt: Date.now(),
expiresAt: Date.now() + cacheTTL,
});
return { ...result, cached: false, cacheKey };
}
// Get cache statistics
getStats(): { size: number; keys: string[] } {
return {
size: this.cache.size,
keys: Array.from(this.cache.keys()),
};
}
// Clear expired entries
cleanup(): number {
const now = Date.now();
let cleared = 0;
for (const [key, value] of this.cache) {
if (now >= value.expiresAt) {
this.cache.delete(key);
cleared++;
}
}
return cleared;
}
}
// Usage
const cache = new NormalizedLLMCache();
// These will likely hit the same cache entry due to normalization
await cache.generate({
input: { text: 'What is machine learning?' },
provider: 'vertex',
model: 'gemini-2.5-flash',
});
await cache.generate({
input: { text: 'Could you please tell me what is machine learning?' },
provider: 'vertex',
model: 'gemini-2.5-flash',
});
Normalization increases hit rates by 15-25% for most applications without requiring additional infrastructure.
Redis Integration for Production Caching
Now you will move from in-memory caching to Redis for production-grade distributed caching. You will build a cache wrapper that tracks hits, misses, and cost savings automatically.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import { NeuroLink } from '@juspay/neurolink';
import type { GenerateOptions, GenerateResult } from '@juspay/neurolink';
import { createClient, RedisClientType } from 'redis';
import { createHash } from 'crypto';
interface CacheEntry {
result: GenerateResult;
createdAt: number;
provider: string;
model: string;
promptHash: string;
}
class RedisLLMCache {
private redis: RedisClientType;
private neurolink: NeuroLink;
private prefix = 'neurolink:cache:';
private statsPrefix = 'neurolink:stats:';
constructor(redisUrl: string) {
this.redis = createClient({ url: redisUrl });
this.neurolink = new NeuroLink();
}
async connect(): Promise<void> {
await this.redis.connect();
}
async disconnect(): Promise<void> {
await this.redis.quit();
}
private generateKey(options: GenerateOptions): string {
const normalized = options.input.text.trim().toLowerCase().replace(/\s+/g, ' ');
const keyData = JSON.stringify({
prompt: normalized,
provider: options.provider,
model: options.model,
temperature: options.temperature ?? 0.7,
systemPrompt: options.systemPrompt,
});
return createHash('sha256').update(keyData).digest('hex');
}
async generate(
options: GenerateOptions,
ttlSeconds: number = 3600
): Promise<GenerateResult & { cached: boolean; latencyMs: number }> {
const startTime = Date.now();
const cacheKey = `${this.prefix}${this.generateKey(options)}`;
// Try to get from cache
const cached = await this.redis.get(cacheKey);
if (cached) {
const entry: CacheEntry = JSON.parse(cached);
await this.recordHit();
return {
...entry.result,
cached: true,
latencyMs: Date.now() - startTime,
};
}
// Cache miss - call NeuroLink
await this.recordMiss();
const result = await this.neurolink.generate(options);
// Store in Redis with TTL
const entry: CacheEntry = {
result,
createdAt: Date.now(),
provider: options.provider || 'default',
model: options.model || 'default',
promptHash: this.generateKey(options),
};
await this.redis.setEx(cacheKey, ttlSeconds, JSON.stringify(entry));
// Track estimated cost savings
const estimatedCost = this.estimateCost(result);
await this.redis.incrByFloat(`${this.statsPrefix}potential_savings`, estimatedCost);
return {
...result,
cached: false,
latencyMs: Date.now() - startTime,
};
}
private estimateCost(result: GenerateResult): number {
// Rough cost estimate based on token usage
const tokens = result.usage?.total || 0;
return tokens * 0.00001; // Approximate cost per token
}
private async recordHit(): Promise<void> {
await this.redis.incr(`${this.statsPrefix}hits`);
}
private async recordMiss(): Promise<void> {
await this.redis.incr(`${this.statsPrefix}misses`);
}
async getStats(): Promise<{
hits: number;
misses: number;
hitRate: number;
potentialSavings: number;
}> {
const [hits, misses, savings] = await Promise.all([
this.redis.get(`${this.statsPrefix}hits`),
this.redis.get(`${this.statsPrefix}misses`),
this.redis.get(`${this.statsPrefix}potential_savings`),
]);
const hitsNum = parseInt(hits || '0');
const missesNum = parseInt(misses || '0');
const total = hitsNum + missesNum;
return {
hits: hitsNum,
misses: missesNum,
hitRate: total > 0 ? (hitsNum / total) * 100 : 0,
potentialSavings: parseFloat(savings || '0'),
};
}
async invalidateByPattern(pattern: string): Promise<number> {
const keys = await this.redis.keys(`${this.prefix}*${pattern}*`);
if (keys.length === 0) return 0;
return await this.redis.del(keys);
}
}
// Usage example
async function main() {
const cache = new RedisLLMCache('redis://localhost:6379');
await cache.connect();
try {
// First call - cache miss
const result1 = await cache.generate({
input: { text: 'Explain quantum computing in simple terms' },
provider: 'anthropic',
model: 'claude-sonnet-4-5-20250929',
temperature: 0.3,
});
console.log(`First call - Cached: ${result1.cached}, Latency: ${result1.latencyMs}ms`);
// Second call - cache hit
const result2 = await cache.generate({
input: { text: 'Explain quantum computing in simple terms' },
provider: 'anthropic',
model: 'claude-sonnet-4-5-20250929',
temperature: 0.3,
});
console.log(`Second call - Cached: ${result2.cached}, Latency: ${result2.latencyMs}ms`);
// Check stats
const stats = await cache.getStats();
console.log(`Cache stats:`, stats);
} finally {
await cache.disconnect();
}
}
Semantic Caching with Embeddings
Next, you will implement semantic caching that understands query meaning rather than relying on exact string matches. This approach requires an embedding model and vector similarity search.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import { NeuroLink } from '@juspay/neurolink';
import type { GenerateOptions, GenerateResult } from '@juspay/neurolink';
import { createClient, RedisClientType } from 'redis';
interface SemanticCacheEntry {
prompt: string;
embedding: number[];
result: GenerateResult;
createdAt: number;
}
class SemanticLLMCache {
private redis: RedisClientType;
private neurolink: NeuroLink;
private entries: SemanticCacheEntry[] = []; // In production, use a vector DB
private similarityThreshold: number;
constructor(redisUrl: string, similarityThreshold: number = 0.92) {
this.redis = createClient({ url: redisUrl });
this.neurolink = new NeuroLink();
this.similarityThreshold = similarityThreshold;
}
async connect(): Promise<void> {
await this.redis.connect();
await this.loadEntriesFromRedis();
}
private async loadEntriesFromRedis(): Promise<void> {
const data = await this.redis.get('neurolink:semantic_cache:entries');
if (data) {
this.entries = JSON.parse(data);
}
}
private async saveEntriesToRedis(): Promise<void> {
await this.redis.set(
'neurolink:semantic_cache:entries',
JSON.stringify(this.entries)
);
}
private async generateEmbedding(text: string): Promise<number[]> {
// For production, use a proper embedding API like OpenAI's text-embedding-3-small
// via neurolink.embed() or the provider's embedding endpoint directly.
// This simplified example uses content hashing as a pseudo-embedding.
return this.textToSimpleEmbedding(text);
}
// Simple text-to-embedding for demonstration
// In production, use OpenAI embeddings, sentence-transformers, etc.
private textToSimpleEmbedding(text: string): number[] {
const normalized = text.toLowerCase().trim();
const words = normalized.split(/\s+/);
const embedding = new Array(128).fill(0);
for (let i = 0; i < words.length; i++) {
const word = words[i];
for (let j = 0; j < word.length; j++) {
const idx = (word.charCodeAt(j) + i * 7 + j * 13) % 128;
embedding[idx] += 1 / (i + 1);
}
}
// Normalize
const magnitude = Math.sqrt(embedding.reduce((sum, v) => sum + v * v, 0));
return embedding.map(v => v / (magnitude || 1));
}
// Cosine similarity between two vectors
private cosineSimilarity(a: number[], b: number[]): number {
if (a.length !== b.length) return 0;
let dotProduct = 0;
let magnitudeA = 0;
let magnitudeB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
magnitudeA += a[i] * a[i];
magnitudeB += b[i] * b[i];
}
const magnitude = Math.sqrt(magnitudeA) * Math.sqrt(magnitudeB);
return magnitude === 0 ? 0 : dotProduct / magnitude;
}
// Find most similar cached entry
private findSimilar(embedding: number[]): {
entry: SemanticCacheEntry;
similarity: number;
} | null {
let bestMatch: { entry: SemanticCacheEntry; similarity: number } | null = null;
for (const entry of this.entries) {
const similarity = this.cosineSimilarity(embedding, entry.embedding);
if (similarity >= this.similarityThreshold) {
if (!bestMatch || similarity > bestMatch.similarity) {
bestMatch = { entry, similarity };
}
}
}
return bestMatch;
}
async generate(
options: GenerateOptions,
ttlMs: number = 3600000
): Promise<GenerateResult & {
cached: boolean;
similarity?: number;
originalPrompt?: string;
}> {
const queryEmbedding = await this.generateEmbedding(options.input.text);
// Search for semantically similar cached response
const similar = this.findSimilar(queryEmbedding);
if (similar) {
console.log(`Semantic cache hit! Similarity: ${similar.similarity.toFixed(3)}`);
return {
...similar.entry.result,
cached: true,
similarity: similar.similarity,
originalPrompt: similar.entry.prompt,
};
}
// No cache hit - generate fresh response
const result = await this.neurolink.generate(options);
// Store with embedding
const newEntry: SemanticCacheEntry = {
prompt: options.input.text,
embedding: queryEmbedding,
result,
createdAt: Date.now(),
};
this.entries.push(newEntry);
// Limit cache size (simple LRU-like behavior)
if (this.entries.length > 10000) {
this.entries = this.entries.slice(-5000);
}
await this.saveEntriesToRedis();
return { ...result, cached: false };
}
async disconnect(): Promise<void> {
await this.saveEntriesToRedis();
await this.redis.quit();
}
}
// Usage - these semantically similar queries should hit cache
async function demonstrateSemanticCache() {
const cache = new SemanticLLMCache('redis://localhost:6379', 0.85);
await cache.connect();
try {
// First query
const r1 = await cache.generate({
input: { text: "What's the weather forecast for tomorrow?" },
provider: 'vertex',
model: 'gemini-2.5-flash',
});
console.log(`Query 1 - Cached: ${r1.cached}`);
// Semantically similar query - should hit cache
const r2 = await cache.generate({
input: { text: "Tell me tomorrow's weather prediction" },
provider: 'vertex',
model: 'gemini-2.5-flash',
});
console.log(`Query 2 - Cached: ${r2.cached}, Similarity: ${r2.similarity}`);
// Another similar variant
const r3 = await cache.generate({
input: { text: "What will the weather be like tomorrow?" },
provider: 'vertex',
model: 'gemini-2.5-flash',
});
console.log(`Query 3 - Cached: ${r3.cached}, Similarity: ${r3.similarity}`);
} finally {
await cache.disconnect();
}
}
Choosing Similarity Thresholds
The similarity threshold determines how “close” two queries must be to trigger a cache hit:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const thresholdConfigs = {
// Very high threshold (0.95+)
// Use for: Legal, medical, financial queries where precision is critical
// Hit rate: Lower
// Risk: Minimal false positives
highPrecision: {
threshold: 0.95,
useCase: 'Compliance-sensitive applications',
},
// Balanced threshold (0.90-0.94)
// Use for: General chatbots, customer support
// Hit rate: Moderate
// Risk: Occasional edge cases
balanced: {
threshold: 0.92,
useCase: 'Most production applications',
},
// Lower threshold (0.85-0.89)
// Use for: FAQ bots, well-defined query spaces
// Hit rate: Higher
// Risk: More false positives
highRecall: {
threshold: 0.87,
useCase: 'FAQ systems with verified question sets',
},
};
Cache Invalidation Strategies
The hardest problem in caching isn’t storage—it’s knowing when cached data becomes stale.
Time-Based Invalidation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
const ttlStrategies = {
// Static content - long TTL
faq: 86400 * 7, // 7 days
productDescriptions: 86400, // 1 day
documentation: 86400 * 3, // 3 days
// Dynamic content - short TTL
pricing: 3600, // 1 hour
inventory: 900, // 15 minutes
recommendations: 1800, // 30 minutes
// Conversational - very short TTL
chat: 300, // 5 minutes
contextual: 600, // 10 minutes
};
function determineTTL(query: string, category: string): number {
let ttl = ttlStrategies[category] || 3600;
// Adjust based on query characteristics
if (/today|now|current|latest/i.test(query)) {
ttl = Math.min(ttl, 900); // Max 15 min for time-sensitive
}
if (/price|cost|\$/i.test(query)) {
ttl = Math.min(ttl, 3600); // Max 1 hour for pricing
}
return ttl;
}
Event-Based Invalidation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import { EventEmitter } from 'events';
import { createClient, RedisClientType } from 'redis';
class CacheInvalidator extends EventEmitter {
private redis: RedisClientType;
private cachePrefix = 'neurolink:cache:';
constructor(redis: RedisClientType) {
super();
this.redis = redis;
this.setupEventHandlers();
}
private setupEventHandlers(): void {
// Product updates invalidate product-related caches
this.on('product:updated', async (productId: string) => {
await this.invalidateByPattern(`product:${productId}`);
await this.invalidateByTag('product-info');
});
// Pricing changes invalidate pricing caches
this.on('pricing:changed', async () => {
await this.invalidateByTag('pricing');
await this.invalidateByPattern('*price*');
await this.invalidateByPattern('*cost*');
});
// Content updates invalidate documentation caches
this.on('content:published', async (contentType: string) => {
await this.invalidateByTag(contentType);
});
}
async invalidateByPattern(pattern: string): Promise<number> {
const keys = await this.redis.keys(`${this.cachePrefix}*${pattern}*`);
if (keys.length === 0) return 0;
const deleted = await this.redis.del(keys);
console.log(`Invalidated ${deleted} cache entries for pattern: ${pattern}`);
return deleted;
}
async invalidateByTag(tag: string): Promise<number> {
const tagKey = `${this.cachePrefix}tag:${tag}`;
const cachedKeys = await this.redis.sMembers(tagKey);
if (cachedKeys.length === 0) return 0;
const deleted = await this.redis.del(cachedKeys);
await this.redis.del(tagKey);
console.log(`Invalidated ${deleted} entries for tag: ${tag}`);
return deleted;
}
}
Multi-Tier Caching Architecture
Production systems benefit from multiple cache layers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
import { NeuroLink } from '@juspay/neurolink';
import type { GenerateOptions, GenerateResult } from '@juspay/neurolink';
import { createClient, RedisClientType } from 'redis';
import { createHash } from 'crypto';
interface CacheResult {
source: 'l1' | 'l2' | 'l3' | 'miss';
cached: boolean;
result?: GenerateResult;
similarity?: number;
latencyMs: number;
}
class MultiTierLLMCache {
// L1: In-memory, fastest, limited size
private l1Cache = new Map<string, { result: GenerateResult; expiresAt: number }>();
private l1MaxSize = 1000;
// L2: Redis, shared across instances
private redis: RedisClientType;
// L3: Semantic search (simplified - use vector DB in production)
private semanticEntries: Array<{
key: string;
embedding: number[];
result: GenerateResult;
}> = [];
private neurolink: NeuroLink;
constructor(redisUrl: string) {
this.redis = createClient({ url: redisUrl });
this.neurolink = new NeuroLink();
// L1 cache eviction every minute
setInterval(() => this.evictL1(), 60000);
}
async connect(): Promise<void> {
await this.redis.connect();
}
private generateKey(options: GenerateOptions): string {
const normalized = options.input.text.trim().toLowerCase().replace(/\s+/g, ' ');
return createHash('sha256').update(JSON.stringify({
prompt: normalized,
provider: options.provider,
model: options.model,
temperature: options.temperature,
})).digest('hex');
}
private evictL1(): void {
const now = Date.now();
for (const [key, value] of this.l1Cache) {
if (now >= value.expiresAt) {
this.l1Cache.delete(key);
}
}
// Also evict oldest entries if over size limit
if (this.l1Cache.size > this.l1MaxSize) {
const entries = Array.from(this.l1Cache.entries());
entries.sort((a, b) => a[1].expiresAt - b[1].expiresAt);
const toRemove = entries.slice(0, this.l1Cache.size - this.l1MaxSize);
toRemove.forEach(([key]) => this.l1Cache.delete(key));
}
}
async generate(
options: GenerateOptions,
config: { ttlSeconds?: number; semanticThreshold?: number } = {}
): Promise<CacheResult & { result: GenerateResult }> {
const startTime = Date.now();
const exactKey = this.generateKey(options);
const ttl = config.ttlSeconds || 3600;
// L1: Check in-memory exact match
const l1Entry = this.l1Cache.get(exactKey);
if (l1Entry && Date.now() < l1Entry.expiresAt) {
return {
source: 'l1',
cached: true,
result: l1Entry.result,
latencyMs: Date.now() - startTime,
};
}
// L2: Check Redis exact match
const l2Data = await this.redis.get(`neurolink:l2:${exactKey}`);
if (l2Data) {
const result = JSON.parse(l2Data) as GenerateResult;
// Promote to L1
this.l1Cache.set(exactKey, {
result,
expiresAt: Date.now() + (ttl * 1000),
});
return {
source: 'l2',
cached: true,
result,
latencyMs: Date.now() - startTime,
};
}
// L3: Semantic search (if threshold provided)
if (config.semanticThreshold) {
const queryEmbedding = this.simpleEmbedding(options.input.text);
const semanticMatch = this.findSemanticMatch(
queryEmbedding,
config.semanticThreshold
);
if (semanticMatch) {
return {
source: 'l3',
cached: true,
result: semanticMatch.result,
similarity: semanticMatch.similarity,
latencyMs: Date.now() - startTime,
};
}
}
// Cache miss - call NeuroLink
const result = await this.neurolink.generate(options);
// Store in all tiers
this.l1Cache.set(exactKey, {
result,
expiresAt: Date.now() + (ttl * 1000),
});
await this.redis.setEx(`neurolink:l2:${exactKey}`, ttl, JSON.stringify(result));
if (config.semanticThreshold) {
this.semanticEntries.push({
key: exactKey,
embedding: this.simpleEmbedding(options.input.text),
result,
});
}
return {
source: 'miss',
cached: false,
result,
latencyMs: Date.now() - startTime,
};
}
private simpleEmbedding(text: string): number[] {
// Simplified embedding - use proper embeddings in production
const normalized = text.toLowerCase();
const embedding = new Array(64).fill(0);
for (let i = 0; i < normalized.length; i++) {
embedding[i % 64] += normalized.charCodeAt(i) / 1000;
}
const mag = Math.sqrt(embedding.reduce((s, v) => s + v * v, 0));
return embedding.map(v => v / (mag || 1));
}
private findSemanticMatch(
embedding: number[],
threshold: number
): { result: GenerateResult; similarity: number } | null {
let best: { result: GenerateResult; similarity: number } | null = null;
for (const entry of this.semanticEntries) {
const similarity = this.cosineSimilarity(embedding, entry.embedding);
if (similarity >= threshold && (!best || similarity > best.similarity)) {
best = { result: entry.result, similarity };
}
}
return best;
}
private cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB) || 1);
}
async disconnect(): Promise<void> {
await this.redis.quit();
}
}
// Usage
async function main() {
const cache = new MultiTierLLMCache('redis://localhost:6379');
await cache.connect();
const result = await cache.generate(
{
input: { text: 'What is machine learning?' },
provider: 'vertex',
model: 'gemini-2.5-flash',
},
{ ttlSeconds: 3600, semanticThreshold: 0.9 }
);
console.log(`Source: ${result.source}`);
console.log(`Cached: ${result.cached}`);
console.log(`Latency: ${result.latencyMs}ms`);
await cache.disconnect();
}
Measuring Cache Effectiveness
Track comprehensive cache metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import { createClient, RedisClientType } from 'redis';
class CacheMetrics {
private redis: RedisClientType;
private prefix = 'neurolink:metrics:';
constructor(redis: RedisClientType) {
this.redis = redis;
}
async recordAccess(result: {
cached: boolean;
source: string;
category?: string;
latencyMs: number;
tokensSaved?: number;
}): Promise<void> {
const hourBucket = Math.floor(Date.now() / 3600000);
await Promise.all([
// Overall hit/miss counts
this.redis.hIncrBy(
`${this.prefix}totals`,
result.cached ? 'hits' : 'misses',
1
),
// Per-source metrics
this.redis.hIncrBy(
`${this.prefix}sources`,
result.source,
1
),
// Per-category metrics (if provided)
result.category && this.redis.hIncrBy(
`${this.prefix}category:${result.category}`,
result.cached ? 'hits' : 'misses',
1
),
// Hourly trends
this.redis.hIncrBy(
`${this.prefix}hourly:${hourBucket}`,
result.cached ? 'hits' : 'misses',
1
),
// Latency tracking
this.redis.lPush(
`${this.prefix}latencies:${result.source}`,
result.latencyMs.toString()
),
// Token savings (cost estimation)
result.cached && result.tokensSaved && this.redis.incrByFloat(
`${this.prefix}tokens_saved`,
result.tokensSaved
),
]);
// Trim latency lists to last 1000 entries
await this.redis.lTrim(`${this.prefix}latencies:${result.source}`, 0, 999);
}
async getSummary(): Promise<{
totalRequests: number;
cacheHits: number;
cacheMisses: number;
hitRate: number;
tokensSaved: number;
estimatedCostSavings: number;
avgLatencyBySource: Record<string, number>;
}> {
const totals = await this.redis.hGetAll(`${this.prefix}totals`);
const tokensSaved = await this.redis.get(`${this.prefix}tokens_saved`);
const sources = await this.redis.hGetAll(`${this.prefix}sources`);
const hits = parseInt(totals.hits || '0');
const misses = parseInt(totals.misses || '0');
const total = hits + misses;
const savedTokens = parseFloat(tokensSaved || '0');
// Calculate average latencies per source
const avgLatencyBySource: Record<string, number> = {};
for (const source of Object.keys(sources)) {
const latencies = await this.redis.lRange(
`${this.prefix}latencies:${source}`,
0,
-1
);
if (latencies.length > 0) {
const sum = latencies.reduce((s, l) => s + parseFloat(l), 0);
avgLatencyBySource[source] = Math.round(sum / latencies.length);
}
}
return {
totalRequests: total,
cacheHits: hits,
cacheMisses: misses,
hitRate: total > 0 ? (hits / total) * 100 : 0,
tokensSaved: savedTokens,
estimatedCostSavings: savedTokens * 0.00001, // Rough estimate
avgLatencyBySource,
};
}
}
Best Practices Summary
Start simple: Begin with exact match caching using normalized keys before implementing semantic caching.
Use deterministic settings: Set
temperature: 0for queries where consistent responses are acceptable—this maximizes cache hit potential.Layer your caches: Combine in-memory (L1), Redis (L2), and semantic (L3) caches for optimal performance across different access patterns.
Monitor obsessively: Track hit rates, latencies, and cost savings to validate your caching strategy.
Choose TTLs wisely: Match cache duration to content volatility—static FAQs can cache for days, while dynamic content needs shorter TTLs.
Implement graceful degradation: Your application should work even if the cache fails.
Consider data sensitivity: Some responses should never be cached (personalized data, PII-containing responses).
What’s Next
You have built a complete multi-tier caching system for NeuroLink applications. Here is the recommended implementation order:
- Start with the basic caching wrapper – wrap
neurolink.generate()with the exact-match pattern and measure your baseline hit rate - Add normalization – implement the
NormalizedLLMCacheto capture query variations and increase hit rates by 15-25% - Deploy Redis – move to the
RedisLLMCachefor distributed caching across multiple application instances - Add semantic caching – once you understand your query patterns, implement embedding-based matching for the highest hit rates
- Monitor everything – use the
CacheMetricsclass to track hit rates, latencies, and cost savings
With proper implementation, you will achieve 40-70% cost reduction while delivering sub-100ms response times for cached queries.
Related posts:
