LLM Cost Optimization: Practical Strategies to Reduce Your AI Spend
Practical strategies to reduce LLM costs - model selection, prompt optimization, external caching, and batching patterns with NeuroLink.
By the end of this guide, you will have a practical playbook for cutting your LLM costs by 40-70% without sacrificing output quality. You will implement tiered model selection, optimize prompts for token efficiency, build external caching with Redis, batch requests for throughput, and set up cost tracking that shows exactly where your money goes.
Every strategy uses NeuroLink’s multi-provider interface, so you can apply these patterns across OpenAI, Anthropic, Google, and any other provider you use.
Understanding Your LLM Cost Structure
Before optimizing, you need to understand where your money goes. LLM costs typically break down into several components.
Token-Based Pricing
Pricing Disclaimer: Figures below are approximate and based on publicly available provider pricing as of January 2026. Always verify current pricing at:
Pricing shown reflects estimates at time of writing. LLM pricing changes frequently – verify current rates at your provider’s pricing page before making cost projections.
Most LLM providers charge based on tokens—roughly 4 characters or 0.75 words per token. Costs are typically split between:
- Input tokens: The prompts and context you send to the model
- Output tokens: The responses generated by the model (usually 2-4x more expensive than input)
For example, GPT-4 Turbo charges approximately $10 per million input tokens and $30 per million output tokens. Claude Sonnet 4 sits at $3 per million input and $15 per million output. These differences matter significantly at scale.
Cost-Saving Tip: Consider GPT-4o as a more cost-effective alternative to GPT-4 Turbo. At $2.50 per million input tokens and $10 per million output tokens, GPT-4o offers similar capabilities at roughly 70% lower cost.
Hidden Cost Factors
Beyond raw token costs, consider:
- Retry overhead: Failed requests that consume tokens before erroring
- Verbose responses: Models that generate more text than necessary
- Context bloat: Sending unnecessary information in prompts
- Inefficient request patterns: Making many small requests instead of consolidated ones
Calculating Your True Cost Per Query
To optimize effectively, calculate your actual cost per meaningful business outcome:
1
True Cost = (Input Tokens x Input Rate + Output Tokens x Output Rate) x (1 + Retry Rate) / Success Rate
This formula accounts for retries and failures, giving you a realistic baseline for optimization efforts.
Strategy 1: Intelligent Model Selection
Not every request requires your most powerful—and expensive—model. Selecting the right model for each task can slash costs by 40-70% while maintaining quality where it matters.
The Tiered Approach
Tier 1 - Economy Models (Simple Tasks)
Use lightweight models like GPT-4o Mini, Claude Haiku 3.5, or Gemini Flash for:
- Simple classifications
- Data extraction from structured text
- Basic summarization
- Formatting and cleanup tasks
Cost: $0.15-1.00 per million input tokens
Tier 2 - Standard Models (Moderate Complexity)
Deploy mid-tier models like Claude Sonnet 4 or GPT-4o for:
- Content generation
- Code assistance
- Complex question answering
- Multi-step reasoning
Cost: $2.50-5 per million input tokens
Note: GPT-4 Turbo ($10/M input, $30/M output) is significantly more expensive than GPT-4o ($2.50/M input, $10/M output) while offering similar performance for most tasks. We recommend GPT-4o for standard workloads.
Tier 3 - Premium Models (Maximum Capability)
Reserve top-tier models like Claude Opus 4.5 (claude-opus-4-5-20251101) for:
- Critical business decisions
- Complex analysis requiring nuanced understanding
- Creative tasks demanding highest quality
- Tasks where errors carry significant consequences
Cost: Claude Opus 4.5: $5/M input, $25/M output (as of January 2026)
Note: Model names and IDs in code examples reflect versions available at time of writing. Model availability, naming conventions, and pricing change frequently. Always verify current model IDs with your provider’s documentation before deploying to production.
Implementing Model Selection with NeuroLink
With NeuroLink, switching between models and providers is straightforward:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
// Economy tier - simple classification task
async function classifyTicket(ticketText: string): Promise<string> {
const result = await neurolink.generate({
input: { text: `Classify this support ticket into one of: billing, technical, general.\n\nTicket: ${ticketText}` },
provider: 'openai',
model: 'gpt-4o-mini', // ~$0.15/M input tokens
});
return result.content;
}
// Standard tier - content generation
async function generateResponse(context: string, query: string): Promise<string> {
const result = await neurolink.generate({
input: { text: `Context: ${context}\n\nQuestion: ${query}\n\nProvide a helpful response.` },
provider: 'anthropic',
model: 'claude-sonnet-4-5-20250929', // ~$3/M input tokens
});
return result.content;
}
// Premium tier - complex analysis
async function analyzeContract(contractText: string): Promise<string> {
const result = await neurolink.generate({
input: { text: `Analyze this contract for risks and key terms:\n\n${contractText}` },
provider: 'anthropic',
model: 'claude-opus-4-5-20251101', // ~$5/M input tokens
});
return result.content;
}
Building a Simple Router
You can build routing logic based on your task requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
type TaskComplexity = 'simple' | 'moderate' | 'complex';
interface ModelConfig {
provider: string;
model: string;
costPerMillionTokens: number;
}
const MODEL_TIERS: Record<TaskComplexity, ModelConfig> = {
simple: {
provider: 'openai',
model: 'gpt-4o-mini',
costPerMillionTokens: 0.15
},
moderate: {
provider: 'anthropic',
model: 'claude-sonnet-4-5-20250929',
costPerMillionTokens: 3
},
complex: {
provider: 'anthropic',
model: 'claude-opus-4-5-20251101',
costPerMillionTokens: 5
}
};
async function routedGenerate(
prompt: string,
complexity: TaskComplexity
): Promise<string> {
const config = MODEL_TIERS[complexity];
const result = await neurolink.generate({
input: { text: prompt },
provider: config.provider,
model: config.model
});
return result.content;
}
// Usage
const classification = await routedGenerate(
'Classify: "My order hasn\'t arrived"',
'simple'
);
const analysis = await routedGenerate(
'Analyze the market trends in this report...',
'complex'
);
Real-World Impact
Organizations implementing tiered model selection typically see:
- Before: 100% GPT-4 usage
- After: 60% economy, 30% standard, 10% premium
- Savings: 50-70% with minimal quality impact on appropriate tasks
Strategy 2: Prompt Optimization
The words you choose directly impact your costs. Optimizing prompts can reduce token consumption by 30-50%.
Minimize Input Tokens
Remove Redundant Context
Before (verbose):
1
2
3
4
5
6
7
8
You are a helpful assistant that specializes in customer support for our
e-commerce platform. You should always be polite and professional. When
answering questions, provide accurate information based on our policies.
Our company values customer satisfaction above all else. Please help the
following customer with their inquiry:
[500 tokens of context]
[Customer question]
After (optimized):
1
2
3
4
E-commerce support assistant. Answer based on provided policy context.
[200 tokens of essential context]
[Customer question]
Token reduction: ~40%
Use Concise System Prompts
1
2
3
4
5
// Cost-optimized system prompt
const systemPrompt = `Role: Support agent
Tone: Professional, helpful
Format: Bullet points for lists
Constraints: Max 3 sentences per point`;
Control Output Length
Explicit Length Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
async function summarizeArticle(article: string): Promise<string> {
const result = await neurolink.generate({
input: {
text: `Summarize this article in exactly 3 bullet points, each under 20 words.
${article}`
},
provider: 'openai',
model: 'gpt-4o-mini',
maxTokens: 150 // Limit output tokens
});
return result.content;
}
Use Explicit Limits in Prompts
Instead of relying on stop sequences, include explicit instructions in your prompts:
1
2
3
4
5
6
7
8
const result = await neurolink.generate({
input: {
text: 'List exactly 3 reasons (no more):\n1.'
},
provider: 'openai',
model: 'gpt-4o-mini',
maxTokens: 200 // Hard limit as backup
});
Prompt Templates for Efficiency
Create reusable, token-efficient templates:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
interface PromptTemplate {
name: string;
template: string;
estimatedInputTokens: number;
}
const TEMPLATES: Record<string, PromptTemplate> = {
classify: {
name: 'classification',
template: 'Classify into [{categories}]: {text}\nCategory:',
estimatedInputTokens: 20
},
summarize: {
name: 'summarization',
template: 'Summarize in {maxWords} words: {text}',
estimatedInputTokens: 15
},
extract: {
name: 'extraction',
template: 'Extract {fields} from: {text}\nJSON:',
estimatedInputTokens: 15
}
};
function buildPrompt(
templateName: string,
variables: Record<string, string>
): string {
const template = TEMPLATES[templateName];
if (!template) throw new Error(`Unknown template: ${templateName}`);
let prompt = template.template;
for (const [key, value] of Object.entries(variables)) {
prompt = prompt.replace(`{${key}}`, value);
}
return prompt;
}
// Usage - minimal tokens
const classifyPrompt = buildPrompt('classify', {
categories: 'billing, technical, general',
text: 'My payment failed'
});
// Result: "Classify into [billing, technical, general]: My payment failed\nCategory:"
Measuring Prompt Efficiency
Track tokens per prompt to identify optimization opportunities:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function estimateTokens(text: string): number {
// Rough estimation: ~4 characters per token
return Math.ceil(text.length / 4);
}
function analyzePromptEfficiency(
prompt: string,
response: string,
taskType: string
): void {
const inputTokens = estimateTokens(prompt);
const outputTokens = estimateTokens(response);
console.log(`Task: ${taskType}`);
console.log(`Input tokens: ${inputTokens}`);
console.log(`Output tokens: ${outputTokens}`);
console.log(`Ratio: ${(outputTokens / inputTokens).toFixed(2)}`);
}
Strategy 3: External Caching with Redis
Why pay to generate the same response twice? Implementing caching at the application level dramatically reduces redundant API calls.
⚠️ Note: This is application-level code you must implement. NeuroLink provides the SDK, not built-in caching. You’ll build caching logic yourself using Redis or similar external services as shown in the patterns below.
Simple Response Caching
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import { NeuroLink } from '@juspay/neurolink';
import Redis from 'ioredis';
import crypto from 'crypto';
const neurolink = new NeuroLink();
const redis = new Redis(process.env.REDIS_URL);
const CACHE_TTL = 86400; // 24 hours in seconds
function generateCacheKey(prompt: string, model: string): string {
const hash = crypto.createHash('sha256')
.update(`${model}:${prompt}`)
.digest('hex');
return `llm:cache:${hash}`;
}
async function cachedGenerate(
prompt: string,
provider: string,
model: string
): Promise<{ text: string; cached: boolean }> {
const cacheKey = generateCacheKey(prompt, model);
// Check cache first
const cached = await redis.get(cacheKey);
if (cached) {
return { text: cached, cached: true };
}
// Cache miss - call LLM
const result = await neurolink.generate({
input: { text: prompt },
provider,
model
});
// Store in cache
await redis.setex(cacheKey, CACHE_TTL, result.content);
return { text: result.content, cached: false };
}
// Usage
const response = await cachedGenerate(
'What is the capital of France?',
'openai',
'gpt-4o-mini'
);
console.log(`Response: ${response.text}`);
console.log(`From cache: ${response.cached}`);
Implementing Similarity-Based Caching
For more sophisticated caching that handles paraphrased queries, you can use embeddings:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import { NeuroLink } from '@juspay/neurolink';
import Redis from 'ioredis';
const neurolink = new NeuroLink();
const redis = new Redis(process.env.REDIS_URL);
interface CachedResponse {
prompt: string;
response: string;
embedding: number[];
timestamp: number;
}
// Generate embedding for similarity comparison
// Note: NeuroLink's generate() is for text generation, not embeddings.
// Use your preferred embedding service directly (e.g., OpenAI's embeddings API).
async function getEmbedding(text: string): Promise<number[]> {
// Example using OpenAI embeddings API directly
const response = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
input: text,
model: 'text-embedding-3-small',
})
});
const data = await response.json();
return data.data[0]?.embedding || [];
}
function cosineSimilarity(a: number[], b: number[]): number {
if (a.length !== b.length) return 0;
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
async function semanticCachedGenerate(
prompt: string,
provider: string,
model: string,
similarityThreshold: number = 0.92
): Promise<{ text: string; cached: boolean; similarity?: number }> {
// Get embedding for the new prompt
const promptEmbedding = await getEmbedding(prompt);
// Check cached responses
// Production: use SCAN instead of KEYS to avoid blocking Redis on large keyspaces
const cachedKeys = await redis.keys('llm:semantic:*');
for (const key of cachedKeys) {
const cachedJson = await redis.get(key);
if (!cachedJson) continue;
const cached: CachedResponse = JSON.parse(cachedJson);
const similarity = cosineSimilarity(promptEmbedding, cached.embedding);
if (similarity >= similarityThreshold) {
return {
text: cached.response,
cached: true,
similarity
};
}
}
// Cache miss - call LLM
const result = await neurolink.generate({
input: { text: prompt },
provider,
model
});
// Store with embedding
const cacheEntry: CachedResponse = {
prompt,
response: result.content,
embedding: promptEmbedding,
timestamp: Date.now()
};
const cacheKey = `llm:semantic:${Date.now()}`;
await redis.setex(cacheKey, 86400, JSON.stringify(cacheEntry));
return { text: result.content, cached: false };
}
Cache Strategy by Use Case
Different applications need different caching strategies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
interface CacheConfig {
ttlSeconds: number;
similarityThreshold: number;
maxEntries: number;
}
const CACHE_CONFIGS: Record<string, CacheConfig> = {
// FAQ/Support - stable answers, long cache
support: {
ttlSeconds: 604800, // 7 days
similarityThreshold: 0.88,
maxEntries: 10000
},
// Content generation - fresh content, short cache
content: {
ttlSeconds: 3600, // 1 hour
similarityThreshold: 0.95,
maxEntries: 1000
},
// Code assistance - moderate caching
code: {
ttlSeconds: 86400, // 24 hours
similarityThreshold: 0.92,
maxEntries: 5000
}
};
Monitoring Cache Performance
Track cache effectiveness:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
interface CacheMetrics {
hits: number;
misses: number;
totalRequests: number;
estimatedSavings: number;
}
class CacheMonitor {
private metrics: CacheMetrics = {
hits: 0,
misses: 0,
totalRequests: 0,
estimatedSavings: 0
};
private costPerRequest: number;
constructor(avgCostPerRequest: number = 0.001) {
this.costPerRequest = avgCostPerRequest;
}
recordHit(): void {
this.metrics.hits++;
this.metrics.totalRequests++;
this.metrics.estimatedSavings += this.costPerRequest;
}
recordMiss(): void {
this.metrics.misses++;
this.metrics.totalRequests++;
}
getHitRate(): number {
if (this.metrics.totalRequests === 0) return 0;
return this.metrics.hits / this.metrics.totalRequests;
}
getReport(): string {
return `
Cache Performance Report
========================
Total Requests: ${this.metrics.totalRequests}
Cache Hits: ${this.metrics.hits}
Cache Misses: ${this.metrics.misses}
Hit Rate: ${(this.getHitRate() * 100).toFixed(1)}%
Estimated Savings: $${this.metrics.estimatedSavings.toFixed(2)}
`.trim();
}
}
Strategy 4: Request Batching Patterns
Processing requests individually incurs overhead. Batching similar requests reduces both costs and latency.
⚠️ Note: This is application-level code you must implement. NeuroLink provides the SDK, but batching logic must be built by your application. The patterns below show how to implement request batching yourself.
Simple Batch Processing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
interface BatchItem {
id: string;
prompt: string;
}
interface BatchResult {
id: string;
response: string;
}
async function processBatch(
items: BatchItem[],
provider: string,
model: string
): Promise<BatchResult[]> {
// Combine prompts into a single request
const combinedPrompt = items
.map((item, index) => `[${index + 1}] ${item.prompt}`)
.join('\n\n');
const systemPrompt = `Process each numbered item and respond with the same numbering format.`;
const result = await neurolink.generate({
input: {
text: `${systemPrompt}\n\n${combinedPrompt}`
},
provider,
model
});
// Parse responses (simplified - production code needs robust parsing)
const responses = result.content.split(/\[\d+\]/).filter(Boolean);
return items.map((item, index) => ({
id: item.id,
response: responses[index]?.trim() || ''
}));
}
// Usage
const items: BatchItem[] = [
{ id: '1', prompt: 'Classify: "Payment issue"' },
{ id: '2', prompt: 'Classify: "Shipping delay"' },
{ id: '3', prompt: 'Classify: "Product question"' }
];
const results = await processBatch(items, 'openai', 'gpt-4o-mini');
Time-Window Batching
For real-time applications, collect requests within a time window:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
interface QueuedRequest {
prompt: string;
resolve: (result: string) => void;
reject: (error: Error) => void;
}
class BatchQueue {
private queue: QueuedRequest[] = [];
private timer: NodeJS.Timeout | null = null;
private readonly windowMs: number;
private readonly maxBatchSize: number;
private readonly provider: string;
private readonly model: string;
constructor(options: {
windowMs?: number;
maxBatchSize?: number;
provider: string;
model: string;
}) {
this.windowMs = options.windowMs || 100;
this.maxBatchSize = options.maxBatchSize || 20;
this.provider = options.provider;
this.model = options.model;
}
async add(prompt: string): Promise<string> {
return new Promise((resolve, reject) => {
this.queue.push({ prompt, resolve, reject });
// Process immediately if batch is full
if (this.queue.length >= this.maxBatchSize) {
this.flush();
} else if (!this.timer) {
// Start timer for window
this.timer = setTimeout(() => this.flush(), this.windowMs);
}
});
}
private async flush(): Promise<void> {
if (this.timer) {
clearTimeout(this.timer);
this.timer = null;
}
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.maxBatchSize);
try {
// Process batch
const combinedPrompt = batch
.map((req, i) => `[${i}] ${req.prompt}`)
.join('\n');
const result = await neurolink.generate({
input: { text: `Respond to each numbered item:\n${combinedPrompt}` },
provider: this.provider,
model: this.model
});
// Parse and resolve
const responses = this.parseResponses(result.content, batch.length);
batch.forEach((req, i) => req.resolve(responses[i] || ''));
} catch (error) {
batch.forEach(req => req.reject(error as Error));
}
}
private parseResponses(text: string, count: number): string[] {
const responses: string[] = [];
const parts = text.split(/\[\d+\]/);
for (let i = 1; i <= count; i++) {
responses.push(parts[i]?.trim() || '');
}
return responses;
}
}
// Usage
const batcher = new BatchQueue({
windowMs: 50,
maxBatchSize: 10,
provider: 'openai',
model: 'gpt-4o-mini',
});
// These requests will be batched together
const [result1, result2, result3] = await Promise.all([
batcher.add('Summarize: Article 1...'),
batcher.add('Summarize: Article 2...'),
batcher.add('Summarize: Article 3...')
]);
Batch Optimization Strategies
Homogeneous Batching: Group similar task types together for better results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
type TaskType = 'classification' | 'summarization' | 'extraction';
class TypedBatchQueue {
private queues: Map<TaskType, BatchQueue> = new Map();
constructor(private provider: string, private model: string) {
const taskTypes: TaskType[] = ['classification', 'summarization', 'extraction'];
taskTypes.forEach(type => {
this.queues.set(type, new BatchQueue({
windowMs: 100,
maxBatchSize: 20,
provider,
model
}));
});
}
async process(taskType: TaskType, prompt: string): Promise<string> {
const queue = this.queues.get(taskType);
if (!queue) throw new Error(`Unknown task type: ${taskType}`);
return queue.add(prompt);
}
}
Strategy 5: Provider Failover for Cost Efficiency
Using multiple providers allows you to optimize for both cost and reliability:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
interface ProviderConfig {
provider: string;
model: string;
costPerMillionTokens: number;
priority: number;
}
const PROVIDERS: ProviderConfig[] = [
{ provider: 'anthropic', model: 'claude-sonnet-4-5-20250929', costPerMillionTokens: 3, priority: 1 },
{ provider: 'openai', model: 'gpt-4o', costPerMillionTokens: 2.5, priority: 2 }, // More cost-effective than GPT-4 Turbo
{ provider: 'openai', model: 'gpt-4o-mini', costPerMillionTokens: 0.15, priority: 3 },
{ provider: 'anthropic', model: 'claude-opus-4-5-20251101', costPerMillionTokens: 5, priority: 4 } // For premium tasks
];
async function generateWithFallback(
prompt: string,
preferCheapest: boolean = false
): Promise<{ text: string; provider: string; model: string }> {
// Sort by cost if preferring cheapest, otherwise by priority
const sorted = [...PROVIDERS].sort((a, b) =>
preferCheapest
? a.costPerMillionTokens - b.costPerMillionTokens
: a.priority - b.priority
);
let lastError: Error | null = null;
for (const config of sorted) {
try {
const result = await neurolink.generate({
input: { text: prompt },
provider: config.provider,
model: config.model
});
return {
text: result.content,
provider: config.provider,
model: config.model
};
} catch (error) {
lastError = error as Error;
console.warn(`Provider ${config.provider}/${config.model} failed, trying next...`);
}
}
throw lastError || new Error('All providers failed');
}
// Usage - prefer cheapest option
const result = await generateWithFallback(
'Classify this text...',
true // preferCheapest
);
Strategy 6: Cost Tracking and Monitoring
You cannot optimize what you do not measure. Implement cost tracking at the application level.
Simple Cost Tracker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
interface UsageRecord {
timestamp: Date;
provider: string;
model: string;
inputTokens: number;
outputTokens: number;
cost: number;
tags: Record<string, string>;
}
interface CostRates {
inputPerMillion: number;
outputPerMillion: number;
}
// Pricing as of January 2026 - verify current rates at provider sites
const MODEL_COSTS: Record<string, CostRates> = {
'gpt-4-turbo': { inputPerMillion: 10, outputPerMillion: 30 },
'gpt-4o': { inputPerMillion: 2.5, outputPerMillion: 10 }, // Recommended over GPT-4 Turbo
'gpt-4o-mini': { inputPerMillion: 0.15, outputPerMillion: 0.6 },
'claude-opus-4-5-20251101': { inputPerMillion: 5, outputPerMillion: 25 }, // Claude Opus 4.5 - Jan 2026
'claude-sonnet-4-5-20250929': { inputPerMillion: 3, outputPerMillion: 15 },
'claude-3-5-haiku-20241022': { inputPerMillion: 0.8, outputPerMillion: 4 }
};
class CostTracker {
private records: UsageRecord[] = [];
calculateCost(
model: string,
inputTokens: number,
outputTokens: number
): number {
const rates = MODEL_COSTS[model];
if (!rates) return 0;
const inputCost = (inputTokens / 1_000_000) * rates.inputPerMillion;
const outputCost = (outputTokens / 1_000_000) * rates.outputPerMillion;
return inputCost + outputCost;
}
record(
provider: string,
model: string,
inputTokens: number,
outputTokens: number,
tags: Record<string, string> = {}
): void {
const cost = this.calculateCost(model, inputTokens, outputTokens);
this.records.push({
timestamp: new Date(),
provider,
model,
inputTokens,
outputTokens,
cost,
tags
});
}
getTotalCost(since?: Date): number {
return this.records
.filter(r => !since || r.timestamp >= since)
.reduce((sum, r) => sum + r.cost, 0);
}
getCostByModel(): Record<string, number> {
const costs: Record<string, number> = {};
for (const record of this.records) {
costs[record.model] = (costs[record.model] || 0) + record.cost;
}
return costs;
}
getCostByTag(tagKey: string): Record<string, number> {
const costs: Record<string, number> = {};
for (const record of this.records) {
const tagValue = record.tags[tagKey] || 'untagged';
costs[tagValue] = (costs[tagValue] || 0) + record.cost;
}
return costs;
}
getReport(): string {
const total = this.getTotalCost();
const byModel = this.getCostByModel();
let report = `
Cost Report
===========
Total Cost: $${total.toFixed(4)}
Total Requests: ${this.records.length}
By Model:
`;
for (const [model, cost] of Object.entries(byModel)) {
report += ` ${model}: $${cost.toFixed(4)}\n`;
}
return report;
}
}
// Usage with NeuroLink
import { NeuroLink } from '@juspay/neurolink';
const neurolink = new NeuroLink();
const tracker = new CostTracker();
async function trackedGenerate(
prompt: string,
provider: string,
model: string,
tags: Record<string, string> = {}
): Promise<string> {
const inputTokens = Math.ceil(prompt.length / 4); // Rough estimate
const result = await neurolink.generate({
input: { text: prompt },
provider,
model
});
const outputTokens = Math.ceil(result.content.length / 4); // Rough estimate
tracker.record(provider, model, inputTokens, outputTokens, tags);
return result.content;
}
// Track with tags for attribution
await trackedGenerate(
'Summarize this article...',
'openai',
'gpt-4o-mini',
{ team: 'marketing', feature: 'content-gen' }
);
console.log(tracker.getReport());
Setting Budget Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class BudgetMonitor {
private dailyBudget: number;
private monthlyBudget: number;
private tracker: CostTracker;
private alertCallback: (message: string) => void;
constructor(options: {
dailyBudget: number;
monthlyBudget: number;
tracker: CostTracker;
onAlert: (message: string) => void;
}) {
this.dailyBudget = options.dailyBudget;
this.monthlyBudget = options.monthlyBudget;
this.tracker = options.tracker;
this.alertCallback = options.onAlert;
}
check(): void {
const now = new Date();
const startOfDay = new Date(now.getFullYear(), now.getMonth(), now.getDate());
const startOfMonth = new Date(now.getFullYear(), now.getMonth(), 1);
const dailyCost = this.tracker.getTotalCost(startOfDay);
const monthlyCost = this.tracker.getTotalCost(startOfMonth);
const dailyPercent = (dailyCost / this.dailyBudget) * 100;
const monthlyPercent = (monthlyCost / this.monthlyBudget) * 100;
if (dailyPercent >= 90) {
this.alertCallback(`CRITICAL: Daily budget at ${dailyPercent.toFixed(1)}%`);
} else if (dailyPercent >= 75) {
this.alertCallback(`WARNING: Daily budget at ${dailyPercent.toFixed(1)}%`);
}
if (monthlyPercent >= 90) {
this.alertCallback(`CRITICAL: Monthly budget at ${monthlyPercent.toFixed(1)}%`);
} else if (monthlyPercent >= 75) {
this.alertCallback(`WARNING: Monthly budget at ${monthlyPercent.toFixed(1)}%`);
}
}
}
Implementing Your Cost Optimization Strategy
Phase 1: Measure (Week 1-2)
- Implement cost tracking with the patterns shown above
- Tag all requests with attribution data (team, feature, use case)
- Establish baseline costs by application, model, and team
- Identify top cost drivers
Phase 2: Quick Wins (Week 3-4)
- Implement external caching for repetitive queries (expect 20-40% reduction)
- Optimize top 10 highest-volume prompts for token efficiency
- Switch simple tasks to economy-tier models
- Set up budget alerts
Phase 3: Deep Optimization (Month 2-3)
- Build routing logic to select models based on task complexity
- Implement batching for eligible workloads
- Fine-tune caching TTLs and similarity thresholds
- Add provider failover for reliability and cost optimization
Phase 4: Continuous Improvement (Ongoing)
- Monthly cost reviews and optimization sprints
- A/B testing for quality vs. cost tradeoffs
- Evaluate new models for cost efficiency
- Architecture reviews for systemic improvements
Measuring Success
Track these KPIs to validate your optimization efforts:
| Metric | Target | Measurement |
|---|---|---|
| Cost per query | -40% | Total spend / successful queries |
| Cache hit rate | >35% | Cache hits / total requests |
| Model tier distribution | 60/30/10 | Requests per tier |
| Quality score | Maintain baseline | User feedback, automated eval |
| P95 latency | <2s | Response time tracking |
Roadmap: Advanced Cost Features
While the strategies in this guide use patterns you can implement today, we are actively working on built-in cost optimization features for NeuroLink:
- Automatic model routing: ML-based task complexity detection for automatic tier selection
- Built-in semantic caching: Native caching with embedding-based similarity matching
- Cost dashboards: Real-time cost visualization and analytics
- Budget enforcement: Automatic throttling and degradation when budgets are exceeded
- Batch API support: Native integration with provider batch APIs for async workloads
Stay tuned to our changelog and documentation for updates on these features.
Conclusion
By now you have a complete cost optimization playbook: model tiering, prompt compression, response caching, request batching, provider failover, and cost tracking with budget alerts. Organizations implementing these patterns typically see 40-60% cost reduction within the first month and 50-70% within three months.
The implementation path:
- Week 1-2: Instrument cost tracking and establish baselines
- Week 3-4: Cache repetitive queries, optimize top prompts, switch simple tasks to economy models
- Month 2-3: Build routing logic, implement batching, tune caching thresholds
- Ongoing: Monthly cost reviews, A/B testing quality vs. cost, evaluate new models
Start measuring. The rest follows from the data.
Related posts:
