Post

Generating 10,000 Descriptions a Day: Cost-Optimized Content Pipelines

Build a production content pipeline that generates 10K+ AI descriptions daily while keeping costs under control with NeuroLink.

Generating 10,000 Descriptions a Day: Cost-Optimized Content Pipelines

We designed a content pipeline that generates 10,000 product descriptions per day at a cost under $50. This deep dive examines the model tiering strategy (which tasks warrant GPT-4 vs. Gemini Flash), the caching layer that eliminates 40% of redundant generations, the quality scoring pipeline that catches bad outputs before they reach production, and the concurrency controls that prevent rate limit storms.

The business case for high-volume AI content is clear. E-commerce catalogs with thousands of SKUs, real estate platforms with hundreds of new listings daily, job boards with continuous postings, and content localization across multiple markets all need automated content at scale. But brute-force generation fails in three ways: cost explosion from using premium models for simple content, quality variance from inconsistent prompting, and zero observability into what you are spending and why.

NeuroLink’s approach combines model registry for cost-aware model selection, tiered generation with automatic quality-based escalation, middleware for spend tracking, and auto-evaluation as a quality gate. This post shows you how to build the complete pipeline.

Architecture: Cost-Optimized Content Pipeline

The pipeline follows a tiered model with caching and quality gates:

flowchart LR
    INPUT(["Content Queue<br/>10K+ items"]) --> CACHE{"Cache<br/>Hit?"}
    CACHE -->|"Yes"| CACHED(["Cached Result"])
    CACHE -->|"No"| TIER{"Cost Tier<br/>Selection"}

    TIER --> FAST["Tier 1: Fast/Cheap<br/>GPT-4o-mini, Gemini Flash"]
    TIER --> BALANCED["Tier 2: Balanced<br/>GPT-4o, Claude Sonnet"]
    TIER --> PREMIUM["Tier 3: Premium<br/>Claude Opus, GPT-4"]

    FAST & BALANCED & PREMIUM --> EVAL["Auto-Evaluation<br/>Quality Score"]
    EVAL -->|"Score >= 7"| OUTPUT(["Published Content"])
    EVAL -->|"Score < 7"| RETRY["Retry with<br/>Higher Tier"]
    RETRY --> BALANCED

    style INPUT fill:#3b82f6,stroke:#2563eb,color:#fff
    style EVAL fill:#6366f1,stroke:#4f46e5,color:#fff
    style OUTPUT fill:#22c55e,stroke:#16a34a,color:#fff

The key insight: start with the cheapest model that might produce acceptable quality. If it fails the quality gate, retry with a better model. Most items pass on the first tier, so your average cost stays low even though some items require premium models.

Model Selection: Using the Model Registry for Cost Analysis

NeuroLink’s model registry gives you programmatic access to model metadata – cost per token, context window, quality ratings, and capabilities. Use it to make data-driven model selection decisions.

1
2
3
4
5
6
7
8
import { NeuroLink } from '@juspay/neurolink';

const neurolink = new NeuroLink();

// Use the models CLI to find cost-effective models
// neurolink models search --use-case creative --max-cost 0.01
// neurolink models best --creative --cost-effective
// neurolink models compare gpt-4o-mini claude-3-haiku gemini-2.0-flash

Here is a cost comparison for content generation models:

ModelProviderInput Cost/1M tokensOutput Cost/1M tokensQuality RatingBest For
GPT-4o-miniOpenAI$0.15$0.60GoodSimple descriptions, short-form
Gemini 2.0 FlashGoogle AI$0.10$0.40GoodHigh-volume, fast turnaround
GPT-4oOpenAI$2.50$10.00ExcellentComplex, nuanced content
Claude SonnetAnthropic$3.00$15.00ExcellentDetailed, structured content
Claude OpusAnthropic$15.00$75.00PremiumComplex reasoning, premium quality

For a 10K daily pipeline with an average of 500 tokens per description (input + output), the cost difference between tiers is significant:

  • Tier 1 (GPT-4o-mini): ~$3.75/day
  • Tier 2 (GPT-4o): ~$62.50/day
  • Tier 3 (Claude Opus): ~$450/day

If 85% of items pass on Tier 1, 12% need Tier 2, and 3% need Tier 3, your blended daily cost is approximately $14 – dramatically lower than running everything on Tier 2 or Tier 3.

Batching Strategy: Processing Items Efficiently

Processing 10,000 items one at a time wastes time on network overhead. Batching with concurrency control maximizes throughput while respecting rate limits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
const neurolink = new NeuroLink();

interface Product {
  id: string;
  name: string;
  category: string;
  attributes: Record<string, string>;
}

interface Description {
  id: string;
  content: string;
  model: string;
  tokensUsed: number;
}

function buildPrompt(item: Product): string {
  return `Write a compelling, SEO-friendly product description for:
Product: ${item.name}
Category: ${item.category}
Attributes: ${JSON.stringify(item.attributes)}

Requirements:
- 150-250 words
- Include key features and benefits
- Use natural language, avoid keyword stuffing
- End with a call to action`;
}

async function generateDescriptions(items: Product[]): Promise<Description[]> {
  const results: Description[] = [];

  // Process in batches of 20 with concurrency control
  for (let i = 0; i < items.length; i += 20) {
    const batch = items.slice(i, i + 20);

    const batchResults = await Promise.all(
      batch.map(item =>
        neurolink.generate({
          input: { text: buildPrompt(item) },
          provider: 'openai',
          model: 'gpt-4o-mini',
          systemPrompt: 'You are a product copywriter. Write compelling, SEO-friendly descriptions.',
        })
      )
    );

    results.push(...batchResults.map((r, idx) => ({
      id: batch[idx].id,
      content: r.content,
      model: 'gpt-4o-mini',
      tokensUsed: r.usage.total,
    })));

    // Rate limiting: pause between batches
    await sleep(1000);
  }

  return results;
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Note: The batch size of 20 is a starting point. Tune it based on your provider’s rate limits. OpenAI allows higher concurrency than some other providers. Monitor for rate limit errors (429 responses) and back off dynamically.

Concurrency Tuning

For maximum throughput, use a concurrency limiter instead of fixed batch sizes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
async function processWithConcurrency<T, R>(
  items: T[],
  processor: (item: T) => Promise<R>,
  concurrency: number = 10,
): Promise<R[]> {
  const results: R[] = [];
  const executing = new Set<Promise<void>>();

  for (const item of items) {
    const p = processor(item).then(result => {
      results.push(result);
      executing.delete(p);
    });
    executing.add(p);

    if (executing.size >= concurrency) {
      await Promise.race(executing);
    }
  }

  await Promise.all(executing);
  return results;
}

// Process 10K items with 10 concurrent requests
const descriptions = await processWithConcurrency(
  products,
  (product) => generateSingleDescription(product),
  10,
);

Analytics Middleware: Tracking Cost Per Item

You cannot optimize what you do not measure. NeuroLink’s analytics middleware tracks token usage, response time, and provider details for every request.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const neurolink = new NeuroLink();

// Analytics middleware is configured separately through the MiddlewareFactory:
// const middleware = new MiddlewareFactory({
//   middlewareConfig: {
//     analytics: { enabled: true },
//   },
// });

// Analytics tracks per-request:
// - Token usage (input + output)
// - Response time
// - Provider and model used
// Access via result.experimental_providerMetadata.neurolink.analytics

Build a cost dashboard from the analytics data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
interface CostReport {
  totalItems: number;
  totalTokens: number;
  totalCost: number;
  averageCostPerItem: number;
  tierBreakdown: Record<string, { count: number; cost: number }>;
}

function buildCostReport(results: Description[]): CostReport {
  const tierCosts: Record<string, number> = {
    'gpt-4o-mini': 0.00075,    // per 1K tokens (blended)
    'gpt-4o': 0.00625,
    'claude-sonnet-4-5-20250929': 0.009,
    'claude-opus-4-6': 0.045,
  };

  const tierBreakdown: Record<string, { count: number; cost: number }> = {};
  let totalTokens = 0;
  let totalCost = 0;

  for (const result of results) {
    const costPer1K = tierCosts[result.model] || 0.005;
    const cost = (result.tokensUsed / 1000) * costPer1K;

    totalTokens += result.tokensUsed;
    totalCost += cost;

    if (!tierBreakdown[result.model]) {
      tierBreakdown[result.model] = { count: 0, cost: 0 };
    }
    tierBreakdown[result.model].count++;
    tierBreakdown[result.model].cost += cost;
  }

  return {
    totalItems: results.length,
    totalTokens,
    totalCost,
    averageCostPerItem: totalCost / results.length,
    tierBreakdown,
  };
}

Set up cost alerts to catch anomalies before they become expensive surprises:

1
2
3
4
5
6
7
8
9
10
11
12
function checkCostAlerts(report: CostReport): void {
  const DAILY_BUDGET = 50; // dollars
  const MAX_COST_PER_ITEM = 0.02;

  if (report.totalCost > DAILY_BUDGET) {
    console.error(`ALERT: Daily cost $${report.totalCost.toFixed(2)} exceeds budget $${DAILY_BUDGET}`);
  }

  if (report.averageCostPerItem > MAX_COST_PER_ITEM) {
    console.warn(`WARNING: Average cost per item $${report.averageCostPerItem.toFixed(4)} is high`);
  }
}

Auto-Evaluation: Quality Gates at Scale

Auto-evaluation is the mechanism that makes tiered model selection work. Every generated description is scored for quality. If it passes, it ships. If it fails, it is retried with a better model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
const neurolink = new NeuroLink();

// Auto-evaluation middleware is configured separately through the MiddlewareFactory:
const middleware = new MiddlewareFactory({
  middlewareConfig: {
    autoEvaluation: {
      enabled: true,
      config: {
        threshold: 7,
        blocking: true,
        onEvaluationComplete: async (evalResult) => {
          if (!evalResult.isPassing) {
            await logLowQuality(evalResult);
          }
        },
      },
    },
  },
});

The evaluation checks multiple dimensions:

  • Relevance: Does the description match the product attributes?
  • Accuracy: Are the claims factually correct?
  • Completeness: Does it cover all key features?
  • Style: Does it match the brand voice and tone?

Set different thresholds for different content types. Product descriptions for your homepage might require a score of 8+, while bulk catalog entries might accept a 6+.

Note: Evaluation tokens add to your total cost. For 10K items, evaluation might add 15-20% to your token usage. Factor this into your cost calculations.

Token Cost Distribution

Understanding where your tokens go helps you optimize the right part of the pipeline:

pie title Token Cost Distribution (10K items)
    "Input Tokens (Prompts)" : 35
    "Output Tokens (Descriptions)" : 45
    "Evaluation Tokens" : 15
    "Retries" : 5

Output tokens are the largest cost driver because they are typically 2-4x more expensive than input tokens. Optimize by:

  • Setting a maximum word count in your prompts
  • Using concise system prompts (fewer input tokens)
  • Caching system prompts when your provider supports it

Tiered Retry: Escalating to Better Models

The tiered retry pattern is where the real cost savings happen. Start cheap, escalate only when needed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
async function generateWithFallback(
  item: Product,
  tier: 'fast' | 'balanced' | 'premium' = 'fast'
): Promise<string> {
  const models = {
    fast: { provider: 'openai', model: 'gpt-4o-mini' },
    balanced: { provider: 'anthropic', model: 'claude-sonnet-4-5-20250929' },
    premium: { provider: 'anthropic', model: 'claude-opus-4-6' },
  };

  const config = models[tier];
  const result = await neurolink.generate({
    input: { text: buildPrompt(item) },
    provider: config.provider,
    model: config.model,
  });

  // Check quality score
  if (result.evaluationResult?.finalScore < 7 && tier !== 'premium') {
    const nextTier = tier === 'fast' ? 'balanced' : 'premium';
    console.log(`Item ${item.id}: Score ${result.evaluationResult?.finalScore}, escalating to ${nextTier}`);
    return generateWithFallback(item, nextTier);
  }

  return result.content;
}

In practice, the distribution typically looks like:

  • 85% Tier 1: Simple products with clear attributes (shoes, basic electronics, standard apparel)
  • 12% Tier 2: Products needing nuanced descriptions (luxury goods, technical equipment)
  • 3% Tier 3: Complex products requiring detailed reasoning (industrial machinery, specialized tools)

This distribution means your effective cost is heavily weighted toward Tier 1 pricing.

Caching: Avoiding Duplicate Generation

If your catalog has similar products – variations in color, size, or minor attributes – caching can significantly reduce generation volume.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import crypto from 'crypto';

class ContentCache {
  private cache = new Map<string, string>();

  private hashPrompt(prompt: string): string {
    return crypto.createHash('sha256').update(prompt).digest('hex');
  }

  get(prompt: string): string | undefined {
    return this.cache.get(this.hashPrompt(prompt));
  }

  set(prompt: string, content: string): void {
    this.cache.set(this.hashPrompt(prompt), content);
  }

  get size(): number {
    return this.cache.size;
  }
}

const cache = new ContentCache();

async function generateWithCache(item: Product): Promise<string> {
  const prompt = buildPrompt(item);
  const cached = cache.get(prompt);

  if (cached) {
    console.log(`Cache hit for item ${item.id}`);
    return cached;
  }

  const content = await generateWithFallback(item);
  cache.set(prompt, content);
  return content;
}

NeuroLink supports multiple caching backends depending on your scale:

  • Memory cache: Simple in-process cache, good for single-run pipelines
  • File cache: Persistent across runs, good for development and testing
  • Redis cache: Distributed, good for production with multiple workers

For a 10K daily pipeline, even a 10% cache hit rate saves 1,000 API calls per day.

Observability: OpenTelemetry Integration

At scale, you need deep observability into your pipeline’s behavior. NeuroLink integrates with OpenTelemetry for metrics and tracing:

1
2
3
4
5
6
7
8
9
// Enable telemetry for pipeline monitoring
// Set NEUROLINK_TELEMETRY_ENABLED=true
// Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

// TelemetryService tracks:
// - ai_requests_total (counter)
// - ai_request_duration_ms (histogram)
// - ai_tokens_used_total (counter)
// - ai_provider_errors_total (counter)

Key metrics to monitor in your pipeline dashboard:

MetricWhat It Tells YouAlert Threshold
ai_requests_total by tierTier distribution driftTier 3 > 10% of total
ai_request_duration_ms p95Latency bottlenecks> 5 seconds
ai_tokens_used_total dailyCost tracking> 150% of budget
ai_provider_errors_totalProvider health> 1% error rate
Items processed per hourPipeline throughput< 80% of target

Cost Optimization Checklist

Before deploying your pipeline, walk through this checklist:

  1. Use the cheapest model that meets quality thresholds. Do not default to GPT-4o for everything. Test GPT-4o-mini and Gemini Flash first.
  2. Cache aggressively for similar products. Hash your prompts and skip generation for duplicates.
  3. Batch requests to amortize overhead. Network latency per request adds up at 10K scale.
  4. Monitor token usage per content type. Some product categories may need longer descriptions (more tokens) than others.
  5. Set up alerts for cost anomalies. A prompt change that increases output length by 50% doubles your output token cost.
  6. Use shorter system prompts to reduce input tokens. Every token in your system prompt is charged on every request.
  7. Run evaluation in non-blocking mode for lower tiers. Only block on evaluation for Tier 2 and above.
  8. Schedule heavy processing during off-peak hours. Some providers offer better latency during off-peak times.

What’s Next

The architecture decisions we have described represent trade-offs that worked for our scale and constraints. The key engineering insights to take away: start with the simplest design that handles your current load, instrument everything so you can identify bottlenecks before they become outages, and resist premature abstraction until you have at least three concrete use cases demanding it. The implementation details will differ for your system, but the underlying constraints – latency budgets, failure domains, resource contention – are universal.


Related posts:

This post is licensed under CC BY 4.0 by the author.