AI LLM Production Node.js

Why Most LLM Pipelines Fail in Production

Demos are easy. Production is not. After shipping multiple LLM systems, here's the real failure taxonomy — and how to fix each one before it hits your users.

E
Egor Dultsev
Senior Engineer
7 min read

Most LLM integrations work flawlessly in demos. The dev environment is small, the prompts are tuned, and you control the input. Then you ship to production. Then the real fun starts.

After building several LLM-powered systems — an onboarding automation layer, a content generation pipeline, and a document intelligence tool — I’ve developed a taxonomy of how these things actually fail.

Failure Mode 1: Context Explosion

The most common issue I see in LLM codebases is unbounded context growth. Teams serialize the full conversation history on every request. Works fine for 5 messages. Costs $0.10+ per request after 20.

In a high-frequency system this is not a budget problem — it’s a latency problem. GPT-4-turbo with 50K tokens in context is not fast.

The fix:

interface AgentMemory {
  // Verbatim last N turns — recency matters
  recentTurns: Turn[];
  // Compressed summary of everything older
  historySummary: string;
  // Structured state — never passed to model raw
  workingState: Record<string, unknown>;
}

function buildContext(memory: AgentMemory, tokenBudget: number): string {
  // Always reserve 20% for the response
  const available = Math.floor(tokenBudget * 0.8);
  // ... trim and compress
}

Track token count per request from day one. You’ll need the data.

Failure Mode 2: Trusting Structured Output

LLMs will hallucinate fields, return null where you expect a string, and occasionally produce valid-looking JSON with keys that don’t exist in your schema.

Every LLM response that enters a downstream system needs validation:

import { z } from 'zod';

const OnboardingOutput = z.object({
  intent: z.enum(['exchange', 'inquiry', 'support']),
  urgency: z.number().min(0).max(1),
  suggestedRoute: z.string().optional(),
});

async function parseWithRetry(raw: string, maxRetries = 2): Promise<OnboardingOutput> {
  const parsed = JSON.parse(raw);
  const result = OnboardingOutput.safeParse(parsed);

  if (result.success) return result.data;

  if (maxRetries > 0) {
    // Feed the validation error back to the model
    const correction = await llm.correct(raw, result.error.message);
    return parseWithRetry(correction, maxRetries - 1);
  }

  throw new Error(`LLM output failed validation after retries`);
}

This pattern — validate, feed error back, retry — recovers ~85% of malformed outputs automatically.

Failure Mode 3: No Observability

An LLM call is a black box by default. If something goes wrong in production, you’re debugging blind. Before you ship anything:

  • Log full prompts (sanitized for PII) with a trace ID
  • Log input and output token counts per call
  • Log latency breakdown by step
  • Wire these to your existing APM (Datadog, Grafana, whatever you have)

Without this, every bug is a multi-hour investigation.

Failure Mode 4: Synchronous Inference in the Critical Path

An LLM call takes 1–8 seconds. If that call is blocking your API response, your p99 latency is 8+ seconds. That’s unacceptable for most user-facing features.

Solutions, in order of preference:

  1. Stream the response — send tokens as they arrive, update the UI progressively
  2. Move to async — trigger inference via a job queue (BullMQ), poll or push results via WebSocket
  3. Pre-compute — run inference before the user asks (if patterns are predictable)

In the OTP exchange platform I built, LLM-driven onboarding runs asynchronously via BullMQ. The user sees the UI immediately; the intelligent responses arrive within 2–3 seconds and update in real-time via WebSocket. No blocking.

Failure Mode 5: No Fallback Strategy

LLM APIs have outages. Rate limits get hit. Model versions change and break your prompts overnight.

Design every integration with a degraded path:

async function getOnboardingResponse(ctx: UserContext): Promise<string> {
  try {
    return await primaryModel.generate(ctx);
  } catch (primaryError) {
    logger.warn('Primary model failed', { error: primaryError, ctx });
    try {
      return await fallbackModel.generate(ctx);
    } catch (fallbackError) {
      // Always have a deterministic fallback
      return getCannedResponse(ctx.intent);
    }
  }
}

The degraded path must return something useful. Users forgive slow. They don’t forgive silent failure.

The Honest Summary

LLM pipelines are distributed systems. They have the same failure modes: latency, cost, observability, fallback. The teams that ship reliable AI features aren’t the ones with the best prompts — they’re the ones who applied distributed systems discipline to the problem.

Treat your LLM calls like external HTTP calls to an unreliable third party. Because that’s exactly what they are.