AI EngineeringCost OptimizationTokenomicsCachingSemantic Cache

LLM Cost Engineering: Saving 90% on Tokens

A
AI Systems Architect
Featured Guide 18 min read

$0.03 adds up fast.

A naive RAG implementation sends the same 2,000-token System Prompt for every user query.

LLM Cost Engineering is the art of caching prompt prefixes, using smaller "Router Models" (like Haiku/Flash) to triage requests, and compressing context.

Deep Dive: Prompt Caching

Models like Haiku and DeepSeek now support Prompt Caching. If your System Prompt + RAG Context (the "Prefix") is identical across requests, the API provider caches the KV states on the GPU.
Impact: 90% Cost Reduction and 80% Latency Reduction for the cached portion. Always structure your prompt so static content comes first.

02. Don't Pay Twice

If User A asks "Who is the CEO?" and User B asks "Who runs the company?", the LLM shouldn't run twice. Use Semantic Caching (Redis + Vectors) to serve the cached answer for semantically similar queries.

// middleware.ts
const
vector =
await
embed(userQuery);
const
cached =
await
redis.similaritySearch(vector, 0.95);

if
(cached) {'{'}
  
return
cached.response; // Cost: $0.00
{'}'}

const
reply =
await
callLLM(); // Cost: $0.01
await
redis.save(vector, reply);

04. The Senior Engineer's Take

JSON Verbosity

When asking for JSON, every character in the key name counts.

Don't ask for { "customer_shipping_address": "..." }. Ask for { "addr": "..." } and map it in your code. You can save 20% on output tokens just by shortening keys.

Context Stuffing vs RAG

Just because Gemini 1.5 Pro has a 2M context window doesn't mean you should dump your whole DB into it.
1. It costs $10 per call.
2. Latency is 60+ seconds.
RAG is still essential for Latency and Cost control, even if capacity exists.

âš¡ Interactive Playground

import React, { useState } from 'react';

// 💰 Cost Savings Visualizer

export default function CostDemo() {
    const [queries, setQueries] = useState([]);
    const [balance, setBalance] = useState(10.00);
    const [savings, setSavings] = useState(0);

    const ask = (text) => {
        // Check semantic cache (simulated)
        const iscached = queries.some(q => q.text.includes("CEO") && text.includes("manager")); // Bad logic but for demo visual
        // Better logic:
        // We simulate that "Who is CEO?" and "Who runs company?" are same.
        
        let cost = 0.05;
        let isHit = false;

        // Visual simulation of semantic match pairs
        const cacheKeys = ["reset password", "pricing", "contact support"];
        const normalized = text.toLowerCase();
        
        // If we recently asked something similar
        const similar = queries.find(q => {
            if (normalized === "reset password" && q.text === "forgot password") return true;
            if (normalized === "pricing" && q.text === "how much is it") return true;
            return q.text === text; // Exact match
        });

        if (similar) {
            cost = 0;
            isHit = true;
            setSavings(s => s + 0.05);
        } else {
            setBalance(b => b - cost);
        }

        setQueries(prev => [{ id: Date.now(), text, isHit, cost }, ...prev].slice(0, 5));
    };

    return (
        <div className="bg-slate-50 dark:bg-slate-950 p-8 rounded-3xl border border-slate-200 dark:border-slate-800 shadow-xl">
             <div className="flex justify-between items-center mb-10">
                <h3 className="text-2xl font-black text-gray-900 dark:text-white flex items-center gap-3">
                    <span className="text-yellow-500">💰</span> Token Optimizer
                </h3>
                <div className="text-right">
                    <div className="text-2xl font-bold text-gray-800 dark:text-white">${balance.toFixed(2)}</div>
                    <div className="text-xs text-green-500 font-bold">Saved: ${savings.toFixed(2)}</div>
                </div>
            </div>

            <div className="flex flex-col md:flex-row gap-8">
                
                {/* Input Controls */}
                <div className="w-full md:w-1/3 space-y-4">
                     <p className="text-sm text-gray-500 mb-2">Simulate User Queries:</p>
                     
                     <button onClick={() => ask("forgot password")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition">
                        "Forgot password"
                     </button>
                     <button onClick={() => ask("reset password")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition font-bold text-blue-500">
                        "Reset password" (Similar)
                     </button>
                     
                     <div className="h-4"></div>

                     <button onClick={() => ask("how much is it")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition">
                        "How much is it?"
                     </button>
                     <button onClick={() => ask("pricing")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition font-bold text-blue-500">
                        "Pricing" (Similar)
                     </button>
                </div>

                {/* Request Log */}
                <div className="flex-1 bg-slate-200 dark:bg-slate-900/50 rounded-2xl p-6 relative min-h-[300px]">
                    <div className="absolute top-4 right-4 flex items-center gap-2 text-xs font-bold text-gray-500 uppercase">
                        <span>🧊</span> Redis Semantic Cache
                    </div>

                    <div className="space-y-3 mt-8">
                        {queries.length === 0 && <div className="text-center text-gray-400 mt-10">Waiting for requests...</div>}
                        
                        {queries.map(q => (
                            <div key={q.id} className="flex items-center justify-between p-4 bg-white dark:bg-slate-900 rounded-xl shadow-sm border border-slate-200 dark:border-slate-800 animate-in slide-in-from-right-4">
                                <div>
                                    <div className="font-bold text-gray-800 dark:text-white">"{q.text}"</div>
                                    <div className="text-xs text-gray-400">{q.isHit ? 'Served from Cache' : 'Sent to LLM API'}</div>
                                </div>
                                <div className={`px-3 py-1 rounded-full text-xs font-bold flex items-center gap-2 ${q.isHit ? 'bg-green-100 text-green-700 dark:bg-green-900/30 dark:text-green-400' : 'bg-red-100 text-red-700 dark:bg-red-900/30 dark:text-red-400'}`}>
                                    {q.isHit ? <span>⚡</span> : <span>💸</span>}
                                    {q.isHit ? 'HIT ($0.00)' : 'MISS ($0.05)'}
                                </div>
                            </div>
                        ))}
                    </div>
                </div>

            </div>
        </div>
    );
}