$0.03 adds up fast.

A naive RAG implementation sends the same 2,000-token System Prompt for every user query.

LLM Cost Engineering is the art of caching prompt prefixes, using smaller "Router Models" (like Haiku/Flash) to triage requests, and compressing context.

Deep Dive: Prompt Caching

Models like Haiku and DeepSeek now support Prompt Caching. If your System Prompt + RAG Context (the "Prefix") is identical across requests, the API provider caches the KV states on the GPU.
Impact: 90% Cost Reduction and 80% Latency Reduction for the cached portion. Always structure your prompt so static content comes first.

02. Don't Pay Twice

If User A asks "Who is the CEO?" and User B asks "Who runs the company?", the LLM shouldn't run twice. Use Semantic Caching (Redis + Vectors) to serve the cached answer for semantically similar queries.

// middleware.ts

const

vector =

await

embed(userQuery);

const

cached =

await

redis.similaritySearch(vector, 0.95);

(cached) {'{'}

return

cached.response; // Cost: $0.00
{'}'}

const

reply =

await

callLLM(); // Cost: $0.01

await

redis.save(vector, reply);

04. The Senior Engineer's Take

JSON Verbosity

When asking for JSON, every character in the key name counts.

Don't ask for { "customer_shipping_address": "..." }. Ask for { "addr": "..." } and map it in your code. You can save 20% on output tokens just by shortening keys.

Context Stuffing vs RAG

Just because Gemini 1.5 Pro has a 2M context window doesn't mean you should dump your whole DB into it.
1. It costs $10 per call.
2. Latency is 60+ seconds.
RAG is still essential for Latency and Cost control, even if capacity exists.

import React, { useState } from 'react'; // 💰 Cost Savings Visualizer export default function CostDemo() { const [queries, setQueries] = useState([]); const [balance, setBalance] = useState(10.00); const [savings, setSavings] = useState(0); const ask = (text) => { // Check semantic cache (simulated) const iscached = queries.some(q => q.text.includes("CEO") && text.includes("manager")); // Bad logic but for demo visual // Better logic: // We simulate that "Who is CEO?" and "Who runs company?" are same. let cost = 0.05; let isHit = false; // Visual simulation of semantic match pairs const cacheKeys = ["reset password", "pricing", "contact support"]; const normalized = text.toLowerCase(); // If we recently asked something similar const similar = queries.find(q => { if (normalized === "reset password" && q.text === "forgot password") return true; if (normalized === "pricing" && q.text === "how much is it") return true; return q.text === text; // Exact match }); if (similar) { cost = 0; isHit = true; setSavings(s => s + 0.05); } else { setBalance(b => b - cost); } setQueries(prev => [{ id: Date.now(), text, isHit, cost }, ...prev].slice(0, 5)); }; return ( <div className="bg-slate-50 dark:bg-slate-950 p-8 rounded-3xl border border-slate-200 dark:border-slate-800 shadow-xl"> <div className="flex justify-between items-center mb-10"> <h3 className="text-2xl font-black text-gray-900 dark:text-white flex items-center gap-3"> <span className="text-yellow-500">💰</span> Token Optimizer </h3> <div className="text-right"> <div className="text-2xl font-bold text-gray-800 dark:text-white">${balance.toFixed(2)}</div> <div className="text-xs text-green-500 font-bold">Saved: ${savings.toFixed(2)}</div> </div> </div> <div className="flex flex-col md:flex-row gap-8"> {/* Input Controls */} <div className="w-full md:w-1/3 space-y-4"> <p className="text-sm text-gray-500 mb-2">Simulate User Queries:</p> <button onClick={() => ask("forgot password")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition"> "Forgot password" </button> <button onClick={() => ask("reset password")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition font-bold text-blue-500"> "Reset password" (Similar) </button> <div className="h-4"></div> <button onClick={() => ask("how much is it")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition"> "How much is it?" </button> <button onClick={() => ask("pricing")} className="w-full p-3 bg-white dark:bg-slate-900 border border-slate-200 dark:border-slate-800 rounded-xl text-left hover:bg-gray-50 dark:hover:bg-slate-800 transition font-bold text-blue-500"> "Pricing" (Similar) </button> </div> {/* Request Log */} <div className="flex-1 bg-slate-200 dark:bg-slate-900/50 rounded-2xl p-6 relative min-h-[300px]"> <div className="absolute top-4 right-4 flex items-center gap-2 text-xs font-bold text-gray-500 uppercase"> <span>🧊</span> Redis Semantic Cache </div> <div className="space-y-3 mt-8"> {queries.length === 0 && <div className="text-center text-gray-400 mt-10">Waiting for requests...</div>} {queries.map(q => ( <div key={q.id} className="flex items-center justify-between p-4 bg-white dark:bg-slate-900 rounded-xl shadow-sm border border-slate-200 dark:border-slate-800 animate-in slide-in-from-right-4"> <div> <div className="font-bold text-gray-800 dark:text-white">"{q.text}"</div> <div className="text-xs text-gray-400">{q.isHit ? 'Served from Cache' : 'Sent to LLM API'}</div> </div> <div className={`px-3 py-1 rounded-full text-xs font-bold flex items-center gap-2 ${q.isHit ? 'bg-green-100 text-green-700 dark:bg-green-900/30 dark:text-green-400' : 'bg-red-100 text-red-700 dark:bg-red-900/30 dark:text-red-400'}`}> {q.isHit ? <span>⚡</span> : <span>💸</span>} {q.isHit ? 'HIT ($0.00)' : 'MISS ($0.05)'} </div> </div> ))} </div> </div> </div> </div> ); }

LLM Cost Engineering: Saving 90% on Tokens

$0.03 adds up fast.

Deep Dive: Prompt Caching

02. Don't Pay Twice

04. The Senior Engineer's Take

JSON Verbosity

Context Stuffing vs RAG

⚡ Interactive Playground