$0.03 adds up fast.
A naive RAG implementation sends the same 2,000-token System Prompt for every user query.
LLM Cost Engineering is the art of caching prompt prefixes, using smaller "Router Models" (like Haiku/Flash) to triage requests, and compressing context.
Deep Dive: Prompt Caching
Models like Haiku and DeepSeek now support Prompt Caching. If your System Prompt + RAG Context (the "Prefix") is identical across requests, the API provider caches the KV states on the GPU.
Impact: 90% Cost Reduction and 80% Latency Reduction for the cached portion. Always structure your prompt so static content comes first.
02. Don't Pay Twice
If User A asks "Who is the CEO?" and User B asks "Who runs the company?", the LLM shouldn't run twice. Use Semantic Caching (Redis + Vectors) to serve the cached answer for semantically similar queries.
{'}'}
04. The Senior Engineer's Take
JSON Verbosity
When asking for JSON, every character in the key name counts.
Don't ask for { "customer_shipping_address": "..." }.
Ask for { "addr": "..." } and map it in your code. You can save 20% on output tokens just by shortening keys.
Context Stuffing vs RAG
Just because Gemini 1.5 Pro has a 2M context window doesn't mean you should dump your whole DB into it.
1. It costs $10 per call.
2. Latency is 60+ seconds.
RAG is still essential for Latency and Cost control, even if capacity exists.