Building Production RAG Systems: Architecture, Costs, and Lessons Learned
I built 3 production RAG systems serving 500K users. Real architecture, cost breakdown, and lessons from 6 months in production.
RAG (Retrieval-Augmented Generation) sounds simple in theory: retrieve relevant documents, feed them to an LLM, get better answers. In practice, building a production RAG system that actually works is way harder than it looks.
I built 3 production RAG systems over 6 months, serving 500K users with 10M queries. Here's the real architecture, cost breakdown ($12K/month), and the painful lessons I learned.
The Three Systems I Built
System 1: Customer Support Bot (SaaS Documentation)
- Data: 2,500 docs, 5M tokens
- Users: 200K/month
- Queries: 3M/month
- Cost: $4,200/month
System 2: Legal Document Analysis (Enterprise)
- Data: 50K docs, 200M tokens
- Users: 5K/month
- Queries: 150K/month
- Cost: $6,800/month
System 3: Code Search (Developer Tool)
- Data: 1M code files, 500M tokens
- Users: 300K/month
- Queries: 7M/month
- Cost: $1,000/month
💡 Key insight: Cost per query varies 40x depending on architecture choices. System 3 costs $0.00014/query vs System 2 at $0.045/query.
Production RAG Architecture
The Stack That Actually Works
┌─────────────┐
│ User Query │
└──────┬──────┘
│
▼
┌─────────────────────┐
│ Query Processing │ ← Rewrite, expand, classify
│ (GPT-4o Mini) │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Embedding Model │ ← text-embedding-3-large
│ (OpenAI) │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Vector Search │ ← Qdrant (top-20 results)
│ + Metadata Filter │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Reranking │ ← Cohere Rerank (top-5)
│ (Cohere) │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Context Assembly │ ← Build prompt with context
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ LLM Generation │ ← GPT-4 Turbo / Claude 3.5
│ (OpenAI/Anthropic) │
└──────┬──────────────┘
│
▼
┌─────────────────────┐
│ Response + Sources │
└─────────────────────┘ Component Breakdown
1. Query Processing (Optional but Recommended)
Use a cheap LLM to rewrite user queries for better retrieval:
// Before: "how do i fix this error?"
// After: "troubleshooting authentication error in API integration"
const rewrittenQuery = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{
role: "system",
content: "Rewrite this query to be more specific and searchable"
}, {
role: "user",
content: userQuery
}]
}); Cost: $0.0001/query | Impact: +15% retrieval accuracy
2. Embedding Model
Convert text to vectors for semantic search:
const embedding = await openai.embeddings.create({
model: "text-embedding-3-large",
input: rewrittenQuery
}); Options:
- OpenAI text-embedding-3-large: $0.00013/1K tokens (best quality)
- Cohere embed-v3: $0.0001/1K tokens (good multilingual)
- Open-source (BGE-large): Free (self-host, lower quality)
3. Vector Database
Store and search embeddings:
const results = await qdrant.search({
collection: "docs",
vector: embedding.data[0].embedding,
limit: 20,
filter: {
must: [
{ key: "category", match: { value: "api-docs" } },
{ key: "updated_at", range: { gte: "2025-01-01" } }
]
}
}); Why top-20 not top-5? Reranking improves results. Retrieve more, then rerank.
4. Reranking (Critical for Quality)
Rerank top-20 results to get best top-5:
const reranked = await cohere.rerank({
model: "rerank-english-v3.0",
query: userQuery,
documents: results.map(r => r.payload.text),
top_n: 5
}); Cost: $0.002/query | Impact: +25% answer quality
🔥 Reranking is the secret sauce — It improved our answer quality more than any other single change. Don't skip this.
5. LLM Generation
Generate final answer with retrieved context:
const context = reranked.results.map(r => r.document.text).join("\n\n");
const answer = await openai.chat.completions.create({
model: "gpt-4-turbo",
messages: [{
role: "system",
content: "Answer based on the provided context. Cite sources."
}, {
role: "user",
content: `Context:\n${context}\n\nQuestion: ${userQuery}`
}]
}); Cost Breakdown ($12K/Month Total)
Per-Component Costs (10M Queries/Month)
| Component | Provider | Cost/Query | Monthly Cost | % of Total |
|---|---|---|---|---|
| Query Rewriting | GPT-4o Mini | $0.0001 | $1,000 | 8% |
| Embeddings | OpenAI | $0.0002 | $2,000 | 17% |
| Vector DB | Qdrant Cloud | $0.00005 | $500 | 4% |
| Reranking | Cohere | $0.002 | $2,000 | 17% |
| LLM Generation | GPT-4 Turbo | $0.006 | $6,000 | 50% |
| Infrastructure | AWS | $0.00005 | $500 | 4% |
| Total | $0.0012 | $12,000 | 100% |
💡 LLM generation is 50% of costs — Optimize here first. We switched 70% of queries to GPT-4o (2x cheaper) and saved $2K/month.
Cost Optimization Strategies
1. Use Cheaper LLMs for Simple Queries
// Classify query complexity
const complexity = await classifyQuery(userQuery);
const model = complexity === "simple"
? "gpt-4o" // $0.003/query
: "gpt-4-turbo"; // $0.006/query
// Saved $2,000/month 2. Cache Embeddings
// Don't re-embed the same query
const cachedEmbedding = await redis.get(`emb:${queryHash}`);
if (cachedEmbedding) return cachedEmbedding;
// Saved $800/month on duplicate queries 3. Batch Reranking
// Batch multiple queries together
const batchRerank = await cohere.rerank({
queries: [query1, query2, query3], // 50% discount
documents: allDocs
});
// Saved $500/month Lessons Learned (The Hard Way)
1. Chunking Strategy Matters More Than You Think
What We Did Wrong
Started with naive 512-token chunks with no overlap. Retrieval accuracy was terrible (45%).
What Actually Works
- Chunk size: 256-512 tokens (smaller is better for precision)
- Overlap: 50 tokens (prevents context loss at boundaries)
- Metadata: Add title, section, date to every chunk
- Hierarchical: Store both full doc + chunks, retrieve chunks but show full doc
Result: Retrieval accuracy improved from 45% to 78%
2. Metadata Filtering is Critical
Don't just do pure vector search. Add metadata filters:
// Bad: Pure vector search
const results = await vectorDB.search(embedding);
// Good: Vector search + metadata filtering
const results = await vectorDB.search(embedding, {
filter: {
user_id: currentUser.id, // User-specific docs
category: "api-docs", // Relevant category
updated_at: { gte: "2025-01-01" } // Recent docs only
}
}); Impact: Reduced irrelevant results by 60%
3. Reranking is Non-Negotiable
We tried to skip reranking to save $2K/month. User satisfaction dropped 30%. We added it back immediately.
Why it matters: Vector search is good at recall (finding relevant docs) but bad at precision (ranking them). Reranking fixes this.
4. Prompt Engineering > Model Choice
We spent weeks testing GPT-4 vs Claude vs Gemini. Differences were minimal (2-3%). Then we spent 2 days improving our prompt and saw a 25% quality improvement.
Key prompt improvements:
- Explicit instructions to cite sources
- Examples of good vs bad answers (few-shot)
- Instruction to say "I don't know" if context doesn't contain answer
5. Monitoring is Essential
Track these metrics or you're flying blind:
- Retrieval accuracy: % of queries where relevant doc is in top-5
- Answer quality: User thumbs up/down
- Latency: P50, P95, P99 response times
- Cost per query: Track by component
- Cache hit rate: For embeddings and results
6. Hybrid Search is Overrated (For Most Use Cases)
We added BM25 keyword search alongside vector search. It helped for exact matches (product codes, error messages) but hurt overall quality.
When to use hybrid:
- Legal/medical docs (exact terminology matters)
- Code search (function names, variable names)
- Product catalogs (SKUs, model numbers)
When to skip it: Everything else. Pure vector search with good embeddings works better.
Common Pitfalls to Avoid
1. Not Handling "I Don't Know"
LLMs will hallucinate answers if you don't explicitly tell them not to:
// Add to system prompt:
"If the context doesn't contain information to answer the question,
respond with 'I don't have enough information to answer that.'
Do NOT make up information." 2. Ignoring Latency
Our initial system took 4.5 seconds per query. Users hated it. We optimized to 1.2 seconds:
- Parallel API calls (embedding + vector search)
- Streaming LLM responses (show partial answers)
- Caching (Redis for embeddings, results)
3. Over-Engineering
We built a complex multi-stage retrieval pipeline with 5 different retrieval strategies. It was slow, expensive, and only 3% better than simple vector search + reranking.
Start simple: Vector search + reranking gets you 90% of the way there.
4. Not Testing with Real Users
Our synthetic benchmarks showed 95% accuracy. Real users reported 60% satisfaction. The gap was huge.
Solution: A/B test everything with real users, not synthetic evals.
Final Recommendations
Minimum Viable RAG Stack
- Embeddings: OpenAI text-embedding-3-large
- Vector DB: Qdrant Cloud (or Pinecone for zero-ops)
- Reranking: Cohere Rerank v3
- LLM: GPT-4o for most queries, GPT-4 Turbo for complex ones
- Caching: Redis for embeddings and results
Expected Costs (100K Queries/Month)
- Embeddings: $20
- Vector DB: $45
- Reranking: $200
- LLM: $600
- Infrastructure: $50
- Total: ~$915/month ($0.009/query)
When NOT to Use RAG
- Your data fits in the LLM context window (use long-context models instead)
- You need real-time data (use function calling + APIs instead)
- Your data changes constantly (RAG indexing lag will hurt you)
- You have <100 documents (just put them all in the prompt)
💡 Pro tip: Start with Claude 3.5's 200K context window. If your entire knowledge base fits, skip RAG entirely. It's simpler and often better.