RAG, End to End: Building Reliable AI with Your Own Data
Why naive prompting fails, and how to architect a retrieval pipeline that actually works in production
Large language models are trained on public data up to a cutoff date. They have no knowledge of your internal documentation, your product database, or last week's incident report. The naive fix — pasting everything into the prompt — breaks down quickly: context windows have hard limits, long contexts dilute attention, and you pay per token. Retrieval-Augmented Generation (RAG) solves this by finding only the relevant pieces of your data at query time and passing those pieces to the model.
The RAG Pipeline at a Glance
A production RAG system has five stages:
Chunk your source documents
Embed each chunk as a dense vector
Store vectors in a vector database
At query time: embed the user's question, retrieve the top-k chunks
Pass retrieved chunks as context and generate the answer
Simple in theory. Most production failures live in stages 1 and 4.
Step 1: Chunking — the Foundation Nobody Gets Right
Chunking is splitting your source documents into pieces small enough to embed meaningfully but large enough to preserve context.
Fixed-size chunking
Splits every N tokens with some overlap. Fast and predictable, but severs sentences and context at arbitrary boundaries.
def chunk_fixed(text: str, size: int = 512, overlap: int = 64) -> list[str]:
tokens = text.split()
chunks = []
for i in range(0, len(tokens), size - overlap):
chunks.append(" ".join(tokens[i : i + size]))
return chunksSemantic chunking
Splits at natural sentence or section boundaries, preserving more context per chunk.
import re
def chunk_semantic(text: str, max_tokens: int = 512) -> list[str]:
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current, count = [], [], 0
for s in sentences:
words = len(s.split())
if count + words > max_tokens and current:
chunks.append(" ".join(current))
current, count = [], 0
current.append(s)
count += words
if current:
chunks.append(" ".join(current))
return chunksRule of thumb: for prose, semantic chunking with 300–600 token chunks and 10–20% overlap is a reliable starting point. For code, split at function or class boundaries.
Step 2: Embeddings — Turning Text into Vectors
An embedding model converts a chunk of text into a dense vector (typically 768–3072 floats). Chunks with similar meaning end up close in vector space.
Popular embedding models:
text-embedding-3-small (OpenAI) — cheap, fast, 1536-dim, solid general baseline
text-embedding-3-large — higher accuracy, 3072-dim
nomic-embed-text — open-source, strong on retrieval benchmarks
voyage-3 (Voyage AI) — state-of-the-art on code and technical content
Embed at index time (batch) and at query time (single call). Always use the same model for both — cross-model cosine similarity is meaningless.
from openai import OpenAI
client = OpenAI()
def embed(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
response = client.embeddings.create(input=texts, model=model)
return [d.embedding for d in response.data]Step 3: Vector Search
Store your chunk embeddings in a vector database. At query time, embed the user's question and find the k nearest chunks by cosine similarity.
With Supabase and pgvector:
-- Enable the pgvector extension
create extension if not exists vector;
-- Chunks table with an embedding column
create table chunks (
id bigserial primary key,
text text not null,
emb vector(1536)
);
-- Approximate nearest-neighbour index (HNSW is faster than IVFFlat for most sizes)
create index on chunks using hnsw (emb vector_cosine_ops);
-- Retrieval function: returns k most similar chunks
create or replace function match_chunks(query_emb vector, k int)
returns table (id bigint, text text, similarity float)
language sql stable as $$
select id, text, 1 - (emb <=> query_emb) as similarity
from chunks
order by emb <=> query_emb
limit k;
$$;Step 4: Reranking
Vector search returns the k most embedding-similar chunks. Embedding similarity is a coarse proxy for relevance — it misses lexical overlap and intent alignment. A cross-encoder reranker scores (query, chunk) pairs precisely. Use it to shrink a top-50 vector result down to the top-5 you actually send to the LLM.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
scores = reranker.predict([(query, c) for c in chunks])
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_n]]Step 5: Generation with Context
Pass the reranked chunks as context. Be explicit about the source boundary and instruct the model not to hallucinate beyond it.
import anthropic
SYSTEM = """You are a helpful assistant. Answer using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Cite the relevant section when you can."""
def generate(query: str, context_chunks: list[str]) -> str:
context = "\n\n---\n\n".join(context_chunks)
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=SYSTEM,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].textEvaluating Your RAG System
Without evaluation you are flying blind. Three metrics that matter:
Retrieval recall@k — of all chunks that contain the answer, how many did you retrieve? Low recall means bad chunking or the wrong embedding model.
Context precision — of the chunks you retrieved, how many are actually relevant? Low precision means noisy context that confuses the generator.
Answer faithfulness — does the generated answer stick to the context, or does the model hallucinate beyond it?
Build a golden evaluation set: (question, expected_answer, relevant_chunk_ids) triples from your domain. Run your pipeline against it on every significant change.
Common Failure Modes
Correct chunks retrieved but wrong answer — context too large, model loses focus. Rerank and reduce to top-3.
Relevant chunks missed — semantic gap between query and chunk phrasing. Add hybrid search (BM25 + vector) or HyDE.
Hallucinated citations — no grounding instruction in system prompt. Add an explicit constraint.
Slow at query time — no ANN index. Use HNSW; run embedding and generation in parallel where possible.
Costs too high — re-embedding the full corpus on every change. Embed once, store, only re-embed changed or new documents.
RAG is not a single technique — it is an architecture. The chunking strategy, embedding model, retrieval mechanism, and reranker are all levers. Start with the simplest version that works, measure it, then tune the stage your evaluation identifies as the weakest.