Building a Production RAG Pipeline with Next.js
A practical walkthrough of retrieval-augmented generation: chunking, embeddings, vector search, and serving answers from a Next.js API without leaking your source documents.
Why RAG beats stuffing the whole document in the prompt
Large language models have a fixed context window. Retrieval-augmented generation (RAG) sends only the most relevant chunks to the model, which keeps latency and cost down while improving factual answers.
Note: RAG is not a substitute for evaluation. You still need golden questions and human review on critical paths.
What you will build
Ingest documents (PDF, MD, HTML)
Chunk and embed text
Store vectors in a database
Retrieve top-k chunks per question
Generate an answer with citations
Architecture at a glance
Your app has three layers: ingestion, retrieval, and generation. Keep them separate so you can re-embed documents without redeploying the UI.
Ingestion pipeline
Batch jobs should be idempotent. If a file hash is unchanged, skip re-chunking.
Stage | Input | Output |
|---|---|---|
Parse | Raw files | Plain text per page |
Chunk | Text | 512–1024 token segments |
Embed | Chunks | Vector + metadata |
Index | Vectors | Searchable store |
Chunking strategies that actually work
Fixed-size chunks are the default. Overlap (10–20%) reduces boundary cuts through sentences.
Semantic chunking
Split on headings or paragraphs when structure is reliable. For API docs, respect h2 / h3 boundaries.
Metadata to store with every chunk
source_id(file or URL)pageorsectionupdated_atoptional
access_levelfor multi-tenant apps
A minimal retrieval function
This TypeScript example scores cosine similarity in application code; in production you would use pgvector, Pinecone, or similar.
type Chunk = {
id: string
text: string
embedding: number[]
}
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0
let normA = 0
let normB = 0
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}
export function topKChunks(
queryEmbedding: number[],
chunks: Chunk[],
k = 5
): Chunk[] {
return [...chunks]
.map((c) => ({ c, score: cosineSimilarity(queryEmbedding, c.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, k)
.map(({ c }) => c)
}Serving answers from Next.js
Expose a Route Handler that embeds the question, retrieves chunks, builds a prompt, and streams the response.
# Example env vars
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
EMBEDDING_MODEL=text-embedding-3-smallUse streaming (ReadableStream or the Vercel AI SDK) so the UI shows tokens as they arrive. Always return citation links to chunk sources in the JSON payload.
Prompt template (sketch)
You are a technical assistant. Answer using ONLY the context below.
If the context is insufficient, say you don't know.
Context:
{{retrieved_chunks}}
Question:
{{user_question}}Observability and failure modes
Empty retrieval: widen k or lower similarity threshold; never hallucinate sources.
Stale index: version embeddings when the model changes.
PII in chunks: redact at ingest time.
Checklist before launch
Golden-set accuracy on 50+ questions
p95 latency under your SLA
Rate limits on the public API
Audit log of queries (no raw secrets)
Further reading
Thanks for reading — leave a comment if you want a follow-up on pgvector + Prisma.
Comments
0No comments yet. Be the first to share your thoughts!