Building a Production RAG Pipeline with Next.js

A practical walkthrough of retrieval-augmented generation: chunking, embeddings, vector search, and serving answers from a Next.js API without leaking your source documents.

Why RAG beats stuffing the whole document in the prompt

Large language models have a fixed context window. Retrieval-augmented generation (RAG) sends only the most relevant chunks to the model, which keeps latency and cost down while improving factual answers.

Note: RAG is not a substitute for evaluation. You still need golden questions and human review on critical paths.

What you will build

Ingest documents (PDF, MD, HTML)
Chunk and embed text
Store vectors in a database
Retrieve top-k chunks per question
Generate an answer with citations

Architecture at a glance

Your app has three layers: ingestion, retrieval, and generation. Keep them separate so you can re-embed documents without redeploying the UI.

Ingestion pipeline

Batch jobs should be idempotent. If a file hash is unchanged, skip re-chunking.

Stage	Input	Output
Parse	Raw files	Plain text per page
Chunk	Text	512–1024 token segments
Embed	Chunks	Vector + metadata
Index	Vectors	Searchable store

flowchart LR ingest[Ingest docs] --> chunk[Chunk text] chunk --> embed[Embed] embed --> index[(Vector DB)] query[User question] --> retrieve[Top-k search] index --> retrieve retrieve --> llm[LLM + prompt] llm --> answer[Streamed answer]

Chunking strategies that actually work

Fixed-size chunks are the default. Overlap (10–20%) reduces boundary cuts through sentences.

Semantic chunking

Split on headings or paragraphs when structure is reliable. For API docs, respect h2 / h3 boundaries.

Metadata to store with every chunk

source_id (file or URL)
page or section
updated_at
optional access_level for multi-tenant apps

A minimal retrieval function

This TypeScript example scores cosine similarity in application code; in production you would use pgvector, Pinecone, or similar.

type Chunk = {
  id: string
  text: string
  embedding: number[]
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0
  let normA = 0
  let normB = 0
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}

export function topKChunks(
  queryEmbedding: number[],
  chunks: Chunk[],
  k = 5
): Chunk[] {
  return [...chunks]
    .map((c) => ({ c, score: cosineSimilarity(queryEmbedding, c.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k)
    .map(({ c }) => c)
}

Serving answers from Next.js

Expose a Route Handler that embeds the question, retrieves chunks, builds a prompt, and streams the response.

# Example env vars
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
EMBEDDING_MODEL=text-embedding-3-small

Use streaming (ReadableStream or the Vercel AI SDK) so the UI shows tokens as they arrive. Always return citation links to chunk sources in the JSON payload.

Prompt template (sketch)

You are a technical assistant. Answer using ONLY the context below.
If the context is insufficient, say you don't know.

Context:
{{retrieved_chunks}}

Question:
{{user_question}}

Observability and failure modes

Empty retrieval: widen k or lower similarity threshold; never hallucinate sources.
Stale index: version embeddings when the model changes.
PII in chunks: redact at ingest time.

Checklist before launch

Golden-set accuracy on 50+ questions
p95 latency under your SLA
Rate limits on the public API
Audit log of queries (no raw secrets)

Building a Production RAG Pipeline with Next.js

A practical walkthrough of retrieval-augmented generation: chunking, embeddings, vector search, and serving answers from a Next.js API without leaking your source documents.

Why RAG beats stuffing the whole document in the prompt

Note: RAG is not a substitute for evaluation. You still need golden questions and human review on critical paths.

What you will build

Ingest documents (PDF, MD, HTML)
Chunk and embed text
Store vectors in a database
Retrieve top-k chunks per question
Generate an answer with citations

Architecture at a glance

Your app has three layers: ingestion, retrieval, and generation. Keep them separate so you can re-embed documents without redeploying the UI.

Ingestion pipeline

Batch jobs should be idempotent. If a file hash is unchanged, skip re-chunking.

Stage	Input	Output
Parse	Raw files	Plain text per page
Chunk	Text	512–1024 token segments
Embed	Chunks	Vector + metadata
Index	Vectors	Searchable store

Chunking strategies that actually work

Fixed-size chunks are the default. Overlap (10–20%) reduces boundary cuts through sentences.

Semantic chunking

Split on headings or paragraphs when structure is reliable. For API docs, respect h2 / h3 boundaries.

Metadata to store with every chunk

source_id (file or URL)
page or section
updated_at
optional access_level for multi-tenant apps

A minimal retrieval function

This TypeScript example scores cosine similarity in application code; in production you would use pgvector, Pinecone, or similar.

type Chunk = {
  id: string
  text: string
  embedding: number[]
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0
  let normA = 0
  let normB = 0
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB))
}

export function topKChunks(
  queryEmbedding: number[],
  chunks: Chunk[],
  k = 5
): Chunk[] {
  return [...chunks]
    .map((c) => ({ c, score: cosineSimilarity(queryEmbedding, c.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k)
    .map(({ c }) => c)
}

Serving answers from Next.js

Expose a Route Handler that embeds the question, retrieves chunks, builds a prompt, and streams the response.

# Example env vars
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://...
EMBEDDING_MODEL=text-embedding-3-small

Use streaming (ReadableStream or the Vercel AI SDK) so the UI shows tokens as they arrive. Always return citation links to chunk sources in the JSON payload.

Prompt template (sketch)

You are a technical assistant. Answer using ONLY the context below.
If the context is insufficient, say you don't know.

Context:
{{retrieved_chunks}}

Question:
{{user_question}}

Observability and failure modes

Empty retrieval: widen k or lower similarity threshold; never hallucinate sources.
Stale index: version embeddings when the model changes.
PII in chunks: redact at ingest time.

Checklist before launch

Golden-set accuracy on 50+ questions
p95 latency under your SLA
Rate limits on the public API
Audit log of queries (no raw secrets)

smaple

Building a Production RAG Pipeline with Next.js

Why RAG beats stuffing the whole document in the prompt

What you will build

Architecture at a glance

Ingestion pipeline

Chunking strategies that actually work

Semantic chunking

Metadata to store with every chunk

A minimal retrieval function

Serving answers from Next.js

Prompt template (sketch)

Observability and failure modes

Checklist before launch

Further reading

Comments

smaple

Building a Production RAG Pipeline with Next.js

Why RAG beats stuffing the whole document in the prompt

What you will build

Architecture at a glance

Ingestion pipeline

Chunking strategies that actually work

Semantic chunking

Metadata to store with every chunk

A minimal retrieval function

Serving answers from Next.js

Prompt template (sketch)

Observability and failure modes

Checklist before launch

Further reading

Comments