Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

AI Models and LLM Development with NVIDIA

The Economic Impact of Generative AI

`RAG` (Retrieval-Augmented Generation)

RAG is becoming the default architecture for AI products.

It allows LLMs to:

access private knowledge
reduce hallucinations
stay up-to-date

Core idea:

LLM + Retrieval = Useful AI System

But building it reliably in production requires careful engineering.

What is RAG?

RAG combines two components:

Retriever
Generator (LLM)

Pipeline:

Query \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM Generation

Instead of asking the LLM directly:

User → LLM → Answer

flowchart TD
    Q[User question] --> R1[Retrieve relevant documents]
    R1 --> R2[Insert retrieved context into prompt]
    R2 --> LLM[LLM generates answer]
    LLM --> A[Grounded response]

we do:


User Query
↓
Embedding Model
↓
Vector Database
↓
Top-K Documents
↓
Prompt + Context
↓
LLM
↓
Answer

The LLM now answers grounded in retrieved knowledge.

Building a Production RAG System Step-by-Step

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.

Step 1 — Data Collection

Your RAG system is only as good as the documents you feed it.

Typical sources:

PDFs
Notion pages
Confluence
Slack threads
support tickets
GitHub repos
product docs

Example pipeline:

Data Sources ↓ Document Loader ↓ Text Cleaning ↓ Chunking

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

Step 2 — Chunking Documents

LLMs have context limits.

Instead of embedding an entire document, we split it into chunks.

Example:

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

A common heuristic:

$chunk_size = 300 token$

with

$overlap = 50$

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 — Embedding the Data

Embeddings convert text into vectors.

Example embedding:

"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]

Similar meaning → similar vectors.

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 — Store in a Vector Database

Embeddings must be stored in a vector index.

Popular options:

Database	Use Case
Pinecone	managed
Weaviate	hybrid search
FAISS	local
Qdrant	open source

Example architecture:

Chunks
  ↓
Embedding Model
  ↓
Vector DB

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 — Query Time Retrieval

User Query
   ↓
Embedding
   ↓
Vector Similarity Search
   ↓
Top-K Documents

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Step 6 — Prompt Construction

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 — Generate Answer with LLM

Now the LLM generates the answer grounded in retrieved knowledge.

Prompt + Context
↓
LLM
↓
Answer

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]
)

Production Architecture

A scalable RAG architecture looks like this:


                ┌─────────────┐
                │   User App  │
                └──────┬──────┘
                       │
                       ▼
                ┌─────────────┐
                │  API Server │
                └──────┬──────┘
                       │
             ┌─────────┴─────────┐
             ▼                   ▼
     Vector Database         LLM API
         (Retrieval)        (Generation)
             │                   │
             └───────┬───────────┘
                     ▼
                 Response

Layer	Tools
Ingestion	Airflow
Embeddings	OpenAI
Vector DB	Pinecone
Orchestration	LangChain
API	FastAPI

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Tue Feb 24 2026

Share This on

← Previous

AI Models and LLM Development with NVIDIA

The Economic Impact of Generative AI

`RAG` (Retrieval-Augmented Generation)

RAG is becoming the default architecture for AI products.

It allows LLMs to:

access private knowledge
reduce hallucinations
stay up-to-date

Core idea:

LLM + Retrieval = Useful AI System

But building it reliably in production requires careful engineering.

What is RAG?

RAG combines two components:

Retriever
Generator (LLM)

Pipeline:

Query \rightarrow Embedding \rightarrow Vector Search \rightarrow Context Retrieval \rightarrow LLM Generation

Instead of asking the LLM directly:

User → LLM → Answer

flowchart TD
    Q[User question] --> R1[Retrieve relevant documents]
    R1 --> R2[Insert retrieved context into prompt]
    R2 --> LLM[LLM generates answer]
    LLM --> A[Grounded response]

we do:


User Query
↓
Embedding Model
↓
Vector Database
↓
Top-K Documents
↓
Prompt + Context
↓
LLM
↓
Answer

The LLM now answers grounded in retrieved knowledge.

Building a Production RAG System Step-by-Step

Large Language Models are powerful, but they have one major limitation: they don't know your private data.

If you ask a model about your company docs, support tickets, or internal knowledge base, it will hallucinate or say it doesn't know.

Retrieval Augmented Generation (RAG) solves this.

Instead of relying only on the model's training data, we retrieve relevant documents at query time and inject them into the prompt.

In this post we’ll walk through how to build a production RAG system step-by-step, including architecture, scaling concerns, and engineering tradeoffs.

Step 1 — Data Collection

Your RAG system is only as good as the documents you feed it.

Typical sources:

PDFs
Notion pages
Confluence
Slack threads
support tickets
GitHub repos
product docs

Example pipeline:

Data Sources ↓ Document Loader ↓ Text Cleaning ↓ Chunking

Python example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/architecture.pdf")
documents = loader.load()

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

Step 2 — Chunking Documents

LLMs have context limits.

Instead of embedding an entire document, we split it into chunks.

Example:

Chunk Size	Tradeoff
Small (200 tokens)	Better retrieval
Large (1000 tokens)	More context

A common heuristic:

$chunk_size = 300 token$

with

$overlap = 50$

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=50
)

chunks = splitter.split_documents(documents)

Step 3 — Embedding the Data

Embeddings convert text into vectors.

Example embedding:

"What is Kubernetes?"
→ [0.12, -0.44, 0.88, ...]

Similar meaning → similar vectors.

Example code:

from openai import OpenAI

client = OpenAI()

embedding = client.embeddings.create(
model="text-embedding-3-large",
input="What is Kubernetes?"
)

Step 4 — Store in a Vector Database

Embeddings must be stored in a vector index.

Popular options:

Database	Use Case
Pinecone	managed
Weaviate	hybrid search
FAISS	local
Qdrant	open source

Example architecture:

Chunks
  ↓
Embedding Model
  ↓
Vector DB

Python:

vector_db.add(
ids=[chunk_id],
embeddings=[embedding],
metadata={"source": "docs"}
)

Step 5 — Query Time Retrieval

User Query
   ↓
Embedding
   ↓
Vector Similarity Search
   ↓
Top-K Documents

Mathematically we search using cosine similarity:

Python:

    results = vector_db.search(
    query_embedding,
    k=5
)

Step 6 — Prompt Construction

Now we inject retrieved documents into the prompt.

Example prompt template:


You are a helpful assistant.

Use the context below to answer the question.

Context:
{retrieved_docs}

Question:
{user_query}

Example

prompt = f"""
Answer the question using the context below.

Context:
{docs}

Question:
{query}
"""

Step 7 — Generate Answer with LLM

Now the LLM generates the answer grounded in retrieved knowledge.

Prompt + Context
↓
LLM
↓
Answer

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": prompt}]
)

Production Architecture

A scalable RAG architecture looks like this:


                ┌─────────────┐
                │   User App  │
                └──────┬──────┘
                       │
                       ▼
                ┌─────────────┐
                │  API Server │
                └──────┬──────┘
                       │
             ┌─────────┴─────────┐
             ▼                   ▼
     Vector Database         LLM API
         (Retrieval)        (Generation)
             │                   │
             └───────┬───────────┘
                     ▼
                 Response

Layer	Tools
Ingestion	Airflow
Embeddings	OpenAI
Vector DB	Pinecone
Orchestration	LangChain
API	FastAPI

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Written by Hitesh Sahu, a passionate developer and blogger.

`RAG` (Retrieval-Augmented Generation)

What is RAG?

Building a Production RAG System Step-by-Step

Step 1 — Data Collection

Step 2 — Chunking Documents

Step 3 — Embedding the Data

Step 4 — Store in a Vector Database

Step 5 — Query Time Retrieval

Step 6 — Prompt Construction

Step 7 — Generate Answer with LLM

Production Architecture

Playstore

Fetching content, this won’t take long…

🦥 Sloths can hold their breath longer than dolphins 🐬.

Retrieval-Augmented Generation (RAG) for AI Applications

Comprehensive guide to Retrieval-Augmented Generation, covering architecture, embeddings, vector databases, document indexing, retrieval strategies, and best practices for building production-ready RAG systems.

Written by Hitesh Sahu, a passionate developer and blogger.

`RAG` (Retrieval-Augmented Generation)

What is RAG?

Building a Production RAG System Step-by-Step

Step 1 — Data Collection

Step 2 — Chunking Documents

Step 3 — Embedding the Data

Step 4 — Store in a Vector Database

Step 5 — Query Time Retrieval

Step 6 — Prompt Construction

Step 7 — Generate Answer with LLM

Production Architecture

Playstore