AI Engineering Learning: RAG Implementation

Posted Oct 12, 2025 Updated Jan 11, 2026

Generated using Gemini AI

By Pravin Tripathi

23 min read

AI Engineering Learning: RAG Implementation

Introduction to RAG
Core Concepts & Theory
Week 2: RAG Fundamentals
Week 3: Advanced RAG Implementation
- The Baseline Stack
- Advanced Retrieval Strategies
Practical Projects
Debugging & Optimization
- The “RAG Traps” to Avoid

Introduction to RAG

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models by integrating them with external data sources. Instead of relying only on the knowledge the model was trained on, RAG allows models to:

Retrieve relevant information from external knowledge bases
Augment the user’s query with this retrieved context
Generate accurate responses using both the retrieved context and the LLM’s training knowledge

Think of RAG as giving your AI assistant access to a reference library. When you ask a question, the assistant first looks up the relevant books (retrieval), then uses that information (augmentation) to answer your question (generation).

Why RAG Matters

Traditional LLMs suffer from several limitations that RAG solves:

Problem	RAG Solution
Hallucinations: Making up false information	Retrieved documents provide grounded truth
Outdated Information: Knowledge cutoff in training data	Can query latest data from knowledge bases
Private Data: Can’t access company-specific documents	Can retrieve from private/internal sources
Context Window Limits: Can’t process very large documents	Retrieves only relevant chunks
Lack of Citations: Hard to trace where answers come from	Can cite source documents

Real-World Use Cases

Document Processing Systems

Company policies chatbots
Technical manual Q&A systems
Legal document analysis

Customer Support

Support ticket auto-responses using company docs
Product knowledge bases
FAQ systems

Enterprise Knowledge Management

Internal documentation search
Research paper analysis
Medical record analysis

Core Concepts & Theory

1. Vector Embeddings - The Foundation

What Are Embeddings?

Embeddings are numerical representations (vectors) of text that capture semantic meaning. Instead of storing text as raw strings, we convert it to arrays of numbers that machines can process.

Simple Example:

Text: "dog"
Embedding: [0.25, 0.89, -0.12, 0.45, 0.78, ...]  (typically 300-1536 dimensions)

Text: "puppy"
Embedding: [0.27, 0.85, -0.10, 0.48, 0.75, ...]  (similar to "dog")

Key Properties

Semantic Similarity: Words with similar meanings have embeddings close together in vector space
- “king” and “queen” are close
- “dog” and “cat” are close
- “dog” and “car” are far apart
Dimensionality:
- Smaller models: 300 dimensions
- Modern models (OpenAI): 1,536 dimensions
- Large models: Up to 4,096 dimensions
Standardization: Must use the same embedding model for both documents and queries

Embedding Models to Know

Model	Dimensions	Pros	Cons
OpenAI text-embedding-3-small	1,536	Best quality, production-ready	Paid API
HuggingFace BAAI/bge-small-en-v1.5	384	Free, fast, good quality	Open-source dependencies
Sentence-BERT	384	Fast, lightweight	Less accurate than large models

2. Vector Similarity Metrics

Cosine Similarity (Most Common)

Cosine Similarity measures the angle between two vectors. It ranges from -1 to 1:

1.0 = vectors point in same direction (identical meaning)
0.0 = vectors are perpendicular (no relation)
-1.0 = vectors point in opposite directions

Formula:

similarity = (A · B) / (||A|| × ||B||)

Where:
A · B = dot product (multiply matching dimensions, add results)
||A|| and ||B|| = magnitude of vectors

Why It’s Perfect for RAG:

Computationally efficient
Works well with high-dimensional data
Only cares about direction, not magnitude
Ideal for semantic search

Example:

Query: "How do I reset my password?"
Document 1: "Password reset instructions: Click forgot password..."  → Similarity: 0.92 ✓
Document 2: "Account security best practices..."                   → Similarity: 0.45 ✗
Document 3: "Password history and policies..."                    → Similarity: 0.68 ✓

System retrieves documents 1 and 3 (top matches)

Other Metrics

Euclidean Distance (L2)

Measures straight-line distance between vectors
Considers both direction AND magnitude
Better when vector magnitude matters

Dot Product

Fast computation
Works well with normalized vectors
Less interpretable than cosine

3. Document Chunking

Why Chunk Documents?

Raw documents (100+ pages) are too large for:

Embedding models (context window limits)
LLM context windows
Semantic search accuracy

Solution: Split into smaller, manageable chunks while maintaining semantic meaning.

Chunking Strategies

Strategy 1: Fixed-Size Chunking

Chunk Size: 1000 characters
Overlap: 100 characters

Original: "Machine learning is... [very long text]"
Chunk 1: "Machine learning is... [first 1000 chars]"
Chunk 2: "[last 100 chars of chunk 1]...next content..."
Chunk 3: "[last 100 chars of chunk 2]...final content"

Pros: Simple, fast
Cons: Cuts mid-sentence, loses semantic meaning

Strategy 2: Recursive Chunking (Recommended)

Try separators in order:
1. Double newline (\n\n) → Paragraphs
2. Single newline (\n) → Sentences
3. Space ( ) → Words
4. Character level

Stop when chunk_size is reached

Pros: Respects document structure, maintains semantics
Cons: Slightly slower than fixed-size

Strategy 3: Semantic Chunking (Advanced)

Break text into sentences
Create embeddings for each sentence
Detect semantic breakpoints (where topic changes)
Group sentences before breakpoint into chunks

Pros: Most semantically coherent
Cons: Expensive (requires many embeddings)

Best Practices

Chunk Size: 250-1000 tokens (document and use case dependent)
Overlap: 10-50 tokens (prevents losing context at boundaries)
Document Type: Use specialized splitters for code, Markdown, HTML

4. The RAG Pipeline

Three Main Phases

Phase 1: Indexing (Offline)

Raw Documents
    ↓
[Load Documents]
    ↓
[Split into Chunks]
    ↓
[Generate Embeddings]  ← Use embedding model
    ↓
[Store in Vector DB]   ← FAISS, Pinecone, etc.

Phase 2: Retrieval (Runtime)

User Query
    ↓
[Generate Query Embedding]  ← Same embedding model as indexing
    ↓
[Vector Similarity Search]  ← Find top-k similar chunks
    ↓
[Retrieved Chunks]

Phase 3: Generation (Runtime)

User Query + Retrieved Chunks
    ↓
[Augment Prompt]  ← Combine query with context
    ↓
[Send to LLM]
    ↓
[Generate Response]
    ↓
Response (grounded in retrieved documents)

Week 2: RAG Fundamentals

1. Vector Databases Comparison

FAISS (Facebook AI Similarity Search)

What it is: Open-source library for efficient similarity search

Best for: Learning, local development, small projects

Pros:

Free and open-source
Blazing fast in-memory search
Great for experimentation
No setup complexity

Cons:

No persistence (manual save/load)
Single machine only
Not production-ready for distributed systems

Install:

pip install faiss-cpu
# Or GPU version: pip install faiss-gpu

Pinecone

What it is: Fully managed cloud vector database

Best for: Production applications, enterprise use

Pros:

Automatic persistence and backups
Seamless scaling
Beautiful dashboard
Enterprise security (SOC 2, HIPAA)
Real-time index updates

Cons:

Paid service (free tier limited)
Internet dependency
API rate limits

Install:

pip install pinecone-client

Chroma

What it is: Open-source vector database designed for LLM applications

Best for: Easy prototyping, local development

Pros:

Simple API
Local-first approach
SQLite persistence
Perfect for small projects

Cons:

Less scalable than Pinecone
Limited indexing options

When to Use Each:

FAISS: Jupyter notebooks, prototyping, learning
Chroma: Local projects with persistence needs
Pinecone: Production systems, enterprise apps

2. Setting Up Your First RAG System

Step 1: Install Dependencies

  
pip install langchain==0.1.0
pip install langchain-openai==0.0.2
pip install langchain-community==0.0.12
pip install faiss-cpu
pip install openai==1.7.2
pip install python-dotenv
pip install pypdf  # For PDF loading

Step 2: Document Preparation

Create prepare_documents.py:

  
import os
from dotenv import load_dotenv
from langchain.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

# Step 1: Load document
loader = TextLoader("sample_document.txt")
documents = loader.load()

# Step 2: Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max 1000 characters per chunk
    chunk_overlap=100,      # 100 characters overlap between chunks
    length_function=len,
    is_separator_regex=False
)

splits = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(splits)}")
print(f"\nFirst chunk:")
print(splits[0].page_content[:200])
print(f"Metadata: {splits[0].metadata}")

Step 3: Create Embeddings

Create create_embeddings.py:

  
import os
from dotenv import load_dotenv
from langchain.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

load_dotenv()

# Load and split documents (same as before)
loader = TextLoader("sample_document.txt")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)
splits = text_splitter.split_documents(documents)

# Step 1: Create embedding model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)

# Step 2: Create vector store
vectorstore = FAISS.from_documents(splits, embeddings)

# Step 3: Save vector store for later use
vectorstore.save_local("faiss_index")

print("✓ Vectorstore created and saved!")
print(f"Total vectors stored: {len(splits)}")

Step 4: Query the Vector Store

Create query_vectorstore.py:

  
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

load_dotenv()

# Load the saved vectorstore
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)

vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Step 1: Query the vectorstore
query = "What is machine learning?"
results = vectorstore.similarity_search(query, k=3)

# Step 2: Display results
print(f"Query: {query}\n")
for i, doc in enumerate(results, 1):
    print(f"\n Result {i} ")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Score: High")

3. Building Your First RAG Chain

Create first_rag_chain.py:

  
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.output_parsers import StrOutputParser

load_dotenv()

# Step 1: Load vector store and create retriever
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)

vectorstore = FAISS.load_local(
    "faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Step 2: Define prompt template
system_prompt = """You are a helpful assistant answering questions based on provided context.

Use the following context to answer the question:
{context}

If the context doesn't contain relevant information, say "I don't have this information in my knowledge base."
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

# Step 3: Create LLM
llm = ChatOpenAI(
    model="gpt-3.5-turbo-0125",
    temperature=0,
    api_key=os.getenv("OPENAI_API_KEY")
)

# Step 4: Create document chain (combines docs with prompt)
document_chain = create_stuff_documents_chain(llm, prompt)

# Step 5: Create retrieval chain (combines retriever with document chain)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

# Step 6: Run the chain
query = "What is the main topic of the document?"
response = retrieval_chain.invoke({"input": query})

print(f"Question: {query}\n")
print(f"Answer: {response['answer']}\n")
print(" Retrieved Context ")
for i, doc in enumerate(response['context'], 1):
    print(f"\nSource {i}:")
    print(doc.page_content[:300] + "...")

Understanding the Chain Components

┌─────────────────────────────────────────────────────────┐
│           User Query: "What is X?"                      │
└────────────────────┬────────────────────────────────────┘
                     │
         ┌───────────▼───────────┐
         │ RETRIEVER             │ (Uses vector similarity)
         │ - Vectorizes query    │
         │ - Searches top-k docs │
         └───────────┬───────────┘
                     │
         ┌───────────▼───────────────────────┐
         │ DOCUMENT CHAIN                    │
         │ - Formats retrieved docs          │
         │ - Creates augmented prompt        │
         │ - Sends to LLM                    │
         └───────────┬───────────────────────┘
                     │
         ┌───────────▼───────────┐
         │ LLM (GPT-3.5-turbo)   │
         │ - Generates response  │
         │ - Uses context        │
         └───────────┬───────────┘
                     │
      ┌──────────────▼──────────────┐
      │ Response + Source Citations │
      └─────────────────────────────┘

4. Understanding Context Augmentation

What is Context Augmentation?

Context Augmentation is the process of enhancing the user’s query with retrieved documents before sending it to the LLM.

Before Augmentation:

Prompt to LLM: "What is quantum computing?"
(LLM uses only its training knowledge)

After Augmentation (RAG):

Prompt to LLM:
"Use the following context to answer the question:

CONTEXT:
Quantum computing is a form of computation that harnesses quantum 
mechanical phenomena such as superposition and entanglement...
[more retrieved text]

QUESTION: What is quantum computing?
"
(LLM uses both training knowledge AND provided context)

How LangChain Does This

The create_stuff_documents_chain function automatically:

Takes retrieved documents
Combines them into a single context string
Inserts into prompt template
Sends to LLM

  
# This happens automatically:
augmented_prompt = f"""You are helpful assistant.

Context:
{retrieved_doc_1.page_content}

{retrieved_doc_2.page_content}

{retrieved_doc_3.page_content}

User Question: {query}

Answer:"""

Week 3: Advanced RAG Implementation

1. Advanced Document Loading

Loading Different File Types

  
from langchain.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    DirectoryLoader
)

# Load PDF
pdf_loader = PyPDFLoader("document.pdf")
pdf_docs = pdf_loader.load()

# Load multiple files from directory
directory_loader = DirectoryLoader(
    "./documents",
    glob="**/*.txt",
    loader_cls=TextLoader
)
all_docs = directory_loader.load()

# Load CSV
csv_loader = CSVLoader(
    file_path="data.csv",
    source_column="source"
)
csv_docs = csv_loader.load()

2. Optimizing Retrieval

Retriever Types

  
# Basic similarity search
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Top 3 results
)

# Maximum marginal relevance (diverse results)
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,
        "fetch_k": 10  # Get 10, return 3 most diverse
    }
)

# Similarity with threshold
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.8,
        "k": 3
    }
)

3. Hybrid Search

Combine multiple search strategies for better results:

  
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 (keyword-based)
bm25_retriever = BM25Retriever.from_documents(documents)

# Semantic (vector-based)
semantic_retriever = vectorstore.as_retriever()

# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.3, 0.7]  # 30% BM25, 70% semantic
)

4. Advanced Prompt Engineering for RAG

Few-Shot Prompting

  
from langchain.prompts import FewShotChatMessagePromptTemplate

examples = [
    {
        "input": "How do I reset my password?",
        "output": "To reset your password: 1) Click 'Forgot Password' 2) Enter email..."
    },
    {
        "input": "What's your refund policy?",
        "output": "Our refund policy allows 30 days for returns..."
    }
]

example_prompt = ChatPromptTemplate.from_template(
    "Question: {input}\nAnswer: {output}"
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    suffix="Question: {input}\nAnswer:"
)

Chain-of-Thought Prompting

  
system_prompt = """You are a helpful assistant that answers questions step-by-step.

Follow this process:
1. Identify the key information needed
2. Find relevant context in the provided documents
3. Reason through the answer
4. Provide a clear, well-sourced response

Context:
{context}

Think step-by-step when answering."""

5. Handling LLM Hallucinations

Problem: Hallucinations

The model generates plausible-sounding but false information not in the retrieved context.

Solutions

Solution 1: Strict Context Adherence

  
system_prompt = """You MUST only answer based on the provided context.

If the context doesn't contain the answer, respond with:
"This information is not available in my knowledge base."

Do NOT use your general knowledge if it contradicts the context.

Context:
{context}

Question: {input}"""

Solution 2: Source Citation

  
system_prompt = """Always cite your sources using [Document X] notation.

Context:
[Document 1] Machine learning basics...
[Document 2] Deep learning applications...

Example response:
"Machine learning can be applied to image recognition [Document 2]. 
The basic concepts include supervised learning [Document 1]."

Question: {input}"""

Solution 3: Confidence Scoring

  
# Add a confidence check step
chain = (
    {
        "context": retriever,
        "input": RunnablePassthrough()
    }
    | prompt
    | llm
    | confidence_output_parser  # Custom parser that adds confidence
)

6. Conversational RAG

Maintain conversation history:

  
from langchain.chains import create_history_aware_retriever
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Contextualize the query using chat history
contextualize_q_system_prompt = """Given a chat history and the latest user question 
which might reference context in the chat history, formulate a standalone question 
which can be understood without the chat history. Don't answer it, just reformulate it."""

contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_q_system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
])

# History-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

# Question-answering prompt
system_prompt = """Use the retrieved context to answer the question."""

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
])

# Create the full chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

# Usage
chat_history = []
while True:
    user_input = input("You: ")
    result = rag_chain.invoke({
        "input": user_input,
        "chat_history": chat_history
    })
    
    # Add to history
    chat_history.append(("human", user_input))
    chat_history.append(("ai", result["answer"]))
    
    print(f"Assistant: {result['answer']}")

7. Advanced RAG Strategies (Deep Dive)

Beyond the implementation basics, production-grade RAG requires specific strategies to handle edge cases and improve quality.

The Baseline Stack

Implement these for an immediate quality boost.

1. Context-Aware Chunking Instead of arbitrarily splitting documents by character count, use “smart” chunking that respects document structure (headers, paragraphs). This preserves logical flow.

2. Reranking A two-step search process:

Retrieve: Fast vector search finds top 50 candidates.
Rerank: A specialized model (e.g., Cohere) scores them for relevance and picks the best 5. Impact: Often the single highest-quality upgrade for RAG.

3. Query Expansion Uses an LLM to generate variations of the user’s query (e.g., “reset password” → “login help”, “account recovery”) to catch different phrasings.

Advanced Retrieval Strategies

4. Contextual Retrieval Solves “loss of context” in chunks. Before embedding, use an LLM to prepend a summary of the parent document to the chunk. Example: “Context: This is from the 2024 Q4 Financial Report. -> [Chunk text]”

5. Agentic RAG Uses an LLM agent to plan multi-step retrieval for complex questions (e.g., “Compare sales in 2023 vs 2024”).

Practical Projects

Project 1: Company Documentation Chatbot

Objective: Create a chatbot that answers questions about company policies

  
import os
from dotenv import load_dotenv
from pathlib import Path
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

load_dotenv()

# Step 1: Load company documents
company_docs_path = "./company_docs"
loader = DirectoryLoader(company_docs_path, glob="**/*.txt")
documents = loader.load()

# Step 2: Split documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)
splits = splitter.split_documents(documents)

# Step 3: Create vectorstore
embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY"))
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever()

# Step 4: Create RAG chain
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", api_key=os.getenv("OPENAI_API_KEY"))

prompt = ChatPromptTemplate.from_template("""
You are a helpful HR assistant. Answer questions about company policies.

Context from company documents:
{context}

Question: {input}

If the information is not in the provided documents, say: 
"This information is not available in our company documentation."
""")

document_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, document_chain)

# Step 5: Use the chatbot
while True:
    question = input("\nQuestion: ")
    if question.lower() in ['exit', 'quit']:
        break
    
    result = chain.invoke({"input": question})
    print(f"\nAnswer: {result['answer']}")

Project 2: Research Paper Q&A System

  
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

# Load PDF paper
pdf_loader = PyPDFLoader("research_paper.pdf")
documents = pdf_loader.load()

# Split with larger chunks for papers
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200
)
splits = splitter.split_documents(documents)

# Create vectorstore
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

# Create chain with academic prompt
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

academic_prompt = ChatPromptTemplate.from_template("""
You are an academic research assistant analyzing a research paper.

Paper content:
{context}

Research Question: {input}

Provide a detailed, scholarly response grounded in the paper's content.""")

chain = (
    academic_prompt
    | llm
    | StrOutputParser()
)

# Query
questions = [
    "What is the main contribution of this paper?",
    "What methodology did the authors use?",
    "What are the limitations discussed?"
]

for q in questions:
    print(f"Q: {q}")
    result = chain.invoke({"input": q, "context": retriever.get_relevant_documents(q)})
    print(f"A: {result}\n")

Debugging & Optimization

1. Enable Debug Mode for RAG

  
from langchain.globals import set_debug

# Enable full debugging
set_debug(True)

# Run your chain
result = chain.invoke({"input": "Your question"})

# This will show:
# - Retrieval results
# - Formatted prompt sent to LLM
# - LLM response
# - Token usage

set_debug(False)

2. Evaluate Retrieval Quality

  
from langchain.evaluation import load_evaluator

# Check if retrieved documents are relevant
retrieval_qa_evaluator = load_evaluator("qa")

# Evaluate for specific query
query = "What is machine learning?"
retrieved_docs = retriever.get_relevant_documents(query)

evaluation = retrieval_qa_evaluator.evaluate_strings(
    input=query,
    prediction=retrieved_docs[0].page_content if retrieved_docs else "No results"
)

print(f"Relevance score: {evaluation['score']}")

3. Common Issues & Solutions

Issue 1: Poor Retrieval Results

  
# Solution: Adjust chunk size
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # Smaller chunks = more specific results
    chunk_overlap=50
)

# Or try different retriever settings
retriever = vectorstore.as_retriever(
    search_type="mmr",   # Diverse results
    search_kwargs={"k": 5}  # Get more results
)

Issue 2: LLM Ignoring Context

  
# Solution: Make instructions clearer
system_prompt = """
IMPORTANT: You MUST answer ONLY based on the provided context.
You CANNOT use external knowledge.

If information is not in the context, respond:
"I don't have this information in the provided documents."

Context:
{context}

Question: {input}

Answer:"""

Issue 3: Too Many False Positives

  
# Solution: Use similarity threshold
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.85,  # Only return high-confidence matches
        "k": 3
    }
)

4. Performance Optimization

Batch Processing

  
queries = [
    "Question 1?",
    "Question 2?",
    "Question 3?"
]

# Batch multiple queries
results = chain.batch([{"input": q} for q in queries])

for query, result in zip(queries, results):
    print(f"Q: {query}")
    print(f"A: {result['answer']}\n")

Caching Embeddings

  
from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache

set_llm_cache(InMemoryCache())

# Now repeated queries use cached embeddings
result1 = chain.invoke({"input": "What is X?"})
result2 = chain.invoke({"input": "What is X?"})  # Uses cache

5. Strategic Optimization

The “RAG Traps” to Avoid

Over-Engineering: Don’t start with Knowledge Graphs. Start with a baseline + Reranking.
Ignoring Latency: Agentic RAG is powerful but slow. Match strategy to use case.
Flying Blind: Always maintain a “Gold Standard” test set (e.g., 20 QA pairs) to measure if changes actually help.

Development Approach

Baseline: Simple chunking + Vector Search.
The “Free” Upgrade: Add Reranking.
Diagnose: Add Query Expansion (for recall) or Contextual Retrieval (for accuracy) only as needed.

Summary: Week 2-3 Learning Path

Week 2 Accomplishments

✅ Understood embeddings and semantic similarity
✅ Learned vector database concepts
✅ Set up FAISS vector store
✅ Built your first RAG chain
✅ Learned about context augmentation

Week 3 Accomplishments

✅ Advanced document loading techniques
✅ Optimized retrieval strategies
✅ Implemented hybrid search
✅ Added conversational context
✅ Built production-ready chatbots
✅ Mastered hallucination prevention

Real-World RAG Use Cases in Action

Case 1: Technical Support Chatbot

A software company uses RAG to:

Index all support documentation and FAQs
When customers ask “How do I fix error 404?”, the system retrieves relevant troubleshooting guides
Returns citations so users can read the original docs
Reduces support tickets by 40%

Case 2: Legal Document Analysis

Law firms use RAG to:

Index thousands of case documents and contracts
Query: “What are precedents for breach of contract?”
System retrieves relevant cases and sections
Lawyers save days of manual document review

Case 3: Medical Information System

Hospitals use RAG to:

Index patient records and medical literature
Query: “What treatments are recommended for condition X?”
System retrieves patient data + latest research
Ensures answers are grounded in real data, not hallucinations

Next Steps (Week 4+)

After mastering RAG fundamentals:

Advanced Retrieval: Implement reranking, query expansion
Multiple Data Sources: Combine databases, SQL, APIs
Real-Time Updates: Handle dynamic data sources
Evaluation Frameworks: RAGAS, LangSmith
Fine-Tuning: Custom embedding models
Scalability: Production deployment with Pinecone/Weaviate

Key Takeaways

RAG solves hallucinations: By providing grounded context
Embeddings are fundamental: They enable semantic search
Chunking matters: Right chunk size = better retrieval
Vector databases are essential: They enable fast similarity search
Retrieval quality first: A good retriever beats a bad one more than LLM choice
Context is key: The augmented prompt is where the magic happens
Test and iterate: RAG systems require evaluation and tuning

Resources

Main Page: AI Engineering Learning

Congratulations! You now have a solid foundation in RAG implementation. The next step is to start building projects and experimenting with your own data.Happy coding! 🚀

Blogging, Article

This post is licensed under CC BY 4.0 by the author.

Table of Contents

Introduction to RAG

What is Retrieval Augmented Generation (RAG)?

Why RAG Matters

Real-World Use Cases

Core Concepts & Theory

1. Vector Embeddings - The Foundation

What Are Embeddings?

Key Properties

Embedding Models to Know

2. Vector Similarity Metrics

Cosine Similarity (Most Common)

Other Metrics

3. Document Chunking

Why Chunk Documents?

Chunking Strategies

Best Practices

4. The RAG Pipeline

Three Main Phases

Week 2: RAG Fundamentals

1. Vector Databases Comparison

FAISS (Facebook AI Similarity Search)

Pinecone

Chroma

2. Setting Up Your First RAG System

Step 1: Install Dependencies

Step 2: Document Preparation

Step 3: Create Embeddings

Step 4: Query the Vector Store

3. Building Your First RAG Chain

Understanding the Chain Components

4. Understanding Context Augmentation

What is Context Augmentation?

How LangChain Does This

Week 3: Advanced RAG Implementation

1. Advanced Document Loading

Loading Different File Types

2. Optimizing Retrieval

Retriever Types

3. Hybrid Search

4. Advanced Prompt Engineering for RAG

Few-Shot Prompting

Chain-of-Thought Prompting

5. Handling LLM Hallucinations

Problem: Hallucinations

Solutions

6. Conversational RAG

7. Advanced RAG Strategies (Deep Dive)

The Baseline Stack

Advanced Retrieval Strategies

Practical Projects

Project 1: Company Documentation Chatbot

Project 2: Research Paper Q&A System

Debugging & Optimization

1. Enable Debug Mode for RAG

2. Evaluate Retrieval Quality

3. Common Issues & Solutions

4. Performance Optimization

Batch Processing

Caching Embeddings

5. Strategic Optimization

The “RAG Traps” to Avoid

Development Approach

Summary: Week 2-3 Learning Path

Week 2 Accomplishments

Week 3 Accomplishments

Real-World RAG Use Cases in Action

Case 1: Technical Support Chatbot

Case 2: Legal Document Analysis

Case 3: Medical Information System

Next Steps (Week 4+)

Key Takeaways

Resources

Trending Tags