Retrieval Augmented Generation (RAG): A Comprehensive Guide

2025-02-28 15:35:46.498544

Introduction to Retrieval Augmented Generation (RAG)

This chapter provides a comprehensive introduction to Retrieval Augmented Generation (RAG), a powerful technique for enhancing Large Language Models (LLMs) with external knowledge. We will begin by defining RAG, exploring its motivation, and outlining its core components. This will lay the foundation for understanding how RAG systems can be built and utilized effectively.

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation, or RAG, is a framework that enhances the capabilities of Large Language Models (LLMs) by enabling them to access and incorporate information from external sources during the generation process.

Retrieval Augmented Generation (RAG): A framework that combines retrieval-based systems and generation-based models. It enhances the accuracy and contextual relevance of generated responses by grounding them in information retrieved from external knowledge sources.

In simpler terms, RAG empowers LLMs to “chat with your documents” or leverage your specific data to provide more informed and relevant responses.

Motivation Behind RAG

Large Language Models are trained on vast datasets, enabling them to generate human-quality text, translate languages, and answer a wide range of questions. However, LLMs have limitations:

Limited Knowledge of Specific Data: LLMs are trained on general knowledge and lack access to private, specific, or real-time data. For example, an LLM might know the capital of France but not the name of your first pet.
Knowledge Cut-off: LLMs have a knowledge cut-off point, meaning they are unaware of events or information created after their training data was compiled.
Hallucinations: LLMs can sometimes generate factually incorrect or nonsensical information, often referred to as “hallucinations,” as they rely solely on their internal knowledge.

RAG addresses these limitations by:

Injecting External Knowledge: RAG systems allow you to “inject” your own data – documents, databases, text files, or any unstructured data – into the LLM’s process.
Contextualizing Responses: By accessing your data during response generation, the LLM can provide answers grounded in your specific context, overcoming the limitations of its pre-trained knowledge.
Improving Accuracy and Relevance: RAG enhances the accuracy and relevance of LLM responses by ensuring they are informed by up-to-date or domain-specific information.

Advantages of RAG

Customization: RAG provides an efficient way to customize LLMs with your own data without requiring extensive retraining of the model itself.
Contextual Relevance: RAG ensures that LLM responses are contextually relevant to the user’s specific information and queries.
Accuracy Enhancement: By grounding responses in retrieved documents, RAG reduces the likelihood of hallucinations and improves factual accuracy.
Access to Up-to-date Information: RAG can be connected to live data sources, enabling LLMs to provide responses based on the most current information.

Setting Up Your Development Environment

To effectively learn about and implement RAG systems, a properly configured development environment is essential. This section outlines the necessary software and accounts you will need to follow along with practical examples.

Essential Software and Accounts

Python: Python is the primary programming language used for developing RAG systems. Ensure you have Python installed on your machine. Instructions for installation across different operating systems (Windows, macOS, Linux) are readily available online.

Python: A high-level, interpreted, general-purpose programming language. It is widely used in data science, machine learning, and web development due to its readability and extensive libraries.
Code Editor (VS Code Recommended): A code editor is crucial for writing and managing your Python code. Visual Studio Code (VS Code) is a highly recommended, free, and feature-rich editor. While not strictly required, using VS Code will align with the examples and demonstrations presented.

VS Code (Visual Studio Code): A free source-code editor made by Microsoft for Windows, Linux and macOS. It includes support for debugging, embedded Git control, syntax highlighting, intelligent code completion, snippets, and code refactoring.
OpenAI Account and API Key: To interact with powerful LLMs like those from OpenAI, you will need an OpenAI account and an API key. This key acts as your authentication and allows your code to access OpenAI’s models and services. You can create an account and generate an API key on the OpenAI website.

API Key (Application Programming Interface Key): A code used to identify and authenticate an application or user when calling an API. It is used to track and control how the API is being used, for example, to prevent abuse or unauthorized access.

Setting up Python and OpenAI Account

Python Installation: If you do not have Python installed, follow the instructions provided at the official Python website (python.org) to download and install the appropriate version for your operating system.
OpenAI Account and API Key Creation:
1. Visit the OpenAI website (openai.com).
2. Sign up for an account or log in if you already have one.
3. Navigate to your account settings and find the “API keys” section.
4. Create a new API key. Important: Keep your API key secure and do not share it publicly.

With these components set up, you will be ready to delve deeper into the workings of RAG and build your own RAG-based applications.

Deep Dive into Retrieval Augmented Generation

This section provides a more detailed exploration of RAG, breaking down its components and illustrating the process with a practical example of “Naive RAG.”

Core Components of RAG

RAG systems fundamentally consist of two main components:

Retriever: The retriever’s role is to identify and fetch relevant documents from a knowledge source (your data) based on the user’s query. This component is responsible for searching and retrieving information that is pertinent to the question being asked.
Generator: The generator, typically an LLM, takes the retrieved documents and the original user query as input. It then generates a coherent and contextually relevant response, leveraging both the user’s question and the information provided by the retriever.

The synergy between these two components is what defines RAG. The retriever provides the necessary context, and the generator synthesizes a meaningful answer based on that context.

Defining RAG: A Framework for Enhanced Responses

Synthesizing the components, we can define RAG more formally as:

Retrieval Augmented Generation (RAG): A framework that combines the strengths of retrieval-based systems and generation-based models to produce more accurate and contextually relevant responses.

The key objective of RAG is to achieve contextually relevant responses. This means providing answers that are not only accurate but also directly address the user’s query within the context of their specific data.

RAG in Action: Customizing LLMs with Your Data

At its core, RAG offers an efficient method to customize an LLM with your own data. LLMs are trained on vast amounts of public data, which is both their strength and limitation. RAG overcomes this limitation by:

Injecting Your Data: RAG allows you to feed your specific data into the LLM’s process, expanding its knowledge base beyond its pre-training data.
Enhanced Knowledge: The LLM now possesses knowledge of your specific, contextual data in addition to its general knowledge.
Targeted Question Answering: This enhanced knowledge enables the LLM to answer questions related to your specific data accurately and effectively.

Overview of the RAG Workflow

The typical RAG workflow can be visualized as follows:

Document Preparation: Your documents are processed and divided into smaller, manageable chunks. This process is known as chunking.

Chunking: The process of dividing large documents or text into smaller segments or chunks. This is often done to manage the context window limitations of Large Language Models and to improve retrieval efficiency.
Embedding Generation: Each document chunk is passed through an embedding model, typically another LLM, to create embeddings. Embeddings are numerical representations of text, capturing their semantic meaning in a vector space.

Embedding Model: A machine learning model that converts text or other data into numerical vectors called embeddings. These vectors represent the semantic meaning of the input data and are used for tasks like similarity search and clustering. Embeddings: Numerical representations of data, such as text or images, in a vector space. Embeddings capture the semantic meaning and relationships between data points, allowing for efficient similarity comparisons.
Vector Storage: These embeddings are stored in a vector database or vector store. A vector database is optimized for efficient storage and retrieval of vector embeddings, enabling rapid similarity searches.

Vector Database (Vector Store): A database specifically designed for storing and querying vector embeddings. Vector databases excel at similarity searches, allowing for efficient retrieval of vectors that are semantically similar to a query vector.
Query Processing: When a user asks a question (the query), it undergoes the same embedding process as the documents, transforming it into a query embedding.

Query: In the context of information retrieval, a query is the user’s question or search input used to retrieve relevant information from a knowledge base.
Retrieval: The query embedding is used to search the vector database for the most similar document embeddings. This similarity search identifies document chunks that are semantically related to the user’s query.

Similarity Search: The process of finding data points (e.g., vector embeddings) that are most similar to a given query point in a vector space. This is often used to retrieve relevant documents or information based on semantic similarity.
Augmentation: The retrieved documents are combined with the original query. This process is called augmentation, as it enriches the query with relevant contextual information.

Augmentation: In RAG, augmentation refers to the process of adding retrieved context (relevant documents) to the user’s query before passing it to the Large Language Model for response generation.
Response Generation: The augmented query (original query + retrieved documents) is fed into the LLM (the generator). The LLM uses this combined input to generate a final, contextually informed response for the user.

Generator: In RAG, the generator is typically a Large Language Model (LLM) that takes the augmented query (user query + retrieved documents) as input and generates a coherent and contextually relevant response.

This workflow illustrates how RAG effectively bridges the gap between the vast general knowledge of LLMs and the specific data relevant to a user’s needs.

Naive RAG: A Step-by-Step Breakdown

To further clarify the RAG process, let’s examine a “Naive RAG” implementation in detail. “Naive RAG” refers to a basic, straightforward implementation of RAG, highlighting the core steps without advanced optimizations.

Indexing Phase:

Documents: You begin with your collection of documents.
Parsing and Pre-processing: Documents are parsed (analyzed and structured) and pre-processed (cleaned and prepared for further processing). This includes:
- Chunking: Dividing documents into smaller chunks.
- Text cleaning (removing noise, formatting).
Parsing: The process of analyzing and structuring text or data into a format that can be easily understood and processed by a computer program. Pre-processing: The stage of preparing raw data before it is used in a machine learning model or other data processing tasks. This can include cleaning, formatting, and transforming the data.
Embedding Model: The chunks are passed through an embedding model to vectorize them, creating vector embeddings.

Vectorizing: The process of converting text or other data into numerical vectors. In the context of RAG, vectorizing is used to create embeddings of document chunks and user queries.
Vector Store: The generated embeddings are saved into a vector store or vector database. This completes the indexing phase, making your documents searchable.

Indexing: In RAG, indexing refers to the process of preparing documents for efficient retrieval. This involves chunking documents, generating embeddings, and storing them in a vector database.

Query Phase:

User Query: A user poses a question or query.
Embedding Model: The user query is also passed through the same embedding model to create a query embedding.
Vector Database Search: The query embedding is used to perform a similarity search in the vector database. This retrieves the most relevant document chunks based on vector similarity.
Augmentation Phase: The retrieved documents are combined with the user’s query. This might involve simply concatenating them or using a more structured approach to create a prompt. A prompt is a carefully crafted input to the LLM that includes both the user’s question and the relevant context from the retrieved documents.

Prompt: In the context of Large Language Models, a prompt is the input text provided to the model to guide its response generation. In RAG, prompts often include the user query and retrieved context to instruct the model on how to generate a relevant answer.
Large Language Model (Generator): The augmented prompt is passed to the LLM (generator).
Response Generation: The LLM generates a coherent and contextually relevant response based on the augmented prompt.

Coherent: In the context of language generation, coherence refers to the logical flow and consistency of the generated text. A coherent response is one that is well-structured, makes sense, and is easy to follow. Contextually Relevant Response: A response that is appropriate and pertinent to the context of the user’s query and the information provided. In RAG, contextual relevance is achieved by grounding responses in retrieved documents.
User Response: The generated response is returned to the user.

This detailed breakdown of Naive RAG provides a solid foundation for understanding the fundamental principles behind Retrieval Augmented Generation. In the subsequent sections, we will explore more advanced techniques and address the limitations of this basic approach.

Hands-On with Naive RAG: Building a Document Chat System

This section guides you through a practical, hands-on example of building a Naive RAG system. We will create a system that can chat with a collection of news articles, demonstrating the core concepts of RAG in action.

Project Setup

Project Directory: Create a new directory for your RAG project, for example, “rag_intro.”
Virtual Environment: It is highly recommended to create a virtual environment for your project to manage dependencies in isolation.

Virtual Environment: A self-contained directory that holds a specific Python installation and packages, isolated from the system-wide Python installation. This helps manage dependencies and avoid conflicts between projects.
- Navigate to your project directory in the terminal.
- Create a virtual environment using: python -m venv venv (or python3 -m venv venv on some systems).
- Activate the virtual environment:
  - Windows: venv\Scripts\activate
  - macOS/Linux: source venv/bin/activate
.env File: Create a file named .env in your project directory. This file will store your OpenAI API key securely. Add the following line to your .env file, replacing YOUR_OPENAI_API_KEY with your actual API key:
```
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
```
News Articles: Download or prepare a collection of news articles in .txt format. Create a subdirectory named “news_articles” within your project directory and place the .txt article files inside it.
app.py File: Create a Python file named app.py in your project directory. This file will contain the code for our RAG system.

Installing Dependencies

Open your terminal in the project directory (with the virtual environment activated) and install the necessary Python packages using pip:

pip install python-dotenv openai chromadb

python-dotenv: Loads environment variables from the .env file.
openai: The official OpenAI Python library for interacting with OpenAI models.
chromadb: A lightweight, in-memory vector database that is easy to use for prototyping RAG systems.

Chroma DB: An open-source embedding database. Chroma is used to store embeddings and query them. It is designed to be developer-friendly and easy to integrate into applications.

Code Implementation (app.py)

import os
from dotenv import load_dotenv
import openai
import chromadb
from chromadb.utils import embedding_functions

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI client
openai.api_key = openai_api_key
openai_client = openai.OpenAI()

# Initialize embedding function (OpenAI)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=openai_api_key,
    model_name="text-embedding-3-small" # Or another embedding model
)

# Initialize Chroma client and collection
chroma_client = chromadb.PersistentClient(path="chroma_persistent_storage")
collection_name = "news_articles_collection"
chroma_collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=openai_ef
)

# Function to load documents from directory
def load_documents_from_directory(directory_path):
    documents = []
    print(f"Loading documents from: {directory_path}")
    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            filepath = os.path.join(directory_path, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                documents.append(f.read())
    return documents

# Function to split text into chunks
def split_text(text, chunk_size=1000, chunk_overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

# Function to generate OpenAI embeddings
def get_openai_embeddings(text_chunks):
    embeddings = []
    for chunk in text_chunks:
        response = openai_client.embeddings.create(
            input=chunk,
            model="text-embedding-3-small"
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

# Function to query documents
def query_documents(query_text, num_results=5):
    results = chroma_collection.query(
        query_texts=[query_text],
        n_results=num_results
    )
    relevant_chunks = results['documents'][0]
    return relevant_chunks

# Function to generate response using OpenAI Chat API
def generate_response(query_text, relevant_chunks):
    context = "\n".join(relevant_chunks)
    prompt = f"""You are a helpful assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say "I don't know."
    Context:
    {context}
    Question: {query_text}
    Answer:"""

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo", # Or another chat model
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=200  # Adjust as needed
    )
    return response.choices[0].message.content

# --- Main Execution ---

# Load documents
document_directory = "news_articles"
documents = load_documents_from_directory(document_directory)

# Split documents into chunks
split_documents_chunks = []
for doc in documents:
    split_documents_chunks.extend(split_text(doc))

print(f"Loaded and split {len(documents)} documents into {len(split_documents_chunks)} chunks.")

# Generate embeddings for chunks
document_embeddings = get_openai_embeddings(split_documents_chunks)

# Add chunks and embeddings to ChromaDB collection
chroma_collection.add(
    embeddings=document_embeddings,
    documents=split_documents_chunks,
    ids=[f"doc_{i}" for i in range(len(split_documents_chunks))]
)

print(f"Indexed {chroma_collection.count()} document chunks in ChromaDB.")

# Example query
example_question = "Tell me about AI replacing TV writers in the strike."

# Query documents
relevant_document_chunks = query_documents(example_question)
print("\nRelevant document chunks:")
for chunk in relevant_document_chunks:
    print(f"- {chunk[:100]}...") # Print first 100 characters of each chunk

# Generate response
answer = generate_response(example_question, relevant_document_chunks)
print(f"\nAnswer: {answer}")

# Another example query
example_question_2 = "Tell me about Databricks."
relevant_document_chunks_2 = query_documents(example_question_2)
answer_2 = generate_response(example_question_2, relevant_document_chunks_2)
print(f"\nQuestion: {example_question_2}")
print(f"Answer: {answer_2}")

Running the Application

Open your terminal in the project directory (with the virtual environment activated).
Run the app.py script: python app.py

The script will:

Load news articles from the “news_articles” directory.
Split the articles into chunks.
Generate embeddings for each chunk using OpenAI’s embedding model.
Store the chunks and embeddings in a ChromaDB vector database.
Query the database with example questions.
Generate answers using OpenAI’s chat model, based on the retrieved document chunks.

You should see output in your terminal showing the loaded documents, indexed chunks, relevant document chunks retrieved for your queries, and the generated answers.

Experiment and Explore

Ask Different Questions: Modify the example_question and example_question_2 variables in app.py to ask different questions related to the content of your news articles. Observe how the RAG system retrieves relevant chunks and generates answers.
Explore Different Documents: Replace the news articles in the “news_articles” directory with your own documents (e.g., personal notes, company documents, research papers). Re-run the script and test the system with queries relevant to your new documents.
Adjust Chunk Size and Overlap: Experiment with different values for chunk_size and chunk_overlap in the split_text function. Observe how these parameters affect retrieval and response quality.
Try Different Embedding Models: Explore other embedding models offered by OpenAI or other providers. Change the model_name parameter in the OpenAIEmbeddingFunction initialization to experiment with different models and see how they impact performance.
Examine the ChromaDB Database: After running the script, you will find a “chroma_persistent_storage” directory created in your project. This directory contains the persisted ChromaDB database. While you won’t directly interact with this database in this basic example, it’s important to understand that your indexed data is stored there.

This hands-on exercise provides a practical understanding of how Naive RAG works and allows you to experiment with key parameters and components.

Pitfalls of Naive RAG: Challenges and Drawbacks

While Naive RAG provides a functional starting point, it suffers from several limitations and challenges. Understanding these pitfalls is crucial for appreciating the need for more advanced RAG techniques.

Limited Contextual Understanding

Naive RAG often struggles with queries that require a deeper understanding of context and relationships between different pieces of information.

Keyword Matching Limitations: Naive RAG relies heavily on keyword matching or basic semantic similarity for retrieval. This can lead to retrieving documents that contain keywords from the query but are not truly relevant to the user’s intent or the broader context of the question.
- Example: If a user asks, “What is the impact of climate change on polar bears?” a Naive RAG system might retrieve documents that broadly discuss “climate change” and separate documents about “polar bears.” However, it might fail to find the most relevant documents that specifically discuss the impact of climate change on polar bears in context.
Irrelevant or Partially Relevant Documents: Due to the limitations of basic retrieval methods, Naive RAG can retrieve documents that are only partially relevant or even completely irrelevant to the user’s query.

Inconsistent Relevance and Quality of Retrieved Documents

The quality and relevance of retrieved documents can vary significantly in Naive RAG.

Ineffective Document Ranking: Naive RAG models may not effectively rank retrieved documents by relevance. This can lead to poor-quality input for the generative model, as less relevant documents might be given undue weight.
Outdated or Less Credible Resources: Without sophisticated ranking and filtering mechanisms, Naive RAG might retrieve outdated or less credible resources alongside more valuable ones.

Poor Integration Between Retrieval and Generation

In Naive RAG, the retriever and generator components often operate independently, without optimized interaction.

Lack of Synergy: This lack of synergy can lead to suboptimal performance. The generative model might not fully leverage the context provided by the retrieved documents.
Ignoring Critical Context: The generative model might generate a response that ignores crucial context from the retrieved documents, resulting in generic or off-topic answers.

Inefficient Handling of Large-Scale Data

Naive RAG systems can struggle to scale effectively to large datasets due to inefficient retrieval mechanisms.

Slow Response Times: In large knowledge bases, a naive retriever might take too long to find relevant documents, leading to slower response times.
Missing Critical Information: Inadequate indexing and search strategies in Naive RAG can cause it to miss critical information within large datasets.

Lack of Robustness and Adaptability

Naive RAG models often lack mechanisms to handle ambiguous or complex queries robustly.

Ambiguous Queries: They may struggle with user queries that are vague, multifaceted, or contain implicit assumptions.
Limited Adaptability: Naive RAG systems are typically not adaptable to changing contexts or user needs without significant manual intervention.
- Example: If a user query is “Tell me more about index funds and anything related to finances,” a Naive RAG system might retrieve documents about “index funds” and separate documents about “finances” but fail to provide a coherent and comprehensive answer that connects these concepts in a meaningful way.

Summary of Naive RAG Pitfalls

The pitfalls of Naive RAG can be broadly categorized into:

Retrieval Challenges: Leading to the selection of misaligned or irrelevant chunks, potentially missing crucial information.
Generative Challenges: The model might struggle with hallucinations, relevance, toxicity, or bias in its outputs, even with retrieved context, due to the limitations of Naive RAG in providing truly relevant and high-quality context.

Understanding these limitations is essential for motivating the exploration of advanced RAG techniques, which aim to address these shortcomings and create more robust and effective RAG systems.

Advanced RAG Techniques: Solutions and Enhancements

To overcome the limitations of Naive RAG, various advanced RAG techniques have been developed. These techniques focus on improving both the retrieval and generation stages of the RAG pipeline.

Benefits of Advanced RAG

Advanced RAG techniques introduce specific improvements to address the pitfalls of Naive RAG, primarily focusing on enhancing retrieval quality. These improvements can be broadly categorized into two stages:

Pre-retrieval: Techniques applied before the actual document retrieval process to improve indexing structures and user queries.
- Improved Indexing Structure: Organizing and structuring indexed data more effectively (e.g., adding metadata, hierarchical indexing).
- Enhanced User Query: Refining user queries to be more precise and informative for retrieval (e.g., query expansion, query rewriting).
Post-retrieval: Techniques applied after the initial document retrieval to refine the retrieved context before feeding it to the generator.
- Reranking: Reranking retrieved documents to highlight the most relevant content based on more sophisticated relevance scoring.
- Contextualization and Filtering: Further processing and filtering of retrieved documents to extract the most pertinent information and remove noise.

Many advanced RAG techniques exist, and their specific implementations can overlap. However, the overarching goal is to create a more efficient and accurate RAG workflow.

Query Expansion with Generated Answers: Refining Retrieval with LLMs

One powerful advanced RAG technique is query expansion with generated answers. This technique leverages LLMs to generate potential answers to the user’s query before retrieval. These generated “hallucinated” answers are then used to refine the query and improve retrieval relevance.

Query Expansion with Generated Answers: An advanced RAG technique that uses a Large Language Model to generate potential answers to a user query. These generated answers are then used to expand or refine the original query, improving the relevance of retrieved documents.

Workflow:

Original Query: The user provides an initial query.
LLM-Generated Answer (Hallucinated): The original query is passed to an LLM to generate a potential answer. This answer is “hallucinated” because it is generated without external knowledge retrieval, relying solely on the LLM’s pre-trained knowledge.
Query and Answer Concatenation: The generated answer is concatenated with the original query to create a new, expanded query.
Vector Database Retrieval: The expanded query is used to perform a similarity search in the vector database, retrieving relevant documents.
Response Generation (with Retrieval): The retrieved documents, along with the original query (or expanded query), are passed to an LLM to generate the final, contextually informed response.

Diagram:

[Query] --> [Large Language Model] --> [Generated Answer (Hallucinated)]
     ^                                        |
     |_________________________________________|
                                        |
                                        v
[Concatenate (Answer + Original Query)] --> [Vector Database (Retrieval System)] --> [Query Results]
                                                |
                                                v
                                        [Large Language Model] --> [Final Answer]

Use Cases for Query Expansion with Generated Answers:

Information Retrieval: Enhancing the effectiveness of search engines by providing more comprehensive and relevant search results.
Question Answering Systems: Improving the retrieval of relevant documents or passages to help answer user queries more accurately.
E-commerce Search: Increasing the accuracy and relevance of product searches by expanding user queries with related terms and concepts.
Academic Research: Helping researchers find more relevant papers by expanding search queries with related scientific terms and concepts.

Hands-On: Query Expansion with Generated Answers

To demonstrate query expansion with generated answers, we will modify our previous Naive RAG example. We will focus on implementing the query expansion technique and visualizing its impact on retrieval results.

Project Setup:

You can reuse the project directory and virtual environment from the Naive RAG example. Ensure you have the required dependencies installed (python-dotenv, openai, chromadb, pandas, pypdf, sentence-transformers, umap-learn, matplotlib).

Code Implementation (expansion_answer.py):

Create a new Python file named expansion_answer.py in your project directory.

import os
import pandas as pd
import chromadb
import openai
from dotenv import load_dotenv
from chromadb.utils import embedding_functions
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import umap
import matplotlib.pyplot as plt
import numpy as np


# Helper functions for embedding projection and visualization (utils.py -  create this file in the same directory)
from utils import project_embeddings

# Load environment variables and OpenAI API key
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = openai_api_key
openai_client = openai.OpenAI()

# Embedding function (Sentence Transformers)
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(sentence_model)

# Chroma client and collection
chroma_client = chromadb.PersistentClient(path="chroma_persistent_storage")
collection_name = "microsoft_annual_report_collection"
chroma_collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=sentence_ef
)

# --- Data Loading and Indexing (Same as before - for brevity, assume functions like pdf_to_text, split_text, etc. are defined or reused from previous example) ---
# ... (Code for loading PDF, splitting text, generating embeddings, and adding to ChromaDB) ...

# --- Query Expansion with Generated Answer Function ---
def augment_query_generated_answer(query):
    augmented_query_prompt = f"""You are a helpful expert financial research assistant.
    Please provide an example answer to the given question, that might be found in a document like an annual report.

    Question: {query}
    Answer:"""

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": augmented_query_prompt}
        ]
    )
    augmented_answer = response.choices[0].message.content
    return augmented_answer

# --- Main Execution ---

# ... (Code for loading PDF, splitting text, indexing data - same as before) ...

# Original query
original_query = "What was the total profit for the year and how does it compare to the previous year?"

# Generate hypothetical answer using LLM
hypothetical_answer = augment_query_generated_answer(original_query)
print(f"Hypothetical Answer: {hypothetical_answer}")

# Joint query (original query + hypothetical answer)
joint_query = f"{original_query} {hypothetical_answer}"
print(f"Joint Query: {joint_query}")

# Query ChromaDB with joint query
query_results = chroma_collection.query(
    query_texts=[joint_query],
    n_results=5,
    include=["documents", "embeddings"]
)

retrieved_documents = query_results['documents'][0]
print("\nRetrieved Documents:")
for doc in retrieved_documents:
    print(f"- {doc[:100]}...")

# --- Visualization (UMAP and Matplotlib) ---
# ... (Code for projecting embeddings and plotting - similar to previous example, using project_embeddings function from utils.py) ...
# ... (Refer to previous example for visualization code) ...

# Get embeddings from ChromaDB results
retrieved_embeddings = query_results['embeddings'][0]

# Get embeddings for original and augmented queries
embedding_function = sentence_ef  # Use the same embedding function as for documents
original_query_embedding = embedding_function([original_query])[0]
augmented_query_embedding = embedding_function([joint_query])[0]

# Project embeddings using UMAP
projected_dataset_embedding = project_embeddings(np.array(chroma_collection.get(include=["embeddings"])['embeddings']), 2)
projected_original_query_embedding = project_embeddings(np.array([original_query_embedding]), 2)
projected_augmented_query_embedding = project_embeddings(np.array([augmented_query_embedding]), 2)
projected_retrieved_embedding = project_embeddings(np.array(retrieved_embeddings), 2)


# Plotting
plt.figure(figsize=(8, 8))
plt.scatter(*projected_dataset_embedding.T, s=2, label='Dataset Embeddings', color='gray')
plt.scatter(*projected_original_query_embedding.T, s=100, color='red', marker='x', label='Original Query')
plt.scatter(*projected_augmented_query_embedding.T, s=100, color='orange', marker='x', label='Augmented Query')
plt.scatter(*projected_retrieved_embedding.T, s=100, color='green', marker='o', label='Retrieved Documents')


plt.legend()
plt.title('2D Embedding Space: Query Expansion with Generated Answer')
plt.show()

utils.py (Helper functions for embedding projection - create this file in the same directory as expansion_answer.py):

import umap
import numpy as np

def project_embeddings(embeddings, n_components=2, random_state=42):
    """Projects embeddings to a lower dimension using UMAP."""
    reducer = umap.UMAP(n_components=n_components, random_state=random_state)
    projected_embeddings = reducer.fit_transform(embeddings)
    return projected_embeddings

Running the Application:

Run the expansion_answer.py script from your terminal: python expansion_answer.py

The script will execute the query expansion technique, retrieve documents based on the expanded query, and generate a 2D visualization of the embedding space using UMAP and Matplotlib.

Analyzing the Visualization:

Examine the generated 2D plot. You should observe:

Original Query (Red ‘x’): Represents the embedding of your original query.
Augmented Query (Orange ‘x’): Represents the embedding of the expanded query (original query + generated answer).
Dataset Embeddings (Gray dots): Represent the embeddings of all document chunks in your dataset.
Retrieved Documents (Green circles): Represent the embeddings of the documents retrieved using the expanded query.

Ideally, you should see that the “Retrieved Documents” (green circles) are clustered closer to the “Augmented Query” (orange ‘x’) in the embedding space compared to the “Original Query” (red ‘x’). This visually demonstrates how query expansion with generated answers can improve retrieval relevance.

Experiment and Challenge:

Ask Different Queries: Modify the original_query variable to ask different questions about the Microsoft annual report. Observe how the visualization changes and whether the query expansion technique consistently improves retrieval.
Analyze Different Prompts: Experiment with different prompts in the augment_query_generated_answer function. Change the prompt to guide the LLM to generate different types of hypothetical answers. Observe how prompt variations affect retrieval and visualization.
Compare to Naive RAG: Comment out the query expansion code in expansion_answer.py and run the script with just the original query (effectively reverting to Naive RAG). Compare the visualization to the query expansion results. Does query expansion noticeably improve retrieval relevance in the visualization?

This hands-on exercise allows you to experience the benefits of query expansion with generated answers and visualize its impact on retrieval performance.

Query Expansion with Multiple Queries: Generating Diverse Perspectives

Another advanced RAG technique for query expansion is query expansion with multiple queries. Instead of generating a single answer, this technique uses an LLM to generate multiple related queries based on the original user query. These multiple queries are then used to retrieve a broader and more diverse set of relevant documents.

Query Expansion with Multiple Queries: An advanced RAG technique that uses a Large Language Model to generate multiple related queries from an original user query. These multiple queries are then used to retrieve a more diverse and comprehensive set of relevant documents.

Workflow:

Original Query: The user provides an initial query.
LLM-Generated Multiple Queries: The original query is passed to an LLM, instructed to generate several (e.g., up to five) related or sub-queries that explore different facets of the original query.
Vector Database Retrieval (Multiple Queries): Each of the generated queries (including the original query) is used to perform a separate similarity search in the vector database.
Document Aggregation and Sanitization: The documents retrieved for each query are aggregated into a single list. Duplicate documents are removed to sanitize the list and avoid redundancy.
Response Generation (with Aggregated Retrieval): The aggregated and sanitized list of retrieved documents, along with the original query, are passed to an LLM to generate the final, contextually informed response.

Diagram:

[Query] --> [Large Language Model (Generate Multiple Queries)] --> [Query 1, Query 2, Query 3, ...]
     |                                                                   ^
     |_____________________________________________________________________|
                                                                   |
                                                                   v
[Vector Database (Retrieval System) - Query 1] --> [Results 1]
[Vector Database (Retrieval System) - Query 2] --> [Results 2]
[Vector Database (Retrieval System) - Query 3] --> [Results 3]
...
                                                                   |
                                                                   v
[Aggregate & Sanitize Results] --> [Aggregated & Sanitized Documents]
                                                                   |
                                                                   v
                                                            [Large Language Model] --> [Final Answer]

Use Cases for Query Expansion with Multiple Queries:

Exploratory Data Analysis: Helping analysts explore different facets of data by generating varied sub-queries.
Academic Research: Providing researchers with different angles on a research question by generating multiple search queries.
Customer Support: Covering all aspects of a user’s query by breaking it down into specific sub-queries.
Healthcare Information Systems: Retrieving comprehensive medical information by expanding queries to include related symptoms, treatments, or diagnoses.

Hands-On: Query Expansion with Multiple Queries

To demonstrate query expansion with multiple queries, we will create another modified version of our RAG example. This time, we will implement the multiple query generation technique and again visualize its impact.

Code Implementation (expansion_queries.py):

Create a new Python file named expansion_queries.py in your project directory.

import os
import pandas as pd
import chromadb
import openai
from dotenv import load_dotenv
from chromadb.utils import embedding_functions
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import umap
import matplotlib.pyplot as plt
import numpy as np

# Helper functions for embedding projection and visualization (utils.py -  create this file in the same directory)
from utils import project_embeddings


# Load environment variables and OpenAI API key
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = openai_api_key
openai_client = openai.OpenAI()


# Embedding function (Sentence Transformers)
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(sentence_model)

# Chroma client and collection
chroma_client = chromadb.PersistentClient(path="chroma_persistent_storage")
collection_name = "microsoft_annual_report_collection"
chroma_collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=sentence_ef
)


# --- Data Loading and Indexing (Same as before - for brevity, assume functions like pdf_to_text, split_text, etc. are defined or reused from previous example) ---
# ... (Code for loading PDF, splitting text, generating embeddings, and adding to ChromaDB) ...

# --- Multi-Query Generation Function ---
def generate_multi_query(query):
    multi_query_prompt = f"""You are a knowledgeable financial research assistant.
    Your users are asking about an annual report.
    For the given question, please propose up to five relevant and related questions to assist the user in finding the information they need.
    Ensure that the questions are distinct and cover different aspects of the topic.

    Original Question: {query}

    Proposed Questions:"""

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": multi_query_prompt}
        ]
    )
    augmented_queries = response.choices[0].message.content.strip().split("\n") # Split into list of queries
    return augmented_queries


# --- Main Execution ---

# ... (Code for loading PDF, splitting text, indexing data - same as before) ...

# Original query
original_query = "What details can you provide about the factors that led to revenue growth?"

# Generate multiple augmented queries
augmented_queries = generate_multi_query(original_query)
print(f"Augmented Queries: {augmented_queries}")

# Join original query with augmented queries for retrieval
joint_query_texts = [original_query] + augmented_queries
print(f"Joint Query Texts: {joint_query_texts}")

# Query ChromaDB with joint queries
query_results = chroma_collection.query(
    query_texts=joint_query_texts,
    n_results=5,
    include=["documents", "embeddings"]
)


# --- Sanitize and Aggregate Retrieved Documents ---
retrieved_documents_per_query = query_results['documents']
print("\nRetrieved Documents per Query:")
for i, docs in enumerate(retrieved_documents_per_query):
    print(f"Query {i+1}:")
    for doc in docs:
        print(f"- {doc[:100]}...")

# Flatten list of lists to get all retrieved documents
all_retrieved_documents = [doc for docs in retrieved_documents_per_query for doc in docs]

# Remove duplicates (optional, but can be helpful)
unique_retrieved_documents = list(set(all_retrieved_documents))
print(f"\nUnique Retrieved Documents Count: {len(unique_retrieved_documents)}")


# --- Visualization (UMAP and Matplotlib) ---
# ... (Code for projecting embeddings and plotting - similar to previous example, using project_embeddings function from utils.py) ...
# ... (Refer to previous example for visualization code, adapting for multiple queries) ...


# Get embeddings from ChromaDB results (flattened for visualization)
retrieved_embeddings_per_query = query_results['embeddings']
all_retrieved_embeddings = [embed for embeds in retrieved_embeddings_per_query for embed in embeds]


# Get embeddings for original and augmented queries
embedding_function = sentence_ef  # Use the same embedding function as for documents
original_query_embedding = embedding_function([original_query])[0]


# Project embeddings using UMAP
projected_dataset_embedding = project_embeddings(np.array(chroma_collection.get(include=["embeddings"])['embeddings']), 2)
projected_original_query_embedding = project_embeddings(np.array([original_query_embedding]), 2)
projected_augmented_query_embeddings = project_embeddings(np.array(embedding_function(augmented_queries)), 2) # Project embeddings for augmented queries
projected_retrieved_embedding = project_embeddings(np.array(all_retrieved_embeddings), 2) # Project flattened retrieved embeddings


# Plotting
plt.figure(figsize=(8, 8))
plt.scatter(*projected_dataset_embedding.T, s=2, label='Dataset Embeddings', color='gray')
plt.scatter(*projected_original_query_embedding.T, s=100, color='red', marker='x', label='Original Query')
plt.scatter(*projected_augmented_query_embeddings.T, s=100, color='orange', marker='x', label='Augmented Queries') # Plot augmented queries
plt.scatter(*projected_retrieved_embedding.T, s=100, color='green', marker='o', label='Retrieved Documents')


plt.legend()
plt.title('2D Embedding Space: Query Expansion with Multiple Queries')
plt.show()

Running the Application:

Run the expansion_queries.py script from your terminal: python expansion_queries.py

The script will execute the query expansion with multiple queries technique, retrieve documents based on the expanded queries, and generate a 2D visualization.

Analyzing the Visualization:

Examine the generated 2D plot. You should observe:

Original Query (Red ‘x’): Represents the embedding of your original query.
Augmented Queries (Orange ‘x’s): Represent the embeddings of the multiple queries generated by the LLM.
Dataset Embeddings (Gray dots): Represent the embeddings of all document chunks.
Retrieved Documents (Green circles): Represent the embeddings of the documents retrieved using all the generated queries.

Ideally, you should see that the “Retrieved Documents” (green circles) are broadly distributed around the cluster of “Augmented Queries” (orange ‘x’s), indicating that the multiple queries helped retrieve a more diverse set of relevant documents compared to using just the original query (red ‘x’).

Experiment and Challenge:

Modify the Original Query: Change the original_query variable to explore different aspects of the Microsoft annual report. Observe how the generated multiple queries and the visualization change.
Refine the Prompt: Experiment with the prompt in the generate_multi_query function. Try different prompts to guide the LLM to generate more targeted or diverse sets of sub-queries. How does prompt engineering influence the quality and diversity of generated queries and the retrieved documents?
Compare to Single Query Expansion: Compare the visualization from expansion_queries.py to the visualization from expansion_answer.py (query expansion with generated answer). Does query expansion with multiple queries retrieve a broader range of documents compared to single query expansion?

This hands-on exercise demonstrates the benefits of query expansion with multiple queries in retrieving a more comprehensive set of relevant documents by exploring different facets of the user’s original query.

Conclusion: Building Effective RAG Systems

This chapter has provided a comprehensive introduction to Retrieval Augmented Generation (RAG), starting with its fundamental concepts and components and progressing through practical examples of Naive RAG and advanced RAG techniques like query expansion.

Key Takeaways from this Chapter

RAG Definition and Motivation: RAG is a powerful framework for enhancing Large Language Models by grounding their responses in external knowledge. It addresses the limitations of LLMs by enabling them to access and incorporate specific, up-to-date, or private data.
Naive RAG Workflow: We explored the basic workflow of Naive RAG, including document indexing, query processing, retrieval, augmentation, and response generation.
Pitfalls of Naive RAG: We identified several challenges and limitations of Naive RAG, including limited contextual understanding, inconsistent relevance, poor integration between retrieval and generation, inefficiency with large datasets, and lack of robustness.
Advanced RAG Techniques: We introduced advanced RAG techniques to overcome the pitfalls of Naive RAG, focusing on query expansion with generated answers and query expansion with multiple queries.
Hands-On Implementation: Through practical code examples and visualizations, we demonstrated how to build and experiment with both Naive RAG and advanced RAG techniques.

Next Steps and Further Exploration

Explore More Advanced RAG Techniques: This chapter only scratched the surface of advanced RAG. Further exploration should include techniques like:
- Reranking Algorithms: Implementing more sophisticated algorithms (e.g., Sentence-BERT reranking, Cohere reranking) to improve the relevance ranking of retrieved documents.
- Context Compression: Techniques to compress and filter retrieved context to reduce noise and improve the generator’s focus on the most pertinent information.
- Knowledge Graph Integration: Incorporating knowledge graphs as external knowledge sources for RAG systems.
- Iterative RAG: Developing RAG systems that can iteratively refine retrieval and generation processes based on user feedback or intermediate results.
Prompt Engineering: Mastering prompt engineering is crucial for effective RAG systems. Experiment with different prompt strategies to optimize the generator’s behavior and response quality.
Vector Database Selection: Explore different vector database options beyond ChromaDB, such as Pinecone, Weaviate, or FAISS, to find the best solution for your specific needs in terms of scalability, performance, and features.
Evaluation Metrics: Learn about and implement appropriate evaluation metrics to quantitatively assess the performance of your RAG systems (e.g., relevance, accuracy, faithfulness).
Real-World Applications: Consider real-world applications for RAG in your domain. How can RAG be used to solve practical problems, enhance existing workflows, or create new and innovative solutions?

Final Thoughts

Retrieval Augmented Generation is a rapidly evolving field with immense potential for enhancing Large Language Models and building intelligent applications. By understanding the fundamentals of RAG, its limitations, and advanced techniques, you are well-equipped to embark on your journey of building effective and impactful RAG systems.

Remember that the key to successful RAG implementation lies in continuous experimentation, refinement, and adaptation to specific use cases and data. As you delve deeper into RAG, you will discover even more sophisticated techniques and strategies to unlock the full potential of this powerful framework.