Building a RAG Chatbot from Your Website Data using OpenAI and Langchain (Hands-On)

Imagine a tireless assistant on your website, ready to answer customer questions 24/7. That’s the power of a chatbot! In this post, we’ll guide you through building a custom chatbot specifically trained on your website’s data using OpenAI and Langchain. Let’s dive in and create this helpful conversational AI!

If you want to perform the steps along with the project in parallel, rather than just reading, check out our project on the same at Building a RAG Chatbot from Your Website Data using OpenAI and Langchain. You will also receive a project completion certificate which you can use to showcase your Generative AI skills.

Step 1: Grabbing Valuable Content from Your Website

We first need the gold mine of information – the content from your website! To achieve this, we’ll build a web crawler using Python’s requests library and Beautiful Soup. This script will act like a smart visitor, fetching the text content from each webpage on your website.

Here’s what our web_crawler.py script will do:

  1. Fetch the Webpage: It’ll send a request to retrieve the HTML content of a given website URL.
  2. Check for Success: The script will ensure the server responds positively (think status code 200) before proceeding.
  3. Parse the HTML Structure: Using Beautiful Soup, it will analyze the downloaded HTML to understand how the webpage is built.
  4. Clean Up the Mess: It will discard unnecessary elements like scripts and styles that don’t contribute to the core content you want for the chatbot.
  5. Extract the Text: After that, it will convert the cleaned HTML into plain text format, making it easier to process later.
  6. Grab Extra Info (Optional): The script can optionally extract metadata like page titles and descriptions for better organization.

Imagine this script as a virtual visitor browsing your website and collecting the text content, leaving behind the fancy formatting for now.

Let’s code!

import requests
from bs4 import BeautifulSoup
import html2text


def get_data_from_website(url):
    """
    Retrieve text content and metadata from a given URL.

    Args:
        url (str): The URL to fetch content from.

    Returns:
        tuple: A tuple containing the text content (str) and metadata (dict).
    """
    # Get response from the server
    response = requests.get(url)
    if response.status_code == 500:
        print("Server error")
        return
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Removing js and css code
    for script in soup(["script", "style"]):
        script.extract()

    # Extract text in markdown format
    html = str(soup)
    html2text_instance = html2text.HTML2Text()
    html2text_instance.images_to_alt = True
    html2text_instance.body_width = 0
    html2text_instance.single_line_break = True
    text = html2text_instance.handle(html)

    # Extract page metadata
    try:
        page_title = soup.title.string.strip()
    except:
        page_title = url.path[1:].replace("/", "-")
    meta_description = soup.find("meta", attrs={"name": "description"})
    meta_keywords = soup.find("meta", attrs={"name": "keywords"})
    if meta_description:
        description = meta_description.get("content")
    else:
        description = page_title
    if meta_keywords:
        meta_keywords = meta_description.get("content")
    else:
        meta_keywords = ""

    metadata = {'title': page_title,
                'url': url,
                'description': description,
                'keywords': meta_keywords}

    return text, metadata

Explanation:

The get_data_from_website function takes a website URL and returns the extracted text content along with any optional metadata. Explore the code further to see how it performs each step mentioned!

Step 2: Cleaning Up the Raw Text

We have the raw text content from our website, but it might contain inconsistencies like extra spaces, irregular newlines, or broken formatting. This can affect how well our future chatbot understands the information. Let’s refine this text for a smoother process.

Here’s where our text_to_doc.py script comes in:

  1. Define Cleaning Functions: This script will have special functions designed to tackle these text quirks. For instance, one function might merge hyphenated words that got split across lines, another might fix inconsistent newline characters, and another might remove unnecessary consecutive newlines.
  2. Scrub the Text Clean: The script will meticulously go through each cleaning function one by one, polishing the raw text and ensuring a consistent format.
  3. Break Down the Text: Large chunks of text can be overwhelming for some Natural Language Processing (NLP) tasks. This script uses a text splitter from Langchain to chop the cleaned text into smaller, more manageable segments.
  4. Create Knowledge Nuggets (Documents): Each text segment will be transformed into a “document” object. This document will hold the refined text content and the corresponding metadata (like page titles) extracted earlier from the webpage.

Think of this script as an editor meticulously reviewing and refining the raw text, making it easier to understand for our future chatbot.

Let’s see the code snippet!

import re
from langchain.text_splitter import MarkdownTextSplitter
from langchain.docstore.document import Document


# Data Cleaning functions

def merge_hyphenated_words(text):
    return re.sub(r"(\w)-\n(\w)", r"\1\2", text)


def fix_newlines(text):
    return re.sub(r"(?<!\n)\n(?!\n)", " ", text)


def remove_multiple_newlines(text):
    return re.sub(r"\n{2,}", "\n", text)


def clean_text(text):
    """
    Cleans the text by passing it through a list of cleaning functions.

    Args:
        text (str): Text to be cleaned

    Returns:
        str: Cleaned text
    """
    cleaning_functions = [merge_hyphenated_words, fix_newlines, remove_multiple_newlines]
    for cleaning_function in cleaning_functions:
        text = cleaning_function(text)
    return text


def text_to_docs(text, metadata):
    """
    Converts input text to a list of Documents with metadata.

    Args:
        text (str): A string of text.
        metadata (dict): A dictionary containing the metadata.

    Returns:
        List[Document]: List of documents.
    """
    doc_chunks = []
    text_splitter = MarkdownTextSplitter(chunk_size=2048, chunk_overlap=128)
    chunks = text_splitter.split_text(text)
    for i, chunk in enumerate(chunks):
        doc = Document(page_content=chunk, metadata=metadata)
        doc_chunks.append(doc)
    return doc_chunks


def get_doc_chunks(text, metadata):
    """
    Processes the input text and metadata to generate document chunks.

    This function takes the raw text content and associated metadata, cleans the text,
    and divides it into document chunks.

    Args:
        text (str): The raw text content to be processed.
        metadata (dict): Metadata associated with the text content.

    Returns:
        List[Document]: List of documents.
    """
    text = clean_text(text)
    doc_chunks = text_to_docs(text, metadata)
    return doc_chunks

By the end of this step, we’ll have a collection of clean, well-organized text snippets ready for the next stage of building our website chatbot!

Step 3: Storing the Knowledge for Retrieval

We’ve gathered valuable content from our website and meticulously cleaned it. Now it’s time to organize this knowledge in a way our chatbot can easily access it later. Here’s where vector stores and Langchain come into play.

The Plan:

  1. Store the Documents: We’ll use a vector store called Chroma to store the documents (text snippets) we created in the previous step. Chroma acts like a special database designed to efficiently store and retrieve text information.

Our utils.py script will handle these tasks:

  1. Create a Chroma Client: This function will establish a connection to our Chroma vector store, allowing us to interact with it and store our documents.
  2. Store Extracted Documents: The script will take a website URL and use the functions from the previous steps (web_crawler.py and text_to_doc.py) to extract text and create documents. These documents will then be uploaded to the Chroma vector store.

Think of this step as creating a knowledge base for your chatbot. The vector store acts like a library, carefully storing the processed website content for easy retrieval later.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from text_to_doc import get_doc_chunks
from web_crawler import get_data_from_website

def get_chroma_client():
    """
    Returns a chroma vector store instance.

    Returns:
        langchain.vectorstores.chroma.Chroma: ChromaDB vector store instance.
    """
    embedding_function = OpenAIEmbeddings()
    return Chroma(
        collection_name="website_data",
        embedding_function=embedding_function,
        persist_directory="data/chroma")


def store_docs(url):
    """
    Retrieves data from a website, processes it into document chunks, and stores them in a vector store.

    Args:
        url (str): The URL of the website to retrieve data from.

    Returns:
        None
    """
    text, metadata = get_data_from_website(url)
    docs = get_doc_chunks(text, metadata)
    vector_store = get_chroma_client()
    vector_store.add_documents(docs)
    vector_store.persist()

The get_chroma_client function sets up the connection to the vector store. The store_docs function demonstrates how the script retrieves data from a website, processes it into documents, and stores them in the Chroma vector store.

This step lays the foundation for our chatbot’s knowledge storage. In the next steps, we’ll explore how to connect this knowledge base with a powerful language model to create our intelligent conversational AI.

Step 4: Guiding the Conversation

Now it’s time to refine the conversation flow by defining a clear “prompt” for the LLM.

The Role of the Prompt:

The prompt for the Large Language Model (LLM) can be described as the input or the question given to the model to generate a response. It’s the specific text or instructions you provide to the model to prompt it to produce output.

In case of chatbot, the prompt acts like a specific instruction for the LLM, guiding it on how to use the retrieved website content (documents) to answer the user’s query effectively. A well-crafted prompt can significantly improve the quality and relevance of the chatbot’s responses.

Crafting an Effective Prompt:

Here are some key elements to consider when writing your prompt in prompt.py:

  1. Introduce the User Query: Clearly state the user’s question within the prompt.
  2. Reference the Retrieved Documents: Indicate that the LLM should consider the documents retrieved from the Chroma vector store based on the user’s query.
  3. Guide the Response Generation: Provide instructions on how the LLM should utilize the documents to answer the user. This could involve summarizing relevant information, extracting key points, or directly answering the question using the retrieved content.

Here’s a basic example prompt structure:

"The user asked '{user_query}'.  Based on the retrieved documents {documents}, please provide a comprehensive and informative answer to their question."

Fine-Tuning the Prompt:

You can experiment with different prompt structures and wording to achieve the desired conversational style and response quality for your chatbot.

Let’s see how we can define the prompt in prompt.py!

from langchain.prompts import (
    SystemMessagePromptTemplate,
    PromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate
)

system_prompt = """You are an expert support agent at {organization_name}. {organization_info}

Your task is to answer customer queries related to {organization_name}. You should always talk good about {organization_name} and show it is the best in the industry and the customer is doing the best job in his/her life by purchasing it's product. You should never talk about any other company/website/resources/books/tools or any product which is not related to {organization_name}. You should always promote the {organization_name}'s products. If you don't know any answer, don't try to make up an answer. Just say that you don't know and to contact the company support.
The ways to contact company support is: {contact_info}.
Don't be overconfident and don't hallucinate. Ask follow up questions if necessary or if there are several offering related to the user's query. Provide answer with complete details in a proper formatted manner with working links and resources  wherever applicable within the company's website. Never provide wrong links.


Use the following pieces of context to answer the user's question.

----------------

{context}
{chat_history}
Follow up question: """


def get_prompt():
    """
    Generates prompt.

    Returns:
        ChatPromptTemplate: Prompt.
    """
    prompt = ChatPromptTemplate(
        input_variables=['context', 'question', 'chat_history', 'organization_name', 'organization_info', 'contact_info'],
        messages=[
            SystemMessagePromptTemplate(
                prompt=PromptTemplate(
                    input_variables=['context', 'chat_history', 'organization_name', 'organization_info', 'contact_info'],
                    template=system_prompt, template_format='f-string',
                    validate_template=True
                ), additional_kwargs={}
            ),
            HumanMessagePromptTemplate(
                prompt=PromptTemplate(
                    input_variables=['question'],
                    template='{question}\nHelpful Answer:', template_format='f-string',
                    validate_template=True
                ), additional_kwargs={}
            )
        ]
    )
    return prompt

By defining the prompt, we’ve completed the core development of our website chatbot.

Step 5: Building the Brains of the Chatbot

We’ve come a long way! We’ve extracted valuable content from our website, cleaned and organized it, and stored it in a way our chatbot can access it. Now it’s time to create the core functionality – allowing the chatbot to understand user queries and respond using the knowledge it has gathered.

We add functionalities

  1. The Power of OpenAI: We’ll leverage the capabilities of OpenAI’s large language models (LLMs) like GPT-3. These LLMs are incredibly powerful AI models trained on massive amounts of text data, allowing them to understand and respond to human language in a comprehensive way.
  2. Connecting the Dots with Langchain: Langchain comes back into play! We’ll use it to create a chain of functionalities. This chain will:
    • Take a user query as input.
    • Use the Chroma vector store to retrieve documents relevant to the query based on their content (embeddings, if created).
    • Pass the retrieved documents and the user query to the OpenAI LLM.
    • The LLM will then analyze the documents and the query, allowing it to generate a response tailored to the user’s specific question.

Imagine this script as the brain of the chatbot. It takes user input, retrieves relevant information from the knowledge base, and uses the power of OpenAI’s LLM to craft informative responses.

So we’ll add two more functions in our utils.py:

from langchain_openai import ChatOpenAI
from prompt import get_prompt
from langchain.chains import ConversationalRetrievalChain

def make_chain():
    """
    Creates a chain of langchain components.

    Returns:
        langchain.chains.ConversationalRetrievalChain: ConversationalRetrievalChain instance.
    """
    model = ChatOpenAI(
            model_name="gpt-3.5-turbo",
            temperature=0.0,
            verbose=True
        )
    vector_store = get_chroma_client()
    prompt = get_prompt()

    retriever = vector_store.as_retriever(search_type="mmr", verbose=True)

    chain = ConversationalRetrievalChain.from_llm(
        model,
        retriever=retriever,
        return_source_documents=True,
        combine_docs_chain_kwargs=dict(prompt=prompt),
        verbose=True,
        rephrase_question=False,
    )
    return chain


def get_response(question, organization_name, organization_info, contact_info):
    """
    Generates a response based on the input question.

    Args:
        question (str): The input question to generate a response for.
        organization_name (str): The name of the organization.
        organization_info (str): Information about the organization.
        contact_info (str): Contact information for the organization.

    Returns:
        str: The response generated by the chain model.
    """
    chat_history = ""
    chain = make_chain()
    response = chain({"question": question, "chat_history": chat_history,
                      "organization_name": organization_name, "contact_info": contact_info,
                      "organization_info": organization_info})
    return response['answer']

This step completes the core functionality of our website chatbot. In the next and final step, we’ll explore how to integrate this powerful tool into your website and enable it to interact with your visitors!

We’ve built a powerful website chatbot! Now it’s time to unleash its potential and allow it to interact with visitors on your website.

Here’s a step-by-step guide using a sample Python script:

1. Load Your OpenAI API Key:

  • Explanation: We’ll use a OpenAI API key to interact with OpenAI’s services. It’s crucial to keep this key secure. Here, we’re loading it from a separate .env.sh file using the python-dotenv library.
# Load the OPENAI KEY from the env.sh file

from dotenv import load_dotenv

load_dotenv('../env.sh')  # Specify path to your env file

2. Import Necessary Functions:

  • Explanation: We’ll import pre-defined functions from our utils.py script. These functions handle storing website content, retrieving chatbot responses, and connecting to the vector store.
# Import chatbot functions

from utils import store_docs, get_response, get_chroma_client

3. Storing Website Content:

  • Explanation: We’ll use the store_docs function to demonstrate how the chatbot can ingest content from a specific webpage URL. This process extracts text content, cleans it, and stores it in the vector store for future retrieval.
# Storing webpage content into vector store
# Make sure not to store the same documents twice

store_docs("https://cloudxlab.com/course/204/certificate-course-on-generative-ai")

4. Setting Up Organization Information:

  • Explanation: While not essential for core functionality, you can provide some information about your organization. This might be displayed alongside chatbot responses for better context. In this example, we’ve defined variables for organization_nameorganization_info, and contact_info.
# Setting up organization information

organization_name = "CloudxLab"
organization_info = "Cloudxlab is known for providing courses on several technologies such as Machine Learning, Big Data, Deep Learning, Artificial Intelligence, DevOps, MLOps, etc. It's main perk is the gamified cloud lab where users can practice all the tools related to the mentioned technologies which are pre-installed in parallel to learning."
contact_info = """Email: reachus@cloudxlab.com 
India Phone: +9108049202224
International Phone: +1 (412) 568-3901.
Raise a query: https://cloudxlab.com/reach-us-queries
Forum: https://discuss.cloudxlab.com/"""

5. Get a Response from the Chatbot:

  • Explanation: The core interaction! We’ll use the get_response function to simulate a user query. This function interacts with the chatbot chain, retrieves relevant information based on the query and stored website content, and generates a response using the OpenAI LLM guided by the defined prompt.
# Get response

response = get_response("What is the duration of this course", organization_name, organization_info, contact_info)
print("Answer:", response)

Congratulations! You’ve built a website chatbot powered by your website’s content and OpenAI’s LLMs. With some additional development for deployment, you can empower your website with this intelligent conversational tool to enhance user experience and engagement.

Want to see this in action? Check out this video tutorial for a more visual explanation.

For the code, you can explore the project on GitHub at https://github.com/cloudxlab/RAG-Chatbot-from-web-data.

Ready to take a deep dive into generative AI? Consider enrolling in our course, Hands-on Generative AI with Langchain and Python on CloudxLab. This course will equip you with the skills to build powerful generative models using Python and Langchain!