Imagine a tireless assistant on your website, ready to answer customer questions 24/7. That’s the power of a chatbot! In this post, we’ll guide you through building a custom chatbot specifically trained on your website’s data using OpenAI and Langchain. Let’s dive in and create this helpful conversational AI!
If you want to perform the steps along with the project in parallel, rather than just reading, check out our project on the same at Building a RAG Chatbot from Your Website Data using OpenAI and Langchain. You will also receive a project completion certificate which you can use to showcase your Generative AI skills.
Step 1: Grabbing Valuable Content from Your Website
We first need the gold mine of information – the content from your website! To achieve this, we’ll build a web crawler using Python’s requests
library and Beautiful Soup. This script will act like a smart visitor, fetching the text content from each webpage on your website.
Here’s what our web_crawler.py
script will do:
- Fetch the Webpage: It’ll send a request to retrieve the HTML content of a given website URL.
- Check for Success: The script will ensure the server responds positively (think status code 200) before proceeding.
- Parse the HTML Structure: Using Beautiful Soup, it will analyze the downloaded HTML to understand how the webpage is built.
- Clean Up the Mess: It will discard unnecessary elements like scripts and styles that don’t contribute to the core content you want for the chatbot.
- Extract the Text: After that, it will convert the cleaned HTML into plain text format, making it easier to process later.
- Grab Extra Info (Optional): The script can optionally extract metadata like page titles and descriptions for better organization.
Imagine this script as a virtual visitor browsing your website and collecting the text content, leaving behind the fancy formatting for now.
Let’s code!
import requests from bs4 import BeautifulSoup import html2text def get_data_from_website(url): """ Retrieve text content and metadata from a given URL. Args: url (str): The URL to fetch content from. Returns: tuple: A tuple containing the text content (str) and metadata (dict). """ # Get response from the server response = requests.get(url) if response.status_code == 500: print("Server error") return # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(response.content, 'html.parser') # Removing js and css code for script in soup(["script", "style"]): script.extract() # Extract text in markdown format html = str(soup) html2text_instance = html2text.HTML2Text() html2text_instance.images_to_alt = True html2text_instance.body_width = 0 html2text_instance.single_line_break = True text = html2text_instance.handle(html) # Extract page metadata try: page_title = soup.title.string.strip() except: page_title = url.path[1:].replace("/", "-") meta_description = soup.find("meta", attrs={"name": "description"}) meta_keywords = soup.find("meta", attrs={"name": "keywords"}) if meta_description: description = meta_description.get("content") else: description = page_title if meta_keywords: meta_keywords = meta_description.get("content") else: meta_keywords = "" metadata = {'title': page_title, 'url': url, 'description': description, 'keywords': meta_keywords} return text, metadata
Explanation:
The get_data_from_website
function takes a website URL and returns the extracted text content along with any optional metadata. Explore the code further to see how it performs each step mentioned!