Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left

  Apply Now

Project - Building a RAG Chatbot from Your Website Data using OpenAI and Langchain

8 / 26

Step 1: Collecting Data - Writing Web Scraper

Building a web scraper involves several steps: first, it fetches the HTML content of a website. Then, it checks for a successful response and parses the HTML structure to understand the webpage's layout. After cleaning up unnecessary elements, it extracts the text content and optionally gathers metadata for better organization. This extracted data will be the foundation for your RAG chatbot's knowledge base.

We extract the text in markdown format instead of plain text. Using markdown format preserves the structure, formatting, and semantic information of web content, including headings, hyperlinks, and text styling like bold or italic. LLMs can interpret markdown, allowing them to better understand and utilize these cues for text generation and analysis. This approach enhances the quality and context of input data for language models.

We will also extract the metadata of the website as it is used in the vector stores.

INSTRUCTIONS
  1. Importing Libraries:

    import requests
    from bs4 import BeautifulSoup
    import html2text
    
  2. Defining a function get_data_from_website that takes url as an input and extracts the content from the webpage.

    def get_data_from_website(url):
    
  3. Now write further code inside this function.

  4. Get response from the server:

    response = requests.get(url)
    if response.status_code == 500:
        print("Server error")
        return
    
  5. Parse the HTML content using BeautifulSoup:

    soup = BeautifulSoup(response.content, 'html.parser')
    
  6. Removing js and css code as it is not required:

    for script in soup(["script", "style"]):
        script.extract()
    
  7. Extract text in markdown format:

    html = str(soup)
    html2text_instance = html2text.HTML2Text()
    html2text_instance.images_to_alt = True
    html2text_instance.body_width = 0
    html2text_instance.single_line_break = True
    text = html2text_instance.handle(html)
    
  8. Extract page metadata:

    try:
        page_title = soup.title.string.strip()
    except:
        page_title = url.path[1:].replace("/", "-")
    
    meta_description = soup.find("meta", attrs={"name": "description"})
    meta_keywords = soup.find("meta", attrs={"name": "keywords"})
    
    if meta_description:
        description = meta_description.get("content")
    else:
        description = page_title
    
    if meta_keywords:
        meta_keywords = meta_description.get("content")
    else:
        meta_keywords = ""
    
    metadata = {'title': page_title,
            'url': url,
            'description': description,
            'keywords': meta_keywords}
    
  9. Return extracted text and metadata:

    return text, metadata
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...