Previous Index Next

Step 1: Collecting Data - Writing Web Scraper

Building a web scraper involves several steps: first, it fetches the HTML content of a website. Then, it checks for a successful response and parses the HTML structure to understand the webpage's layout. After cleaning up unnecessary elements, it extracts the text content and optionally gathers metadata for better organization. This extracted data will be the foundation for your RAG chatbot's knowledge base.

We extract the text in markdown format instead of plain text. Using markdown format preserves the structure, formatting, and semantic information of web content, including headings, hyperlinks, and text styling like bold or italic. LLMs can interpret markdown, allowing them to better understand and utilize these cues for text generation and analysis. This approach enhances the quality and context of input data for language models.

We will also extract the metadata of the website as it is used in the vector stores.

INSTRUCTIONS

Importing Libraries:

import requests
from bs4 import BeautifulSoup
import html2text

Defining a function get_data_from_website that takes url as an input and extracts the content from the webpage.
```
def get_data_from_website(url):
```
Now write further code inside this function.

Get response from the server:

response = requests.get(url)
if response.status_code == 500:
    print("Server error")
    return

Parse the HTML content using BeautifulSoup:

soup = BeautifulSoup(response.content, 'html.parser')

Removing js and css code as it is not required:

for script in soup(["script", "style"]):
    script.extract()

Extract text in markdown format:

html = str(soup)
html2text_instance = html2text.HTML2Text()
html2text_instance.images_to_alt = True
html2text_instance.body_width = 0
html2text_instance.single_line_break = True
text = html2text_instance.handle(html)

Extract page metadata:

try:
    page_title = soup.title.string.strip()
except:
    page_title = url.path[1:].replace("/", "-")

meta_description = soup.find("meta", attrs={"name": "description"})
meta_keywords = soup.find("meta", attrs={"name": "keywords"})

if meta_description:
    description = meta_description.get("content")
else:
    description = page_title

if meta_keywords:
    meta_keywords = meta_description.get("content")
else:
    meta_keywords = ""

metadata = {'title': page_title,
        'url': url,
        'description': description,
        'keywords': meta_keywords}

Return extracted text and metadata:
```
return text, metadata
```

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Project - Building a RAG Chatbot from Your Website Data using OpenAI, Langchain and Vector Database

Step 1: Collecting Data - Writing Web Scraper

XP

Loading comments...