Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Building a web scraper involves several steps: first, it fetches the HTML content of a website. Then, it checks for a successful response and parses the HTML structure to understand the webpage's layout. After cleaning up unnecessary elements, it extracts the text content and optionally gathers metadata for better organization. This extracted data will be the foundation for your RAG chatbot's knowledge base.
We extract the text in markdown format instead of plain text. Using markdown format preserves the structure, formatting, and semantic information of web content, including headings, hyperlinks, and text styling like bold or italic. LLMs can interpret markdown, allowing them to better understand and utilize these cues for text generation and analysis. This approach enhances the quality and context of input data for language models.
We will also extract the metadata of the website as it is used in the vector stores.
Importing Libraries:
import requests
from bs4 import BeautifulSoup
import html2text
Defining a function get_data_from_website
that takes url
as an input and extracts the content from the webpage.
def get_data_from_website(url):
Now write further code inside this function.
Get response from the server:
response = requests.get(url)
if response.status_code == 500:
print("Server error")
return
Parse the HTML content using BeautifulSoup:
soup = BeautifulSoup(response.content, 'html.parser')
Removing js and css code as it is not required:
for script in soup(["script", "style"]):
script.extract()
Extract text in markdown format:
html = str(soup)
html2text_instance = html2text.HTML2Text()
html2text_instance.images_to_alt = True
html2text_instance.body_width = 0
html2text_instance.single_line_break = True
text = html2text_instance.handle(html)
Extract page metadata:
try:
page_title = soup.title.string.strip()
except:
page_title = url.path[1:].replace("/", "-")
meta_description = soup.find("meta", attrs={"name": "description"})
meta_keywords = soup.find("meta", attrs={"name": "keywords"})
if meta_description:
description = meta_description.get("content")
else:
description = page_title
if meta_keywords:
meta_keywords = meta_description.get("content")
else:
meta_keywords = ""
metadata = {'title': page_title,
'url': url,
'description': description,
'keywords': meta_keywords}
Return extracted text and metadata:
return text, metadata
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...