Shubh Tripathi, Author at CloudxLab Blog

Building a RAG Chatbot from Your Website Data using OpenAI and Langchain (Hands-On)

Imagine a tireless assistant on your website, ready to answer customer questions 24/7. That’s the power of a chatbot! In this post, we’ll guide you through building a custom chatbot specifically trained on your website’s data using OpenAI and Langchain. Let’s dive in and create this helpful conversational AI!

If you want to perform the steps along with the project in parallel, rather than just reading, check out our project on the same at https://cloudxlab.com/assessment/playlist-intro/3101/building-a-rag-chatbot-from-your-website-data-usin. You will also receive a project completion certificate which you can use to showcase your Generative AI skills.

Step 1: Grabbing Valuable Content from Your Website

We first need the gold mine of information – the content from your website! To achieve this, we’ll build a web crawler using Python’s requests library and Beautiful Soup. This script will act like a smart visitor, fetching the text content from each webpage on your website.

Here’s what our web_crawler.py script will do:

Fetch the Webpage: It’ll send a request to retrieve the HTML content of a given website URL.
Check for Success: The script will ensure the server responds positively (think status code 200) before proceeding.
Parse the HTML Structure: Using Beautiful Soup, it will analyze the downloaded HTML to understand how the webpage is built.
Clean Up the Mess: It will discard unnecessary elements like scripts and styles that don’t contribute to the core content you want for the chatbot.
Extract the Text: After that, it will convert the cleaned HTML into plain text format, making it easier to process later.
Grab Extra Info (Optional): The script can optionally extract metadata like page titles and descriptions for better organization.

Imagine this script as a virtual visitor browsing your website and collecting the text content, leaving behind the fancy formatting for now.

Let’s code!

import requests
from bs4 import BeautifulSoup
import html2text


def get_data_from_website(url):
    """
    Retrieve text content and metadata from a given URL.

    Args:
        url (str): The URL to fetch content from.

    Returns:
        tuple: A tuple containing the text content (str) and metadata (dict).
    """
    # Get response from the server
    response = requests.get(url)
    if response.status_code == 500:
        print("Server error")
        return
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Removing js and css code
    for script in soup(["script", "style"]):
        script.extract()

    # Extract text in markdown format
    html = str(soup)
    html2text_instance = html2text.HTML2Text()
    html2text_instance.images_to_alt = True
    html2text_instance.body_width = 0
    html2text_instance.single_line_break = True
    text = html2text_instance.handle(html)

    # Extract page metadata
    try:
        page_title = soup.title.string.strip()
    except:
        page_title = url.path[1:].replace("/", "-")
    meta_description = soup.find("meta", attrs={"name": "description"})
    meta_keywords = soup.find("meta", attrs={"name": "keywords"})
    if meta_description:
        description = meta_description.get("content")
    else:
        description = page_title
    if meta_keywords:
        meta_keywords = meta_description.get("content")
    else:
        meta_keywords = ""

    metadata = {'title': page_title,
                'url': url,
                'description': description,
                'keywords': meta_keywords}

    return text, metadata

Explanation:

The get_data_from_website function takes a website URL and returns the extracted text content along with any optional metadata. Explore the code further to see how it performs each step mentioned!

Step 2: Cleaning Up the Raw Text

Building your own ChatGPT from scratch

In a world where technology constantly pushes the boundaries of human imagination, one phenomenon stands out: ChatGPT. You’ve probably experienced its magic, admired how it can chat meaningfully, and maybe even wondered how it all works inside. ChatGPT is more than just a program; it’s a gateway to the realms of artificial intelligence, showcasing the amazing progress we’ve made in machine learning.

At its core, ChatGPT is built on a technology called Generative Pre-trained Transformer (GPT). But what does that really mean? Let’s understand in this blog.

In this blog, we’ll explore the fundamentals of machine learning, including how machines generate words. We’ll delve into the transformer architecture and its attention mechanisms. Then, we’ll demystify GPT and its role in AI. Finally, we’ll embark on coding our own GPT from scratch, bridging theory and practice in artificial intelligence.

How does Machine learn?

Imagine a network of interconnected knobs—this is a neural network, inspired by our own brains. In this network, information flows through nodes, just like thoughts in our minds. Each node processes information and passes it along to the next, making decisions as it goes.

Each knob represents a neuron, a fundamental unit of processing. As information flows through this network, these neurons spring to action, analyzing, interpreting, and transmitting data. It’s similar to how thoughts travel through your mind—constantly interacting and influencing one another to form a coherent understanding of the world around you. In a neural network, these interactions pave the way for learning, adaptation, and intelligent decision-making, mirroring the complex dynamics of the human mind in the digital realm.

GPT 4 and its advancements over GPT 3

The field of natural language processing has witnessed remarkable advancements over the years, with the development of cutting-edge language models such as GPT-3 and the recent release of GPT-4. These models have revolutionized the way we interact with language and have opened up new possibilities for applications in various domains, including chatbots, virtual assistants, and automated content creation.

What is GPT?

GPT is a natural language processing (NLP) model developed by OpenAI that utilizes the transformer model. Transformer is a type of Deep Learning model, best known for its ability to process sequential data, such as text, by attending to different parts of the input sequence and using this information to generate context-aware representations of the text.

What makes transformers special is that they can understand the meaning of the text, instead of just recognizing patterns in the words. They can do this by “attending” to different parts of the text and figuring out which parts are most important to understanding the meaning of the whole.

For example, imagine you’re reading a book and come across the sentence “The cat sat on the mat.” A transformer would be able to understand that this sentence is about a cat and a mat and that the cat is sitting on the mat. It would also be able to use this understanding to generate new sentences that are related to the original one.

GPT is pre-trained on a large dataset, which consists of:

Starting Machine Learning with an End-to-End Project

When you are learning about Machine Learning, it is best to experiment with real-world data alongside learning concepts. It is even more beneficial to start Machine Learning with a project including end-to-end model building, rather than going for conceptual knowledge first.

Benefits of Project-Based Learning

You get to know about real-world projects which in a way prepares you for real-time jobs.
Encourages critical thinking and problem-solving skills in learners.
Gives an idea of the end-to-end process of building a project.
Gives an idea of tools and technologies used in the industry.
Learners get an in-depth understanding of the concepts which directly boosts their self-confidence.
It is a more fun way to learn things rather than traditional methods of learning.

What is an End-to-End project?

End-to-end refers to a full process from start to finish. In an ML end-to-end project, you have to perform every task from first to last by yourself. That includes getting the data, processing it, preparing data for the model, building the model, and at last finalizing it.

Ideology to start with End to End project

It is much more beneficial to start learning Machine Learning with an end-to-end project rather than diving down deep into the vast ocean of Machine Learning concepts. But, what will be the benefit of practicing concepts without even understanding them properly? How to implement concepts when we don’t understand them properly?

There are not one but several benefits of starting your ML journey with a project. Some of them are:

How to Crack Machine Learning Interviews with Top Interview Questions(2024)

Machine Learning is the most rapidly growing domain in the software industry. More and more sectors are using concepts of Machine Learning to enhance their businesses. It is now not an add-on but has become a necessity for businesses to use ML algorithms for optimizing their businesses and to offer a personalised user experience.

This demand for Machine Learning in the industry has directly increased the demand for Machine Learning Engineers, the ones who unload this magic in reality. According to a survey conducted by LinkedIn, Machine Learning Engineer is the most emerging job role in the current industry with nearly 10 times growth.

But, even this high demand doesn’t make getting a job in ML any easier. ML interviews are tough regardless of your seniority level. But as said, with the right knowledge and preparation, interviews become a lot easier to crack.

In this blog, I will walk you through the interview process for an ML job role and will pass on some tips and tactics on how to crack one. We will also discuss the skills required in accordance with each round of the process.

How to Interact with Apache Zookeeper using Python?

In the Hadoop ecosystem, Apache Zookeeper plays an important role in coordination amongst distributed resources. Apart from being an important component of Hadoop, it is also a very good concept to learn for a system design interview.

What is Apache Zookeeper?

Apache ZooKeeper is a coordination tool to let people build distributed systems easier. In very simple words, it is a central data store of key-value pairs, using which distributed systems can coordinate. Since it needs to be able to handle the load, Zookeeper itself runs on many machines.

Zookeeper provides a simple set of primitives and it is very easy to program.

It is used for:

synchronization
locking
maintaining configuration
failover management.

It does not suffer from Race Conditions and Dead Locks.

Bucketing- CLUSTERED BY and CLUSTER BY

The bucketing in Hive is a data-organising technique. It is used to decompose data into more manageable parts, known as buckets, which in result, improves the performance of the queries. It is similar to partitioning, but with an added functionality of hashing technique.

Introduction

Bucketing, a.k.a clustering is a technique to decompose data into buckets. In bucketing, Hive splits the data into a fixed number of buckets, according to a hash function over some set of columns. Hive ensures that all rows that have the same hash will be stored in the same bucket. However, a single bucket may contain multiple such groups.

For example, bucketing the data in 3 buckets will look like-

The Era of Software Engineering and how to become one

Today’s world is also known as the world of software with its builders known as Software Engineers. It’s on them that today we are interacting with each other because the webpage on which you are reading this blog, the web browser displaying this webpage, and the operating system to run the web browser are all made by a software engineer.

In today’s blog, we will start by introducing software engineering and will discuss its history, scope, and types. Then we will compare different types of software engineers on the basis of their demand in the industry. After that, we will discuss on full-stack developer job role and responsibilities and will also discuss key skills and the hiring process for a full-stack developer in detail.

Classification metrics and their Use Cases

In this blog, we will discuss about commonly used classification metrics. We will be covering Accuracy Score, Confusion Matrix, Precision, Recall, F-Score, ROC-AUC and will then learn how to extend them to the multi-class classification. We will also discuss in which scenarios, which metric will be most suitable to use.

First let’s understand some important terms used throughout the blog-

True Positive (TP): When you predict an observation belongs to a class and it actually does belong to that class.

True Negative (TN): When you predict an observation does not belong to a class and it actually does not belong to that class.

False Positive (FP): When you predict an observation belongs to a class and it actually does not belong to that class.

False Negative(FN): When you predict an observation does not belong to a class and it actually does belong to that class.

All classification metrics work on these four terms. Let’s start understanding classification metrics-