Project - Building a RAG Chatbot from Your Website Data using OpenAI and Langchain

10 / 26

Step 2: Cleaning and Processing Data - Cleaning Functions

We will define functions to tackle common website text quirks, like fixing hyphenated words or inconsistent newline characters. By meticulously running these functions, the script will polish the raw text and guarantee a consistent, high-quality format for the chatbot's knowledge base.

INSTRUCTIONS
  1. Importing Libraries:

    import re
    
  2. Define a function merge_hyphenated_words that takes "text" as an input and merges the hyphenated words.

    def merge_hyphenated_words(text):
        return re.sub(r"(\w)-\n(\w)", r"\1\2", text)
    
  3. Define a function fix_newlines that takes "text" as an input and fixes inconsistent newline characters.

    def fix_newlines(text):
        return re.sub(r"(?<!\n)\n(?!\n)", " ", text)
    
  4. Define a function remove_multiple_newlines that takes "text" as an input and remove_multiple_newlines.

    def remove_multiple_newlines(text):
        return re.sub(r"\n{2,}", "\n", text)
    
  5. Now, as we have our cleaning functions in place, let's write a function clean_text that takes text as an input and passes it through all the above cleaning functions.

    def clean_text(text):
        cleaning_functions = [merge_hyphenated_words, fix_newlines, remove_multiple_newlines]
        for cleaning_function in cleaning_functions:
            text = cleaning_function(text)
        return text
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...