We will define functions to tackle common website text quirks, like fixing hyphenated words or inconsistent newline characters. By meticulously running these functions, the script will polish the raw text and guarantee a consistent, high-quality format for the chatbot's knowledge base.
Importing Libraries:
import re
Define a function merge_hyphenated_words
that takes "text" as an input and merges the hyphenated words.
def merge_hyphenated_words(text):
return re.sub(r"(\w)-\n(\w)", r"\1\2", text)
Define a function fix_newlines
that takes "text" as an input and fixes inconsistent newline characters.
def fix_newlines(text):
return re.sub(r"(?<!\n)\n(?!\n)", " ", text)
Define a function remove_multiple_newlines
that takes "text" as an input and remove_multiple_newlines.
def remove_multiple_newlines(text):
return re.sub(r"\n{2,}", "\n", text)
Now, as we have our cleaning functions in place, let's write a function clean_text
that takes text as an input and passes it through all the above cleaning functions.
def clean_text(text):
cleaning_functions = [merge_hyphenated_words, fix_newlines, remove_multiple_newlines]
for cleaning_function in cleaning_functions:
text = cleaning_function(text)
return text
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
Be the first one to comment!