Project - Building Spam Classifier

15 / 27

Spam Classifier - Create a Preprocessing Function

Now let us create a preprocessing function.


The following function first drops the section, then converts all <a> tags to the word HYPERLINK, then it gets rid of all HTML tags, leaving only the plain text. For readability, it also replaces multiple newlines with single newlines, and finally it unescapes html entities:

import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

Here, the re module provides regular expression matching operations similar to those found in Perl. With the help of html.unescape method, we can convert the ascii string into html script by replacing ascii characters with special characters by using html.escape method. re.sub returns the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed.

Get Hint See Answer

Loading comments...