Project - Building Spam Classifier

You are currently auditing this course.
16 / 27

Spam Classifier - Test HTML to Plain Text Function

Now we will test the HTML to Plain Text function we created in the previous step.

INSTRUCTIONS
  • First, let's select a spam email, store it in a variable, and print it:

    html_spam_emails = [email for email in X_train[y_train==1]
                        if get_email_structure(email) == "text/html"]
    sample_html_spam = html_spam_emails[7]
    print(sample_html_spam.get_content().strip()[:1000], "...")
    

    This is what the spam email looks like in original form.

  • Now we will convert it into plain text using the html_to_plain_text function we created and passing to that function this variable we created above:

    print(<< your code goes here >>(sample_html_spam.get_content())[:1000], "...")
    
  • Finally, let's write a function that takes an email as input and returns its content as plain text, whatever its format is:

    def email_to_text(email):
        html = None
        for part in email.walk():
            ctype = part.get_content_type()
            if not ctype in ("text/plain", "text/html"):
                continue
            try:
                content = part.get_content()
            except: # in case of encoding issues
                content = str(part.get_payload())
            if ctype == "text/plain":
                return content
            else:
                html = content
        if html:
            return html_to_plain_text(html)
    
Get Hint See Answer

Loading comments...