Project - Building Spam Classifier

7 / 27

Spam Classifier - Parse the Emails

In this step, we will parse the emails we downloaded.

We can use Python's email module to parse these emails (this handles headers, encoding, and so on). First, we will import the email and email.policy modules. The email package is a library for managing email messages, which does not include sending emails.

The control component of the email module is the policy module. Every EmailMessage, every generator, and every parser has an associated policy object that controls its behavior. Usually an application only needs to specify the policy when an EmailMessage is created, either by directly instantiating an EmailMessage to create a new email, or by parsing an input stream using a parser. But the policy can be changed when the message is serialized using a generator. This allows, for example, a generic email message to be parsed from disk, but to serialize it using standard SMTP settings when sending it to an email server.

  • First, let us import the required modules:

    import email
    import email.policy
  • Next, we will define a function load_email which does exactly what it sounds like, it loads the emails for parsing:

    def << your code goes here >>(is_spam, filename, spam_path=SPAM_PATH):
        directory = "spam" if is_spam else "easy_ham"
        with open(os.path.join(spam_path, directory, filename), "rb") as f:
            return email.parser.BytesParser(policy=email.policy.default).parse(f)
  • Finally, we will store only those emails whose names we had stored in the previous step:

    ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
    spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]
Get Hint See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...