Project - Building Spam Classifier

3 / 27

Spam Classifier - Fetch the Dataset

In this step, we will fetch the dataset from the SpamAssassin website.

INSTRUCTIONS
  • In the first step, we will import 3 libraries; namely os, tarfile, and urllib:

    import os
    import tarfile
    import urllib
    

    The os module provides a portable way of using operating system dependent functionality like reading or writing a file, manipulating paths, etc. The tarfile module is used to read and write tar archives. This is a form of archive file like Zip. The urllib is a package that collects several modules for working with URLs.

  • Next, we will set the paths for downloading the ham and spam datasets:

    DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
    HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
    SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
    SPAM_PATH = os.path.join("datasets", "spam")
    
  • Now we will define a function called fetch_spam_data which will take 2 input, the spam and ham paths that we had set above:

    def << your code goes here >>>(spam_url=SPAM_URL, spam_path=SPAM_PATH):
        if not os.path.isdir(spam_path):
            os.makedirs(spam_path)
        for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
            path = os.path.join(spam_path, filename)
            if not os.path.isfile(path):
                urllib.request.urlretrieve(url, path)
            tar_bz2_file = tarfile.open(path)
            tar_bz2_file.extractall(path=SPAM_PATH)
            tar_bz2_file.close()
    

    The above function does 2 things. First, it checks if the directory datasets/spam exists, if it does not then it creates that directory. Second, it fetches the data from the url given above using the urllib module, and extracts the same in the directory created in the previous step using the tar module.

Get Hint See Answer


Note - Having trouble with the assessment engine? Follow the steps listed here

Please login to comment

9 Comments

import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        with tarfile.open(path) as tar_bz2_file:
            tar_bz2_file.extractall(path=spam_path)

fetch_spam_data()

  Upvote    Share

  if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)

the above if structure is checking if the file exist or not and if not then feth the content from the url and store it in the file mentioned in the path.

Am i right ?

  Upvote    Share

Yes, you are right.

  Upvote    Share

when the defined function is used to fetch spam data, then why are we passing ham url in the for loop ?

  Upvote    Share

You can try printing the value of filename and url inside the loop. You may find your answer by doing that.

  Upvote    Share

i could not download the data when i call the function getting a read error

  Upvote    Share

Hi,

Please take a hint or look at the answer if you are stuck.

Thanks.

  Upvote    Share

Pl explain Ham URL  I am not aware of these terminologies. Thanks.

  Upvote    Share

Hi,

HAM is the opposite of SPAM.

Thanks.

  Upvote    Share