Halloween Sale: Flat 70% + Addl. 25% Off + 30 Days Extra Lab on all Courses | Use Coupon HS25 in Checkout | Offer Expires In

  Enroll Now

Spam Classifier - Fetch the Dataset

In this step, we will fetch the dataset from the SpamAssassin website.

INSTRUCTIONS
  • In the first step, we will import 3 libraries; namely os, tarfile, and urllib:

    import os
    import tarfile
    import urllib
    

    The os module provides a portable way of using operating system dependent functionality like reading or writing a file, manipulating paths, etc. The tarfile module is used to read and write tar archives. This is a form of archive file like Zip. The urllib is a package that collects several modules for working with URLs.

  • Next, we will set the paths for downloading the ham and spam datasets:

    DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
    HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
    SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
    SPAM_PATH = os.path.join("datasets", "spam")
    
  • Now we will define a function called fetch_spam_data which will take 2 input, the spam and ham paths that we had set above:

    def << your code goes here >>>(spam_url=SPAM_URL, spam_path=SPAM_PATH):
        if not os.path.isdir(spam_path):
            os.makedirs(spam_path)
        for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
            path = os.path.join(spam_path, filename)
            if not os.path.isfile(path):
                urllib.request.urlretrieve(url, path)
            tar_bz2_file = tarfile.open(path)
            tar_bz2_file.extractall(path=SPAM_PATH)
            tar_bz2_file.close()
    

    The above function does 2 things. First, it checks if the directory datasets/spam exists, if it does not then it creates that directory. Second, it fetches the data from the url given above using the urllib module, and extracts the same in the directory created in the previous step using the tar module.


No hints are availble for this assesment

Answer is not availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...