In this step, we will fetch the dataset from the SpamAssassin website.
In the first step, we will import 3 libraries; namely os
, tarfile
, and urllib
:
import os
import tarfile
import urllib
The os
module provides a portable way of using operating system dependent functionality like reading or writing a file, manipulating paths, etc. The tarfile
module is used to read and write tar archives. This is a form of archive file like Zip. The urllib
is a package that collects several modules for working with URLs.
Next, we will set the paths for downloading the ham and spam datasets:
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")
Now we will define a function called fetch_spam_data which will take 2 input, the spam and ham paths that we had set above:
def << your code goes here >>>(spam_url=SPAM_URL, spam_path=SPAM_PATH):
if not os.path.isdir(spam_path):
os.makedirs(spam_path)
for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
path = os.path.join(spam_path, filename)
if not os.path.isfile(path):
urllib.request.urlretrieve(url, path)
tar_bz2_file = tarfile.open(path)
tar_bz2_file.extractall(path=SPAM_PATH)
tar_bz2_file.close()
The above function does 2 things. First, it checks if the directory datasets/spam exists, if it does not then it creates that directory. Second, it fetches the data from the url given above using the urllib
module, and extracts the same in the directory created in the previous step using the tar
module.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
9 Comments
import os
import tarfile
import urllib.request
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")
def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
if not os.path.isdir(spam_path):
os.makedirs(spam_path)
for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
path = os.path.join(spam_path, filename)
if not os.path.isfile(path):
urllib.request.urlretrieve(url, path)
with tarfile.open(path) as tar_bz2_file:
tar_bz2_file.extractall(path=spam_path)
fetch_spam_data()
Upvote Shareif not os.path.isfile(path):
urllib.request.urlretrieve(url, path)
the above if structure is checking if the file exist or not and if not then feth the content from the url and store it in the file mentioned in the path.
Am i right ?
Upvote ShareYes, you are right.
Upvote Sharewhen the defined function is used to fetch spam data, then why are we passing ham url in the for loop ?
Upvote ShareYou can try printing the value of filename and url inside the loop. You may find your answer by doing that.
Upvote Sharei could not download the data when i call the function getting a read error
Hi,
Please take a hint or look at the answer if you are stuck.
Thanks.
Upvote SharePl explain Ham URL I am not aware of these terminologies. Thanks.
Upvote ShareHi,
HAM is the opposite of SPAM.
Thanks.
Upvote Share