Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Say, we have an array of strings that might be loaded from a file on the driver. Here it is defined inside the code. This array of strings contains the common words.
commonWords = ["a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it"]
Let's convert this array into a Map. Looking up for a word in a map becomes really fast as compared to an array.
commonWordsMap = {}
for word in commonWords:
commonWordsMap[word] = 1
Then we create a broadcast variable using sc.broadcast
method of spark context and passing our map as the argument.
commonWordsBC = sc.broadcast(commonWordsMap)
Next, we create our rdd using a file located in HDFS. Here each record will be a string because a textinputformat is being used by default.
file = sc.textFile("/data/mr/wordcount/input/big.txt")
Afterward, we define a function toWords
which breaks down a line into an array of words that are not in the common words map accessed via the broadcast variable.
def toWords(line):
words = line.split(" ")
output = [];
for word in words:
cleanWord = word.lower().strip().replace("[^a-z]","")
if not cleanWord in commonWordsBC.value:
output.append(cleanWord)
return output
Then we call this function using a flatMap resulting in RDD of uncommon words.
uncommonWords = file.flatMap(toWords)
We can check the result using an action such as take()
to list a few records from the RDD. You would see that the common words have been removed.
uncommonWords.take(100)
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Loading comments...