Adv Spark Programming - Broadcast Variables Example (Python)

Say, we have an array of strings that might be loaded from a file on the driver. Here it is defined inside the code. This array of strings contains the common words.

commonWords = ["a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it"]

Let's convert this array into a Map. Looking up for a word in a map becomes really fast as compared to an array.

commonWordsMap = {}

for word in commonWords:
    commonWordsMap[word] = 1

Then we create a broadcast variable using sc.broadcast method of spark context and passing our map as the argument.

commonWordsBC = sc.broadcast(commonWordsMap)

Next, we create our rdd using a file located in HDFS. Here each record will be a string because a textinputformat is being used by default.

file = sc.textFile("/data/mr/wordcount/input/big.txt")

Afterward, we define a function toWords which breaks down a line into an array of words that are not in the common words map accessed via the broadcast variable.

def toWords(line):  
    words = line.split(" ")
    output = [];
    for word in words:
        cleanWord = word.lower().strip().replace("[^a-z]","")
        if not cleanWord in commonWordsBC.value:
            output.append(cleanWord)
    return output

Then we call this function using a flatMap resulting in RDD of uncommon words.