Adv Spark Programming

30 / 52

Adv Spark Programming - Broadcast Variables

Slides - Adv Spark Programming (2)


Say, we a list of common words such as "a", "an", "the", "of" and "at" etc. If we need to remove these common words from our word count example, what do we need to do? One way is to create a local variable inside driver logic and use it inside our transformations and actions methods. Is it inefficient?

Yes, using local variables having huge data is inefficient because spark sends all of the referenced variables to all of the workers. The default task launching mechanism is optimized for small task sizes. If used multiple times, Spark will be sending it again to all nodes. So, instead of just referring to the variables holding huge data, we should use broadcast variables instead.

Broadcast variables are great for sharing read-only data with the spark application. If data is really huge and can not fit in memory, you should put it in HDFS and use it by the way of RDD. It is very similar to Hadoop MapReduce distributed cache. Broadcast variables allow the program to efficiently send a large, read-only value that can fit in memory to all the worker nodes.

Let’s take a few examples:

  1. Say we have a lookup table that we need to use in our transformations or actions. We can create a broadcast variable to hold such a lookup table.

  2. Another example use case could be that we have a large feature vector that we need to use in a machine learning algorithm being run by map transformation.

It is like a distributed cache of Hadoop. Spark distributes broadcast variables efficiently to reduce communication costs. Broadcast variables are useful when tasks across multiple stages need the same data and caching the data in the deserialized form is important.

Let's try to use broadcast variables to remove the common words or stop words from a file located in big.txt. This can be further extended to word count.

No hints are availble for this assesment

Answer is not availble for this assesment

Loading comments...