Apache Spark - Running On Cluster - Local Mode

Based on the resource manager, the spark can run in two modes: Local Mode and cluster mode.

The way we specify the resource manager is by the way of a command-line option called --master.

Local Mode is also known as Spark in-process is the default mode of spark. It does not require any resource manager. It runs everything on the same machine. Because of local mode, we are able to simply download spark and run without having to install any resource manager.

With local mode, we can utilize multiple cores of a CPU for processing. Essentially, It is good for parallel computing.

Since the smallest unit of parallelization is a partition, the partitions are generally kept less than or equal to the number of CPUs available. If we keep partitions more than the CPUs, it would not give any additional advantage with respect to parallelization.

The local mode is also quite useful while testing a Spark application.

So, how do you run the spark in local mode? It is very simple.

When we do not specify any --master flag to the command spark-shell, pyspark, spark-submit, or any other binary, it is running in local mode.

Or we can specify --master option with local as argument which defaults to 1 thread.

We can specify the number of threads in square brackets after local. So, spark-shell --master local[2] is good enough.

A better way is to use asterisks instead of specifying the number of threads. local[*] uses as many threads as the number of processors available to the Java virtual machine.

When we do not provide any master option on the command line, it defaults to local[*].

sc or spark context has a flag isLocal. If this flag is true that means it is running in local mode else it is running in cluster mode.

The other way to check the mode is by checking a variable master. This variable carries the URL of the master. To know which resource manager we are using, we can then print the value of sc.master.

Let us do a quick hands-on to check the option master.

Let's first log in to the CloudxLab console or ssh. First, we launch spark-shell without any arguments. Wait for the scala prompt to appear. It might take a while.

Once the prompt appears, you can check if it is running in local mode by using sc.isLocal. As you can see that it is running in local mode. Next, we check with sc.master

which returned local[*] which means by default it uses local mode with max number of threads provided to java virtual machine.

Now exit the spark scala shell by pressing control+d. Now, relaunch spark-shell with spark-shell --master local. Once it is up, we can check if it is running in local mode. Also, note that the sc.master prints local not local[*].

Apache Spark - Running On Cluster

Spark On Cluster

Apache Spark - Running On Cluster - Local Mode

XP

Loading comments...