Spark On Cluster

10 / 23

Previous Index Next

Apache Spark - Running On Cluster - Local Mode

Based on the resource manager, the spark can run in two modes: Local Mode and cluster mode.

The way we specify the resource manager is by the way of a command-line option called --master.

Local Mode is also known as Spark in-process is the default mode of spark. It does not require any resource manager. It runs everything on the same machine. Because of local mode, we are able to simply download spark and run without having to install any resource manager.

With local mode, we can utilize multiple cores of a CPU for processing. Essentially, It is good for parallel computing.

Since the smallest unit of parallelization is a partition, the partitions are generally kept less than or equal to the number of CPUs available. If we keep partitions more than the CPUs, it would not give any additional advantage with respect to parallelization.

The local mode is also quite useful while testing a Spark application.

So, how do you run the spark in local mode? It is very simple.

When we do not specify any --master flag to the command spark-shell, pyspark, spark-submit, or any other binary, it is running in local mode.

Or we can specify --master option with local as argument which defaults to 1 thread.

We can specify the number of threads in square brackets after local. So, spark-shell --master local[2] is good enough.

A better way is to use asterisks instead of specifying the number of threads. local[*] uses as many threads as the number of processors available to the Java virtual machine.

When we do not provide any master option on the command line, it defaults to local[*].

sc or spark context has a flag isLocal. If this flag is true that means it is running in local mode else it is running in cluster mode.

The other way to check the mode is by checking a variable master. This variable carries the URL of the master. To know which resource manager we are using, we can then print the value of sc.master.

Let us do a quick hands-on to check the option master.

Let's first log in to the CloudxLab console or ssh. First, we launch spark-shell without any arguments. Wait for the scala prompt to appear. It might take a while.

Once the prompt appears, you can check if it is running in local mode by using sc.isLocal. As you can see that it is running in local mode. Next, we check with sc.master

which returned local[*] which means by default it uses local mode with max number of threads provided to java virtual machine.

Now exit the spark scala shell by pressing control+d. Now, relaunch spark-shell with spark-shell --master local. Once it is up, we can check if it is running in local mode. Also, note that the sc.master prints local not local[*].

Apache Spark - Running On Cluster

Previous Index Next

Please login to comment

18 Comments

Karthikeyan Mahalingam

2 years ago

can you pl tell me where to check submitted jobs like the below in cloudera.

Karthikeyan Mahalingam

2 years ago

When we run job thru spark-shell --master yarn on Cluster Mode, How to see the Job status under application manager where we can see the Job Status, Error Details as like the below Screenshot.

Shubh Tripathi

2 years ago

Hi,

You can check that using the "yarn application" command. To list all the applications, you can use the command:

yarn application -list

To print status about a particular application, you can use the following command:

yarn application -status <<Application_ID>>

To see more of what the command can do, please refer to: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

Sachin Ranveer

2 years ago

Hi,

hi, I am not able to access spark UI I am running spark in jupyter notebook

http://localhost:4040/

Shubh Tripathi

2 years ago

Hi,

Can you please elaborate on your issue?

Sachin Ranveer

2 years ago

hi,

I want to access spark UI to check progress of a spark job. How do I access spark UI ?

Shubh Tripathi

2 years ago

Hi Sachin,

First you need to find the cluster which you are using. It can be either 'e' or 'f'. You can find it in the URL while opening the jupyter notebook.

Then you need to find the port number. You can do that by running the following code:

spark.sparkContext.uiWebUrl.split(":")[-1]

Now, suppose you are using the 'f' cluster, then the URL will be http://f.cloudxlab.com:port-number. If you are using the 'e' cluster, then just replace 'f' with 'e'. Also, remember the protocol should be 'http' and not 'https'.

So suppose, ou are using the 'f' cluster and the port number comes as 4043, then the URL will be : http://f.cloudxlab.com:4043

Sachin Ranveer

2 years ago

thanks I am able to access UI

Sachin Ranveer

2 years ago

hi,

How do I run spark in cluster mode in jupyter notebook python ?

I tried this command but it not working

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("first_app").getOrCreate()
print(spark.sparkContext.master)

This comment has been removed.

Shubh Tripathi

2 years ago

You can refer to the following blog for that: https://cloudxlab.com/blog/running-pyspark-jupyter-notebook/

Punit Bhilota

4 years ago

https://stackoverflow.com/questions/32356143/what-does-setmaster-local-mean-in-spark

Mohit Agarwal

5 years ago

I guess some of the questions are asked in advance while they are elaborated in later exercise, any reason???

Mohit Agarwal

5 years ago

I'm getting following error while running spark-shell --master yarn command. is this correct???

Kapil Tyagi

7 years ago

Hi Sandeep,

When I am trying to running spark in yarn mode. i am getting below error

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 69 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0

----------------------------

I am using below command

1.) spark-shell --master yarn
2.) spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class com.util.Utility --master yarn --deploy-mode cluster jar/cloudx-0.0.1-SNAPSHOT.jar /user/kapilltyagi3562/my_result/part-00000 /user/kapilltyagi3562/clustermaster

Please resolve this issue.

Thanks
Kapil

Sandeep Giri

7 years ago

This is because this version of yarn is not yet support spark 2. We are working on it.

Manoj

8 years ago

Hi,
we are getting green screen towards the end of the video.
Please fix it

Sandeep Giri

8 years ago

Hi Manoj,

This has been fixed. Thanks to you.

Regards,
Sandeep Giri