Spark On Cluster

10 / 23

Apache Spark - Running On Cluster - Local Mode

Based on the resource manager, the spark can run in two modes: Local Mode and cluster mode.

The way we specify the resource manager is by the way of a command-line option called --master.

Local Mode is also known as Spark in-process is the default mode of spark. It does not require any resource manager. It runs everything on the same machine. Because of local mode, we are able to simply download spark and run without having to install any resource manager.

With local mode, we can utilize multiple cores of a CPU for processing. Essentially, It is good for parallel computing.

Since the smallest unit of parallelization is a partition, the partitions are generally kept less than or equal to the number of CPUs available. If we keep partitions more than the CPUs, it would not give any additional advantage with respect to parallelization.

The local mode is also quite useful while testing a Spark application.

So, how do you run the spark in local mode? It is very simple.

When we do not specify any --master flag to the command spark-shell, pyspark, spark-submit, or any other binary, it is running in local mode.

Or we can specify --master option with local as argument which defaults to 1 thread.

We can specify the number of threads in square brackets after local. So, spark-shell --master local[2] is good enough.

A better way is to use asterisks instead of specifying the number of threads. local[*] uses as many threads as the number of processors available to the Java virtual machine.

When we do not provide any master option on the command line, it defaults to local[*].

sc or spark context has a flag isLocal. If this flag is true that means it is running in local mode else it is running in cluster mode.

The other way to check the mode is by checking a variable master. This variable carries the URL of the master. To know which resource manager we are using, we can then print the value of sc.master.

Let us do a quick hands-on to check the option master.

Let's first log in to the CloudxLab console or ssh. First, we launch spark-shell without any arguments. Wait for the scala prompt to appear. It might take a while.

Once the prompt appears, you can check if it is running in local mode by using sc.isLocal. As you can see that it is running in local mode. Next, we check with sc.master

which returned local[*] which means by default it uses local mode with max number of threads provided to java virtual machine.

Now exit the spark scala shell by pressing control+d. Now, relaunch spark-shell with spark-shell --master local. Once it is up, we can check if it is running in local mode. Also, note that the sc.master prints local not local[*].

Apache Spark - Running On Cluster


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

18 Comments

can you pl tell me where to check submitted jobs like the below in cloudera.

 

  Upvote    Share

When we run job thru spark-shell --master yarn  on  Cluster Mode, How to see the Job status under application manager where we can see the Job Status, Error Details as like the below Screenshot.

 

  Upvote    Share

Hi,

You can check that using the "yarn application" command. To list all the applications, you can use the command:

yarn application -list

To print status about a particular application, you can use the following command:

yarn application -status <<Application_ID>>

To see more of what the command can do, please refer to: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

  Upvote    Share

Hi,

   hi, I am not able to access spark UI I am running spark in jupyter notebook

http://localhost:4040/

  Upvote    Share

Hi,

Can you please elaborate on your issue?

  Upvote    Share

hi,

   I want to access spark UI to check progress of a spark job. How do I access spark UI ?

  Upvote    Share

Hi Sachin,

First you need to find the cluster which you are using. It can be either 'e' or 'f'. You can find it in the URL while opening the jupyter notebook. 

Then you need to find the port number. You can do that by running the following code:

spark.sparkContext.uiWebUrl.split(":")[-1]

Now, suppose you are using the 'f' cluster, then the URL will be http://f.cloudxlab.com:port-number. If you are using the 'e' cluster, then just replace 'f' with 'e'. Also, remember the protocol should be 'http' and not 'https'.

So suppose, ou are using the 'f' cluster and the port number comes as 4043, then the URL will be : http://f.cloudxlab.com:4043

 1  Upvote    Share

thanks I am able to access UI

  Upvote    Share

hi, 

   How do I run spark in cluster mode in jupyter notebook python ?

I tried this command but it not working

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("first_app").getOrCreate()
print(spark.sparkContext.master)

  Upvote    Share

This comment has been removed.

You can refer to the following blog for that: https://cloudxlab.com/blog/running-pyspark-jupyter-notebook/

  Upvote    Share

https://stackoverflow.com/questions/32356143/what-does-setmaster-local-mean-in-spark

  Upvote    Share

I guess some of the questions are asked in advance while they are elaborated in later exercise, any reason???

  Upvote    Share

I'm getting following error while running spark-shell --master yarn command. is this correct???

  Upvote    Share

Hi Sandeep,

When I am trying to running spark in yarn mode. i am getting below error

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 69 more
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0

----------------------------

I am using below command

1.) spark-shell --master yarn
2.) spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class com.util.Utility --master yarn --deploy-mode cluster jar/cloudx-0.0.1-SNAPSHOT.jar /user/kapilltyagi3562/my_result/part-00000 /user/kapilltyagi3562/clustermaster

Please resolve this issue.

Thanks
Kapil

  Upvote    Share

This is because this version of yarn is not yet support spark 2. We are working on it.

  Upvote    Share

Hi,
we are getting green screen towards the end of the video.
Please fix it

  Upvote    Share

Hi Manoj,

This has been fixed. Thanks to you.

Regards,
Sandeep Giri

  Upvote    Share