Spark On Cluster

16 / 23

Previous Index Next

Apache Spark - Running On Cluster - Cluster Mode - YARN

If you already have a hadoop cluster, you can use --master option with yarn.

Please note that the spark applications tasks would be run inside the yarn's containers on various node managers.

Before launching spark in yarn cluster mode, you must set the two environment variables YARN_CONF_DIR and HADOOP_CONF_DIR to the location of configuration directory of hadoop. In case of cloudxlab or hortonworks data platform, the location is /etc/hadoop/conf/

Let's take a look at how to launch a spark application on YARN. For this, we are going to run the example application shipped with spark. This example application computes the value of pi by first counting pixels that lie inside circle and square and then finding the proportional area from squares area which is a*a.

For this we are going to first export the two variables and then execute the command.

Let us take a look. First login to cloudxlab webconsole or ssh, then export the two variables. And then run the spark-submit command. As you can see that the value of PI is roughly 3.142344.

Also, we can take a look in hue. The spark job should be there in our yarn jobs. Login into hue with your login and password. Open Job Browser, you can see that there is only one job so far with the name Spark Pi. So, our last job was really executed via Yarn.

Apache Spark - Running On Cluster

Previous Index Next

Please login to comment

42 Comments

Ajeet Kumar

4 years ago

How can I access Spark UI? Please help.

Abhinav Singh

4 years ago

Hi Ajeet,

This discussion will help you.

Sairam Srinivas Vasantada

4 years ago

Hue is not available. How to check submitted spark job name?

Abhinav Singh

4 years ago

Hi Sairam,

You can check the job in Spark UI.

Also if you are using master as YARN then you can check the job status with yarn application command.

Hope this helps.

Sairam Srinivas Vasantada

4 years ago

Hi Abhinav,

In case of YARN master, I am able to get information using "yarn application" command.

If I am running in local mode, I am unable to get job information in Spark UI.
I copied Spark UI URL and tried to access in web browser. Getting "Connection Timeout Error"

Please help to resolve

Abhinav Singh

4 years ago

Hi Sairam,

For SparKUI, this discussion will help you.

For YARN master, you also have to pass the application id in yarn application command. Please check YARN documentation for yarn application command. Application id will be displayed to you in the log itself.

Hope this answers your query.

Thanks

Sairam Srinivas Vasantada

4 years ago

Hi Abhinav,

For SparkUI, You have shared "discussion" link.

1. In e.cloudxlab.com, logged in
2. submitted spark job.
[sairamsrinivasv5522@cxln4 spark]$ spark-submit sparkPartition02.py

3. Spark UI started on port 4046. Please see below screen print.

4. Now I tried to open below link to see the submitted job
http://e.cloudxlab.com:4046/jobs/

At step 4, got the error "This site can't be reached"

Can you help me where I am doing wrong?

According to the link, I tried to open Spark UI link http://

Sandeep Giri

4 years ago

May your firewall or network is blocking the port.

Sandeep Giri

4 years ago

Did you try to open "http://e.cloudxlab.com:4046" ?

Mohd Zaid

4 years ago

My Pi value is not showing please help me.

Abhinav Singh

4 years ago

You should use this command

spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar 10

Punit Bhilota

4 years ago

Executed the Yarn script with >2/dev/null to ignore standard errors and warnings.

[punitnb7985@cxln5 ~]$ spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 2>/dev/null

Pi is roughly 3.1434751434751433

Raoof Naushad

4 years ago

Your tutorials are really good. I can say so for sure because I had seen a lot of videos and two other tutorials.

Rajtilak Bhattacharjee

4 years ago

Hi,

Thank you for your appreciation. Do reach out to us if you need any help.

Thanks.

Raoof Naushad

4 years ago

I have a query, please reply me the answer if possible. I have feature vectors for face similarity search. When I get a new query how can I use spark with other bigdata tools to optimize the search result fastly. The only operation is to take the cosine similarity of the query vector with all other feature vector and get the value of maximum similarity.

Thanks in advance.

Sandeep Giri

4 years ago

Imagine that you have stored all of the face embedding vectors in a files in HDFS. There are millions of such faces.

face_embd_txt = sc.textFile("folder_containing_vectors")

my_embd = [0.1, ...] # This is what you are looking for

similarities = face_embd_txt.map(lambda x: to_array(x).dot(myembd))

Now, find the max:

print(similarities.max())

Since the process is going to be run using multiple threads on multiple computers, it is going to be faster but load the embeding from file is going to take time.

chiranjeevi katta

5 years ago

How to check the job browser now. Hue is no longer exists. Is there a way to check from command line.

vinit ratan

5 years ago

Hi,

I am trying to run code on yarn through the steps mentioned in video. But after doing spark-submit it is continuously retrying since 10-15 minutes.

Retry iMessage (Sample snip )

nternal/10.142.1.2:8050. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:25 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:26 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:27 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:28 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

Rajtilak Bhattacharjee

4 years ago

Hi,

Are you still facing this issue?

Thanks.

vinit ratan

5 years ago

Hi,

I am trying to run code on yarn through the steps mentioned in video. But after doing spark-submit it is continuously retrying since 10-15 minutes.

Retry iMessage (Sample snip )

nternal/10.142.1.2:8050. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:25 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:26 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:27 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:28 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

Sujeet Pathak

5 years ago

Done perfectly

Ninad Dhopeshwarkar

5 years ago

Error

Ninad Dhopeshwarkar

5 years ago

Unable to login to HUE - getting the below error

This comment has been removed.

Mohit Agarwal

5 years ago

@disqus_XTh3bUKOBh:disqus why am i getting the following error same as the command show in video?

Mohit Agarwal

5 years ago

Chandrashekar Garudkar

5 years ago

Unable to execute spark submit command for yarn. getting below error

[csg1473524@cxln4 ~]$ spark-submit --master yarn --class org.apache.spark.example.sparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
SPARK_MAJOR_VERSION is set to 2, using Spark2
java.lang.ClassNotFoundException: org.apache.spark.example.sparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:708)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[csg1473524@cxln4 ~]$

Satyajit Das

5 years ago

Hi, Chandrashekar.

The s in spark-submit command .sparkPi --> s should be capital.

Kindly run the below commands in your web-console, it should work fine.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

All the best.

Chandrashekar Garudkar

5 years ago

Thanks :)

Sanjay Ray

5 years ago

I exported both yarn & hadoop but spark-submit command for calculation of Pi is not executed. Also showing 0 jobs in hue job browser. Screen-shot attached. Pl check where I am wrong.

Satyajit Das

5 years ago

Hi, Sanjay.

The last command for spark submit is not correct!
Kindly try again, it should work fine.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

All the best!

Sarthak Srivastava

5 years ago

[sarthak33987884@cxln4 ~]$ spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
SPARK_MAJOR_VERSION is set to 2, using Spark2
20/01/21 14:06:23 INFO SparkContext: Running Spark version 2.1.1.2.6.2.0-205
20/01/21 14:06:24 INFO SecurityManager: Changing view acls to: sarthak33987884
20/01/21 14:06:24 INFO SecurityManager: Changing modify acls to: sarthak33987884
20/01/21 14:06:24 INFO SecurityManager: Changing view acls groups to:
20/01/21 14:06:24 INFO SecurityManager: Changing modify acls groups to:
20/01/21 14:06:24 INFO SecurityManager: SecurityManager: authentication disabled;ui acls disabled; users with view permissions: Set(sarthak33987884); groups withview permissions: Set(); users with modify permissions: Set(sarthak33987884); groups with modify permissions: Set()
20/01/21 14:06:24 INFO Utils: Successfully started service 'sparkDriver' on port 42959.
20/01/21 14:06:24 INFO SparkEnv: Registering MapOutputTracker
20/01/21 14:06:24 INFO SparkEnv: Registering BlockManagerMaster
20/01/21 14:06:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/01/21 14:06:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/01/21 14:06:24 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-122b3317-b45c-4d8d-b25e-c0c591e436ac
20/01/21 14:06:24 INFO MemoryStore: MemoryStore started with capacity 114.6 MB
20/01/21 14:06:24 INFO SparkEnv: Registering OutputCommitCoordinator
20/01/21 14:06:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/01/21 14:06:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.142.1.4:4040
20/01/21 14:06:25 INFO SparkContext: Added JAR file:/usr/hdp/current/spark-client/lib/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar at spark://10.142.1.4:42959/jars/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar withtimestamp 1579615585045
20/01/21 14:06:25 INFO RMProxy: Connecting to ResourceManager at cxln2.c.thelab-240901.internal/10.142.1.2:8050

Execution seem to stop after this???
I have to forcefully terminate the execution

Satyajit Das

5 years ago

Hi, Sarthak.

Did you export the yarn or hadoop path?
Please refer the tutorial again.
You should be able to do it.

All the best!

Tech Shaj

6 years ago

I am getting the following error while giving "export YARN_CONF_DIR = /etc/hadoop/conf/" command. How to fix this
-bash: export: `=': not a valid identifier
-bash: export: `/etc/hadoop/conf': not a valid identifier

Abhinav Singh

6 years ago

Hi @techshaj:disqus ,

No spaces can appear between the variable, the equal sign, and the value.

Hope this helps.

Thanks

Pallavi Chaporkar

7 years ago

hi, the video is not complete. We can hear only till "you can". Please fix the steps how to view job in the job browser in hue. thanks

Sandeep Giri

7 years ago

Hi Pallavi,
Thank you for bringing it to our knowledge. I am getting into it.

Regards,
Sandeep Giri

CloudxLab

7 years ago

Hi Pallavi,

We've fixed the video. Please check it now.

Thanks

Manoj

8 years ago

Hi
This video is also not proper.
Please fix it

Sandeep Giri

8 years ago

Hi Manoj,

Thank you for letting us know. I have located the problem. There is a green screen.

Let me try to fix it.

Regards,
Sandeep Giri

Sandeep Giri

8 years ago

This has been fixed. Thank you manoj.

Dhruvang Suthar

5 years ago

dear sir,

i am unable to login HUE.

pls solve the problem