Spark On Cluster

16 / 23

Apache Spark - Running On Cluster - Cluster Mode - YARN

If you already have a hadoop cluster, you can use --master option with yarn.

Please note that the spark applications tasks would be run inside the yarn's containers on various node managers.

Before launching spark in yarn cluster mode, you must set the two environment variables YARN_CONF_DIR and HADOOP_CONF_DIR to the location of configuration directory of hadoop. In case of cloudxlab or hortonworks data platform, the location is /etc/hadoop/conf/

Let's take a look at how to launch a spark application on YARN. For this, we are going to run the example application shipped with spark. This example application computes the value of pi by first counting pixels that lie inside circle and square and then finding the proportional area from squares area which is a*a.

For this we are going to first export the two variables and then execute the command.

Let us take a look. First login to cloudxlab webconsole or ssh, then export the two variables. And then run the spark-submit command. As you can see that the value of PI is roughly 3.142344.

Also, we can take a look in hue. The spark job should be there in our yarn jobs. Login into hue with your login and password. Open Job Browser, you can see that there is only one job so far with the name Spark Pi. So, our last job was really executed via Yarn.

Apache Spark - Running On Cluster


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

42 Comments

How can I access Spark UI? Please help.

  Upvote    Share

Hi Ajeet,

This discussion will help you.

 1  Upvote    Share

Hue is not available. How to check submitted spark job name?

  Upvote    Share

Hi Sairam,

You can check the job in Spark UI. 

Also if you are using master as YARN then you can check the job status with yarn application command.

Hope this helps.

  Upvote    Share

Hi Abhinav,

In case of YARN master, I am able to get information using "yarn application" command.

If I am running in local mode, I am unable to get job information in Spark UI.
I copied Spark UI URL and tried to access in web browser. Getting "Connection Timeout Error"

Please help to resolve

 

 

  Upvote    Share

Hi Sairam,

For SparKUI, this discussion will help you.

For YARN master, you also have to pass the application id in yarn application command. Please check YARN documentation for yarn application command. Application id will be displayed to you in the log itself.

Hope this answers your query.

Thanks

  Upvote    Share

Hi Abhinav,

For SparkUI, You have shared "discussion" link. 

1. In e.cloudxlab.com, logged in
2. submitted spark job. 
    [sairamsrinivasv5522@cxln4 spark]$ spark-submit sparkPartition02.py

3. Spark UI started on port 4046. Please see below screen print.

4. Now I tried to open below link to see the submitted job
     http://e.cloudxlab.com:4046/jobs/

At step 4, got the error "This site can't be reached"

Can you help me where I am doing wrong? 

 

According to the link, I tried to open Spark UI link http://

 

  Upvote    Share

May your firewall or network is blocking the port.

  Upvote    Share

Did you try to open "http://e.cloudxlab.com:4046" ?

  Upvote    Share

My Pi value is not showing please help me.

  Upvote    Share

You should use this command 

spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar 10

 

  Upvote    Share

Executed the Yarn script with >2/dev/null to ignore standard errors and warnings.

[punitnb7985@cxln5 ~]$ spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 2>/dev/null

Pi is roughly 3.1434751434751433

 

 2  Upvote    Share

Your tutorials are really good. I can say so for sure because I had seen a lot of videos and two other tutorials.

  Upvote    Share

Hi,

Thank you for your appreciation. Do reach out to us if you need any help.

Thanks.

 1  Upvote    Share

I have a query, please reply me the answer if possible. I have feature vectors for face similarity search. When I get a new query how can I use spark with other bigdata tools to optimize the search result fastly. The only operation is to take the cosine similarity of the query vector with all other feature vector and get the value of maximum similarity. 

Thanks in advance.

 1  Upvote    Share

Imagine that you have stored all of the face embedding vectors in a files in HDFS. There are millions of such faces.

face_embd_txt = sc.textFile("folder_containing_vectors")

my_embd = [0.1, ...] # This is what you are looking for

similarities = face_embd_txt.map(lambda x: to_array(x).dot(myembd)) 

Now, find the max:

print(similarities.max())

Since the process is going to be run using multiple threads on multiple computers, it is going to be faster but load the embeding from file is going to take time.

  Upvote    Share

How to check the job browser now. Hue is no longer exists. Is there a way to check from command line.

  Upvote    Share
 

Hi,

I am trying to run code on yarn through the steps mentioned in video. But after doing spark-submit it is continuously retrying since 10-15 minutes.

 

Retry iMessage (Sample snip )

nternal/10.142.1.2:8050. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:25 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:26 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:27 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:28 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

 

  Upvote    Share

Hi,

Are you still facing this issue?

Thanks.

  Upvote    Share
 

Hi,

I am trying to run code on yarn through the steps mentioned in video. But after doing spark-submit it is continuously retrying since 10-15 minutes.

 

Retry iMessage (Sample snip )

nternal/10.142.1.2:8050. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:25 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:26 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:27 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
20/09/24 03:34:28 INFO Client: Retrying connect to server: cxln2.c.thelab-240901.internal/10.142.1.2:8050. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

 

  Upvote    Share

Done perfectly

  Upvote    Share

Error

  Upvote    Share

Unable to login to HUE - getting the below error

  Upvote    Share

This comment has been removed.

@disqus_XTh3bUKOBh:disqus why am i getting the following error same as the command show in video?

  Upvote    Share

Unable to execute spark submit command for yarn. getting below error

[csg1473524@cxln4 ~]$ spark-submit --master yarn --class org.apache.spark.example.sparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
SPARK_MAJOR_VERSION is set to 2, using Spark2
java.lang.ClassNotFoundException: org.apache.spark.example.sparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:708)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[csg1473524@cxln4 ~]$

  Upvote    Share

Hi, Chandrashekar.

The s in spark-submit command .sparkPi --> s should be capital.

Kindly run the below commands in your web-console, it should work fine.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

All the best.

  Upvote    Share

Thanks :)

  Upvote    Share

I exported both yarn & hadoop but spark-submit command for calculation of Pi is not executed. Also showing 0 jobs in hue job browser. Screen-shot attached. Pl check where I am wrong.

  Upvote    Share

Hi, Sanjay.

The last command for spark submit is not correct!
Kindly try again, it should work fine.

export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10

All the best!

  Upvote    Share

[sarthak33987884@cxln4 ~]$ spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
SPARK_MAJOR_VERSION is set to 2, using Spark2
20/01/21 14:06:23 INFO SparkContext: Running Spark version 2.1.1.2.6.2.0-205
20/01/21 14:06:24 INFO SecurityManager: Changing view acls to: sarthak33987884
20/01/21 14:06:24 INFO SecurityManager: Changing modify acls to: sarthak33987884
20/01/21 14:06:24 INFO SecurityManager: Changing view acls groups to:
20/01/21 14:06:24 INFO SecurityManager: Changing modify acls groups to:
20/01/21 14:06:24 INFO SecurityManager: SecurityManager: authentication disabled;ui acls disabled; users with view permissions: Set(sarthak33987884); groups withview permissions: Set(); users with modify permissions: Set(sarthak33987884); groups with modify permissions: Set()
20/01/21 14:06:24 INFO Utils: Successfully started service 'sparkDriver' on port 42959.
20/01/21 14:06:24 INFO SparkEnv: Registering MapOutputTracker
20/01/21 14:06:24 INFO SparkEnv: Registering BlockManagerMaster
20/01/21 14:06:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/01/21 14:06:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/01/21 14:06:24 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-122b3317-b45c-4d8d-b25e-c0c591e436ac
20/01/21 14:06:24 INFO MemoryStore: MemoryStore started with capacity 114.6 MB
20/01/21 14:06:24 INFO SparkEnv: Registering OutputCommitCoordinator
20/01/21 14:06:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/01/21 14:06:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.142.1.4:4040
20/01/21 14:06:25 INFO SparkContext: Added JAR file:/usr/hdp/current/spark-client/lib/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar at spark://10.142.1.4:42959/jars/spark-examples-1.6.3.2.6.2.0-205-hadoop2.7.3.2.6.2.0-205.jar withtimestamp 1579615585045
20/01/21 14:06:25 INFO RMProxy: Connecting to ResourceManager at cxln2.c.thelab-240901.internal/10.142.1.2:8050

Execution seem to stop after this???
I have to forcefully terminate the execution

  Upvote    Share

Hi, Sarthak.

Did you export the yarn or hadoop path?
Please refer the tutorial again.
You should be able to do it.

All the best!

  Upvote    Share

I am getting the following error while giving "export YARN_CONF_DIR = /etc/hadoop/conf/" command. How to fix this
-bash: export: `=': not a valid identifier
-bash: export: `/etc/hadoop/conf': not a valid identifier

  Upvote    Share
Abhinav Singh

Hi @techshaj:disqus ,

No spaces can appear between the variable, the equal sign, and the value.

Hope this helps.

Thanks

  Upvote    Share

hi, the video is not complete. We can hear only till "you can". Please fix the steps how to view job in the job browser in hue. thanks

  Upvote    Share

Hi Pallavi,
Thank you for bringing it to our knowledge. I am getting into it.

Regards,
Sandeep Giri

  Upvote    Share

Hi Pallavi,

We've fixed the video. Please check it now.

Thanks

  Upvote    Share

Hi
This video is also not proper.
Please fix it

  Upvote    Share

Hi Manoj,

Thank you for letting us know. I have located the problem. There is a green screen.

Let me try to fix it.

Regards,
Sandeep Giri

  Upvote    Share

This has been fixed. Thank you manoj.

  Upvote    Share

dear sir,

i am unable to login HUE.

pls solve the problem

  Upvote    Share